Unfortunately, I don't have an example of working solution for your problem, but I can suggest that run manifests we introduced in Python Analytics SDK 0.2.0 could be helpful for solving similar problems. Run manifests are lightweight mechanism that allows you to track what runs (directories inside
enriched/archive) were processed by Spark (or other engines) inside your code with Python/Scala Analytics SDKs.
So, you could end up with somewhat similar to following pseudo-code:
run_ids = list_runids()
for run in run_ids:
aggregated_data = process_enriched_data(run_id)
End result will of course highly depend on what you want to do with your data, but at least this is some approach we're currently exploring. I hope you'll find it useful.