Using the Python Tracker, should Tracker be a singleton, or recreated for each new event? Are there known memory leaks?

Context: We’ve been experiencing consistently increasing memory usage on our workers that are solely responsible for firing these Snowplow events. We’re attempting to rule out potential misuse of the library as a culprit (something akin to a memory leak perhaps).

1. Using the Python Tracker, should Tracker be a singleton, or recreated for each new event?

Suppose the Emitter is initialized once as a singleton. For each event fired, a Tracker is created anew, the Subject is set on that Tracker, and tracker.track_self_describing_event() is called passing along the appropriate data. Is this a perfectly acceptable way to use Tracker?

The alternative is to keep Tracker a long-lived instance and change the Subject that is set on it as needed (as appears to be demonstrated in documentation). The concern with this was the variability of state this introduces to the Tracker instance - without the proper cleanup from any calling code that uses the Tracker, this could pass along Subject info that is not desired.

2. Are there any known memory leaks aside from the above?

The other consideration was if there are any known memory leaks with the Python Tracker library in general. We have some events that may have started passing along significantly larger amounts of data than before, potentially close to when the steady increase in memory usage began. We are suspicious of if the significant memory increase is due to holding on to that data, and we’re only observing it now because of how much data is now being passed along.

Hello @shinee and welcome to the Snowplow community!

This is good timing! We’re just about to release Snowplow Python tracker v0.9.0 which also includes some changes related to what you describe.

1. Using the Python Tracker, should Tracker be a singleton, or recreated for each new event?

Having a single Tracker instance is the recommended approach to your use case. You can certainly have more than one Tracker, but this targets the case where, for example, you might need different tracker parameters (e.g. tracker namespace).

However, as you said, it is indeed the case that the Subject, which is also a tracker parameter, is mutable state, something that the coming 0.9.0 version addresses by making it possible to set a different subject per event without having to change the Tracker subject, thus making it idempotent even for multi-threaded applications.

2. Are there any known memory leaks aside from the above?

We are not aware of reported memory leaks. If, like you describe, a new tracker is created for every single event, that might explain the increasing memory usage you notice, if the tracker instances created are not released even though they are no longer needed. That is why, having a single Tracker instance is the recommended approach in this case for server applications.

We have some events that may have started passing along significantly larger amounts of data … We are suspicious of if the significant memory increase is due to holding on to that data

This is certainly something to look closer into. Generally, once the buffer is flushed it is also emptied, so i cannot think of any reference to the events that is not released and so builds up memory usage. It would be great if you could keep us posted with any further observations that might indicate this is not the case.

So, to wrap up, it seems like a better approach to have a single Tracker instance, and also upgrade to v0.9.0 (once it is released), so that you can safely set different subject per event without having to create separate tracker instances.

4 Likes

Thank you so much for the guidance Ada! We will be eagerly awaiting the v0.9.0 release then :slight_smile:

Hello @shinee !

Version 0.9.0 just released: Snowplow Python Tracker 0.9.0 released

The docs are updated too: Python Tracker - Snowplow Docs

Looking forward to your feedback!