Geolocation issue


#1

Hello,

Trying to identify web-users who enter our platform from different countries around the world (while travelling for example), I found some inconsistencies in the data. Precisely, I am looking at the attribute geo_country based on the unique domain_userid of the users. What I am getting is a large proportion of users who happen to be in two different countries at the same time, or in a time period of less than an hour.

Any ideas about what could have caused this issue?

Thank you,
Ioannis

P.S.The Snowplow JS web tracker is updated to its latest version.


#2

Hi @ioannis,

I have several hypotheses about it. But first of all, geo_country is populated by IP Lookups enrichment from Maxmind DB, based on user’s IP address, rather then from their longitude/latitude (retrieved from Geolocation API). So, changes in geo_contry clearly originate from rapid IP address change.

And it’s very easy to spoof your IP address using VPN, proxies, anonymizers. Some anonymyzers, such as Tor can change their exit-nodes during one session, so I would imagine geo_country can contain very distant countries in short period of time. These tools can be more or less popular among your user base depending on domain of your services. They’re more popular among technically advanced users and in Asian countries. I also think cohort with anonymizers can be identified by their devices - it’s much more difficult to use these tools from mobile.

Another hypothesis is bots and crawlers. Not all bots are smart enough to correctly handle cookies and JS, but they exist for sure.

Last hypothesis is anti-virus/adult-filter software installed on user’s computer. This problem is especially wide-spread for users tracked with JS-tracker. In short - this software “intercepts” HTTP requests, receives response from server by itself and checks if there’s any “forbidden” content.

So, I think you can find an answer in details of these events (many IPs from same subnet? same User Agents? any other suspicious patterns?) or in domain of your services (could it be it’s popular among privacy-concerned users or even travelers?).


#3

Hi @anton

Thank you very much for your answer!

I am aware of the connection of geo-attributes with the IP address and the reason why I use geo_country is because I found it more accurate comparing to user_ipaddress, which can jump even in the same country/city etc. The reason why I got confused with the data is because I found a large proportion of users, who seem to change countries and have page views in our platform in less than an hour. Since this percentage is around 30% in different samples and given that I believe our users are not that tech-savvy to use VPN technologies, I couldn’t explain it that way. However, your suggestion about the mobile users proved to be correct, since for those users this percentage decreased significantly.

About the bots, I excluded them from the beginning and concerning the anti-virus/adult-filter software I am not sure how can I check that. Maybe you can give me some hints? Also, there are user cases where the subnet is the same or not, and the same happens with the user agents.

Finally and most importantly, I think that the problem I have is a bit broader. Precisely, some of the issues are:

  • the domain_sessionidx does not change every 30 mins, or two domain_sessionidx overlap at the same time
  • there are sessions with page pings and no page views
  • users cases where the the dvce screen width and screen height changes in the same session (and this is one of my doubts on whether the domain_userid is unique. How can the device change with the same cookie?)

An example is the user below:

[Moderator note: screenshot removed because it contained PII (IP addresses)]

This user had 167 page views in our platform yesterday in a time range of 21 hour and all of them in the same domain_sessionidx, in 4 different geo_counties, using 4 different user agents, 80 different ip addresses (some of them from the same subnet some not) and their device screen width and screen height changes over time. And it is not just an exception…

Overall, and excuse for going out of the main topic of this discussion (maybe I should post a new one), I would like to understand If this is something common generally, or maybe something went wrong from my side when setting up everything. Is there a way to check that? Some key point that I should look for? I am new to Snowplow, so what do you suggest in order to start unfolding the problems?

Thank you in advance and I am looking forward for your suggestions!


#4

All the IPs that you’ve posted are Google owned (and look like Google proxies) so I suspect you’ve run into a behaviour where there is a shared cookie between multiple Google IP addresses that are being used as a proxy. I’d be curious to see if you look at the raw data whether you are getting an X-Forwarded-For header which may contain the actual client IP address versus that of the proxy.

Edit: The IpAddressExtractor should be extracting the client IP (in X-Forwarded-For) though there may something else in the headers to indicate it’s proxying a certain type of request.


#5

Hi @mike,

Thank you for your answer. Your idea sounds really interesting, however I am nor sure how can I check it. Would you like to explain a bit further?


#6

If you’re using the Scala stream collector:

  1. Have a look at the raw LZO files being sunk from the raw Kinesis stream into your configured S3 bucket
  2. Decode one of the files using lzop -d filename
  3. Either decode (or look at the raw bytes) at the collector payloads. There’s a field in the CollectorPayload Thrift Schema (350) that will contain a list of headers sent with the request.

#7

@mike,

I checked the ip headers (x-forwarded-for field) and it seems that for the specific user case I provided above, the header is also jumping a lot, so probably another proxy?

However in most of the user cases that I have this problem, the field ‘x-forwarded-for’ is equal to ‘-’, so basically I don’t have the header information…

Thank you for your suggestion though!