It’s a Big World – Where Should We Start?

As word of Sisense has spread, our international customer base has grown. In order to capture more international customers, we now run marketing campaigns targeting companies outside the U.S.

Before we start a campaign, we make sure to prioritize the countries we advertise in.

Having our own flexible user analytics is an important piece. Google Analytics gives us basic location data, but we need to be able to slice and dice by our quality metrics and other characteristics.

Extracting Geographical Information from Domain Names

Our first step is figuring out where our international signups are coming from. While country code top-level domains (ccTLDs) don’t always make it clear which country a company is based in, they can get you pretty far. For example, companies with an vn ccTLD email domain very likely are based in Vietnam.

We collect email domain data from the email field of incoming signups. These domains are of the form @domain.cctld — everything after the ‘.’ is the ccTLD.

We find the ‘.’ index from the url by reversing the url to find the right-most ‘.’, and then extract all of the characters after the period using left.

  count(1),  case
    when email like '%.%'
      then reverse(left(reverse(email), 
        charindex('.', reverse(email)) - 1))
    else 'none/'
  end as tld_path
group by 2

We then use Google Developer’s TLD to Latitude / Longitude table to visualize the results. It can help to remove domains with misleading or lacking geographical information, such as lyco, and io.

TLD map

Extracting Geographical Information from IP Addresses

Now that our international marketing campaigns are bringing in new customers, we need to find data centers closer to our customers to reduce latency.

In many cases we know the IP address of the user. IP addresses are rich with geographical information, and there are a number of companies that provide detailed information, such as zipcodes or the address of the local provider.

A great open source tool for converting an IP address to a country is ip2nation. They offer a mapping from IP address to country in the form of a MySQL database, complete with ISO codes and latitude and longitude.

with ip_addresses as (
  select (split_part(ip_address, '.', 1)::bigint << 24) +
       (split_part(ip_address, '.', 2)::bigint << 16) +
       (split_part(ip_address, '.', 3)::bigint << 8) +
       (split_part(ip_address, '.', 4)::bigint) as ip_address
  from pings 
, countries as (
    join ip2nation on
      ip = (
          ip_address > ip

This query first converts the IP address to its integer equivalent. We then find the maximum value in the ip2nation database that our IP address is greater than, and extract the matching country code.

IP map

Depending on the type of your database, there are different methods for converting an IP address to the integer equivalent.

In our example running on Amazon Redshift, we are explicitly converting each part of the IP address into an integer and taking the sum.

select (split_part('', '.', 1)::bigint << 24) +
       (split_part('', '.', 2)::bigint << 16) +
       (split_part('', '.', 3)::bigint << 8) +
       (split_part('', '.', 4)::bigint);

Different databases have different ways of extracting information from ip addresses.

A popular solution in Postgres is to use inet functions. To get the integer equivalent, you can subtract against the 0 valued IP address.

select ''::inet - ''::inet;

MySQL directly supports IP address to integer with the inet_aton function.

select inet_aton('');

Bear in mind this method is not without pitfalls: proxies and VPNs will provide an IP address that appears to come from another location.

Other Data Sources that Can Hint at Geolocation

If you don’t have access to urls or IP addresses, you can still look for ways to infer international popularity. Raw text can be used to extract the language, and by searching for common dialects and identifying locations you can infer the country of origin. User self-reported physical addresses and demographic features are very popular to help build geographic profiles.

Are you extracting geographical information from your dataset? We’d love to hear how!