Implementing efficient Geo IP location system in MySQL

Often application needs to know where a user is physically located. The easiest way to figure that out is by looking up their IP address in a special database. It can all be implemented in MySQL, but I often see it done inefficiently. In my post I will show how to implement a complete solution that offers great performance.

Importing Geo IP data

First you will require a database mapping network addresses to real locations. There are various resources available, but I chose the one nginx web server uses with its geoip module. GeoLite City comes in CSV format and is available for download with no charge from MaxMind.

The archive contains two files. GeoLiteCity-Blocks.csv lists all IP ranges and maps each one to the corresponding location identifier, while GeoLiteCity-Location.csv contains location details such as country, region or city names. The contents of these files can be important into two MySQL tables. I created the following structures:

CREATE TABLE `geoip_blocks` (
  `gbl_block_start` int(10) unsigned NOT NULL,
  `gbl_block_end` int(10) unsigned NOT NULL,
  `gbl_glc_id` int(10) unsigned NOT NULL,
  PRIMARY KEY (`gbl_block_start`,`gbl_block_end`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `geoip_locations` (
  `glc_id` int(10) unsigned NOT NULL,
  `glc_country` char(2) NOT NULL,
  `glc_region` varchar(2) NOT NULL,
  `glc_city` varchar(64) NOT NULL,
  `glc_zip` varchar(16) NOT NULL,
  `glc_latitude` decimal(7,4) NOT NULL,
  `glc_longitude` decimal(7,4) NOT NULL,
  PRIMARY KEY (`glc_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

Please note that I used INT UNSIGNED type for gbl_block_start and gbl_block_end that will carry network addresses. An IPv4 address can be represented either as text or as a 32-bit number. The database I have uses the latter, which is a good thing, because it makes it much smaller and also easier for querying.

The next step is preparing the CSV files for import. As it turns out GeoLiteCity-Location.csv isn’t correctly encoded and performing a conversion is required or otherwise data will be broken once loaded into MySQL.

% file -ib GeoLiteCity-Location.csv                                                    
text/plain; charset=iso-8859-1

% iconv -f ISO-8859-1 -t UTF-8 GeoLiteCity-Location.csv > GeoLiteCity-Location-UTF8.csv

% file -ib GeoLiteCity-Location-UTF8.csv                                                    
text/plain; charset=utf-8

Then it can be imported into the tables.

mysql> LOAD DATA INFILE '/home/maciek/GeoLiteCity_20120504/GeoLiteCity-Blocks.csv' 
       INTO TABLE geoip_blocks CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
       OPTIONALLY ENCLOSED BY '"' IGNORE 2 LINES (gbl_block_start, gbl_block_end, gbl_glc_id);
Query OK, 1875384 rows affected (1 min 17.67 sec)
Records: 1875384  Deleted: 0  Skipped: 0  Warnings: 0

mysql> LOAD DATA INFILE '/home/maciek/GeoLiteCity_20120504/GeoLiteCity-Location-UTF8.csv' 
       INTO TABLE geoip_locations CHARACTER SET utf8 FIELDS TERMINATED BY ',' 
       OPTIONALLY ENCLOSED BY '"' IGNORE 2 LINES (glc_id, glc_country, glc_region, glc_city, 
       glc_zip, glc_latitude, glc_longitude, @ignored, @ignored);
Query OK, 353223 rows affected (3.01 sec)
Records: 353223  Deleted: 0  Skipped: 0  Warnings: 0

Be sure to verify that neither query reported any warnings from the execution. Otherwise data was likely imported incorrectly.

Designing the lookup query

The next step is figuring out what query could efficiently map a particular IP address to the corresponding location. I already mentioned that our database uses numbers for network address representation. Any applications that use text format internally will have to perform a conversion first. MySQL has a built-in function called INET_ATON() for mapping a.b.c.d form into a number, but in fact every programming language has something similar as well, so the conversion can be done anywhere.

Bad query

I have seen many approaches to the problem. The most common I find implemented in so many places is a query like this:

mysql> SELECT glc.*
       FROM   geoip_blocks gbl
              JOIN geoip_locations glc
              ON     glc.glc_id = gbl.gbl_glc_id
       WHERE  INET_ATON('149.156.1.4') BETWEEN gbl.gbl_block_start AND gbl.gbl_block_end\G
*************************** 1. row ***************************
         glc_id: 57106
    glc_country: PL
     glc_region: 77
       glc_city: Cracow
        glc_zip: 
   glc_latitude: 50.0833
  glc_longitude: 19.9167
1 row in set (0.43 sec)

It seems natural to simply look for a range that contains the one IP address, but it is not very efficient:

mysql> SHOW STATUS LIKE 'Handler_read_%';
+----------------------------+---------+
| Variable_name              | Value   |
+----------------------------+---------+
| Handler_read_first         | 1       |
| Handler_read_key           | 2       |
| Handler_read_last          | 0       |
| Handler_read_next          | 1378462 |
| Handler_read_prev          | 0       |
| Handler_read_rnd           | 0       |
| Handler_read_rnd_next      | 0       |
+----------------------------+---------+
7 rows in set (0.00 sec)

mysql> EXPLAIN SELECT glc.*
       FROM   geoip_blocks gbl
              JOIN geoip_locations glc
              ON     glc.glc_id = gbl.gbl_glc_id
       WHERE  INET_ATON('149.156.1.4') BETWEEN gbl.gbl_block_start AND gbl.gbl_block_end\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: gbl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 937963
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: glc
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: geoip.gbl.gbl_glc_id
         rows: 1
        Extra: 
2 rows in set (0.00 sec)

It uses the primary key for filtering in geoip_blocks table, but Handler_read_next from SHOW STATUS output indicates that it reads a lot of rows along the way. MySQL cannot [fully] optimise a condition that involves multiple columns, even if all are part of the same index. So what it does here, it uses the primary key to only match the rows where the address value is larger than gbl_block_start. This much it can do. However it still meant reading roughly 1.4M rows and rejecting all but one. The actual number can vary depending on the IP address value, but overall it is far from being optimal solution.

Good query

What we know about these ranges of IP addresses is that they should be mutually exclusive. I.e. if values from 2510028800 to 2510045951 make one block, none of them can appear in any other block. So to discover into which range an address falls, all we need to find is either the preceding gbl_block_start value or the following gbl_block_end value. This can be done very efficiently:

mysql> SELECT   glc.*
       FROM     geoip_blocks gbl
                JOIN geoip_locations glc
                ON       glc.glc_id = gbl.gbl_glc_id
       WHERE    gbl_block_start <= INET_ATON('149.156.1.4')
       ORDER BY gbl_block_start DESC
       LIMIT    1\G  
*************************** 1. row ***************************
         glc_id: 57106
    glc_country: PL
     glc_region: 77
       glc_city: Cracow
        glc_zip: 
   glc_latitude: 50.0833
  glc_longitude: 19.9167
1 row in set (0.00 sec)

The execution time dropped from 0.43 sec to 0.00 sec, but if we compare the execution plans, they will not seem any different for either query:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: gbl
         type: range
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: NULL
         rows: 937963
        Extra: Using where
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: glc
         type: eq_ref
possible_keys: PRIMARY
          key: PRIMARY
      key_len: 4
          ref: geoip.gbl.gbl_glc_id
         rows: 1
        Extra: 

They are indeed the same. But the way MySQL accessed the rows in this new query was completely different:

mysql> SHOW STATUS LIKE 'Handler_read_%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_read_first         | 0     |
| Handler_read_key           | 2     |
| Handler_read_last          | 0     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
+----------------------------+-------+

In this case MySQL simply starts reading rows backwards in the descending order from the largest value that matches gbl_block_start <= INET_ATON('149.156.1.4'). Applying LIMIT 1 allows it to stop execution immediately after finding the first matching row. The goal for our query was to find a single row with the largest possible value, but no larger than INET_ATON('a.b.c.d'), and this is precisely where this query began its execution and where it ended - perfect efficiency, zero overhead.

[MySQL Health Check]
About Maciej Dobrzanski

A MySQL consultant with the primary focus on systems, databases and application stacks performance and scalability. Expert on open source technologies such as Linux, BSD, Apache, nginx, MySQL, and many more. @linkedin

Comments

  1. Unfortunately, the efficiency is limited to single lookups. If I have a series of IPs stored in a table which I want to plot on a map, your solution will not help. I’ve tried some tricks such as using your solution as a subquery within the SELECT section and it still ran slowly.

    For thousands of IP lookups into a single query, it was sufficient to add first_octet and second_octet columns into the IP lookup table, and start off the index with those. Then (here’s the important-to-remember-bit) you have to break out into two or more rows those IP ranges where they cross the second_octet (or the first), and make sure you have a single first_octet and second_octet for every range in your DB. On a decent machine, this should allow 100K+ IP lookups in a single query in a matter of seconds.

  2. I just wondered how well this approach handles any gaps in the address ranges. For example if you had ranges like this:

    1-10 11-20 22-30 …

    and you searched for the deliberately missing “address” 21, using “select low, high from iplookup where low <= 21 order by
    low desc limit 1" it would return the 11-20 block despite the address not being present in this range.

    So is it reasonable to assume that there must be no gaps in the data? I don't know if this is likely to be inherently the case
    with ip address allocations, but I could imagine so.

    Thanks,
    Brendan.

    • Brendan,

      Out of the top of my head. You can pull the matched IP block data as well (i.e. the row from geoip_blocks) and then verify whether the looked up IP address belongs to the block or not.

      • Max Bube says:

        You can add IF(endipnum >= inet_aton(?), 1, 0) as valid_result to validate the range.

  3. When I try the two SELECT statements on the “IP to country” table at my work, I don’t see any huge differences in the handler_read status variables. Our table has essentially the same structure as yours.

    Do you know if this handling differs between MySQL versions? We’re just running 5.0. What version are you running in your examples?

    • I do not think the version should matter here. My guess is that you did not create proper index or your query is somehow different.

      • Now I’ve tried it locally on MySQL 5.5. I copied the table definition and data from work. And now I see a difference; using BETWEEN handler_read_next becomes 32542, using <= and ORDER BY handler_read_next becomes 0.

        But wtf…? Now I also get these results on the servers at work. I didn't see that before. Hm… strange…

  4. Great article and thank you so much for sharing this! I’ve been having server performance issues since implementing user location detection with maxmind’s database. My query looked just like the example you posted and it usually took 0.2 seconds… it was actually faster without indexes on the tables. I tried your query and it was initially twice as slow, until I noticed the primary key you added to the block table. It’s lightning fast at 0.0005 seconds now! Thanks again!

  5. This works… My old statement used the BETWEEN and took something like 0.05!? Changed it to something like this and it took 0.5, so added an index to the begin_ip_range field and it dropped to something like 0.0004… Pretty Decent! Thanks!

Trackbacks

  1. [...] Firstly, there is a great article here about constructing your table and importing the data Geo IP location system in MySQL . My import for windows and my table was slightly different – LOAD DATA INFILE [...]

Speak Your Mind

*