Finding User’s Locality

Free of cost, scalable, and with no API call limit.

DataOps Bonus: How it is deployed in AWS!!

What it is?

User’s who share their GPS location can be mapped to its locality. The locality consists of Pincode, geohash, state, city, town, and the distance in km from the locality. We may consider geohash as a square area of some X meters side. There can be multiple localities within a geohash.

Challenges

  • To map a user to its locality, we started off with benchmarking 12+ APIs, both free and paid on performance and accuracy. We narrowed it down to a few, but the cost was high for paid, and processing was a blocker for both free and paid.
  • We tried multiprocessing and asyncio both. With multiprocessing, we got decent results.
  • asyncio was much faster but couldn’t measure the run-time because of IP banned/ quota limit.

Geohashing

Geohashing is a geocoding method used to encode geographic coordinates (latitude and longitude) into a short string of digits and letters delineating an area on a map, which is called a cell, with varying resolutions. The more characters in the string, the more precise the location.

Pygeohash is a Python module that provides functions for decoding and encoding geohashes to and from latitude and longitude coordinates, and doing basic calculations and approximations with them.

The Final Solution

  • We came across some open-source data for mapping localities to geo coordinates. All the data sources were pre-processed, standardized, and manually QCed for 100+ localities for the accuracy of geo coordinates.
  • One of the major databases available is: https://www.geonames.org/
  • A separate dataset is also created, which contains geo-cords of localities provided by the analyst. This can be updated by the user anytime and will be given preference over the open-sourced data. This helps to always keep improving the accuracy and fullness of the data.
  • Currently, we have 19.5K + localities which includes pin code, state, city, town, and locality.
  • Geohashing is applied to postal codes geo coordinates with ± 0.61 km precision to remove any duplicate localities within such a small region.
  • Geohash to locality mapping is saved for future reference i.e. what each geohash means geographically.
  • Geohashing is applied to the user’s and locality geo coordinates with +/-20 km precision to limit join and compare.
  • Joining on geohash between a user and localities will avoid unnecessary comparison/ distance measures of the user to all other localities. A user will only be searched for the locality within the joined geohash.
  • Removed users who don’t come under the area of geo cords in India.
  • Distance between pin code and user geo-coordinates are calculated using haversine distance.
  • The closest locality is assigned to the user along with the distance in kilometers.

The Pipeline

Once the geo-coordinates are received from a user, using geo-hashing, we will fetch localities as and when we receive the data in s3 and write back to s3.

To take the deployment of the lambda following steps were followed:

  1. Create a lambda function.
  2. Add s3 trigger:
  • event type: ObjectCreatedByPut
  • bucket: XYZ
  • Prefix: user_location_data/
  • Suffix: _SUCCESS

3. Spot configuration of Lambda like add request layer, environment variables, and extend the timeout.

4. Post this the lambda triggers an EMR job whenever a file is created in s3://XYZ/user_location_data/ and write with the same suffix i.e. maintaining the date and hour folder.

A petrol-head who is a data scientist by profession and loves to solve any problem logically and travel illogically.