Clustering Lat Lon data in Pyspark.
How do we map geolocation?
For any device connected to the Internet, we try and map their geolocation. We get geolocation data from two data sources, client and IP address. For the users (client) who allow their GPS location, using data analytics algorithm we convert the latitude and longitude to their respective location. And, for those users who do not allow their GPS location, we try and fetch their location from the IP address.
What does geolocation mean to us?
This data is one of the most important data for us with multiple use cases:
- Sourcing apps from developers through the user’s distribution across geographies.
- Targeting users with relevant apps.
- Recommending locally relevant apps and content to the user based on their demographics.
The inaccuracy involved
Through extensive research on GPS location, we are able to mark a user’s location within 10 meters of accuracy. But, the majority of the users go with the thought that “my location should remain confidential” and hence for the majority of the users, we do not get the GPS location.
When we try to map geolocation from IP addresses using third-party services, it has been observed that we get localities marked as cities, for example, Koramangala, Lajpat Nagar, Worli, etc. This leads to miss targeting and some users don’t get targeted, which further impacts the campaign performance, personalization and VOCs.
- Scalability: We started off with clustering using K-means and DBSCAN using the euclidean distance. Since the number of users is 150+M, clustering them was taking unfeasible time.
- Finding the perfect K (# of clusters): Elbow method does give the best clusters statistically heterogeneous but practically knowing the perfect K varies with each city cluster into consideration. A smaller number of clusters doesn’t make sense for the problem and a large number of clusters is making the results inaccurate and non-scalable.
- Distance measure algorithms: Euclidean distance is extensively available in most of the framework, which is an incorrect way for finding the geo distance.
- Slow web scraping: We tried to map each city’s center to the user’s IP address location. In distributed systems, only the master node is used for scraping geo coordinates. As a result, it is taking almost 24 hours to scrape just 24K addresses.
After doing research on multiple ways to solve the inaccuracy involved with geolocation, we have come up with an approach that combines the localities into the closest city a.k.a city cluster. This helps us to improve user’s coverage under a city cluster i.e. the recall metric and the accuracy. The motive behind this exercise is to club anomalies in the location (particularly the city).
To find the nearest city to a locality, we need to calculate the distance between the locality and city center, for which after extensive research we boiled down to Haversine distance. The Haversine is a great-circle distance between two points on a sphere given their latitudes and longitudes. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. We have created our own algorithm to calculate this distance.
The Haversine is a great-circle distance between two points on a sphere given their longitudes and latitudes. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. The dimension of the data must be 2. Computing the distance is detailed down below and also built the same as a function in python.
As the Earth is irregularly shaped ellipsoid, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average. The Haversine distance is defined as a function in python and converts to UDF for use in Spark.
# fn. to get areial distance in kms
def haversine_distance(lat1, lon1, lat2, lon2): """
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
""" #deg2rad = 3.141592653589793/180.0
deg2rad = pi/180.0 # convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1) a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a)) #c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))
# Radius of earth in kilometers. Use 3956 for miles
#No rounding R = 3959.87433 (miles), 6372.8(km)
r = 6372.8 return c * r# convert the fn. to UDF for spark use
haversine_distance_udf = udf(haversine_distance, FloatType())
Final approach — Nearest Neighbour
Since the number of clusters (K) was a big question, we started off with pre-defined centroids (cluster centers) and then cluster/ assign the cities to the nearest cluster centers.
- City and state for each user are derived from client and IP addresses. Client locations are given preference over the location derived from the IP addresses.
- Standardize the string to Camel Case. Also, remove special characters and numbers.
- Get the unique set of cities and states.
- Form address by concatenating city and state.
- Filter out an address for which Geo coordinates is not scrapped.
- Geo coordinates are scrapped for only newly added locations. Multithreading (pool.map) is used to fasten the scraping time.
- In case, no Geo coordinates are present (first time), it scrapes for the addresses and takes maximum time.
- Combine already present Geo coordinates with the new ones.
- Remove any null coordinates.
- Outer join to city Geo coordinates with city cluster coordinates to get all possible combinations.
- Calculate the Haversine distance (in KMS) between the city cluster and the city coordinates using the custom build python UDF function.
- Filter out the nearest city cluster corresponding to each city along with the distance in km.
- Customize the cities to be considered as city clusters.
- Decide the range in KMS from the city cluster where we want to do targeting or exclude.
Using geopy we will get the coordinates of the geographic area to start up the map.
## get location
city = "Bhopal"
locator = geopy.geocoders.Nominatim(user_agent="MyCoder")
location = locator.geocode(city)
print(location)## keep latitude and longitude only
location = [location.latitude, location.longitude]
print("[lat, long]:", location)
Now we will create the map with folium, a really convenient package that allows us to plot interactive maps without needing to load a shapefile. Each city shall be identified by a point with colour based on its city clsuter. Also, a small piece of HTML code to the default map to display the legend has been added.
x, y = "lat", "lon"
color = "city_cluster"
popup = "city"
marker = "centroid"
data = dtf.copy()## create color column
lst_elements = sorted(list(dtf[color].unique()))
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
data["color"] = data[color].apply(lambda x:
lst_colors[lst_elements.index(x)])## initialize the map with the starting location
map_ = folium.Map(location=location, tiles="cartodbpositron",
## add points
data.apply(lambda row: folium.CircleMarker(
color=row["color"], fill=True).add_to(map_), axis=1)## add html legend
legend_html = """<div style="position:fixed; bottom:10px; left:10px; border:2px solid black; z-index:9999; font-size:14px;"> <b>"""+color+""":</b><br>"""
for i in lst_elements:
legend_html = legend_html+""" <i class="fa fa-circle
legend_html = legend_html+"""</div>"""
map_.get_root().html.add_child(folium.Element(legend_html))## add centroids marker
lst_elements = sorted(list(dtf[marker].unique()))
icon=folium.Icon(color="black")).add_to(map_), axis=1)## plot the map