Clustering Lat Lon data in Pyspark.

How do we map geolocation?

For any device connected to the Internet, we try and map their geolocation. We get geolocation data from two data sources, client and IP address. For the users (client) who allow their GPS location, using data analytics algorithm we convert the latitude and longitude to their respective location. And, for those users who do not allow their GPS location, we try and fetch their location from the IP address.

What does geolocation mean to us?

This data is one of the most important data for us with multiple use cases:

  • Sourcing apps from developers through the user’s distribution across geographies.
  • Targeting users with relevant apps.
  • Recommending locally relevant apps and content to the user based on their demographics.

The inaccuracy involved

Through extensive research on GPS location, we are able to mark a user’s location within 10 meters of accuracy. But, the majority of the users go with the thought that “my location should remain confidential” and hence for the majority of the users, we do not get the GPS location.


  • Scalability: We started off with clustering using K-means and DBSCAN using the euclidean distance. Since the number of users is 150+M, clustering them was taking unfeasible time.
  • Finding the perfect K (# of clusters): Elbow method does give the best clusters statistically heterogeneous but practically knowing the perfect K varies with each city cluster into consideration. A smaller number of clusters doesn’t make sense for the problem and a large number of clusters is making the results inaccurate and non-scalable.
  • Distance measure algorithms: Euclidean distance is extensively available in most of the framework, which is an incorrect way for finding the geo distance.
  • Slow web scraping: We tried to map each city’s center to the user’s IP address location. In distributed systems, only the master node is used for scraping geo coordinates. As a result, it is taking almost 24 hours to scrape just 24K addresses.

The research

After doing research on multiple ways to solve the inaccuracy involved with geolocation, we have come up with an approach that combines the localities into the closest city a.k.a city cluster. This helps us to improve user’s coverage under a city cluster i.e. the recall metric and the accuracy. The motive behind this exercise is to club anomalies in the location (particularly the city).

Haversine distance

The Haversine is a great-circle distance between two points on a sphere given their longitudes and latitudes. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. The dimension of the data must be 2. Computing the distance is detailed down below and also built the same as a function in python.

# fn. to get areial distance in kms
def haversine_distance(lat1, lon1, lat2, lon2):
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
#deg2rad = 3.141592653589793/180.0
deg2rad = pi/180.0
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1)
a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a))
#c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))

# Radius of earth in kilometers. Use 3956 for miles
#No rounding R = 3959.87433 (miles), 6372.8(km)
r = 6372.8
return c * r# convert the fn. to UDF for spark use
haversine_distance_udf = udf(haversine_distance, FloatType())

Final approach — Nearest Neighbour

Since the number of clusters (K) was a big question, we started off with pre-defined centroids (cluster centers) and then cluster/ assign the cities to the nearest cluster centers.

  • Standardize the string to Camel Case. Also, remove special characters and numbers.
  • Get the unique set of cities and states.
  • Form address by concatenating city and state.
  • Filter out an address for which Geo coordinates is not scrapped.
  • Geo coordinates are scrapped for only newly added locations. Multithreading ( is used to fasten the scraping time.
  • In case, no Geo coordinates are present (first time), it scrapes for the addresses and takes maximum time.
  • Combine already present Geo coordinates with the new ones.
  • Remove any null coordinates.
  • Outer join to city Geo coordinates with city cluster coordinates to get all possible combinations.
  • Calculate the Haversine distance (in KMS) between the city cluster and the city coordinates using the custom build python UDF function.
  • Filter out the nearest city cluster corresponding to each city along with the distance in km.
  • Customize the cities to be considered as city clusters.
  • Decide the range in KMS from the city cluster where we want to do targeting or exclude.


Using geopy we will get the coordinates of the geographic area to start up the map.

## get location 
city = "Bhopal"
locator = geopy.geocoders.Nominatim(user_agent="MyCoder")
location = locator.geocode(city)
## keep latitude and longitude only
location = [location.latitude, location.longitude]
print("[lat, long]:", location)
x, y = "lat", "lon"
color = "city_cluster"
popup = "city"
marker = "centroid"
data = dtf.copy()
## create color column
lst_elements = sorted(list(dtf[color].unique()))
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
data["color"] = data[color].apply(lambda x:
## initialize the map with the starting location
map_ = folium.Map(location=location, tiles="cartodbpositron",
## add points
data.apply(lambda row: folium.CircleMarker(
location=[row[x],row[y]], popup=row[popup],
color=row["color"], fill=True).add_to(map_), axis=1)
## add html legend
legend_html = """<div style="position:fixed; bottom:10px; left:10px; border:2px solid black; z-index:9999; font-size:14px;">&nbsp;<b>"""+color+""":</b><br>"""
for i in lst_elements:
legend_html = legend_html+"""&nbsp;<i class="fa fa-circle
fa-1x" style="color:"""+lst_colors[lst_elements.index(i)]+"""">
legend_html = legend_html+"""</div>"""
## add centroids marker
lst_elements = sorted(list(dtf[marker].unique()))
data[data[marker]==1].apply(lambda row:
popup=row[marker], draggable=False,
icon=folium.Icon(color="black")).add_to(map_), axis=1)
## plot the map

A petrol-head who is a data scientist by profession and loves to solve any problem logically and travel illogically.