Clustering Lat lon data in Pyspark.

What is it?

In geographical clustering the goal is to cluster Geo points having Geo coordinates latitude, longitude. The motive behind this exercise is to club anomalies in the location (particularly city).


  • Scalability : We started off with clustering using K-means and DBSCAN using the euclidean distance. Since the unique cities were 40K, doing multiple iterations for large number of clusters is taking unfeasible time.
  • Finding the perfect K (# of clusters): Elbow method does give best clusters which are statistically heterogeneous but practically knowing the perfect K varies with each city cluster into consideration. Smaller number of clusters doesn’t make sense for the problem and large number of clusters is making the results inaccurate and non scalable.
  • Limited distance measure algorithms: Only euclidean distance is available in Spark which is in-correct way for finding the Geo distance.
  • Slow web scrapping: Only master node is used for scraping Geo coordinates. As a result, it is taking almost 4 hours to scrape for 4K addresses.

Haversine Distance

The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. The dimension of the data must be 2. Computing the distance is detailed down below and also built the same as a function in python.

# fn. to get areial distance in kms
def haversine_distance(lat1, lon1, lat2, lon2):
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
#deg2rad = 3.141592653589793/180.0
deg2rad = pi/180.0
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1)
a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a))
#c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))

# Radius of earth in kilometers. Use 3956 for miles
#No rounding R = 3959.87433 (miles), 6372.8(km)
r = 6372.8
return c * r# convert the fn. to UDF for spark use
haversine_distance_udf = udf(haversine_distance, FloatType())

Final Approach — Nearest Neighbour Classifier

Since the number of clusters (K) was a big questions, we started off with pre-defined centroids (cluster centres) and then cluster/ assign the cities to the nearest cluster centres.

  • Standardise the string to Camel Case. Also, remove special characters and numbers.
  • Get the unique set of city and states.
  • Form address by concatenating city and state.
  • Filter out a address for which Geo coordinates is not scrapped.
  • Geo coordinates are scrapped for only newly added locations. Multi threading ( is used to fasten the scraping time.
  • In case, no Geo coordinates are present (first time), it scrapes for the addresses and takes maximum time.
  • Combine already present Geo coordinates with the new ones.
  • Remove any null coordinates.
  • Outer join to city Geo coordinates with city cluster coordinates to get all possible combinations.
  • Calculate the Haversine distance (in KMS) between city cluster and the city coordinates using the custom build python UDF function.
  • Filter out the nearest city cluster corresponding to each city along the distance in kms.
  • Customise the cities to be considered as city clusters.
  • Decide the range in KMS from city cluster where we want to do targeting or exclude.


Using geopy we will get the coordinates of the geographic area to start up the map.

## get location 
city = "Bhopal"
locator = geopy.geocoders.Nominatim(user_agent="MyCoder")
location = locator.geocode(city)
## keep latitude and longitude only
location = [location.latitude, location.longitude]
print("[lat, long]:", location)
x, y = "lat", "lon"
color = "city_cluster"
popup = "city"
marker = "centroid"
data = dtf.copy()
## create color column
lst_elements = sorted(list(dtf[color].unique()))
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
data["color"] = data[color].apply(lambda x:
## initialize the map with the starting location
map_ = folium.Map(location=location, tiles="cartodbpositron",
## add points
data.apply(lambda row: folium.CircleMarker(
location=[row[x],row[y]], popup=row[popup],
color=row["color"], fill=True).add_to(map_), axis=1)
## add html legend
legend_html = """<div style="position:fixed; bottom:10px; left:10px; border:2px solid black; z-index:9999; font-size:14px;">&nbsp;<b>"""+color+""":</b><br>"""
for i in lst_elements:
legend_html = legend_html+"""&nbsp;<i class="fa fa-circle
fa-1x" style="color:"""+lst_colors[lst_elements.index(i)]+"""">
legend_html = legend_html+"""</div>"""
## add centroids marker
lst_elements = sorted(list(dtf[marker].unique()))
data[data[marker]==1].apply(lambda row:
popup=row[marker], draggable=False,
icon=folium.Icon(color="black")).add_to(map_), axis=1)
## plot the map

A petrol-head who is a data scientist by profession and loves to solve any problem logically and travel illogically.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store