Clustering Lat lon data in Pyspark.

What is it?

In geographical clustering the goal is to cluster Geo points having Geo coordinates latitude, longitude. The motive behind this exercise is to club anomalies in the location (particularly city).

It has been observed that significant localities were marked as city, example koramangala, Lajpat nagar, Worli, Times Square etc. This leads to miss targeting at city level and some users don’t get targeted.

To solve this, we have come with an approach which combines these localities into the closest city a.k.a city cluster. This helps us improve users coverage under a city cluster i.e. …


Although creating extract-transform-load (ETL) pipelines using Apache Spark is obvious to many of us, computing ML specific features at scale is still a challenge and interesting problem to explore as each business need/use case requires exploring a variety of algorithms and eventually scaling the chosen methodology.

In this post we describe the motivation and means of performing one to all app similarity using cosine similarity along with business heuristics using Spark.

Since the data was around 450K records, processing using Python standalone code would take ages. To be precise it took ~12 hours using multi-processing and not feasible to scale…

Vipin Chauhan

A petrol-head who is a data scientist by profession and loves to solve any problem logically and travel illogically.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store