Clustering Lat Lon data in Pyspark.
For any device connected to the Internet, we try and map their geolocation. We get geolocation data from two data sources, client and IP address. For the users (client) who allow their GPS location, using data analytics algorithm we convert the latitude and longitude to their respective location. And, for those users who do not allow their GPS location, we try and fetch their location from the IP address.
This data is one of the most important data for us with multiple use cases:
Although creating extract-transform-load (ETL) pipelines using Apache Spark is obvious to many of us, computing ML specific features at scale is still a challenge and interesting problem to explore as each business need/use case requires exploring a variety of algorithms and eventually scaling the chosen methodology.
In this post we describe the motivation and means of performing one to all app similarity using cosine similarity along with business heuristics using Spark.
Since the data was around 450K records, processing using Python standalone code would take ages. To be precise it took ~12 hours using multi-processing and not feasible to scale…
A petrol-head who is a data scientist by profession and loves to solve any problem logically and travel illogically.