Clustering Lat lon data in Pyspark.
What is it?
In geographical clustering the goal is to cluster Geo points having Geo coordinates latitude, longitude. The motive behind this exercise is to club anomalies in the location (particularly city).
It has been observed that significant localities were marked as city, example koramangala, Lajpat nagar, Worli, Times Square etc. This leads to miss targeting at city level and some users don’t get targeted.
To solve this, we have come with an approach which combines these localities into the closest city a.k.a city cluster. This helps us improve users coverage under a city cluster i.e. the recall metric and not accuracy.
- Scalability : We started off with clustering using K-means and DBSCAN using the euclidean distance. Since the unique cities were 40K, doing multiple iterations for large number of clusters is taking unfeasible time.
- Finding the perfect K (# of clusters): Elbow method does give best clusters which are statistically heterogeneous but practically knowing the perfect K varies with each city cluster into consideration. Smaller number of clusters doesn’t make sense for the problem and large number of clusters is making the results inaccurate and non scalable.
- Limited distance measure algorithms: Only euclidean distance is available in Spark which is in-correct way for finding the Geo distance.
- Slow web scrapping: Only master node is used for scraping Geo coordinates. As a result, it is taking almost 4 hours to scrape for 4K addresses.
The Haversine (or great circle) distance is the angular distance between two points on the surface of a sphere. The first coordinate of each point is assumed to be the latitude, the second is the longitude, given in radians. The dimension of the data must be 2. Computing the distance is detailed down below and also built the same as a function in python.
As the Earth is nearly spherical, the haversine formula provides a good approximation of the distance between two points of the Earth surface, with a less than 1% error on average. The Haversine distance is defined as a function in python and convert to UDF for use in Spark.
# fn. to get areial distance in kms
def haversine_distance(lat1, lon1, lat2, lon2): """
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
""" #deg2rad = 3.141592653589793/180.0
deg2rad = pi/180.0 # convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1) a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a)) #c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))
# Radius of earth in kilometers. Use 3956 for miles
#No rounding R = 3959.87433 (miles), 6372.8(km)
r = 6372.8 return c * r# convert the fn. to UDF for spark use
haversine_distance_udf = udf(haversine_distance, FloatType())
Final Approach — Nearest Neighbour Classifier
Since the number of clusters (K) was a big questions, we started off with pre-defined centroids (cluster centres) and then cluster/ assign the cities to the nearest cluster centres.
- City and state for each user is derived client and IP address. Client locations is given preference over the location derived from IP address.
- Standardise the string to Camel Case. Also, remove special characters and numbers.
- Get the unique set of city and states.
- Form address by concatenating city and state.
- Filter out a address for which Geo coordinates is not scrapped.
- Geo coordinates are scrapped for only newly added locations. Multi threading (pool.map) is used to fasten the scraping time.
- In case, no Geo coordinates are present (first time), it scrapes for the addresses and takes maximum time.
- Combine already present Geo coordinates with the new ones.
- Remove any null coordinates.
- Outer join to city Geo coordinates with city cluster coordinates to get all possible combinations.
- Calculate the Haversine distance (in KMS) between city cluster and the city coordinates using the custom build python UDF function.
- Filter out the nearest city cluster corresponding to each city along the distance in kms.
- Customise the cities to be considered as city clusters.
- Decide the range in KMS from city cluster where we want to do targeting or exclude.
Using geopy we will get the coordinates of the geographic area to start up the map.
## get location
city = "Bhopal"
locator = geopy.geocoders.Nominatim(user_agent="MyCoder")
location = locator.geocode(city)
print(location)## keep latitude and longitude only
location = [location.latitude, location.longitude]
print("[lat, long]:", location)
Now we will create the map with folium, a really convenient package that allows us to plot interactive maps without needing to load a shapefile. Each city shall be identified by a point with colour based on its city clsuter. Also, a small piece of HTML code to the default map to display the legend has been added.
x, y = "lat", "lon"
color = "city_cluster"
popup = "city"
marker = "centroid"
data = dtf.copy()## create color column
lst_elements = sorted(list(dtf[color].unique()))
lst_colors = ['#%06X' % np.random.randint(0, 0xFFFFFF) for i in
data["color"] = data[color].apply(lambda x:
lst_colors[lst_elements.index(x)])## initialize the map with the starting location
map_ = folium.Map(location=location, tiles="cartodbpositron",
## add points
data.apply(lambda row: folium.CircleMarker(
color=row["color"], fill=True).add_to(map_), axis=1)## add html legend
legend_html = """<div style="position:fixed; bottom:10px; left:10px; border:2px solid black; z-index:9999; font-size:14px;"> <b>"""+color+""":</b><br>"""
for i in lst_elements:
legend_html = legend_html+""" <i class="fa fa-circle
legend_html = legend_html+"""</div>"""
map_.get_root().html.add_child(folium.Element(legend_html))## add centroids marker
lst_elements = sorted(list(dtf[marker].unique()))
icon=folium.Icon(color="black")).add_to(map_), axis=1)## plot the map