Calculate distance between coordinates in two different dataframes - python

I have two dataframes as follow:
df_customers.head()
| | customer_id | zip_code_prefix | coords |
|----+----------------------------------+-------------------+-------------------------------------------|
| 0 | 06b8999e2fba1a1fbc88172c00ba8bc7 | 14409 | (-20.509897499999997, -47.3978655) |
| 1 | 18955e83d337fd6b2def6b18a428ac77 | 9790 | (-23.72685273154166, -46.54574582941039) |
| 2 | 4e7b3e00288586ebd08712fdd0374a03 | 1151 | (-23.527788191788307, -46.66030962184773) |
| 3 | b2b6027bc5c5109e529d4dc6358b12c3 | 8775 | (-23.49693002789165, -46.185351975305366) |
| 4 | 4f2d8ab171c80ec8364f7c12e35b23ad | 13056 | (-22.98722237101393, -47.151072819246686) |
+----+----------------------------------+-------------------+-------------------------------------------+
df_sellers.head()
| | seller_id | zip_code_prefix | coords |
|----+----------------------------------+-------------------+--------------------------------------------|
| 0 | 3442f8959a84dea7ee197c632cb2df15 | 13023 | (-22.898536428530225, -47.063125168330544) |
| 1 | d1b65fc7debc3361ea86b5f14c68d2e2 | 13844 | (-22.382941116125448, -46.94664125419024) |
| 2 | ce3ad9de960102d0677a81f5d0bb7b2d | 20031 | (-22.91064096725142, -43.17650983181368) |
| 3 | c0f3eea2e14555b6faeea3dd58c1b1c3 | 4195 | (-23.657250175378767, -46.61075944811122) |
| 4 | 51a04a8a6bdcb23deccc82b0b80742cf | 12914 | (-22.971647510075705, -46.53361841170685) |
+----+----------------------------------+-------------------+--------------------------------------------+
I would like to calculate the difference between those coords columns with haversine library without having to merge those dataframes(There is a Many-to-Many relationship between them).
So what I am looking for is a way to merge both dataframes on the fly by column zip_code_prefix while using the haversine library to calculate coordinates distance in KM.
Is that possible?

Related

Fuzzymatcher returns NaN for best_match_score

I'm observing odd behaviour while performing fuzzy_left_join from fuzzymatcher library. Trying to join two df, left one with 5217 records and right one with 8734, the all records with best_match_score is 71 records, which seems really odd . To achieve better results I even remove all the numbers and left only alphabetical charachters for joining columns. In the merged table the id column from the right table is NaN, which is also strange result.
left table - column for join "amazon_s3_name". First item - limonig
+------+---------+-------+-----------+------------------------------------+
| id | product | price | category | amazon_s3_name |
+------+---------+-------+-----------+------------------------------------+
| 1 | A | 1.49 | fruits | limonig |
| 8964 | B | 1.39 | beverages | studencajfuzelimonilimonetatrevaml |
| 9659 | C | 2.79 | beverages | studencajfuzelimonilimtreval |
+------+---------+-------+-----------+------------------------------------+
right table - column for join "amazon_s3_name" - last item - limoni
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| id | picture | amazon_s3_name |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
| 191 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G.jpg | ahmadcajlimonidjindjifilxg |
| 192 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/AhmadCajLimonIDjindjifil20X2G40g.jpg | ahmadcajlimonidjindjifilxgg |
| 204 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Ahmadcajlimonidjindjifil20x2g40g00051265.jpg | ahmadcajlimonidjindjifilxgg |
| 1608 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Cajstudenfuzetealimonilimonovatreva15lpet.jpg | cajstudenfuzetealimonilimonovatrevalpet |
| 4689 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo.jpg | lesieursalatensosslimonimaslinovomaslo |
| 4690 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Lesieursalatensosslimonimaslinovomaslo05l500ml01301150.jpg | lesieursalatensosslimonimaslinovomaslolml |
| 4723 | https://s3.eu-central-1.amazonaws.com/groceries.pictures/images/Limoni.jpg | limoni |
+------+----------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+
merged table - as we can see in the merged table best_match_score is NaN
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| id | best_match_score | __id_left | __id_right | price | category | amazon_s3_name_left | image_left | amazon_s3_name_left | image_right | amazon_s3_name_right |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
| 0 | NaN | 0_left | None | 1.49 | Fruits | Limoni500g09700112 | NaN | limonig | NaN | NaN |
| 2 | NaN | 2_left | None | 1.69 | Bio | Morkovi1kgbr09700132 | NaN | morkovikgbr | NaN | NaN |
+----+------------------+-----------+------------+-------+----------+----------------------+------------+---------------------+-------------+----------------------+
You could give polyfuzz a try. Use the examples' setup, for example using TF-IDF or Bert, then run:
model = PolyFuzz(matchers).match(df1["amazon_s3_name"].tolist(), df2["amazon_s3_name"].to_list())
df1['To'] = model.get_matches()['To']
then merge:
df1.merge(df2, left_on='To', right_on='amazon_s3_name')

Plotting a CDF from a multiclass pandas dataframe

I understand the package empiricaldist provides a CDF function as per the documentation.
However, I find it tricky to plot my dataframe in the column has multiple values.
df.head()
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| | trip_id | seconds_start | seconds_end | duration | distance | speed | acceleration | lat_start | lon_start | lat_end | lon_end | travelmode |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
| 0 | 318410 | 1461743310 | 1461745298 | 1988 | 5121.49 | 2.58 | 0.00130 | 41.162687 | -8.615425 | 41.177888 | -8.597549 | car |
| 1 | 318411 | 1461749359 | 1461750290 | 931 | 1520.71 | 1.63 | 0.00175 | 41.177949 | -8.597074 | 41.177839 | -8.597574 | bus |
| 2 | 318421 | 1461806871 | 1461806941 | 70 | 508.15 | 7.26 | 0.10370 | 37.091240 | -8.211239 | 37.092322 | -8.206681 | foot |
| 3 | 318422 | 1461837354 | 1461838024 | 670 | 1207.39 | 1.80 | 0.00269 | 37.092082 | -8.205060 | 37.091659 | -8.206462 | car |
| 4 | 318425 | 1461852790 | 1461853845 | 1055 | 1470.49 | 1.39 | 0.00132 | 37.091628 | -8.202143 | 37.092095 | -8.205070 | foot |
+------+---------+---------------+-------------+----------+----------+-------+--------------+-----------+-----------+-----------+-----------+------------+
Would like to plot CDF for the column travelmode for each travel mode.
groups = df.groupby('travelmode')
However, I don't really understand how this could be done from the documentation.
You can plot them in a loop like
import matplotlib.pyplot as plt
def decorate_plot(title):
''' Adds labels to plot '''
plt.xlabel('Outcome')
plt.ylabel('CDF')
plt.title(title)
for tm in df['travelmode'].unique():
for col in df.columns:
if col != 'travelmode':
# Create new figures for each plot
fig, ax = plt.subplots()
d4 = Cdf.from_seq(df[col])
d4.plot()
decorate_plot(f"{tm} - {col}")

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

calculate difference of column values for sets of row indices which are not successive in pandas

Say I have the following table:
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.72694 | 1.4742 | 0.32396 | 0.98535 | 1 | 0.83592 | 0.0046566 | 0.0039465 | 0.04779 | 0.12795 | 0.016108 | 0.0052323 | 0.00027477 | 1.1756 | 1 |
| 2 | 0.74173 | 1.5257 | 0.36116 | 0.98152 | 0.99825 | 0.79867 | 0.0052423 | 0.0050016 | 0.02416 | 0.090476 | 0.0081195 | 0.002708 | 7.48E-05 | 0.69659 | 1 |
| 3 | 0.76722 | 1.5725 | 0.38998 | 0.97755 | 1 | 0.80812 | 0.0074573 | 0.010121 | 0.011897 | 0.057445 | 0.0032891 | 0.00092068 | 3.79E-05 | 0.44348 | 1 |
| 4 | 0.73797 | 1.4597 | 0.35376 | 0.97566 | 1 | 0.81697 | 0.0068768 | 0.0086068 | 0.01595 | 0.065491 | 0.0042707 | 0.0011544 | 6.63E-05 | 0.58785 | 1 |
| 5 | 0.82301 | 1.7707 | 0.44462 | 0.97698 | 1 | 0.75493 | 0.007428 | 0.010042 | 0.0079379 | 0.045339 | 0.0020514 | 0.00055986 | 2.35E-05 | 0.34214 | 1 |
| 7 | 0.82063 | 1.7529 | 0.44458 | 0.97964 | 0.99649 | 0.7677 | 0.0059279 | 0.0063954 | 0.018375 | 0.080587 | 0.0064523 | 0.0022713 | 4.15E-05 | 0.53904 | 1 |
| 8 | 0.77982 | 1.6215 | 0.39222 | 0.98512 | 0.99825 | 0.80816 | 0.0050987 | 0.0047314 | 0.024875 | 0.089686 | 0.0079794 | 0.0024664 | 0.00014676 | 0.66975 | 1 |
| 9 | 0.83089 | 1.8199 | 0.45693 | 0.9824 | 1 | 0.77106 | 0.0060055 | 0.006564 | 0.0072447 | 0.040616 | 0.0016469 | 0.00038812 | 3.29E-05 | 0.33696 | 1 |
| 11 | 0.7459 | 1.4927 | 0.34116 | 0.98296 | 1 | 0.83088 | 0.0055665 | 0.0056395 | 0.0057679 | 0.036511 | 0.0013313 | 0.00030872 | 3.18E-05 | 0.25026 | 1 |
| 12 | 0.79606 | 1.6934 | 0.43387 | 0.98181 | 1 | 0.76985 | 0.0077992 | 0.011071 | 0.013677 | 0.057832 | 0.0033334 | 0.00081648 | 0.00013855 | 0.49751 | 1 |
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
I have two sets of row indices :
set1 = [1,3,5,8,9]
set2 = [2,4,7,10,10]
Note : Here, I have indicated the first row with index value 1. Length of both sets shall always be same.
What I am looking for is a fast and pythonic way to get the difference of column values for corresponding row indices, that is : difference of 1-2,3-4,5-7,8-10,9-10.
For this example, my resultant dataframe is the following:
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.01479 | 0.0515 | 0.0372 | 0.00383 | 0.00175 | 0.03725 | 0.0005857 | 0.0010551 | 0.02363 | 0.037474 | 0.0079885 | 0.0025243 | 0.00019997 | 0.47901 | 0 |
| 1 | 0.02925 | 0.1128 | 0.03622 | 0.00189 | 0 | 0.00885 | 0.0005805 | 0.0015142 | 0.004053 | 0.008046 | 0.0009816 | 0.00023372 | 0.0000284 | 0.14437 | 0 |
| 3 | 0.04319 | 0.1492 | 0.0524 | 0.00814 | 0.00175 | 0.05323 | 0.0023293 | 0.0053106 | 0.0169371 | 0.044347 | 0.005928 | 0.00190654 | 0.00012326 | 0.32761 | 0 |
| 3 | 0.03483 | 0.1265 | 0.02306 | 0.00059 | 0 | 0.00121 | 0.0017937 | 0.004507 | 0.0064323 | 0.017216 | 0.0016865 | 0.00042836 | 0.00010565 | 0.16055 | 0 |
| 1 | 0.05016 | 0.2007 | 0.09271 | 0.00115 | 0 | 0.06103 | 0.0022327 | 0.0054315 | 0.0079091 | 0.021321 | 0.0020021 | 0.00050776 | 0.00010675 | 0.24725 | 0 |
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
My resultant difference values are absolute here.
I cant apply diff(), since the row indices may not be consecutive.
I am currently achieving my aim via looping through sets.
Is there a pandas trick to do this?
Use loc based indexing -
df.loc[set1].values - df.loc[set2].values
Ensure that len(set1) is equal to len(set2). Also, keep in mind setX is a counter-intuitive name for list objects.
You need to select by data reindexing and then subtract:
df = df.reindex(set1) - df.reindex(set2).values
loc or iloc will raise a future warning, since passing list-likes to .loc or [] with any missing label will raise KeyError in the future.
In short, try the following:
df.iloc[::2].values - df.iloc[1::2].values
PS:
Or alternatively, if (like in your question the indices follow no simple rule):
df.iloc[set1].values - df.iloc[set2].values

Python group multiple lines into one vector line by a specific column value

How to group multiple lines into one vector line by a specific column value.
As an example, I have this data:
| Latitude | Longitude | group |
|-----------|------------|-------|
| 46.852397 | -72.02586 | A |
| 47.059016 | -70.907962 | A |
| 46.897785 | -71.140082 | A |
| 46.99328 | -70.986152 | A |
| 46.64613 | -71.934034 | A |
| 46.622638 | -71.994857 | A |
| 46.968093 | -71.284281 | B |
| 47.422739 | -70.32361 | B |
| 46.878963 | -71.717918 | B |
| 46.91002 | -71.108395 | C |
| 47.465175 | -70.337958 | C |
| 46.6936 | -71.862257 | C |
| 47.40885 | -70.390739 | C |
| 47.00737 | -71.232117 | C |
| 47.013901 | -70.965815 | C |
| 46.824111 | -71.554997 | C |
| 47.003765 | -71.193865 | C |
| 46.665319 | -72.15102 | C |
| 47.129865 | -70.842406 | C |
| 46.932361 | -71.994677 | C |
that I would like to convert to this:
| group | Latitude | Longitude |
|-------|-------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------|
| A | [46.852397,47.059016,46.897785,46.99328,46.64613,46.622638] | [-72.02586,-70.907962,-71.140082,-70.986152,-71.934034,-71.994857] |
| B | [46.968093,47.422739,46.878963] | [-71.284281,-70.32361,-71.717918] |
| C | [46.91002,47.465175,46.6936,47.40885,47.00737,47.013901,46.824111,47.003765,46.665319,,47.129865,46.932361] | [-71.108395,-70.337958,-71.862257,-70.390739,-71.232117,-70.965815,-71.554997,-71.193865,-72.15102,-70.842406,-71.994677] |
Lets say you have a data frame that looks like this:
>>> df
v1 v2 v3
0 1 2 a
1 3 4 a
2 1 2 b
3 3 4 b
Then, you can have what you want with:
>>> df.groupby('v3').agg(lambda m: list(m)).reset_index()
v3 v1 v2
0 a [1, 3] [2, 4]
1 b [1, 3] [2, 4]
However, this is a bad idea because Pandas doesn't handle lists as values very well. It wasn't designed for that. However, if it works for you, go ahead and use it.

Categories

Resources