Compare and match values from two df and multiple columns

Compare and match values from two df and multiple columns - python

I've got two dataframes with data about popular stores and districts where they are located. Each store is kind of a chain and may have more than one district location id (for example "Store1" has several stores in different places).
First df has info about top-5 most popular stores and district ids separated by semicolon, for example:
store_name district_id
Store1 | 1;2;3;4;5
Store2 | 1;2
Store3 | 3
Store4 | 4;7;10;15
Store5 | 12;15;
Second df has only two columns with ALL districts in city and each row is unique district id and it's name.
district_id district_name
1 | District1
2 | District2
3 | District3
4 | District4
5 | District5
6 | District6
7 | District7
8 | District8
9 | District9
10 | District10
etc.
The goal is to create columns in df1 for every store in top-5 and match every district id number to district name.
So, firstly I splitted df1 into form like this:
store_name district_id 0 1 2 3 4 5
Store1 | 1 | 2 | 3 | 4 | 5
Store2 | 1 | 2 | | |
Store3 | 3 | | | |
Store4 | 4 | 7 | 10| 15|
Store5 | 12 | 15|
But now I'm stucked and don't know how to match each value from df1 to df2 and get district names for each id. Empty cells is None, because columns were created by maximum values for each store.
I would like to get df like this:
store_name district_name district_name2 district_name3 district_name4 district_name5
Store1 | District1 | District2 | District3 | District4 | District5
Store2 | District1 | District2 | | |
Store3 | District3 | | | |
Store4 | District4 | District7 | District10 | District15 |
Store5 | District12 | District15 | | |
Thanks in advance!

You can stack first dataframe, then convert it to float type, map the column from second dataframe, then unstack and finally add_prefix:
df1.stack().astype(float).map(df2['district_name']).unstack().add_prefix('district_name')
OUTPUT:
district_name0 district_name1 ... district_name3 district_name4
store_name ...
Store1 District1 District2 ... District4 District5
Store2 District1 District2 ... NaN NaN
Store3 District3 NaN ... NaN NaN
Store4 District4 District7 ... NaN NaN
Store5 NaN NaN ... NaN NaN
The dataframes used for above code:
>>> df1
0 1 2 3 4
store_name
Store1 1 2 3 4 5
Store2 1 2 NaN NaN NaN
Store3 3 NaN NaN NaN NaN
Store4 4 7 10 15 NaN
Store5 12 15 NaN NaN NaN
>>> df2
district_name
district_id
1 District1
2 District2
3 District3
4 District4
5 District5
6 District6
7 District7
8 District8
9 District9
10 District10

So there are many ways to possibly do this, this is just one. Assume you have your two dataframes stored as df1 and df2:
First, normalize your district_id column in df1 so that they are all the same length:
# make all strings the same size when split
def return_full_string(text):
l = len(text.split(';'))
for _ in range(5 - l):
text = f"{text};"
return text
df1['district_id'] = df1.district_id.apply(return_full_string)
Then split the text column into separate columns and delete the original:
# split district id's into different columns
district_columns = [f"district_name{n+1}" for n in range(5)]
df1[district_columns] = list(df1.district_id.str.split(';'))
df1.drop('district_id', inplace=True)
Then acquire a map of the ids in df2 to their respective names, and use that to replace the values in your new columns:
id_to_name = {str(ii): nn for ii, nn in zip(df2['district_id'], df2['district_name'])}
for col in district_columns:
df1[col] = df1[col].apply(id_to_name.get)
Like I said, I'm sure there are other ways to do this, but this should work

df1=pd.DataFrame(data={'store_name':['store1','store2','store3','store4','store5'],
'district_id':[[1,2,3,4,5], [1,2], 3, [4,7,10], [8,10]]})
df2=pd.DataFrame(data={'district_id':[1,2,3,4,5,6,7,8,9,10],
'district_name':['District1', 'District2', 'District3', 'District4', 'District5', 'District6', 'District7', 'District8', 'District9', 'District10']})
step 1:use explode() to split values to rows
df3=df1.explode('district_id').reset_index(drop=True)
step2: use merge() with on='district_id'
df4=pd.merge(df3,df2, on='district_id' )
step 3: use groupby() & agg() to get column with lists
df5=df4.groupby('district_name').agg(list).reset_index()
store_name district_id district_name
0 store1 [1, 2, 3, 4, 5] [District1,District2,District3,District4,District5]
1 store2 [1, 2] [District1,District2]
2 store3 [3] [District3]
3 store4 [4, 7, 10] [District4,District7,District10]
4 store5 [10, 8] [District10,District8]
Then it can be split however required.

I'd suggest something like the below and then pivot etc. as required as having a column with strings like 1;2;3;4;5 in it is going to be awkward (I feel).
import pandas as pd
df1 = pd.DataFrame({'store_name': {0: 'Store1',
1: 'Store2',
2: 'Store3',
3: 'Store4',
4: 'Store5'},
'district_id': {0: '1;2;3;4;5',
1: '1;2',
2: '3',
3: '4;7;10;15',
4: '12;15;'}})
df3 = pd.DataFrame({'district_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'district_name': {0: 'District1',
1: ' District2',
2: ' District3',
3: ' District4',
4: ' District5',
5: ' District6',
6: ' District7',
7: ' District8',
8: ' District9',
9: ' District10'}})
# 'explode' the 'district_id' column with strings like '1;2;3;4;5' in df1
df2 = pd.DataFrame(df1.district_id.str.split(';').tolist(), index=df1.store_name).stack()
df2 = df2.reset_index()[[0, 'store_name']]
df2.columns = ['district_id', 'store_name']
df2 = df2[~df2['district_id'].eq('')]
df2['district_id'] = df2['district_id'].astype(int)
'''df2 Shows:
district_id store_name
0 1 Store1
1 2 Store1
2 3 Store1
3 4 Store1
4 5 Store1
etc.
'''
df4 = pd.merge(df2, df3, on='district_id', how='left')
print(df4)
district_id store_name district_name
0 1 Store1 District1
1 2 Store1 District2
2 3 Store1 District3
3 4 Store1 District4
4 5 Store1 District5
5 1 Store2 District1
6 2 Store2 District2
7 3 Store3 District3
8 4 Store4 District4
9 7 Store4 District7
10 10 Store4 District10
11 15 Store4 NaN
12 12 Store5 NaN
13 15 Store5 NaN
# From here you can pivot df4 etc. and carry on as required.

Related

Pandas Dataframe - record number of rows based on cumulative sum on a column with a condition

In below df I already have column "A". I'm trying to add another column "Desired" where the value is the number of rows below the corresponding A's value, to first satisfy: cumulative sum of A's value >= 8
For example: row 1 of column "Desired" would be 3 because 5+2+3 >= 8. rows 2 of column "Desired" would be 4 because 2+3+2+2>=8
Therefore the ideal new df would be like below.
df:
A
Desired
8
3
5
4
2
4
3
4
2
3
2
2
1
1
11
1
8
NA
6
NA

Use cumsum() and a for loop:
df = pd.DataFrame({'A':[8,5,2,3,2,2,1,11,8,6]})
cumsum_arr = df['A'].cumsum().values
desired = np.zeros(len(df))
for i in range(len(df)):
desired[i] = np.argmax((cumsum_arr[i:] - cumsum_arr[i])>=8)
df['desrired'] = desired
df['desrired'] = df['desrired'].replace(0, np.nan)
A desrired
0 8 3.0
1 5 4.0
2 2 4.0
3 3 4.0
4 2 3.0
5 2 2.0
6 1 1.0
7 11 1.0
8 8 NaN
9 6 NaN

Using rolling() window it can be achieved without any looping.
df = pd.read_csv(io.StringIO("""|A|Desired|
|8 |3 |
|5 |4 |
|2 |4 |
|3 |4 |
|2 |3 |
|2 |2 |
|1 |1 |
|11 |1 |
|8 |NA |
|6 |NA |"""),sep="|")
df = df.drop(columns=[c for c in df.columns if "Unnamed" in c])
df["Desired"] = pd.to_numeric(df["Desired"], errors="coerce").astype("Int64")
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rolling.html see example
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=len(df))
df["DesiredCalc"] = (df["A"]
# looking at rows after current row
.shift(-1)
.rolling(indexer, min_periods=1)
# if any result of cumsum()>=8 then return zero based index + 1, else no result
.apply(lambda x: np.where(np.cumsum(x).ge(8).any(), np.argmax(np.cumsum(x).ge(8)) + 1, np.nan))
.astype("Int64")
)
output
A Desired DesiredCalc
8 3 3
5 4 4
2 4 4
3 4 4
2 3 3
2 2 2
1 1 1
11 1 1
8 <NA> <NA>
6 <NA> <NA>

how to fill date column in one dataframe with nearest dates from another dataframe

I have a dataframe visit =
visit_occurrence_id visit_start_date person_id
1 2016-06-01 1
2 2019-05-01 2
3 2016-01-22 1
4 2017-02-14 2
5 2018-05-11 3
and another dataframe measurement =
measurement_date person_id visit_occurrence_id
2017-09-04 1 Nan
2018-04-24 2 Nan
2018-05-22 2 Nan
2019-02-02 1 Nan
2019-01-28 3 Nan
2019-05-07 1 Nan
2018-12-11 3 Nan
2017-04-28 3 Nan
I want to fill the visit_occurrence_id for measurement table with visit_occurrence_id of visit table on the basis of person_id and nearest date possible.
I have written a code but its taking a lot of time.
measurement has 7*10^5 rows.
Note: visit_start_date and measurement_date are object types
my code -
import datetime as dt
unique_person_list = measurement['person_id'].unique().tolist()
def nearest_date(row,date_list):
date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list]
row = min(date_list, key=lambda x: abs(x - row))
return row
modified_measurement = pd.DataFrame(columns = measurement.columns)
for person in unique_person_list:
near_visit_dates = visit[visit['person_id']==person]['visit_start_date'].tolist()
if near_visit_dates:
near_visit_dates = list(filter(None, near_visit_dates))
near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates]
store_dates = measurement.loc[measurement['person_id']== person]['measurement_date']
store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,))
modified_measurement = modified_measurement.append(store_dates)
My code's execution time is quite high. Can you help me in either reducing the time complexity or with another solution.
edit - adding dataframe constructors.
import numpy as np
measurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02',
'2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'],
'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
visit = {'visit_occurrence_id':[1, 2, 3, 4, 5],
'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'],
'person_id':[1, 2, 1, 2, 3]}
# Create DataFrame
measurement = pd.DataFrame(measurement)
visit = pd.DataFrame(visit)

You can do the following:
df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")
df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)
df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")
res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]
Output:
measurement_date person_id visit_occurrence_id_2
0 2017-09-04 1 1
1 2018-04-24 2 2
2 2018-05-22 2 2
3 2019-02-02 1 1
4 2019-01-28 3 5
5 2019-05-07 1 1
6 2018-12-11 3 5
7 2017-04-28 3 5

Here's what I've come up with:
# Get all visit start dates
df = measurement.drop('visit_occurrence_id', axis=1).merge(visit, on='person_id')
df['date_difference'] = abs(df.measurement_date - df.visit_start_date)
# Find the smallest visit start date for each person_id - measurement_date pair
df['smallest_difference'] = df.groupby(['person_id', 'measurement_date'])['date_difference'].transform(min)
df = df[df.date_difference == df.smallest_difference]
df = df[['measurement_date', 'person_id', 'visit_occurrence_id']]
# Fill in visit_occurrence_id from original dataframe
measurement.drop("visit_occurrence_id", axis=1).merge(
df, on=["measurement_date", "person_id"]
)
This produces:
| | measurement_date | person_id | visit_occurrence_id |
|---:|:-------------------|------------:|----------------------:|
| 0 | 2017-09-04 | 1 | 1 |
| 1 | 2018-04-24 | 2 | 2 |
| 2 | 2018-05-22 | 2 | 2 |
| 3 | 2019-02-02 | 1 | 1 |
| 4 | 2019-01-28 | 3 | 5 |
| 5 | 2019-05-07 | 1 | 1 |
| 6 | 2018-12-11 | 3 | 5 |
| 7 | 2017-04-28 | 3 | 5 |
I believe there's probably a cleaner way of writing this using sklearn: https://scikit-learn.org/stable/modules/neighbors.html

Move row by name to desired location in df

I have a df which looks like this:
a b
apple | 7 | 2 |
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
other | 8 | 9 |
I want to select a given row say by name, say "apple" and move it to a new location, say -1 (second last row)
desired output
a b
google | 8 | 8 |
swatch | 6 | 6 |
merc | 7 | 8 |
apple | 7 | 2 |
other | 8 | 9 |
Is there any functions available to achieve this?

Use Index.difference for remove value and numpy.insert for add value to new index, last use DataFrame.reindex or DataFrame.loc for change order of rows:
a = 'apple'
idx = np.insert(df.index.difference([a], sort=False), -1, a)
print (idx)
Index(['google', 'swatch', 'merc', 'apple', 'other'], dtype='object')
df = df.reindex(idx)
#alternative
#df = df.loc[idx]
print (df)
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

This seems good, I am using pd.Index.insert() and pd.Index.drop_duplicates():
df.reindex(df.index.insert(-1,'apple').drop_duplicates(keep='last'))
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

I'm not aware of any built-in function, but one approach would be to manipulate the index only, then use the new index to re-order the DataFrame (assumes all index values are unique):
name = 'apple'
position = -1
new_index = [i for i in df.index if i != name]
new_index.insert(position, name)
df = df.loc[new_index]
Results:
a b
google 8 8
swatch 6 6
merc 7 8
apple 7 2
other 8 9

How to groupby and map by two columns pandas dataframe

i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .
train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean') - train.surface.mean())
here i have set the values of grouby by "cat" feature with the mean of suface . Cool
now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .
mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)
So , far there is no problem . Now , i would use two columns in groupby .
train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean') - train.surface.mean())
but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .
Thanks
for example my train is
+------+------+-------+
| Cat1 | Cat2 | surface |
+------+------+-------+
| 1 | 3 | 10 |
+------+------+-------+
| 2 | 2 | 12 |
+------+------+-------+
| 3 | 1 | 12 |
+------+------+-------+
| 1 | 3 | 5 |
+------+------+-------+
| 2 | 2 | 10 |
+------+------+-------+
| 3 | 2 | 13 |
+------+------+-------+
my test is
+------+------+
| Cat1 | Cat2 |
+------+------+
| 1 | 2 |
+------+------+
| 2 | 1 |
+------+------+
| 3 | 1 |
+------+------+
| 1 | 3 |
+------+------+
| 2 | 3 |
+------+------+
| 3 | 1 |
+------+------+
Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5
Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .
i hope that you have got me .

You can use
groupby().means() to calculate means
reset_index() to convert indexes Cat1, Cat2 into columns again
merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).
.
headers = ['Cat1', 'Cat2', 'surface']
train_data = [
[1, 3, 10],
[2, 2, 12],
[3, 1, 12],
[1, 3, 5],
[2, 2, 10],
[3, 2, 13],
]
test_data = [
[1, 2],
[2, 1],
[3, 1],
[1, 3],
[2, 3],
[3, 1],
]
import pandas as pd
train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])
print('--- train ---')
print(train)
print('--- test ---')
print(test)
print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()
print(means)
print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])
print(means)
print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')
print(result)
print('--- result (fillna)---')
result = result.fillna(0)
print(result)
Result:
--- train ---
Cat1 Cat2 surface
0 1 3 10
1 2 2 12
2 3 1 12
3 1 3 5
4 2 2 10
5 3 2 13
--- test ---
Cat1 Cat2
0 1 2
1 2 1
2 3 1
3 1 3
4 2 3
5 3 1
--- means ---
surface
Cat1 Cat2
1 3 7.5
2 2 11.0
3 1 12.0
2 13.0
--- means (dataframe) ---
Cat1 Cat2 surface
0 1 3 7.5
1 2 2 11.0
2 3 1 12.0
3 3 2 13.0
--- result ----
Cat1 Cat2 surface
0 1 2 NaN
1 2 1 NaN
2 3 1 12.0
3 1 3 7.5
4 2 3 NaN
5 3 1 12.0
--- result (fillna)---
Cat1 Cat2 surface
0 1 2 0.0
1 2 1 0.0
2 3 1 12.0
3 1 3 7.5
4 2 3 0.0
5 3 1 12.0

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?

You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare and match values from two df and multiple columns - python

Related

Pandas Dataframe - record number of rows based on cumulative sum on a column with a condition

how to fill date column in one dataframe with nearest dates from another dataframe

Move row by name to desired location in df

How to groupby and map by two columns pandas dataframe

Pandas data frame: adding columns based on previous time periods

Categories

Resources