How to groupby and map by two columns pandas dataframe - python

i have a problem on python working with a pandas dataframe i'm trying to make a machine learning model predictin the surface . I have the surface column in the train dataframe and i don't have it in the test dataframe . So , i would to create some features based on the surface in the train like .
train['error_cat1'] = abs(train.groupby(train['cat1'])['surface'].transform('mean') - train.surface.mean())
here i have set the values of grouby by "cat" feature with the mean of suface . Cool
now i must add it to the test too . So , will use this method to map the values from the train for each groupby to the test row .
mp = {k: g['error_cat1'].tolist()[0] for k,g in train.groupby('cat1')}
test['error_cat1'] = test['cat1'].map(mp)
So , far there is no problem . Now , i would use two columns in groupby .
train['error_cat1_cat2'] = abs(train.groupby(train[['cat1','cat2']])['surface'].transform('mean') - train.surface.mean())
but i don't know how to map it for test dataframe . Please can you help me handling this problem or give me some other methods so i can do it .
Thanks
for example my train is
+------+------+-------+
| Cat1 | Cat2 | surface |
+------+------+-------+
| 1 | 3 | 10 |
+------+------+-------+
| 2 | 2 | 12 |
+------+------+-------+
| 3 | 1 | 12 |
+------+------+-------+
| 1 | 3 | 5 |
+------+------+-------+
| 2 | 2 | 10 |
+------+------+-------+
| 3 | 2 | 13 |
+------+------+-------+
my test is
+------+------+
| Cat1 | Cat2 |
+------+------+
| 1 | 2 |
+------+------+
| 2 | 1 |
+------+------+
| 3 | 1 |
+------+------+
| 1 | 3 |
+------+------+
| 2 | 3 |
+------+------+
| 3 | 1 |
+------+------+
Now i would do a groupby mean surface on the cat1 and cat2 for example the mean surface on (cat1,cat2)=(1,3) is (10+5)/2 = 7.5
Now , i must go to the test and map this value on the (cat1,cat2)=(1,3) rows .
i hope that you have got me .

You can use
groupby().means() to calculate means
reset_index() to convert indexes Cat1, Cat2 into columns again
merge(how='left', ) to join two dataframes like tables in database (LEFT JOIN in SQL).
.
headers = ['Cat1', 'Cat2', 'surface']
train_data = [
[1, 3, 10],
[2, 2, 12],
[3, 1, 12],
[1, 3, 5],
[2, 2, 10],
[3, 2, 13],
]
test_data = [
[1, 2],
[2, 1],
[3, 1],
[1, 3],
[2, 3],
[3, 1],
]
import pandas as pd
train = pd.DataFrame(train_data, columns=headers)
test = pd.DataFrame(test_data, columns=headers[:-1])
print('--- train ---')
print(train)
print('--- test ---')
print(test)
print('--- means ---')
means = train.groupby(['Cat1', 'Cat2']).mean()
print(means)
print('--- means (dataframe) ---')
means = means.reset_index(level=['Cat1', 'Cat2'])
print(means)
print('--- result ----')
result = pd.merge(df2, means, on=['Cat1', 'Cat2'], how='left')
print(result)
print('--- result (fillna)---')
result = result.fillna(0)
print(result)
Result:
--- train ---
Cat1 Cat2 surface
0 1 3 10
1 2 2 12
2 3 1 12
3 1 3 5
4 2 2 10
5 3 2 13
--- test ---
Cat1 Cat2
0 1 2
1 2 1
2 3 1
3 1 3
4 2 3
5 3 1
--- means ---
surface
Cat1 Cat2
1 3 7.5
2 2 11.0
3 1 12.0
2 13.0
--- means (dataframe) ---
Cat1 Cat2 surface
0 1 3 7.5
1 2 2 11.0
2 3 1 12.0
3 3 2 13.0
--- result ----
Cat1 Cat2 surface
0 1 2 NaN
1 2 1 NaN
2 3 1 12.0
3 1 3 7.5
4 2 3 NaN
5 3 1 12.0
--- result (fillna)---
Cat1 Cat2 surface
0 1 2 0.0
1 2 1 0.0
2 3 1 12.0
3 1 3 7.5
4 2 3 0.0
5 3 1 12.0

Related

Pandas, remove rows based on equivalence on differents columns between them [duplicate]

I am looking for a an efficient and elegant way in Pandas to remove "duplicate" rows in a DataFrame that have exactly the same value set but in different columns.
I am ideally looking for a vectorized way to do this as I can already identify very inefficient ways using the Pandas pandas.DataFrame.iterrows() method.
Say my DataFrame is:
source|target|
----------------
| 1 | 2 |
| 2 | 1 |
| 4 | 3 |
| 2 | 7 |
| 3 | 4 |
I want it to become:
source|target|
----------------
| 1 | 2 |
| 4 | 3 |
| 2 | 7 |
df = df[~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()]
source target
0 1 2
2 4 3
3 2 7
explanation:
np.sort(df.values,axis=1) is sorting DataFrame column wise
array([[1, 2],
[1, 2],
[3, 4],
[2, 7],
[3, 4]], dtype=int64)
then making a dataframe from it and checking non duplicated using prefix ~ on duplicated
~pd.DataFrame(np.sort(df.values,axis=1)).duplicated()
0 True
1 False
2 True
3 True
4 False
dtype: bool
and using this as mask getting final output
source target
0 1 2
2 4 3
3 2 7

Compare and match values from two df and multiple columns

I've got two dataframes with data about popular stores and districts where they are located. Each store is kind of a chain and may have more than one district location id (for example "Store1" has several stores in different places).
First df has info about top-5 most popular stores and district ids separated by semicolon, for example:
store_name district_id
Store1 | 1;2;3;4;5
Store2 | 1;2
Store3 | 3
Store4 | 4;7;10;15
Store5 | 12;15;
Second df has only two columns with ALL districts in city and each row is unique district id and it's name.
district_id district_name
1 | District1
2 | District2
3 | District3
4 | District4
5 | District5
6 | District6
7 | District7
8 | District8
9 | District9
10 | District10
etc.
The goal is to create columns in df1 for every store in top-5 and match every district id number to district name.
So, firstly I splitted df1 into form like this:
store_name district_id 0 1 2 3 4 5
Store1 | 1 | 2 | 3 | 4 | 5
Store2 | 1 | 2 | | |
Store3 | 3 | | | |
Store4 | 4 | 7 | 10| 15|
Store5 | 12 | 15|
But now I'm stucked and don't know how to match each value from df1 to df2 and get district names for each id. Empty cells is None, because columns were created by maximum values for each store.
I would like to get df like this:
store_name district_name district_name2 district_name3 district_name4 district_name5
Store1 | District1 | District2 | District3 | District4 | District5
Store2 | District1 | District2 | | |
Store3 | District3 | | | |
Store4 | District4 | District7 | District10 | District15 |
Store5 | District12 | District15 | | |
Thanks in advance!
You can stack first dataframe, then convert it to float type, map the column from second dataframe, then unstack and finally add_prefix:
df1.stack().astype(float).map(df2['district_name']).unstack().add_prefix('district_name')
OUTPUT:
district_name0 district_name1 ... district_name3 district_name4
store_name ...
Store1 District1 District2 ... District4 District5
Store2 District1 District2 ... NaN NaN
Store3 District3 NaN ... NaN NaN
Store4 District4 District7 ... NaN NaN
Store5 NaN NaN ... NaN NaN
The dataframes used for above code:
>>> df1
0 1 2 3 4
store_name
Store1 1 2 3 4 5
Store2 1 2 NaN NaN NaN
Store3 3 NaN NaN NaN NaN
Store4 4 7 10 15 NaN
Store5 12 15 NaN NaN NaN
>>> df2
district_name
district_id
1 District1
2 District2
3 District3
4 District4
5 District5
6 District6
7 District7
8 District8
9 District9
10 District10
So there are many ways to possibly do this, this is just one. Assume you have your two dataframes stored as df1 and df2:
First, normalize your district_id column in df1 so that they are all the same length:
# make all strings the same size when split
def return_full_string(text):
l = len(text.split(';'))
for _ in range(5 - l):
text = f"{text};"
return text
df1['district_id'] = df1.district_id.apply(return_full_string)
Then split the text column into separate columns and delete the original:
# split district id's into different columns
district_columns = [f"district_name{n+1}" for n in range(5)]
df1[district_columns] = list(df1.district_id.str.split(';'))
df1.drop('district_id', inplace=True)
Then acquire a map of the ids in df2 to their respective names, and use that to replace the values in your new columns:
id_to_name = {str(ii): nn for ii, nn in zip(df2['district_id'], df2['district_name'])}
for col in district_columns:
df1[col] = df1[col].apply(id_to_name.get)
Like I said, I'm sure there are other ways to do this, but this should work
df1=pd.DataFrame(data={'store_name':['store1','store2','store3','store4','store5'],
'district_id':[[1,2,3,4,5], [1,2], 3, [4,7,10], [8,10]]})
df2=pd.DataFrame(data={'district_id':[1,2,3,4,5,6,7,8,9,10],
'district_name':['District1', 'District2', 'District3', 'District4', 'District5', 'District6', 'District7', 'District8', 'District9', 'District10']})
step 1:use explode() to split values to rows
df3=df1.explode('district_id').reset_index(drop=True)
step2: use merge() with on='district_id'
df4=pd.merge(df3,df2, on='district_id' )
step 3: use groupby() & agg() to get column with lists
df5=df4.groupby('district_name').agg(list).reset_index()
store_name district_id district_name
0 store1 [1, 2, 3, 4, 5] [District1,District2,District3,District4,District5]
1 store2 [1, 2] [District1,District2]
2 store3 [3] [District3]
3 store4 [4, 7, 10] [District4,District7,District10]
4 store5 [10, 8] [District10,District8]
Then it can be split however required.
I'd suggest something like the below and then pivot etc. as required as having a column with strings like 1;2;3;4;5 in it is going to be awkward (I feel).
import pandas as pd
df1 = pd.DataFrame({'store_name': {0: 'Store1',
1: 'Store2',
2: 'Store3',
3: 'Store4',
4: 'Store5'},
'district_id': {0: '1;2;3;4;5',
1: '1;2',
2: '3',
3: '4;7;10;15',
4: '12;15;'}})
df3 = pd.DataFrame({'district_id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10},
'district_name': {0: 'District1',
1: ' District2',
2: ' District3',
3: ' District4',
4: ' District5',
5: ' District6',
6: ' District7',
7: ' District8',
8: ' District9',
9: ' District10'}})
# 'explode' the 'district_id' column with strings like '1;2;3;4;5' in df1
df2 = pd.DataFrame(df1.district_id.str.split(';').tolist(), index=df1.store_name).stack()
df2 = df2.reset_index()[[0, 'store_name']]
df2.columns = ['district_id', 'store_name']
df2 = df2[~df2['district_id'].eq('')]
df2['district_id'] = df2['district_id'].astype(int)
'''df2 Shows:
district_id store_name
0 1 Store1
1 2 Store1
2 3 Store1
3 4 Store1
4 5 Store1
etc.
'''
df4 = pd.merge(df2, df3, on='district_id', how='left')
print(df4)
district_id store_name district_name
0 1 Store1 District1
1 2 Store1 District2
2 3 Store1 District3
3 4 Store1 District4
4 5 Store1 District5
5 1 Store2 District1
6 2 Store2 District2
7 3 Store3 District3
8 4 Store4 District4
9 7 Store4 District7
10 10 Store4 District10
11 15 Store4 NaN
12 12 Store5 NaN
13 15 Store5 NaN
# From here you can pivot df4 etc. and carry on as required.

how to fill date column in one dataframe with nearest dates from another dataframe

I have a dataframe visit =
visit_occurrence_id visit_start_date person_id
1 2016-06-01 1
2 2019-05-01 2
3 2016-01-22 1
4 2017-02-14 2
5 2018-05-11 3
and another dataframe measurement =
measurement_date person_id visit_occurrence_id
2017-09-04 1 Nan
2018-04-24 2 Nan
2018-05-22 2 Nan
2019-02-02 1 Nan
2019-01-28 3 Nan
2019-05-07 1 Nan
2018-12-11 3 Nan
2017-04-28 3 Nan
I want to fill the visit_occurrence_id for measurement table with visit_occurrence_id of visit table on the basis of person_id and nearest date possible.
I have written a code but its taking a lot of time.
measurement has 7*10^5 rows.
Note: visit_start_date and measurement_date are object types
my code -
import datetime as dt
unique_person_list = measurement['person_id'].unique().tolist()
def nearest_date(row,date_list):
date_list = [dt.datetime.strptime(date, '%Y-%m-%d').date() for date in date_list]
row = min(date_list, key=lambda x: abs(x - row))
return row
modified_measurement = pd.DataFrame(columns = measurement.columns)
for person in unique_person_list:
near_visit_dates = visit[visit['person_id']==person]['visit_start_date'].tolist()
if near_visit_dates:
near_visit_dates = list(filter(None, near_visit_dates))
near_visit_dates = [i.strftime('%Y-%m-%d') for i in near_visit_dates]
store_dates = measurement.loc[measurement['person_id']== person]['measurement_date']
store_dates= store_dates.apply(nearest_date, args=(near_visit_dates,))
modified_measurement = modified_measurement.append(store_dates)
My code's execution time is quite high. Can you help me in either reducing the time complexity or with another solution.
edit - adding dataframe constructors.
import numpy as np
measurement = {'measurement_date':['2017-09-04', '2018-04-24', '2018-05-22', '2019-02-02',
'2019-01-28', '2019-05-07', '2018-12-11','2017-04-28'],
'person_id':[1, 2, 2, 1, 3, 1, 3, 3],'visit_occurrence_id':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
visit = {'visit_occurrence_id':[1, 2, 3, 4, 5],
'visit_start_date':['2016-06-01', '2019-05-01', '2016-01-22', '2017-02-14', '2018-05-11'],
'person_id':[1, 2, 1, 2, 3]}
# Create DataFrame
measurement = pd.DataFrame(measurement)
visit = pd.DataFrame(visit)
You can do the following:
df=pd.merge(measurement[["person_id", "measurement_date"]], visit, on="person_id", how="inner")
df["dt_diff"]=df[["visit_start_date", "measurement_date"]].apply(lambda x: abs(datetime.datetime.strptime(x["visit_start_date"], '%Y-%m-%d').date() - datetime.datetime.strptime(x["measurement_date"], '%Y-%m-%d').date()), axis=1)
df=pd.merge(df, df.groupby(["person_id", "measurement_date"])["dt_diff"].min(), on=["person_id", "dt_diff", "measurement_date"], how="inner")
res=pd.merge(measurement, df, on=["measurement_date", "person_id"], suffixes=["", "_2"])[["measurement_date", "person_id", "visit_occurrence_id_2"]]
Output:
measurement_date person_id visit_occurrence_id_2
0 2017-09-04 1 1
1 2018-04-24 2 2
2 2018-05-22 2 2
3 2019-02-02 1 1
4 2019-01-28 3 5
5 2019-05-07 1 1
6 2018-12-11 3 5
7 2017-04-28 3 5
Here's what I've come up with:
# Get all visit start dates
df = measurement.drop('visit_occurrence_id', axis=1).merge(visit, on='person_id')
df['date_difference'] = abs(df.measurement_date - df.visit_start_date)
# Find the smallest visit start date for each person_id - measurement_date pair
df['smallest_difference'] = df.groupby(['person_id', 'measurement_date'])['date_difference'].transform(min)
df = df[df.date_difference == df.smallest_difference]
df = df[['measurement_date', 'person_id', 'visit_occurrence_id']]
# Fill in visit_occurrence_id from original dataframe
measurement.drop("visit_occurrence_id", axis=1).merge(
df, on=["measurement_date", "person_id"]
)
This produces:
| | measurement_date | person_id | visit_occurrence_id |
|---:|:-------------------|------------:|----------------------:|
| 0 | 2017-09-04 | 1 | 1 |
| 1 | 2018-04-24 | 2 | 2 |
| 2 | 2018-05-22 | 2 | 2 |
| 3 | 2019-02-02 | 1 | 1 |
| 4 | 2019-01-28 | 3 | 5 |
| 5 | 2019-05-07 | 1 | 1 |
| 6 | 2018-12-11 | 3 | 5 |
| 7 | 2017-04-28 | 3 | 5 |
I believe there's probably a cleaner way of writing this using sklearn: https://scikit-learn.org/stable/modules/neighbors.html

DataFrame transpose from List

As most pandas problems, I am guessing the problem has been dealt with before, but I can't find a direct answer and I'm also worried about performance. My dataset is large, so I'm hoping to find the most efficient way of doing this.
The Problem
I have 2 dataframes - dfA contains a list of id's from dfB. I'd like to
transpose those IDs as columns
replace the IDs with a value looked up from dfB
collapse repeated columns and aggregate with sum
Here's an illustration:
dfA
dfA = pd.DataFrame({'a_id':['0000001','0000002','0000003','0000004'],
'list_of_b_id':[['2','3','7'],[],['1','2','3','4'],['6','7']]
})
+------+--------------+
| a_id | list_of_b_id |
+------+--------------+
| 1 | [2, 3, 7] |
+------+--------------+
| 2 | [] |
+------+--------------+
| 3 | [1, 2, 3, 4] |
+------+--------------+
| 4 | [6, 7] |
+------+--------------+
dfB
dfB = pd.DataFrame({'b_id':['1','2','3','4','5','6','7'],
'replacement': ['Red','Red','Blue','Red','Green','Blue','Red']
})
+------+-------------+
| b_id | replacement |
+------+-------------+
| 1 | Red |
+------+-------------+
| 2 | Red |
+------+-------------+
| 3 | Blue |
+------+-------------+
| 4 | Red |
+------+-------------+
| 5 | Orange |
+------+-------------+
| 6 | Blue |
+------+-------------+
| 7 | Red |
+------+-------------+
Goal (Final Result)
Here is what I'm hoping to eventually get to, in the most efficient way possible.
In reality, I may have over 5M obs in both dfA and dfB, and ~50 unique values for replacement in dfB, which explains why I need to do this in dynamic fashion and not just hard-code it.
+------+-----+------+
| a_id | Red | Blue |
+------+-----+------+
| 1 | 2 | 1 |
+------+-----+------+
| 2 | 0 | 0 |
+------+-----+------+
| 3 | 3 | 1 |
+------+-----+------+
| 4 | 1 | 1 |
+------+-----+------+
First all lists are flattening by numpy.repeat and numpy.concatenate:
df = pd.DataFrame({'id':np.repeat(dfA['a_id'], dfA['list_of_b_id'].str.len()),
'b': np.concatenate(dfA['list_of_b_id'])})
print (df)
b id
0 2 0000001
0 3 0000001
0 7 0000001
2 1 0000003
2 2 0000003
2 3 0000003
2 4 0000003
3 6 0000004
3 7 0000004
Then map by Series created from dfB, which is used for
groupby for counts, reshape by unstack and add missing values by reindex:
df = (df.groupby(['id',df['b'].map(dfB.set_index('b_id')['replacement'])])
.size()
.unstack(fill_value=0)
.reindex(dfA['a_id'].unique(), fill_value=0))
print (df)
b Blue Red
id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
print (df['b'].map(dfB.set_index('b_id')['replacement']))
0 Red
0 Blue
0 Red
2 Red
2 Red
2 Blue
2 Red
3 Blue
3 Red
Name: b, dtype: object
a = [['2','3','7'],[],['1','2','3','4'],['6','7']]
b =['Red','Red','Blue','Red','Green','Blue','Red']
res = []
for line in a:
tmp = {}
for ele in line:
tmp[b[int(ele)-1]] = tmp.get(b[int(ele)-1], 0) +1
res.append(tmp)
print pd.DataFrame(res).fillna(0)
Blue Red
0 1.0 2.0
1 0.0 0.0
2 1.0 3.0
3 1.0 1.0
Use
In [5611]: dft = (dfA.set_index('a_id')['list_of_b_id']
.apply(pd.Series)
.stack()
.replace(dfB.set_index('b_id')['replacement'])
.reset_index())
In [5612]: (dft.groupby(['a_id', 0]).size().unstack()
.reindex(dfA['a_id'].unique(), fill_value=0))
Out[5612]:
0 Blue Red
a_id
0000001 1 2
0000002 0 0
0000003 1 3
0000004 1 1
Details
In [5613]: dft
Out[5613]:
a_id level_1 0
0 0000001 0 Red
1 0000001 1 Blue
2 0000001 2 Red
3 0000003 0 Red
4 0000003 1 Red
5 0000003 2 Blue
6 0000003 3 Red
7 0000004 0 Blue
8 0000004 1 Red
You can try the code below:
pd.concat([dfA, dfA.list_of_b_id.apply(lambda x: dfB[dfB.b_id.isin(x)].replacement.value_counts())], axis=1)
d=dfB.set_index('b_id').T.to_dict('r')[0]
dfA['list_of_b_id']=dfA['list_of_b_id'].apply(lambda x : [d.get(k,k) for k in x])
pd.concat([dfA,pd.get_dummies(dfA['list_of_b_id'].apply(pd.Series).stack()).sum(level=0)],axis=1)
Out[66]:
a_id list_of_b_id Blue Red
0 0000001 [Red, Blue, Red] 1.0 2.0
1 0000002 [] NaN NaN
2 0000003 [Red, Red, Blue, Red] 1.0 3.0
3 0000004 [Blue, Red] 1.0 1.0

Pandas data frame: adding columns based on previous time periods

I am trying to work through a problem in pandas, being more accustomed to R.
I have a data frame df with three columns: person, period, value
df.head() or the top few rows look like:
| person | period | value
0 | P22 | 1 | 0
1 | P23 | 1 | 0
2 | P24 | 1 | 1
3 | P25 | 1 | 0
4 | P26 | 1 | 1
5 | P22 | 2 | 1
Notice the last row records a value for period 2 for person P22.
I would now like to add a new column that provides the value from the previous period. So if for P22 the value in period 1 is 0, then this new column would look like:
| person | period | value | lastperiod
5 | P22 | 2 | 1 | 0
I believe I need to do something like the following command, having loaded pandas:
for p in df.period.unique():
df['lastperiod']== [???]
How should this be formulated?
You could groupby person and then apply a shift to the values:
In [11]: g = df.groupby('person')
In [12]: g['value'].apply(lambda s: s.shift())
Out[12]:
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 0
dtype: float64
Adding this as a column:
In [13]: df['lastPeriod'] = g['value'].apply(lambda s: s.shift())
In [14]: df
Out[14]:
person period value lastPeriod
1 P22 1 0 NaN
2 P23 1 0 NaN
3 P24 1 1 NaN
4 P25 1 0 NaN
5 P26 1 1 NaN
6 P22 2 1 0
Here the NaN signify missing data (i.e. there wasn't an entry in the previous period).

Categories

Resources