Related
I've tried merging two dataframes, but I can't seem to get it to work. Each time I merge, the rows where I expect values are all 0. Dataframe df1 already as some data in it, with some left blank. Dataframe df2 will populate those blank rows in df1 where column names match at each value in "TempBin" and each value in "Month" in df1.
EDIT:
Both dataframes are in a for loop. df1 acts as my "storage", df2 changes for each location iteration. So if df2 contained the results for LocationZP, I would also want that data inserted in the matching df1 rows. If I use df1 = df1.append(df2) in the for loop, all of the rows from df2 keep inserting at the very end of df1 for each iteration.
df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 0 0 0
13 1 0 0 0
13 2 0 0 0
13 3 0 0 0
13 4 0 0 0
13 5 0 0 0
df2:
Month TempBin LocationAA
13 0 11
13 1 22
13 2 33
13 3 44
13 4 55
13 5 66
desired output in df1:
Month TempBin LocationAA LocationXA LocationZP
1 0 7 1 2
1 1 98 0 89
1 2 12 23 38
1 3 3 14 17
1 4 7 9 14
1 5 1 8 99
13 0 11 0 0
13 1 22 0 0
13 2 33 0 0
13 3 44 0 0
13 4 55 0 0
13 5 66 0 0
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]}
)
df1 = pd.merge(df1, df2, on=["Month","TempBin","LocationAA"], how="left")
result:
Month TempBin LocationAA LocationXA LocationZP
1 0 7.0 1.0 2.0
1 1 98.0 0.0 89.0
1 2 12.0 23.0 38.0
1 3 3.0 14.0 17.0
1 4 7.0 9.0 14.0
1 5 1.0 8.0 99.0
13 0 NaN NaN NaN
13 1 NaN NaN NaN
13 2 NaN NaN NaN
13 3 NaN NaN NaN
13 4 NaN NaN NaN
13 5 NaN NaN NaN
Here's some code that worked for me:
# Merge two df into one dataframe on the columns "TempBin" and "Month" filling nan values with 0.
import pandas as pd
df1 = pd.DataFrame({'Month': [1]*6 + [13]*6,
'TempBin': [0,1,2,3,4,5]*2,
'LocationAA': [7,98,12,3,7,1,0,0,0,0,0,0],
'LocationXA': [1,0,23,14,9,8,0,0,0,0,0,0],
'LocationZP': [2,89,38,17,14,99,0,0,0,0,0,0]}
)
df2 = pd.DataFrame({'Month': [13]*6,
'TempBin': [0,1,2,3,4,5],
'LocationAA': [11,22,33,44,55,66]})
df_merge = pd.merge(df1, df2, how='left',
left_on=['TempBin', 'Month'],
right_on=['TempBin', 'Month'])
df_merge.fillna(0, inplace=True)
# add column LocationAA and fill it with the not null value from column LocationAA_x and LocationAA_y
df_merge['LocationAA'] = df_merge.apply(lambda x: x['LocationAA_x'] if pd.isnull(x['LocationAA_y']) else x['LocationAA_y'], axis=1)
# remove column LocationAA_x and LocationAA_y
df_merge.drop(['LocationAA_x', 'LocationAA_y'], axis=1, inplace=True)
print(df_merge)
Output:
Month TempBin LocationXA LocationZP LocationAA
0 1 0 1.0 2.0 0.0
1 1 1 0.0 89.0 0.0
2 1 2 23.0 38.0 0.0
3 1 3 14.0 17.0 0.0
4 1 4 9.0 14.0 0.0
5 1 5 8.0 99.0 0.0
6 13 0 0.0 0.0 11.0
7 13 1 0.0 0.0 22.0
8 13 2 0.0 0.0 33.0
9 13 3 0.0 0.0 44.0
10 13 4 0.0 0.0 55.0
11 13 5 0.0 0.0 66.0
Let me know if there's something you don't understand in the comments :)
PS: Sorry for the extra comments. But I left them there for some more explanations.
You need to use append to get the desired output:
df1 = df1.append(df2)
and if you want to replace the Nulls to zeros add:
df1 = df1.fillna(0)
Here is another way using combine_first()
i = ['Month','TempBin']
df2.set_index(i).combine_first(df1.set_index(i)).reset_index()
This is my table
I want it to be the following, i.e., by duplicating Quantity of (shopID, productID) Quantity of other difference (shopID, productID) as new columns, Quantity_shopID_productID.
Following is my code:
from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
"ShopID":[1,1,1,1,2,2,2,2],
"ProductID":[1,1,2,2,1,1,2,2],
"Quantity":[3,3,4,4,5,5,6,6]})
for sid in df.ShopID.unique():
for pid in df.ProductID.unique():
col_name='Quantity{}_{}'.format(sid,pid)
print(col_name)
df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
df1.rename(columns={'Quantity':col_name}, inplace=True)
display(df1)
df=df.merge(df1, how="left",on="Date")
df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None
print(df)
The problem is, it works very slow as I have over 108 different (shopID, productID) combinations over 3 years period. Is there anyway to make it more efficient?
Method 1: using pivot_table with join (vectorized solution)
We can pivot your quantity values per shopid, productid to columns, and then join them back to your original dataframe. This should be way faster than your forloops since this is a vectorized approach:
piv = df.pivot_table(index=['ShopID', 'ProductID'], columns=['ShopID', 'ProductID'], values='Quantity')
piv2 = piv.ffill().bfill()
piv3 = piv2.mask(piv2.eq(piv))
final = df.set_index(['ShopID', 'ProductID']).join(piv3).reset_index()
Output
ShopID ProductID dt Quantity (1, 1) (1, 2) (2, 1) (2, 2)
0 1 1 2019-10-01 3 NaN 4.0 5.0 6.0
1 1 1 2019-10-02 3 NaN 4.0 5.0 6.0
2 1 2 2019-10-01 4 3.0 NaN 5.0 6.0
3 1 2 2019-10-02 4 3.0 NaN 5.0 6.0
4 2 1 2019-10-01 5 3.0 4.0 NaN 6.0
5 2 1 2019-10-02 5 3.0 4.0 NaN 6.0
6 2 2 2019-10-01 6 3.0 4.0 5.0 NaN
7 2 2 2019-10-02 6 3.0 4.0 5.0 NaN
Method 2, using GroupBy, mask, where:
We can speed up your code by using GroupBy and mask + where instead of two for-loops:
groups = df.groupby(['ShopID', 'ProductID'])
for grp, data in groups:
m = df['ShopID'].eq(grp[0]) & df['ProductID'].eq(grp[1])
values = df['Quantity'].where(m).ffill().bfill()
df[f'Quantity_{grp[0]}_{grp[1]}'] = values.mask(m)
Output
dt ShopID ProductID Quantity Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
0 2019-10-01 1 1 3 NaN 4.0 5.0 6.0
1 2019-10-02 1 1 3 NaN 4.0 5.0 6.0
2 2019-10-01 1 2 4 3.0 NaN 5.0 6.0
3 2019-10-02 1 2 4 3.0 NaN 5.0 6.0
4 2019-10-01 2 1 5 3.0 4.0 NaN 6.0
5 2019-10-02 2 1 5 3.0 4.0 NaN 6.0
6 2019-10-01 2 2 6 3.0 4.0 5.0 NaN
7 2019-10-02 2 2 6 3.0 4.0 5.0 NaN
This is a pivot and merge problem with a little extra:
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
Output:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
Test data with similar size to your actual data:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
The above code tooks about 10 seconds on an i5 laptop :-)
I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019
I have two DataFrames
df1 has following form
ID col1 col2
0 1 2 10
1 3 1 21
and df2 looks like this
ID field1 field2
0 1 4 1
1 1 3 3
2 3 5 4
3 3 9 5
4 1 2 0
I want to concatenate both DataFrames but so that I have only one line per each ID, so it'd look like this:
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4 1 3 3 2 0
1 3 1 21 5 4 9 5
I have tried merging and pivoting the data df.pivot(index=df1.index, columns='ID')
But because the length is variable, I become a ValueError.
ValueError: all arrays must be same length
Without over formatting, we want to merge and add a level of a multi index that counts the 'ID's.
df = df1.merge(df2)
cc = df.groupby('ID').cumcount()
df.set_index(['ID', 'col1', 'col2', cc]).unstack()
field1 field2
0 1 2 0 1 2
ID col1 col2
1 2 10 4.0 3.0 2.0 1.0 3.0 0.0
3 1 21 5.0 9.0 NaN 4.0 5.0 NaN
We can nail down the formatting with:
df = df1.merge(df2)
cc = df.groupby('ID').cumcount() + 1
d1 = df.set_index(['ID', 'col1', 'col2', cc]).unstack().sort_index(axis=1, level=1)
d1.columns = d1.columns.to_series().map('{0[0]}_{0[1]}'.format)
d1.reset_index()
ID col1 col2 field1_1 field2_1 field1_2 field2_2 field1_3 field2_3
0 1 2 10 4.0 1.0 3.0 3.0 2.0 0.0
1 3 1 21 5.0 4.0 9.0 5.0 NaN NaN
I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.
Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0