Duplicate Quantity as new columns - python

This is my table
I want it to be the following, i.e., by duplicating Quantity of (shopID, productID) Quantity of other difference (shopID, productID) as new columns, Quantity_shopID_productID.
Following is my code:
from datetime import date
import pandas as pd
df=pd.DataFrame({"Date":[date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2),date(2019,10,1),date(2019,10,2)],
"ShopID":[1,1,1,1,2,2,2,2],
"ProductID":[1,1,2,2,1,1,2,2],
"Quantity":[3,3,4,4,5,5,6,6]})
for sid in df.ShopID.unique():
for pid in df.ProductID.unique():
col_name='Quantity{}_{}'.format(sid,pid)
print(col_name)
df1=df[(df.ShopID==sid) & (df.ProductID==pid)][['Date','Quantity']]
df1.rename(columns={'Quantity':col_name}, inplace=True)
display(df1)
df=df.merge(df1, how="left",on="Date")
df.loc[(df.ShopID==sid) & (df.ProductID==pid),col_name]=None
print(df)
The problem is, it works very slow as I have over 108 different (shopID, productID) combinations over 3 years period. Is there anyway to make it more efficient?

Method 1: using pivot_table with join (vectorized solution)
We can pivot your quantity values per shopid, productid to columns, and then join them back to your original dataframe. This should be way faster than your forloops since this is a vectorized approach:
piv = df.pivot_table(index=['ShopID', 'ProductID'], columns=['ShopID', 'ProductID'], values='Quantity')
piv2 = piv.ffill().bfill()
piv3 = piv2.mask(piv2.eq(piv))
final = df.set_index(['ShopID', 'ProductID']).join(piv3).reset_index()
Output
ShopID ProductID dt Quantity (1, 1) (1, 2) (2, 1) (2, 2)
0 1 1 2019-10-01 3 NaN 4.0 5.0 6.0
1 1 1 2019-10-02 3 NaN 4.0 5.0 6.0
2 1 2 2019-10-01 4 3.0 NaN 5.0 6.0
3 1 2 2019-10-02 4 3.0 NaN 5.0 6.0
4 2 1 2019-10-01 5 3.0 4.0 NaN 6.0
5 2 1 2019-10-02 5 3.0 4.0 NaN 6.0
6 2 2 2019-10-01 6 3.0 4.0 5.0 NaN
7 2 2 2019-10-02 6 3.0 4.0 5.0 NaN
Method 2, using GroupBy, mask, where:
We can speed up your code by using GroupBy and mask + where instead of two for-loops:
groups = df.groupby(['ShopID', 'ProductID'])
for grp, data in groups:
m = df['ShopID'].eq(grp[0]) & df['ProductID'].eq(grp[1])
values = df['Quantity'].where(m).ffill().bfill()
df[f'Quantity_{grp[0]}_{grp[1]}'] = values.mask(m)
Output
dt ShopID ProductID Quantity Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
0 2019-10-01 1 1 3 NaN 4.0 5.0 6.0
1 2019-10-02 1 1 3 NaN 4.0 5.0 6.0
2 2019-10-01 1 2 4 3.0 NaN 5.0 6.0
3 2019-10-02 1 2 4 3.0 NaN 5.0 6.0
4 2019-10-01 2 1 5 3.0 4.0 NaN 6.0
5 2019-10-02 2 1 5 3.0 4.0 NaN 6.0
6 2019-10-01 2 2 6 3.0 4.0 5.0 NaN
7 2019-10-02 2 2 6 3.0 4.0 5.0 NaN

This is a pivot and merge problem with a little extra:
# somehow merge only works with pandas datetime
df['Date'] = pd.to_datetime(df['Date'])
# define the new column names
df['new_col'] = 'Quantity_'+df['ShopID'].astype(str) + '_' + df['ProductID'].astype(str)
# new data to merge:
pivot = df.pivot_table(index='Date',
columns='new_col',
values='Quantity')
# merge
new_df = df.merge(pivot, left_on='Date', right_index=True)
# mask
mask = new_df['new_col'].values[:,None] == pivot.columns.values
# adding the None the values:
new_df[pivot.columns] = new_df[pivot.columns].mask(mask)
Output:
Date ShopID ProductID Quantity new_col Quantity_1_1 Quantity_1_2 Quantity_2_1 Quantity_2_2
-- ------------------- -------- ----------- ---------- ------------ -------------- -------------- -------------- --------------
0 2019-10-01 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
1 2019-10-02 00:00:00 1 1 3 Quantity_1_1 nan 4 5 6
2 2019-10-01 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
3 2019-10-02 00:00:00 1 2 4 Quantity_1_2 3 nan 5 6
4 2019-10-01 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
5 2019-10-02 00:00:00 2 1 5 Quantity_2_1 3 4 nan 6
6 2019-10-01 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
7 2019-10-02 00:00:00 2 2 6 Quantity_2_2 3 4 5 nan
Test data with similar size to your actual data:
# 3 years dates
dates = pd.date_range('2015-01-01', '2018-12-31', freq='D')
# 12 Shops and 9 products
idx = pd.MultiIndex.from_product((dates, range(1,13), range(1,10)),
names=('Date','ShopID', 'ProductID'))
# the test data
np.random.seed(1)
df = pd.DataFrame({'Quantity':np.random.randint(0,10, len(idx))},
index=idx).reset_index()
The above code tooks about 10 seconds on an i5 laptop :-)

Related

Replace columns with the same value between two dataframes

I have two dataframes
df1
Date RPM
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
and df2
Date RPM
0 0 0
1 2 2
2 4 4
3 6 6
I want to replace the RPM in df1 with the RPM in df2 where they have the same Date
I tried with replace but it didn't work out
Use Series.map by Series created from df2 and then replace misisng valeus by original column by Series.fillna:
df1['RPM'] = df1['Date'].map(df2.set_index('Date')['RPM']).fillna(df1['RPM'])
You could merge() the two frames on the Date column to get the new RPM against the corresponding date row:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
Date RPM RPM new
0 1 0 NaN
1 2 0 2.0
2 3 0 NaN
3 4 0 4.0
4 5 0 NaN
5 6 0 6.0
6 7 0 NaN
You can then fill in the nans in RPM new using .fillna() to get the RPM column:
df['RPM'] = df['RPM new'].fillna(df['RPM'])
Date RPM RPM new
0 1 0.0 NaN
1 2 2.0 2.0
2 3 0.0 NaN
3 4 4.0 4.0
4 5 0.0 NaN
5 6 6.0 6.0
6 7 0.0 NaN
Then drop the RPM new column:
df = df.drop('RPM new', axis=1)
Date RPM
0 1 0.0
1 2 2.0
2 3 0.0
3 4 4.0
4 5 0.0
5 6 6.0
6 7 0.0
Full code:
df = df1.merge(df2, on='Date', how='left', suffixes=[None, ' new'])
df['RPM'] = df['RPM new'].fillna(df['RPM'])
df = df.drop('RPM new', axis=1)

Concatenate columns skipping pasted rows and columns

I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:
You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0
Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.

Pandas countif based on multiple conditions, result in new column

How can I add a field that returns 1/0 if the value in any specified column in not NaN?
Example:
df = pd.DataFrame({'id': [1,2,3,4,5,6,7,8,9,10],
'val1': [2,2,np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2],
'val2': [7,0.2,5,8,np.nan,1,0,np.nan,1,1],
})
display(df)
mycols = ['val1', 'val2']
# if entry in mycols != np.nan, then df[row, 'countif'] =1; else 0
Desired output dataframe:
We do not need countif logic in pandas , try notna + any
df['out'] = df[['val1','val2']].notna().any(1).astype(int)
df
Out[381]:
id val1 val2 out
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1
Using iloc accessor filtre last two columns. Check if the sum of not NaNs in each row is more than zero. Convert resulting Boolean to integers.
df['countif']=df.iloc[:,1:].notna().sum(1).gt(0).astype(int)
id val1 val2 countif
0 1 2.0 7.0 1
1 2 2.0 0.2 1
2 3 NaN 5.0 1
3 4 NaN 8.0 1
4 5 NaN NaN 0
5 6 1.0 1.0 1
6 7 NaN 0.0 1
7 8 NaN NaN 0
8 9 NaN 1.0 1
9 10 2.0 1.0 1

Pandas - Replace NaNs in a column with the mean of specific group

I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019

Take the differences between groups of varying size in pandas groupby

I need to calculate the differences between consecutive time groups in data like the following
from io import StringIO
import pandas as pd
strio = StringIO("""\
date feat1 feat2 value
2016-10-15T00:00:00 1 1 0.0
2016-10-15T00:00:00 1 2 1.0
2016-10-15T00:00:00 2 1 2.0
2016-10-15T00:00:00 2 2 3.0
2016-10-15T00:01:00 1 1 8.0
2016-10-15T00:01:00 1 2 5.0
2016-10-15T00:02:00 1 1 8.0
2016-10-15T00:02:00 1 2 12.0
2016-10-15T00:02:00 2 1 10.0
2016-10-15T00:02:00 2 2 11.0
2016-10-15T00:03:00 1 1 12.0
2016-10-15T00:03:00 1 2 13.0
2016-10-15T00:03:00 2 1 14.0
2016-10-15T00:03:00 2 2 15.0""")
I can do this using xarray library
df = pd.read_table(strio, sep='\s+')
dims = df.columns.values[:3].tolist()
df.set_index(dims, inplace=True) # needed to convert to xarray dataset
dataset = df.to_xarray()
diff_time = dataset.diff(dim=dims[0]) # take the diff in time
print(diff_time.to_dataframe().reset_index())
prints
date feat1 feat2 value
0 2016-10-15T00:01:00 1 1 8.0
1 2016-10-15T00:01:00 1 2 4.0
2 2016-10-15T00:01:00 2 1 NaN
3 2016-10-15T00:01:00 2 2 NaN
4 2016-10-15T00:02:00 1 1 0.0
5 2016-10-15T00:02:00 1 2 7.0
6 2016-10-15T00:02:00 2 1 NaN
7 2016-10-15T00:02:00 2 2 NaN
8 2016-10-15T00:03:00 1 1 4.0
9 2016-10-15T00:03:00 1 2 1.0
10 2016-10-15T00:03:00 2 1 4.0
11 2016-10-15T00:03:00 2 2 4.0
So in time instant 2016-10-15T00:01:00 that I have feat1:2 missing the relevant diffs are nan
How can I do this in pure pandas in a vectorized way? Constructing the original dataframe with nan fill-ins (so groups are equally sized) is an option but rather avoided
A clumsy way to do it would be:
dfs = []
for k, v in zip(itertools.islice(df.groupby(level=0).groups.values(), 1, None),
df.groupby(level=0).groups.values()):
# print(df.loc(axis=0)[k.values] , df.loc(axis=0)[v.values])
diff = df.loc(axis=0)[k.values].reset_index(level=0, drop=True) - \
df.loc(axis=0)[v.values].reset_index(level=0, drop=True)
diff = pd.concat([diff], keys=[k.values[0][0]], names=['date'])
dfs.append(diff)
print(pd.concat(dfs).reset_index())
It does print the same output but it is not vectorized
Updated solution:
df.unstack(0)['value']\
.diff(axis=1)\
.dropna(how='all', axis=1)\
.unstack([0,1])\
.rename('value')\
.reset_index()
Output:
date feat1 feat2 value
0 2016-10-15T00:01:00 1 1 8.0
1 2016-10-15T00:01:00 1 2 4.0
2 2016-10-15T00:01:00 2 1 NaN
3 2016-10-15T00:01:00 2 2 NaN
4 2016-10-15T00:02:00 1 1 0.0
5 2016-10-15T00:02:00 1 2 7.0
6 2016-10-15T00:02:00 2 1 NaN
7 2016-10-15T00:02:00 2 2 NaN
8 2016-10-15T00:03:00 1 1 4.0
9 2016-10-15T00:03:00 1 2 1.0
10 2016-10-15T00:03:00 2 1 4.0
11 2016-10-15T00:03:00 2 2 4.0
Details:
After creating a three level MultiIndex, first let's unstack level 0, date, which moves dates from rows to columns, then use diff on columns, lastly drop the the first date using dropna where the whole column is nan and unstack feat1 and feat2 to recreate multiindex and convert back to dataframe.

Categories

Resources