Column-wise concatenating with every dataframe have different index in pandas - python

Suppose we have a list of dataframes A which contains three dataframes df_1, df_2, and df_3:
A = [df_a, df_b, df_c]
df_a =
morning noon night
date
2019-12-31 B 3.0 3.0 0.0
C 0.0 0.0 1.0
D 0.0 1.0 0.0
E 142.0 142.0 142.0
df_b =
morning noon night
date
2020-01-31 A 3.0 0.0 0.0
B 1.0 0.0 0.0
E 142.0 145.0 145.0
df_c =
morning noon night
date
2020-02-29 F 145.0 145.0 145.0
All dataframes have morning, noon, night columns and have same index which is date and [A,B,C,D,E,F] column and I want to concatenate those three dataframes into one dataframe (let's say full_df) which every date have equal rows/indexes.
But as you see each dataframe have different number of rows, df_1,df_2, and df_3 have [B,C,D,E], [A,B,E] and [F] respectively.
Is there some way we can concat those dataframes but this time, each dataframe have index of all unique index from those three combined ? It returns 0.0 if the corresponding index is not available on the original dataframe.
This is what I was thinking about full_df:
full_df =
morning noon night
date
2019-12-31 A 0.0 0.0 0.0
B 3.0 3.0 0.0
C 0.0 0.0 1.0
D 0.0 1.0 0.0
E 142.0 142.0 142.0
F 0.0 0.0 0.0
2020-01-31 A 3.0 0.0 0.0
B 1.0 0.0 0.0
C 0.0 0.0 0.0
D 0.0 0.0 0.0
E 142.0 145.0 145.0
F 0.0 0.0 0.0
2020-02-29 A 0.0 0.0 0.0
B 0.0 0.0 0.0
C 0.0 0.0 0.0
D 0.0 0.0 0.0
E 0.0 0.0 0.0
F 145.0 145.0 145.0

You can try:
pd.concat(A).unstack(level=-1, fill_value=0).stack()

Related

Concat dataframes/series with axis=1 in a loop

I have a dataframe of email senders as follows.
I am trying to get as an output a dataframe which is the number of emails send by eac person by month.
I want the index to be the month end and the columns to be the persons.
I am able to build this but with two issues:
First, I: am using multiple pd.concat statements (all the df_temps) which is ugly and does not scale.
Is there a way to put this in a for loop or some other way to loop over say the first n persons?
Second, while it puts all the data together correctly, there is a discontinuity in the index.
The second last row is 1999-01-31 and the last one is 2000-01-31.
Is there an option or a way to get NaN for the in between months?
Code below:
import pandas as pd
df_in = pd.DataFrame({
'sender':['Able Boy','Able Boy','Able Boy','Mark L. Taylor','Mark L. Taylor',
'Mark L. Taylor','scott kirk','scott kirk','scott kirk','scott kirk',
'Able Boy','Able Boy','james h. madison','james h. madison','james h. madison',
'james joyce','scott kirk','james joyce','james joyce','james joyce',
'james h. madison','Able Boy'],
'receiver':['Toni Z. Zapata','Mark Angel','scott kirk','paul a boyd','michelle fam',
'debbie bradford','Mark Angel','Johnny C. Cash','Able Boy','Mark L. Taylor',
'jenny chang','julie s. smith', 'scott kirk', 'tiffany r.','Able Boy',
'Mark Angel','Able Boy','julie s. smith','jenny chang','debbie bradford',
'Able Boy','Toni Z. Zapata'],
'time':[911929000000,911929000000,910228000000,911497000000,911497000000,
911932000000,914261000000,914267000000,914269000000,914276000000,
914932000000,915901000000,916001000000,916001000000,916001000000,
947943000000,947943000000,947943000000,947943000000,947943000000,
916001000000,911929100000],
'email_ID':['<A34E5R>','<A34E5R>','<B34E5R>','<C34E5R>','<C34E5R>',
'<C36E5R>','<C36E5A>','<C36E5B>','<C36E5C>','<C36E5D>',
'<D000A0>','<D000A1>','<D000A2>','<D000A2>','<D000A2>',
'<D000A3>','<D000A3>','<D000A3>','<D000A3>','<D000A3>',
'<D000A4>','<A34E5S>']
})
df_in['time'] = pd.to_datetime(df_in['time'],unit='ms')
df_1 = df_in.copy()
df_1['number'] = 1
df_2 = df_1.drop_duplicates(subset="email_ID",keep="first",inplace=False)\
.reset_index()
df_3 = df_2.drop(columns=['index','receiver','email_ID'],inplace=False)
df_6 = df_3.groupby(['sender',pd.Grouper(key='time',freq='M')]).sum()
df_6_squeezed = df_6.squeeze()
df_grp_1 = df_3.groupby(['sender']).count()
df_grp_1.sort_values(by=['number'],ascending=False,inplace=True)
toppers = list(df_grp_1.index.array)
df_temp_1 = df_6_squeezed[toppers[0]]
df_temp_2 = df_6_squeezed[toppers[1]]
df_temp_3 = df_6_squeezed[toppers[2]]
df_temp_4 = df_6_squeezed[toppers[3]]
df_temp_5 = df_6_squeezed[toppers[4]]
df_temp_1.rename(toppers[0],inplace=True)
df_temp_2.rename(toppers[1],inplace=True)
df_temp_3.rename(toppers[2],inplace=True)
df_temp_4.rename(toppers[3],inplace=True)
df_temp_5.rename(toppers[4],inplace=True)
df_concat_1 = pd.concat([df_temp_1,df_temp_2],axis=1,sort=False)
df_concat_2 = pd.concat([df_concat_1,df_temp_3],axis=1,sort=False)
df_concat_3 = pd.concat([df_concat_2,df_temp_4],axis=1,sort=False)
df_concat_4 = pd.concat([df_concat_3,df_temp_5],axis=1,sort=False)
print("\nCONCAT (df_concat_4):")
print(df_concat_4)
print(type(df_concat_4))
Consider pivot_table after calculating month_end (see #Root's answer). Also, use reindex to fill in missing months. Usually in Pandas, grouping aggregations like count of senders per month does not require looping or temporary helper data frames.
from pandas.tseries.offsets import MonthEnd
df_in['month_end'] = (df_in['time'] + MonthEnd(0)).dt.normalize()
agg_df = (df_in.pivot_table(index='month_end', columns='sender', values='time', aggfunc='count')
.reindex(pd.date_range('1998-01-01', '2000-01-31', freq='m').values, axis='index')
.fillna(0)
)
Output
print(agg_df)
# sender Able Boy Mark L. Taylor james h. madison james joyce scott kirk
# month_end
# 1998-01-31 0.0 0.0 0.0 0.0 0.0
# 1998-02-28 0.0 0.0 0.0 0.0 0.0
# 1998-03-31 0.0 0.0 0.0 0.0 0.0
# 1998-04-30 0.0 0.0 0.0 0.0 0.0
# 1998-05-31 0.0 0.0 0.0 0.0 0.0
# 1998-06-30 0.0 0.0 0.0 0.0 0.0
# 1998-07-31 0.0 0.0 0.0 0.0 0.0
# 1998-08-31 0.0 0.0 0.0 0.0 0.0
# 1998-09-30 0.0 0.0 0.0 0.0 0.0
# 1998-10-31 0.0 0.0 0.0 0.0 0.0
# 1998-11-30 4.0 3.0 0.0 0.0 0.0
# 1998-12-31 1.0 0.0 0.0 0.0 4.0
# 1999-01-31 1.0 0.0 4.0 0.0 0.0
# 1999-02-28 0.0 0.0 0.0 0.0 0.0
# 1999-03-31 0.0 0.0 0.0 0.0 0.0
# 1999-04-30 0.0 0.0 0.0 0.0 0.0
# 1999-05-31 0.0 0.0 0.0 0.0 0.0
# 1999-06-30 0.0 0.0 0.0 0.0 0.0
# 1999-07-31 0.0 0.0 0.0 0.0 0.0
# 1999-08-31 0.0 0.0 0.0 0.0 0.0
# 1999-09-30 0.0 0.0 0.0 0.0 0.0
# 1999-10-31 0.0 0.0 0.0 0.0 0.0
# 1999-11-30 0.0 0.0 0.0 0.0 0.0
# 1999-12-31 0.0 0.0 0.0 0.0 0.0
# 2000-01-31 0.0 0.0 0.0 4.0 1.0

Can't Re-Order Columns Data

I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

How to drop columns of Pandas DataFrame with zero values in the last row

I have a Pandas Dataframe which tells me monthly sales of items in shops
df.head():
ID month sold
0 150983 0 1.0
1 56520 0 13.0
2 56520 1 7.0
3 56520 2 13.0
4 56520 3 8.0
I want to remove all IDs where there were no sales last month. I.e. month == 33 & sold == 0. Doing the following
unwanted_df = df[((df['month'] == 33) & (df['sold'] == 0.0))]
I just get 46 rows, which is far too little. But nevermind, I would like to have the data in different format anyway. Pivoted version of above table is just what I want:
pivoted_df = df.pivot(index='month', columns = 'ID', values = 'sold').fillna(0)
pivoted_df.head()
ID 0 2 3 5 6 7 8 10 11 12 ... 214182 214185 214187 214190 214191 214192 214193 214195 214197 214199
month
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Question. How to remove columns with the value 0 in the last row in pivoted_df?
You can do this with one line:
pivoted_df= pivoted_df.drop(pivoted_df.columns[pivoted_df.iloc[-1,:]==0],axis=1)
I want to remove all IDs where there were no sales last month
You can first calculate the IDs satisfying your condition:
id_selected = df.loc[(df['month'] == 33) & (df['sold'] == 0), 'ID']
Then filter these from your dataframe via a Boolean mask:
df = df[~df['ID'].isin(id_selected)]
Finally, use pd.pivot_table with your filtered dataframe.

Losing Values Joining DataFrames

I can't work out why this code is dropping values
solddf[['Name', 'Barcode', 'SalesRank', 'SoldPrices', 'SoldDates', 'SoldIds']].head()
Out[3]:
Name Barcode \
62693 Near Dark [DVD] [1988] [Region 1] [US Import] ... 1.313124e+10
94823 Battlefield 2 Modern Combat / Game 1.463315e+10
24965 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24964 Star Wars: The Force Unleashed (PS3) 2.327201e+10
24963 Star Wars: The Force Unleashed (PS3) 2.327201e+10
SalesRank SoldPrices SoldDates SoldIds
62693 14.04 2017-08-05 07:28:56 162558627930
94823 1.49 2017-09-06 04:48:42 132301267483
24965 4.29 2017-08-23 18:44:42 302424166550
24964 5.27 2017-09-08 19:55:02 132317908530
24963 5.56 2017-09-15 08:23:24 132322978130
Here's my dataframe. It stores each sale I pull from an eBay API as a new row.
My aim to look for correlation between weekly sales and Amazon's Sales Rank.
solddf['Week'] = solddf['SoldDates'].apply(lambda x: x.week)
weeklysales = solddf.groupby(['Barcode', 'Week']).size().unstack()
weeklysales = weeklysales.fillna(0)
weeklysales['Mean'] = weeklysales.mean(axis=1)
weeklysales.head()
Out[5]:
Week 29 30 31 32 33 34 35 36 37 38 39 40 41 \
Barcode
1.313124e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1.463315e+10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
2.327201e+10 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 2.0 2.0 0.0 2.0 1.0
2.327201e+10 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2.327201e+10 0.0 0.0 3.0 2.0 2.0 2.0 1.0 1.0 5.0 0.0 2.0 2.0 1.0
Week 42 Mean
Barcode
1.313124e+10 0.0 0.071429
1.463315e+10 0.0 0.071429
2.327201e+10 0.0 0.642857
2.327201e+10 0.0 0.142857
2.327201e+10 0.0 1.500000
So, I've worked out the mean weekly sales for each item (or barcode)
I then want to take the mean values and insert them back into my solddf dataframe that I started with.
s1 = pd.Series(weeklysales.Mean, index=solddf.Barcode).reset_index()
s1 = s1.sort_values('Barcode')
s1.head()
Out[17]:
Barcode Mean
0 1.313124e+10 0.071429
1 1.463315e+10 0.071429
2 2.327201e+10 0.642857
3 2.327201e+10 0.642857
4 2.327201e+10 0.642857
This is looking fine, has the right number of rows and should fit
solddf = solddf.sort_values('Barcode')
solddf['WeeklySales'] = s1.Mean
This method seems to work, but I'm having an issue that some np.nan values are now appeared which weren't in s1 before
s1.Mean.isnull().sum()
Out[13]: 0
len(s1) == len(solddf)
Out[14]: True
But loads of my values that have passed across are now np.nan
solddf.WeeklySales.isnull().sum()
Out[16]: 27214
Can anyone tell me why?
While writing this I had an idea for a work-around
s1list = s1.Mean.tolist()
solddf['WeeklySales'] = s1list
solddf.WeeklySales.isnull().sum()
Out[20]: 0
Still curious what the problem with the previous method is though!
Instead of trying to align the two indices and inserting the new row, you should just use pd.merge.
output = pd.merge(solddf, s1, on='Barcode')
This way you can select the type of join you would like to do as well using the how kwarg.
I would also advise reading Merge, join, and concatenate as it covers a lot of helpful methods for combining dataframes.

Set value based on day in month in pandas timeseries

I have a timeseries
date
2009-12-23 0.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 0.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
I would like to set the value to 1 based on the following rule:
If the constant is set 9 this means the 9th of each month. Due to
that that 2010-01-09 doesn't exist I would like to set the next date
that exists in the series to 1 which is 2010-01-11 above.
I have tried to create two series one (series1) with day < 9 set to 1 and one (series2) with day > 9 to 1 and then series1.shift(1) * series2
It works in the middle of the month but not if day is set to 1 due to that the last date in previous month is set to 0 in series1.
Assume your timeseries is s with a datetimeindex
I want to create a groupby object of all index values whose days are greater than or equal to 9.
g = s.index.to_series().dt.day.ge(9).groupby(pd.TimeGrouper('M'))
Then I'll check that there is at least one day past >= 9 and grab the first among them. With those, I'll assign the value of 1.
s.loc[g.idxmax()[g.any()]] = 1
s
date
2009-12-23 1.0
2009-12-28 0.0
2009-12-29 0.0
2009-12-30 0.0
2009-12-31 0.0
2010-01-04 0.0
2010-01-05 0.0
2010-01-06 0.0
2010-01-07 0.0
2010-01-08 0.0
2010-01-11 1.0
2010-01-12 0.0
2010-01-13 0.0
2010-01-14 0.0
2010-01-15 0.0
2010-01-18 0.0
2010-01-19 0.0
2010-01-20 0.0
2010-01-21 0.0
2010-01-22 0.0
2010-01-25 0.0
2010-01-26 0.0
2010-01-27 0.0
2010-01-28 0.0
2010-01-29 0.0
2010-02-01 0.0
2010-02-02 0.0
Name: val, dtype: float64
Note that 2009-12-23 also was assigned a 1 as it satisfies this requirement as well.

Categories

Resources