Compute Average/Mean across Dataframes in Python Pandas - python

I have a list of dataframes. Each dataframe was originally numerical data taken from which are all shaped identically with 21 rows and 5 columns. The first column is an index (index 0 to index 20). I want to compute the average (mean) values into a single dataframe. Then I want to export the dataframe to excel.
Here's a simplified version of my existing code:
#look to concatenate the dataframes together all at once
#dataFrameList is the given list of dataFrames
concatenatedDataframes = pd.concat(dataFrameList, axis = 1)
#grouping the dataframes by the index, which is the same across all of the dataframes
groupedByIndex = concatenatedDataframes.groupby(level = 0)
#take the mean
meanDataFrame = groupedByIndex.mean()
# Create a Pandas Excel writer using openpyxl as the engine.
writer = pd.ExcelWriter(filepath, engine='openpyxl')
meanDataFrame.to_excel(writer)
However, when I open the excel file, I see what looks like EVERY dataframe is copied into the sheet and the average/mean values are not shown. A simplified example is shown below (cutting most of the rows and dataframes)
Dataframe 1 Dataframe 2 Dataframe 3
Index Col2 Col3 Col4 Col5 Col2 Col3 Col4 Col5 Col2 Col3 Col4 Col5
0 Data Data Data Data Data Data Data Data Data Data Data Data
1 Data Data Data Data Data Data Data Data Data Data Data Data
2 Data Data Data Data Data Data Data Data Data Data Data Data
....
I'm looking for something more like:
Averaged DF
Index Col2 Col3 Col4
0 Mean Index0,Col2 across DFs Mean Index0,Col3 across DFs Mean Index0,Col4 across DFs
1 Mean Index1,Col2 across DFs Mean Index1,Col3 across DFs Mean Index1,Col4 across DFs
2 Mean Index2,Col2 across DFs Mean Index2,Col3 across DFs Mean Index3,Col4 across DFs
...
I have also already seen this answer:
Get the mean across multiple Pandas DataFrames
If possible, I'm looking for a clean solution, not one which would simply involve looping through each dataFrame value by value. Any suggestions?

Perhaps I misunderstood what you asked
The solution is simple. You just need to concat along the correct axis
dummy data
df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[10 + i * j for j in range(columns)] for i in range(rows) ])
df2 = df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[i + j for j in range(columns)] for i in range(rows) ])
ps. this should be your job as OP
pd.concat
df_concat0 = pd.concat((df1, df2), axis=1)
puts all the dataframes next to eachother.
0 1 0 1
0 10 10 0 1
1 10 11 1 2
2 10 12 2 3
If we want to do a groupby now, we first need to stack, groupby and stack again
df_concat0.stack().groupby(level=[0,1]).mean().unstack()
0 1
0 5.0 5.5
1 5.5 6.5
2 6.0 7.5
If we do
df_concat = pd.concat((df1, df2))
This puts all the dataframes on top of eachother
0 1
0 10 10
1 10 11
2 10 12
0 0 1
1 1 2
2 2 3
now we need to just groupby the index, like you did
df_concat.groupby(level=0).mean()
0 1
0 5.0 5.5
1 5.5 6.5
2 6.0 7.5
and then use ExcelWriter as context manager
with pd.ExcelWriter(filepath, engine='openpyxl') as writer:
result.to_excel(writer)
or just plain
result.to_excel(filepath, engine='openpyxl')
if you can overwrite what is is filepath

I suppose you need the mean of all rows against each column.
Concatenating a list of data frames with same index will add the columns from other data frames to the right of the first data frame. As below:
col1 col2 col3 col1 col2 col3
0 1 2 3 2 3 4
1 2 3 4 3 4 5
2 3 4 5 4 5 6
3 4 5 6 5 6 7
Try appending the data frames and then group by and take the mean to get the desired result.
##creating data frames
df1= pd.DataFrame({'col1':[1,2,3,4],
'col2':[2,3,4,5],
'col3':[3,4,5,6]})
df2= pd.DataFrame({'col1':[2,3,4,5],
'col2':[3,4,5,6],
'col3':[4,5,6,7]})
## list of data frames
dflist = [df1,df2]
## empty data frame to use for appending
df=pd.DataFrame()
#looping through each item in list and appending to empty data frame
for i in dflist:
df = df.append(i)
# group by and calculating mean on index
data_mean=df.groupby(level=0).mean()
Write to file as you are writing
Alternatively :
Instead of appending using a for loop you can also mention the axis along which you want to concatenate the data frames, in your case you want to concatenate along the index(axis = 0) to put the data data frames on top top each other. As below:
col1 col2 col3
0 1 2 3
1 2 3 4
2 3 4 5
3 4 5 6
0 2 3 4
1 3 4 5
2 4 5 6
3 5 6 7
##creating data frames
df1= pd.DataFrame({'col1':[1,2,3,4],
'col2':[2,3,4,5],
'col3':[3,4,5,6]})
df2= pd.DataFrame({'col1':[2,3,4,5],
'col2':[3,4,5,6],
'col3':[4,5,6,7]})
## list of data frames
dflist = [df1,df2]
#concat the dflist along axis 0 to put the data frames on top of each other
df_concat=pd.concat(dflist,axis=0)
# group by and calculating mean on index
data_mean=df_concat.groupby(level=0).mean()
Write to file as you are writing

Related

Merging many to many Dask

say I have the following databases (suppose they are Dask data frames:
df A =
1
1
2
2
2
2
3
4
5
5
5
5
5
5
df B =
1
2
2
3
3
3
4
5
5
5
and I would like to merge the two so that the resulting DataFrame has the most information among the two (so for instance in the case of observation 1 I would like to preserve the info of df A, in case of observation number 3, I would like to preserve the info of df B and iso on...).
In other words the resulting DataFrame should be like this:
df C=
1
1
2
2
2
2
3
3
3
4
5
5
5
5
5
5
Is there a way to do that in Dask?
Thank you
Notes:
There are various ways to merge dask dataframes. Dask provides various built-in modules, such as: dask.dataframe.DataFrame.join, dask.dataframe.multi.concat, dask.dataframe.DataFrame.merge, dask.dataframe.multi.merge, dask.dataframe.multi.merge_asof. Depending on one's requirements one might want to use a specific one.
This thread has really valuable information on merges. Even though its focus is on Pandas, it will allow one to understand left, right, outer, and inner merges.
If one wants to do it with Pandas dataframes, there are various ways to achieve that.
One approach would creating a dataframe to store the dataframes that have the highest number of rows per sample_id, and then apply a custom made function. Let's invest a bit more time in that approach.
We will first create a dataframe to store the number of rows that each dataframe has per sample_id as follows
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()})
df_count['df_a'] = df_count['sample_id'].map(df_a.groupby('sample_id').size())
df_count['df_b'] = df_count['sample_id'].map(df_b.groupby('sample_id').size())
As it will be helpful, let us create a column df_max that will store the dataframe that has more rows per sample_id
df_count['df_max'] = df_count[['df_a', 'df_b']].idxmax(axis=1)
[Out]:
sample_id df_a df_b df_max
0 1 2 1 df_a
1 2 4 2 df_a
2 3 1 3 df_b
3 4 1 1 df_a
4 5 6 3 df_a
A one-liner to create the desired df_count would look like the following
df_count = pd.DataFrame({'sample_id': df_a['sample_id'].unique()}).assign(df_a=lambda x: x['sample_id'].map(df_a.groupby('sample_id').size()), df_b=lambda x: x['sample_id'].map(df_b.groupby('sample_id').size()), df_max=lambda x: x[['df_a', 'df_b']].idxmax(axis=1))
Now, given df_a, df_b, and df_count, one will want a function to merge the dataframes based on a specific condition:
If df_max is df_a, then take the rows from df_a.
If df_max is df_b, then take the rows from df_b.
One can create a function merge_df that takes df_a, df_b, and df_count and returns the merged dataframe
def merge_df(df_a, df_b, df_count):
# Create a list to store the dataframes
df_list = []
# Iterate over the rows in df_count
for index, row in df_count.iterrows():
# If df_max is df_a, then take the rows from df_a
if row['df_max'] == 'df_a':
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# If df_max is df_b, then take the rows from df_b
elif row['df_max'] == 'df_b':
df_list.append(df_b[df_b['sample_id'] == row['sample_id']])
# If df_max is neither df_a nor df_b, then use the first dataframe
else:
df_list.append(df_a[df_a['sample_id'] == row['sample_id']])
# Concatenate the dataframes in df_list and return the result. Also, reset the index.
return pd.concat(df_list).reset_index(drop=True)
Then one can apply the function
df_merged = merge_df(df_a, df_b, df_count)
[Out]:
sample_id
0 1
1 1
2 2
3 2
4 2
5 2
6 3
7 3
8 3
9 4
10 5
11 5
12 5
13 5
14 5
15 5

Filter pandas dataframe to get desired output

I have a dataframe having data as below.
index col1 col2
0 1 a
1 1 b
2 1 c
3 2 a
4 2 b
5 2 c
6 3 a
7 3 b
8 3 c
Need data as below. I tried getting this output using for loop but its taking lot of time as there are almost 1 million records. Please suggest some good logic using pandas functions
col1 col2
1 a
2 b
3 c
col1ids= []
col2ids= []
for index, row in df.iterrows():
if row['col1'] in(col1ids) or row['col2'] in(col2ids):
df.drop(index, inplace=True)
else:
col1ids.append(row['col1'])
col2ids.append(row['col2'])
Yes just 2 columns and pattern is not same for all rows. There are many rows with just 1:1
Col1 Col2
0700225821 461605030
0700225821 461605029
0700225822 461605030
0700225822 461605029
0700225826 461605031
0700225826 461605033
0700225824 461605031
0700225824 461605033
0700281780 437328994
0700199984 440669675

Converting rows of the same group to one single row with Dask Dataframes

I have a dask dataframe that look like this:
group index col1 col2 col3
1 1 5 3 4
1 2 4 3 7
1 3 1 2 9
-----------------------
2 2 4 3 7
2 3 1 2 9
2 4 7 4 3
-----------------------
3 3 1 2 9
3 4 7 4 3
3 5 6 3 2
It´s basically a rolling window where each group has its row and x more rows on the dataset. I need to change it to something like this:
group col1_1 col2_1 col3_1 col1_2 col2_2 col3_2 col1_3 col2_3 col3_3
1 5 3 4 4 3 7 1 2 9
2 4 3 7 1 2 9 7 4 3
3 1 2 9 7 4 3 6 3 2
so for each group I have a row that contains all the values in that group. The number of rows per group is constant but can change, meaning it could be 10, but it would be 10 for the whole dataset. In pandas I found some way to do it using this code that I found in this page: link.
indexCol = dff.index.name
dff.reset_index(inplace=True)
colNames = dff.columns
df = pd.pivot_table(dff, index=[indexCol], columns=dff.groupby(indexCol).cumcount().add(1), values=colNames,
aggfunc='sum')
df.columns = df.columns.map('{0[0]}{0[1]}'.format)
The problem is that dask pivot table does not work like pandas and for what I have read it does not admit multiindex so this code does not work with dask dataframes. I can´t make compute() in the dask dataframe neither because the dataset is too big for my memory so I should keep it in dask.
Thank you very much for your help
Well, I figured it out at the end so I post it here:
def series(x):
di = {};
for y in x.columns:
di.update({y + str(i + 1): t for i, t in enumerate(x[y])})
return pd.Series(di);
dictMeta = {};
for y in colNames:
dictMeta.update({y + str(i + 1): df[y].dtype for i in range(0, int(window))})
lista = [(k, dictMeta[k]) for k in dictMeta.keys()]
#We create the 2d dataset for the model
df = dff.groupby(indexCol).apply(lambda x: series(x[colNames]), meta= dictMeta)
where colNames are the columns of the original dataset (col1, col2 and col3 in the question) and indexCol is the name of groupby column (group in the question). Basically we create a dictionary for each group and append it to the dataframe as a row. The dictMeta creates the meta since sometimes errors happen without it.

Compare and remove duplicates from both dataframe

I have 2 dataframes that needs to be compared and remove duplicates (if any)
Daily = DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Out[4]:
col1 col2
0 1 2
1 2 3
2 3 4
col1 col2
0 4 6
1 2 3
2 5 5
3 6 6
What I am trying to achieve is to remove duplicates if there are any, from both DF and get the count of remaining entries from daily DF
Expected output:
col1 col2
0 1 2
2 3 4
col1 col2
0 4 6
2 5 5
3 6 6
Count = 2
How can i do it?
Both or either DFs can be empty, and daily can have more entries than Montlhy and vice versa
Why don't just concat both into one df and drop the duplicates completely?
s = (pd.concat([Daily.assign(source="Daily"),
Accumulated.assign(source="Accumlated")])
.drop_duplicates(["col1","col2"], keep=False))
print (s[s["source"].eq("Daily")])
col1 col2 source
0 1 2 Daily
2 3 4 Daily
print (s[s["source"].eq("Accumlated")])
col1 col2 source
0 4 6 Accumlated
2 5 5 Accumlated
3 6 6 Accumlated
You can try the below code
## For 1st Dataframe
for i in range(len(df1)):
for j in range(len(df2)):
if df1.iloc[i].to_list()==df2.iloc[j].to_list():
df1=df1.drop(index=i)
Similarly you can do for the second datframe
I would do it following way:
import pandas as pd
daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
daily['isdaily'] = True
accumulated['isdaily'] = False
together = pd.concat([daily, accumulated])
without_dupes = together.drop_duplicates(['col1','col2'],keep=False)
daily_count = sum(without_dupes['isdaily'])
I added isdaily column to dataframes as Trues and Falses so they could be easily summed at end.
If I understood correctly, you need to have both tables separated.
You can concatenate them, keeping the table from where they come from and then recreate them:
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Daily["Table"] = "Daily"
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
Accumulated["Table"] = "Accum"
df = pd.concat([Daily, Accumulated]).reset_index()
not_dup = df[["col1", "col2"]].drop_duplicates()
not_dup = df.loc[not_dup.index,:]
Daily = not_dup[not_dup["Table"] == "Daily"][["col1","col2"]]
Accumulated = not_dup[not_dup["Table"] == "Accum"][["col1","col2"]]
print(Daily)
print(Accumulated)
following those steps:
Concatenate the 2 data-frames
Drop all duplication
For each data-frame find the intersection with the concat data-frame
Find count with len
Daily = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
Accumulated = pd.DataFrame({'col1':[4,2,5], 'col2':[6,3,5]})
df = pd.concat([Daily, Accumulated]) # step 1
df = df.drop_duplicates(keep=False) # step 2
Daily = pd.merge(df, Daily, how='inner', on=['col1','col2']) #step 3
Accumulated = pd.merge(df, Accumulated, how='inner', on=['col1','col2']) #step 3
count = len(Daily) #step 4

Compare rows of 2 pandas data frames by column and keep bigger and sum

I have two data frames of same IDs with identical structure:
X, Y, Value, ID
The only difference between the two should be values in column Value - it may need to be sorted by ID first so both have same order of rows to make sure.
I want to compare these two data frames by row based on column Value and keep the row from first or second depending where the Value is bigger. I would also like to see example how to add additional column SUM for sum of Value columns from both data frames.
I will be glad for any example, including using numpy if you feel it is better to use for this than Pandas.
Edit: I just realized after testing the example from the first answer that the data frames I have are missing completely the rows with ids where Value was null. That makes two data frames of different number of rows. So could be please also included how to make them same size before comparison - adding rows with missing ids from each other with IDs and zeros?
import numpy as np
# create a new dataframe, where Value is the max value per row
val1 = df1['Value']
val2 = df2['Value'][val1.index] # align to val1
df = df1.copy()
df['Value'] = np.maximum(val1, val2)
# add a SUM column:
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
df = (pd.concat([df1, df2])
.groupby(['ID','X','Y'])
.agg({'value':'max', 'value_sum':'sum'}))
I use reindex_like for align dataframes and then where and loc for filling column Value of new dataframe df:
print df1
X Y Value ID
0 1 4 10 0
1 2 5 55 1
2 3 6 21 2
print df2
X Y Value ID
0 2 5 7 1
1 3 6 34 2
#align dataframes
df1 = df1.set_index('ID')
df2 = df2.set_index('ID')
df2 = df2.reindex_like(df1)
print df2
X Y Value
ID
0 NaN NaN NaN
1 2 5 7
2 3 6 34
#create new df
df = df1.copy()
df['Value'] = df1['Value'].where(df1['Value'] > df2['Value'], df2['Value'])
#if value is NaN in column df2 give value of column1
df.loc[df2['Value'].isnull(), 'Value'] = df1['Value']
print df
X Y Value
ID
0 1 4 10
1 2 5 55
2 3 6 34
#sum columns Value to columns SUM
df1['SUM'] = df1['Value'].sum()
df2['SUM'] = df2['Value'].sum()
print df1
X Y Value SUM
ID
0 1 4 10 86
1 2 5 55 86
2 3 6 21 86
#remove rows with NaN
print df2.dropna()
X Y Value SUM
ID
1 2 5 7 41
2 3 6 34 41

Categories

Resources