Merging multiple dataframes with overlapping rows and different columns - python

I have multiple pandas data frames with some common columns and some overlapping rows. I would like to combine them in such a way that I have one final data frame with all of the columns and all of the unique rows (overlapping/duplicate rows dropped). The remaining gaps should be nans.
I have come up with the function below. In essence it goes through all columns one by one, appending all of the values from each data frame, dropping the duplicates (overlap), and building a new output data frame column by column.
def combine_dfs(dataframes:list):
## Identifying all unique columns in all data frames
columns = []
for df in dataframes:
columns.extend(df.columns)
columns = np.unique(columns)
## Appending values from each data frame per column
output_df = pd.DataFrame()
for col in columns:
column = pd.Series(dtype="object", name=col)
for df in dataframes:
if col in df.columns:
column = column.append(df[col])
## Removing overlapping data (assuming consistent values)
column = column[~column.index.duplicated()]
## Adding column to output data frame
column = pd.DataFrame(column)
output_df = pd.concat([output_df,column], axis=1)
output_df.sort_index(inplace=True)
return output_df
df_1 = pd.DataFrame([[10,20,30],[11,21,31],[12,22,32],[13,23,33]], columns=["A","B","C"])
df_2 = pd.DataFrame([[33,43,54],[34,44,54],[35,45,55],[36,46,56]], columns=["C","D","E"], index=[3,4,5,6])
df_3 = pd.DataFrame([[50,60],[51,61],[52,62],[53,63],[54,64]], columns=["E","F"])
print(combine_dfs([df_1,df_2,df_3]))
The output, as intended in the visualization, looks like this:
A B C D E F
0 10.0 20.0 30 NaN 50 60.0
1 11.0 21.0 31 NaN 51 61.0
2 12.0 22.0 32 NaN 52 62.0
3 13.0 23.0 33 43.0 54 63.0
4 NaN NaN 34 44.0 54 64.0
5 NaN NaN 35 45.0 55 NaN
6 NaN NaN 36 46.0 56 NaN
This method works well on small data sets. Is there a way to optimize this?

IIUC you can chain combine_first:
print (df_1.combine_first(df_2).combine_first(df_3))
A B C D E F
0 10.0 20.0 30 NaN 50.0 60.0
1 11.0 21.0 31 NaN 51.0 61.0
2 12.0 22.0 32 NaN 52.0 62.0
3 13.0 23.0 33 43.0 54.0 63.0
4 NaN NaN 34 44.0 54.0 64.0
5 NaN NaN 35 45.0 55.0 NaN
6 NaN NaN 36 46.0 56.0 NaN

Related

Pandas column concatenation

I have a dataframe (example DF1) with 300 columns of experimental data, where some of the experiments are repeated several times. I am able to use the set default method to get the column names (index), and I was wondering if there was a was to vertically append columns with similar names to a new data frame (example DF2)? I appreciate any help :)
You can melt then use groupby + cumcount to determine the row label and then you pivot.
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,25).reshape(8,3).T,
columns=['E1', 'E1', 'E2', 'E3', 'E4', 'E4', 'E4', 'E5'])
Code
df2 = df.melt()
df2['idx'] = df2.groupby('variable').cumcount()
df2 = (df2.pivot(index='idx', columns='variable', values='value')
.rename_axis(index=None, columns=None))
E1 E2 E3 E4 E5
0 1.0 7.0 10.0 13.0 22.0
1 2.0 8.0 11.0 14.0 23.0
2 3.0 9.0 12.0 15.0 24.0
3 4.0 NaN NaN 16.0 NaN
4 5.0 NaN NaN 17.0 NaN
5 6.0 NaN NaN 18.0 NaN
6 NaN NaN NaN 19.0 NaN
7 NaN NaN NaN 20.0 NaN
8 NaN NaN NaN 21.0 NaN

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

Create pandas dataframe where each cell basis on slope calculation with time series rows from another df

I have a dataframe with about 40 columns and about 100000 rows:
ID MONTH_NUM_
FROM_EVENT F1 F2 F3 F4 etc…
2 1 4.0 133.0 28.0 NaN
2 2 NaN 132.0 29.0 24.0
2 3 NaN 131.0 NaN 29.0
2 4 4.0 130.0 31.0 7.0
2 5 8.0 129.0 26.0 2.0
2 6 8.0 128.0 25.0 3.0
4 1 5.0 139.0 29.0 7.0
4 2 5.0 138.0 NaN 22.0
4 3 5.0 137.0 30.0 28.0
4 4 5.0 136.0 29.0 25.0
4 5 5.0 135.0 NaN 27.0
4 6 5.0 134.0 27.0 29.0
etc…
each columns F is a 6m time series data with NaN for each rows ID client
I want to output new dataframe without monthes like so:
ID F1 F2 F3 F4 etc…
2
4
etc…
where each cell of new data frame is a slope calculation of 6m time series for each F colums with following code example:
x = [6, 5, 4, 3, 2, 1] #its constanta for each calcul, monthes with reverse orders because 1 is last month before event prediction
y = df.F1[df['ID']==2]
xm = np.ma.masked_array(x,mask=np.isnan(y)).compressed() #ignore Nans
ym = np.ma.masked_array(y,mask=np.isnan(y)).compressed() #ignore Nans
from scipy.stats import linregress
linregress(xm, ym).slope
What is the efficient way to looping this calculation and create new df?
Thanx in advance...

How to add conditions to columns at grouped by pivot table Pandas

I've used group by and pivot table from pandas package in order to create the following table:
Input:
q4 = q1[['category','Month']].groupby(['category','Month']).Month.agg({'Count':'count'}).reset_index()
q4 = pd.DataFrame(q4.pivot(index='category',columns='Month').reset_index())
then the output :
category Count
Month 6 7 8
0 adult-classes 29.0 109.0 162.0
1 air-pollution 27.0 43.0 13.0
2 babies-and-toddlers 4.0 51.0 2.0
3 bicycle 210.0 96.0 23.0
4 building NaN 17.0 NaN
5 buildings-maintenance 23.0 12.0 NaN
6 catering 1351.0 4881.0 1040.0
7 childcare 9.0 NaN NaN
8 city-planning 105.0 81.0 23.0
9 city-services 2461.0 2130.0 1204.0
10 city-taxes 1.0 4.0 42.0
I'm trying to add a condition to the months,
the problem I'm having is that after pivoting I can't access the columns
how can I show only the rows where 6<7<8?
To flatten your multi-index, you can use renaming of your columns (check out this answer).
q4.columns = [''.join([str(c) for c in col]).strip() for col in q4.columns.values]
To remove NaNs:
q4.fillna(0, inplace=True)
To select according to your constraint:
result = q4[(q4['Count6'] < q['Count7']) & (q4['Count7'] < q4['Count8'])]

How can i filter consecutive data rows btw NaN rows in a pandas dataframe?

I have a dataframe that looks like the following. There are >=1 consecutive rows where y_l is populated and y_h is NaN and vice versa.
When we have more than 1 consecutive populated lines between the NaNs we only want to keep the one with the lowest y_l or the highest y_h.
e.g. on the df below from the last 3 rows we would only keep the 2nd and discard the other two.
What would be a smart way to implement that?
df = pd.DataFrame({'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['y_l','y_h'])
>>> df
y_l y_h
0 NaN 90.0
1 97.0 NaN
2 95.0 NaN
3 98.0 NaN
4 NaN 95
Desired result:
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95
You need create new column or Series for distinguish each consecutives and then use groupby with aggreagte by agg, last for change order of columns use reindex:
a = df['y_l'].isnull()
b = a.ne(a.shift()).cumsum()
df = (df.groupby(b, as_index=False)
.agg({'y_l':'min', 'y_h':'max'})
.reindex(columns=['y_l','y_h']))
print (df)
y_l y_h
0 NaN 90.0
1 95.0 NaN
2 NaN 95.0
Detail:
print (b)
0 1
1 2
2 2
3 2
4 3
Name: y_h, dtype: int32
What if you had more columns?
for example
df = pd.DataFrame({'A': [NaN, 15,20,25,NaN],'y_l': [NaN, 97,95,98,NaN],'y_h': [90, NaN,NaN,NaN,95]}, columns=['A','y_l','y_h'])
>>>df
A y_l y_h
0 NaN NaN 90.0
1 15.0 97.0 NaN
2 20.0 95.0 NaN
3 25.0 98.0 NaN
4 NaN NaN 95.0
How could you keep the values in column A after filtering out the irrelevant rows as below?
A y_l y_h
0 NaN NaN 90.0
1 20.0 95.0 NaN
2 NaN NaN 95.0

Categories

Resources