How to merge 3 dataframes' column with two criteria in python - python

I tried to merge 3 columns from 3 dataframes based on 2 conditions. For example I have the 3 dataframes below called df_a, df_b and df_c
df_a:
df_b:
df_c:
I want to merge the column Results_b from df_b to df_a if they are the same company and in the same period. Also I would like to remove the column of factor a and factor b.
I tried df_merged = pd.merge(df_a, df_b, on=['Company name', 'Period'], how='left') for merging df_a and df_b and it works, but I am not sure how to only merge the column of Results_a and Results_b instead of merging all columns.
Lastly, I would also like to merge the column Results_c from df_c if they are the same company and in the same period. However, df_c data are based on each quarter (or every 3 months) and df_a and df_b are based on every month, so for the months which is not in df_c, I would like the data to be the same from previous available data. I am not so sure how to deal with it.
This is the outcome that I would like to see:
It would be really appreciated if someone can help me!! Thanks a lot
For reproducing the dataframes:
df_a = pd.DataFrame({
'Company name': ['A','B','C','A','B','C','A','B','C','A','B','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-03-31','2019-04-30','2019-04-30','2019-04-30'],
'factor a': [37,41,64,52,97,10,55,47,52,61,59,70],
'Results_a': [1,4,2,3,4,1,2,3,3,1,2,4]
})
# b
df_b = pd.DataFrame({
'Company name': ['A','B','C','A','B','A','D','B','C'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-02-28','2019-02-28','2019-03-31','2019-03-31','2019-04-30','2019-04-30'],
'factor b': [55,34,28,17,95,98,61,14,87],
'Results_b': [2,3,1,4,2,1,4,1,4]
})
#c
df_c = pd.DataFrame({
'Company name': ['A','B','C','A','D'],
'Period': ['2019-01-31','2019-01-31','2019-01-31','2019-04-30','2019-04-30'],
'factor c': [27,63,18,23,89],
'Results_c' : [2,1,3,4,1],
})```

You Can also use merge & update value of column "Results_c" in loop, if required;
# Merge Data Sets
data_frames = [df_a,df_b,df_c]
df_result = reduce(lambda left,right: pd.merge(left,right,on=['Company name', 'Period'],
how='left'), data_frames)
df_result = df_result[[col for col in df_result.columns if "factor" not in col]]
# Updating Results_c values if NaN for the month
lst_period = sorted(list(df_result["Period"].unique()))
for i in range(0,len(lst_period)):
df_temp = df_result[df_result["Period"] == lst_period[i]]
if df_temp["Results_c"].isna().sum() == 3: #Edit this number depending on your company's count... as of now 3 bcz of A,B,C
lst_val = df_result[df_result["Period"] == lst_period[i-1]]["Results_c"]
df_result.loc[df_result["Period"] == lst_period[i],"Results_c"] = list(lst_val)
Hope this Helps...
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1 2.0 2.0
1 B 2019-01-31 4 3.0 1.0
2 C 2019-01-31 2 1.0 3.0
3 A 2019-02-28 3 4.0 2.0
4 B 2019-02-28 4 2.0 1.0
5 C 2019-02-28 1 NaN 3.0
6 A 2019-03-31 2 1.0 2.0
7 B 2019-03-31 3 NaN 1.0
8 C 2019-03-31 3 NaN 3.0
9 A 2019-04-30 1 NaN 4.0
10 B 2019-04-30 2 1.0 NaN
11 D 2019-04-30 4 NaN 1.0

Use concat:
dfs = [df_a, df_b, df_c]
out = pd.concat([df.set_index(['Company name', 'Period'])
.filter(like='Results_')
for df in dfs],
axis=1).reset_index()
NB. to mimick the left merge, you can add .dropna(subset='Results_a').
output:
Company name Period Results_a Results_b Results_c
0 A 2019-01-31 1.0 2.0 2.0
1 B 2019-01-31 4.0 3.0 1.0
2 C 2019-01-31 2.0 1.0 3.0
3 A 2019-02-28 3.0 4.0 NaN
4 B 2019-02-28 4.0 2.0 NaN
5 C 2019-02-28 1.0 NaN NaN
6 A 2019-03-31 2.0 1.0 NaN
7 B 2019-03-31 3.0 NaN NaN
8 C 2019-03-31 3.0 NaN NaN
9 A 2019-04-30 1.0 NaN 4.0
10 B 2019-04-30 2.0 1.0 NaN
11 D 2019-04-30 4.0 NaN 1.0
12 D 2019-03-31 NaN 4.0 NaN
13 C 2019-04-30 NaN 4.0 NaN

Related

Panda- How can some column values can be moved to new column?

I have the below data frame
d = {
"name":["RRR","RRR","RRR","RRR","RRR","ZZZ","ZZZ","ZZZ","ZZZ","ZZZ"],
"id":[1,1,2,2,3,2,3,3,4,4],"value":[12,13,1,44,22,21,23,53,64,9]
}
I want the out output as below:
First pivot by DataFrame.set_index with counter by GroupBy.cumcount and DataFrame.unstack with helper column ind by id, then sorting second level of MultiIndex with flatten values:
df = (df.assign(ind = df['id'])
.set_index(['name','id', df.groupby(['name','id']).cumcount()])[['value', 'ind']]
.unstack(1)
.sort_index(axis=1, kind='mergesort', level=1))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.droplevel(1).reset_index()
print (df)
name ind_1 value_1 ind_2 value_2 ind_3 value_3 ind_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0
try this:
def func(sub: pd.DataFrame) ->pd.DataFrame:
dfs = [g.reset_index(drop=True).rename(
columns=lambda x: f'{x}_{n}') for n, g in sub.drop(columns='name').groupby('id')]
return pd.concat(dfs, axis=1)
res = df.groupby('name').apply(func).droplevel(1).reset_index()
print(res)
>>>
name id_1 value_1 id_2 value_2 id_3 value_3 id_4 value_4
0 RRR 1.0 12.0 2.0 1.0 3.0 22.0 NaN NaN
1 RRR 1.0 13.0 2.0 44.0 NaN NaN NaN NaN
2 ZZZ NaN NaN 2.0 21.0 3.0 23.0 4.0 64.0
3 ZZZ NaN NaN NaN NaN 3.0 53.0 4.0 9.0

How to combine two different length dataframes with datetime index

I have two dataframes like
a = pd.DataFrame(
{
'Date': ['01-01-1990', '01-01-1991', '01-01-1993'],
'A': [1,2,3]
}
)
a = a.set_index('Date')
------------------------------------
A
Date
01-01-1990 1
01-01-1991 2
01-01-1993 3
and another one
b = pd.DataFrame(
{
'Date': ['01-01-1990', '01-01-1992', '01-01-1993', '01-01-1994'],
'B': [4,6,7,8]
}
)
b = b.set_index('Date')
-------------------------------
B
Date
01-01-1990 4
01-01-1992 6
01-01-1993 7
01-01-1994 8
here if you notice two dataframes have different lengths (a=3, b=4) with a different Date entry in '01-01-1992'.
Issue is when I am concating these dataframes I am getting below result
pd.concat([a,b], sort=True)
------------------------------
A B
Date
01-01-1990 1.0 NaN
01-01-1991 2.0 NaN
01-01-1993 3.0 NaN
01-01-1990 NaN 4.0
01-01-1992 NaN 6.0
01-01-1993 NaN 7.0
01-01-1994 NaN 8.0
here dates are repeating 01-01-1990 etc. also, there are Nan entries. I want to know how can I get rid of NaNs and unique dates like
A B
Date
01-01-1990 1.0 4.0
01-01-1991 2.0 NaN
01-01-1992 NaN 6.0
01-01-1993 3.0 7.0
01-01-1994 NaN 8.0
concat by default concatenate along rows (axis=0). You can specify axis=1 so it concatenate along columns (and join on index):
pd.concat([a, b], axis=1)
A B
01-01-1990 1.0 4.0
01-01-1991 2.0 NaN
01-01-1993 3.0 7.0
01-01-1992 NaN 6.0
01-01-1994 NaN 8.0
Or join:
a.join(b, how='outer')
Or merge:
a.merge(b, right_index=True, left_index=True, how='outer')

How to merge dataframes correctly?

I have dataframes to merge:
first = pd.DataFrame({
'id': [1, 2],
'time': [1, 2]
})
second = pd.DataFrame({
'id': [2, 3],
'time': [3, 4]
})
third = pd.DataFrame({
'id': [3, 4],
'time': [5, 6]
})
first.merge(second, on='id', how='outer', suffixes=('', 2)).merge(third, on='id', how='outer', suffixes=('', 3))
What i have:
id time time2 time3
0 1 1.0 NaN NaN
1 2 2.0 3.0 NaN
2 3 NaN 4.0 5.0
3 4 NaN NaN 6.0
How can I get this?:
id time time2 time3
0 1 1.0 NaN NaN
1 2 2.0 3.0 NaN
2 3 4.0 5.0 NaN
3 4 6.0 NaN NaN
I need the value to move to the first empty column, so that in every row all the NaNs are on the right.
Fix your output by transform with sorted
df=df.transform(lambda x : sorted(x, key=pd.isnull),1)
Out[255]:
id time time2 time3
0 1.0 1.0 NaN NaN
1 2.0 2.0 3.0 NaN
2 3.0 4.0 5.0 NaN
3 4.0 6.0 NaN NaN

pandas append duplicates as columns

I have a df that looks like this
ID data1 data2
index
1 1 3 4
2 1 2 5
3 2 9 3
4 3 7 2
5 3 4 7
6 1 10 12
What I'm trying to do is append as columns all the lines that have the same ID, so that I'd get something like this
ID data2 data3 data4 data5 data6 data7
index
1 1 3 4 2 5 10 12
3 2 9 3
4 3 7 2 4 7
The problem is that I don't know how many columns I will have to append.
The column. Note that ID is NOT an index but a normal column, but the one that is used to find the duplicates.
I have already tried with pd.concat(), but had no luck.
You can use cumcount for count duplicates with set_index + unstack for reshaping. Then convert MultiIndex to columns by map and last reset_index for column ID from index.
df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.set_index(['ID','g']).unstack().sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID data1_0 data2_0 data1_1 data2_1 data1_2 data2_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN
Solution with pivot:
df['g'] = df.groupby('ID').cumcount().astype(str)
df = df.pivot(index='ID',columns='g').sort_index(axis=1, level=1)
df.columns = df.columns.map('_'.join)
df = df.reset_index()
print (df)
ID data1_0 data2_0 data1_1 data2_1 data1_2 data2_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN
Another solution with apply and DataFrame constructor:
df = df.groupby('ID')['data1','data2']
.apply(lambda x: pd.DataFrame(x.values, columns=['a','b']))
.unstack()
.sort_index(axis=1, level=1)
df.columns = df.columns.map('{0[0]}_{0[1]}'.format)
df = df.reset_index()
print (df)
ID a_0 b_0 a_1 b_1 a_2 b_2
0 1 3.0 4.0 2.0 5.0 10.0 12.0
1 2 9.0 3.0 NaN NaN NaN NaN
2 3 7.0 2.0 4.0 7.0 NaN NaN

Delete rows in dataframe based on column values

I need to rid myself of all rows with a null value in column C. Here is the code:
infile="C:\****"
df=pd.read_csv(infile)
A B C D
1 1 NaN 3
2 3 7 NaN
4 5 NaN 8
5 NaN 4 9
NaN 1 2 NaN
There are two basic methods I have attempted.
method 1:
source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN
df.dropna()
The result is an empty dataframe, which makes sense because there is an NaN value in every row.
df.dropna(subset=[3])
For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty.
method 2:
source: Deleting DataFrame row in Pandas based on column value
df = df[df.C.notnull()]
Still results in an empty dataframe!
What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D'])
df = df[df['C'].notnull()]
df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0):
In [100]: df
Out[100]:
A B C D
0 1.0 1.0 NaN 3.0
1 2.0 3.0 7.0 NaN
2 4.0 5.0 NaN 8.0
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [101]: df.dropna(subset=['C'])
Out[101]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [102]: df[df.C.notnull()]
Out[102]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN
In [103]: df = df[df.C.notnull()]
In [104]: df
Out[104]:
A B C D
1 2.0 3.0 7.0 NaN
3 5.0 NaN 4.0 9.0
4 NaN 1.0 2.0 NaN

Categories

Resources