Concatenating multiple pandas dataframes when columns are not aligned - python

I have 3 dataframes:
df1
A B C
1 1 1
2 2 2
df2
A B C
3 3 3
4 4 4
df3
A B
5 5
So I want to concat all dataframes to become the following one:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 NaN
I tried with pd.concat([df1,df2,df3]) with both axis=0 and axis=1 but none of them works as expected.

df = pd.concat([df1,df2,df3], ignore_index=True)
df.fillna("NA", inplace=True)

If there are same common columns names , working nice - common columns are aligned properly:
print (df1.columns.tolist())
['A', 'B', 'C']
print (df2.columns.tolist())
['A', 'B', 'C']
print (df3.columns.tolist())
['A', 'B']
If possible som trailing whitespaces, is possible use str.strip:
print (df1.columns.tolist())
['A', 'B ', 'C']
df1.columns = df1.columns.str.strip()
print (df1.columns.tolist())
['A', 'B', 'C']
Also parameter ignore_index=True is for default RangeIndex after concat, for avoid duplicated index and add parameter sort for avoid FutureWarning:
df = pd.concat([df1,df2,df3], ignore_index=True, sort=True)
print (df)
A B C
0 1 1 1.0
1 2 2 2.0
2 3 3 3.0
3 4 4 4.0
4 5 5 NaN

I think you need to tell concat to ignore the index:
result = pd.concat([df1,df2,df3], ignore_index=True)

Related

Reshape pandas dataframe: Create multiple columns from one column

I would like to reshape the folowing dataframe
into
Could somebody help me with that?
Have you tried df.pivot() or pd.pivot()? The values in column C will become column headers. After that, flatten the multi-index columns, and rename them.
import pandas as pd
#df = df.pivot(['A', 'B'], columns='C').reset_index() #this also works
df = pd.pivot(data=df, index=['A', 'B'], columns='C').reset_index()
df.columns = ['A', 'B', 'X', 'Y']
print(df)
Output
A B X Y
0 a aa 1 5
1 b bb 6 2
2 c cc 3 7
3 d dd 8 4
Sometimes, there might be repeated records with the same index, then you'd have to use pd.pivot_table() instead. The param aggfunc=np.mean will take the mean of these repeated records, and become type float as you can see from the output.
import pandas as pd
import numpy as np
df = pd.pivot_table(data=df, index=['A', 'B'], columns='C', aggfunc=np.mean).reset_index()
df.columns = ['A', 'B', 'X', 'Y']
print(df)
Output
A B X Y
0 a aa 1.0 5.0
1 b bb 6.0 2.0
2 c cc 3.0 7.0
3 d dd 8.0 4.0
You can try
out = df.pivot(index=['A', 'B'], columns='C', values='D').reset_index()
print(out)
C A B X Y
0 a aa 1 5
1 b bb 6 2
2 c cc 3 7
3 d dd 8 4

replace empty strings in a dataframe with values from another dataframe iwth a different index

two sample dataframes with different index values, but identical column names and order:
df1 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[2,4])
df2 = pd.DataFrame([[1, '', 3], ['', 2, '']], columns=['A', 'B', 'C'], index=[7,9])
df1
A B C
2 1 3
4 2
df2
A B C
7 4
9 5 6
I know how to concat the two dataframes, but that gives this:
A B C
2 1 3
4 2
Omitting the non=matching indexes from the other df
result I am trying to achieve is:
A B C
0 1 4 3
1 5 2 6
I want to combine the rows with the same index values from each df so that missing values in one df are replaced by the corresponding value in the other.
Concat and Merge are not up to the job I have found.
I assume I have to have identical indexes in each df which correspond to the values I want to merge into one row. But, so far, no luck getting it to come out correctly. Any pandas transformational wisdom is appreciated.
This merge attempt did not do the trick:
df1.merge(df2, on='A', how='outer')
The solutions below were all offered before I edited the question. My fault there, I neglected to point out that my actual data has different indexes in the two dataframes.
Let us try mask
out = df1.mask(df1=='',df2)
Out[428]:
A B C
0 1 4 3
1 5 2 6
for i in range(df1.shape[0]):
for j in range(df1.shape[1]):
if df1.iloc[i,j]=="":
df1.iloc[i,j] = df2.iloc[i,j]
print(df1)
A B C
0 1 4 3
1 5 2 6
Since the index of your two dataframes are different, it's easier to make it into the same index.
index = [i for i in range(len(df1))]
df1.index = index
df2.index = index
ddf = df1.replace('',np.nan)).fillna(df2)
If both df1 and df2 have different size of datas, it's still workable.
df1 = pd.DataFrame([[1, '', 3], ['', 2, ''],[7,8,9],[10,11,12]], columns=['A', 'B', 'C'],index=[7,8,9,10])
index1 = [i for i in range(len(df1))]
index2 = [i for i in range(len(df2))]
df1.index = index1
df2.index = index2
df1.replace('',np.nan).fillna(df2)
You can get
Out[17]:
A B C
0 1.0 5 3.0
1 4 2.0 6
2 7.0 8.0 9.0
3 10.0 11.0 12.0

Pandas GroupBy on column names

I have a dataframe, we can proxy by
df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})
and a category series
category = pd.Series(['A', 'B', 'B', 'A'], ['a', 'b', 'c', 'd'])
I'd like to get a sum of df's columns grouped into the categories 'A', 'B'. Maybe something like:
result = df.groupby(??, axis=1).sum()
returning
result = pd.DataFrame({'A':[3,3,4], 'B':[1,1,0]})
Use groupby + sum on the columns (the axis=1 is important here):
df.groupby(df.columns.map(category.get), axis=1).sum()
A B
0 3 1
1 3 1
2 4 0
After reindex you can assign the category to the column of df
df=df.reindex(columns=category.index)
df.columns=category
df.groupby(df.columns.values,axis=1).sum()
Out[1255]:
A B
0 3 1
1 3 1
2 4 0
Or pd.Series.get
df.groupby(category.get(df.columns),axis=1).sum()
Out[1262]:
A B
0 3 1
1 3 1
2 4 0
Here what i did to group dataframe with similar column names
data_df:
1 1 2 1
q r f t
Code:
df_grouped = data_df.groupby(data_df.columns, axis=1).agg(lambda x: ' '.join(x.values))
df_grouped:
1 2
q r t f

python pandas: merge two data frame but didn't merge the repeat rows

I have two dataframe: df1 and df2.
df1 is following:
name exist
a 1
b 1
c 1
d 1
e 1
df2 (just have one column:name)is following:
name
e
f
g
a
h
I want to merge these two dataframe, and didn't merge repeat names, I mean, if the name in df2 exist in df1, just show one time, else if the name is df2 not exist in df1, set the exist value is 0 or Nan. for example as df1(there is a and e), and df2(there is a and e, just showed a, e one time), I want to be the following df:
a 1
b 1
c 1
d 1
e 1
f 0
g 0
h 0
I used the concat function to do it, my code is following:
import pandas as pd
df1 = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e'],
'exist': ['1', '1', '1', '1', '1']})
df2 = pd.DataFrame({'name': ['e', 'f', 'g', 'h', 'a']})
df = pd.concat([df1, df2])
print(df)
but the result is wrong(name a and e is repeated to be showed):
exist name
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
0 NaN e
1 NaN f
2 NaN g
3 NaN h
4 NaN a
please give your hands, thanks in advance!
As indicated by your title, you can use merge instead of concat and specify how parameter as outer since you want to keep all records from df1 and df2 which defines an outer join:
import pandas as pd
pd.merge(df1, df2, on = 'name', how = 'outer').fillna(0)
# exist name
# 0 1 a
# 1 1 b
# 2 1 c
# 3 1 d
# 4 1 e
# 5 0 f
# 6 0 g
# 7 0 h

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...
I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Categories

Resources