Prevent pandas concat'ting my dataframes both vertically and horizontally - python

I am trying to concat two dataframes, horizontally. df2 contains 2 result variables for every observation in df1.
df1.shape
(242583, 172)
df2.shape
(242583, 2)
My code is:
Fin = pd.concat([df1, df2], axis= 1)
But somehow the result is stacked in 2 dimensions:
Fin.shape
(485166, 174)
What am I missing here?

There are different index values, so indexes are not aligned and get NaNs:
df1 = pd.DataFrame({
'A': ['a','a','a'],
'B': range(3)
})
print (df1)
A B
0 a 0
1 a 1
2 a 2
df2 = pd.DataFrame({
'C': ['b','b','b'],
'D': range(4,7)
}, index=[5,7,8])
print (df2)
C D
5 b 4
7 b 5
8 b 6
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
0 a 0.0 NaN NaN
1 a 1.0 NaN NaN
2 a 2.0 NaN NaN
5 NaN NaN b 4.0
7 NaN NaN b 5.0
8 NaN NaN b 6.0
One possible solution is create default indexes:
Fin = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis= 1)
print (Fin)
A B C D
0 a 0 b 4
1 a 1 b 5
2 a 2 b 6
Or assign:
df2.index = df1.index
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
0 a 0 b 4
1 a 1 b 5
2 a 2 b 6
df1.index = df2.index
Fin = pd.concat([df1, df2], axis= 1)
print (Fin)
A B C D
5 a 0 b 4
7 a 1 b 5
8 a 2 b 6

If you are looking for the one-liner, there is the set_index method:
import pandas as pd
x = pd.DataFrame({'A': ["a"] * 3, 'B': range(3)})
y = pd.DataFrame({'C': ["b"] * 3, 'D': range(4,7)})
pd.concat([x, y.set_index(x.index)], axis = 1)
Note that pd.concat([x, y], axis = 1) will instead create new lines and produce NA values, due to non-matching indexes, as shown by #jezrael

Related

Melting multiple columns into one column

I have been trying to melt this columns
d = {'key': [1,2,3,4,5], 'a': ['None','a', 'None','None','None'], 'b': ['None','None','b','None','None'],'c':['None','None','None','c','c']}
df = pd.DataFrame(d)
I need to look like this
key
letter
1
None
2
a
3
b
4
c
5
c
I tried:
df = pd.melt(df,id_vars=['key'], var_name = 'letters')
but i got:
key
letters
value
1
a
None
2
a
a
3
a
None
4
a
None
5
a
None
1
b
None
2
b
None
3
b
b
4
b
None
5
b
None
1
c
None
2
c
None
3
c
None
4
c
c
5
c
c
If need get first non None value per rows after key column use DataFrame.set_index with replace possible None strings with back filling missing values and selected first column by position, last use Series.reset_index:
df = (df.set_index('key')
.replace('None', np.nan)
.bfill(axis=1)
.iloc[:, 0]
.reset_index(name='letter')))
print (df)
key letter
0 1 NaN
1 2 a
2 3 b
3 4 c
4 5 c
If possible multiple non None value per rows use:
d = {'key': [1,2,3,4,5],
'a': ['None','a', 'None','None','None'],
'b': ['None','b','b','None','None'],
'c':['None','None','None','c','c']}
df = pd.DataFrame(d)
df = (df[['key']].join(df.set_index('key')
.replace('None', np.nan)
.stack()
.groupby(level=0)
.agg(','.join)
.rename('letter'), on='key'))
print (df)
key letter
0 1 NaN
1 2 a,b
2 3 b
3 4 c
4 5 c
Or:
df = (df.set_index('key')
.replace('None', np.nan)
.apply(lambda x: ','.join(x.dropna()), axis=1)
.replace('', np.nan)
.reset_index(name='letter'))
print (df)
key letter
0 1 NaN
1 2 a,b
2 3 b
3 4 c
4 5 c

How can I return the value of a column in a new column based on conditions with python

I have a dataframe with three columns
a b c
[1,0,2]
[0,3,2]
[0,0,2]
and need to create a fourth column based on a hierarchy as follows:
If column a has value then column d = column a
if column a has no value but b has then column d = column b
if column a and b have no value but c has then column d = column c
a b c d
[1,0,2,1]
[0,3,2,3]
[0,0,2,2]
I'm quite the beginner at python and have no clue where to start.
Edit: I have tried the following but they all will not return a value in column d if column a is empty or None
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d'] = df['a']
df.loc[df['a'] == None, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
df['d']=np.where(df.a!=0, df.a,\
np.where(df.b!=0,\
df.b, df.c)
A simple one-liner would be,
df['d'] = df.replace(0, np.nan).bfill(axis=1)['a'].astype(int)
Step by step visualization
Convert no value to NaN
a b c
0 1.0 NaN 2
1 NaN 3.0 2
2 NaN NaN 2
Now backward fill the values along rows
a b c
0 1.0 2.0 2.0
1 3.0 3.0 2.0
2 2.0 2.0 2.0
Now select the required column, i.e 'a' and create a new column 'd'
Output
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2
import numpy as np
import pandas as pd
df = pd.DataFrame([[1,0,2], [0,3,2], [0,0,2]], columns = ('a','b','c'))
print(df)
df['d'] = df['a']
df.loc[df['a'] == 0, 'd'] = df['b']
df.loc[~df['a'].astype('bool') & ~df['b'].astype('bool'), 'd'] = df['c']
print(df)
Try this (df is your dataframe)
df['d']=np.where(df.a!=0 and df.a is not None, df.a, np.where(df.b!=0 and df.b is not None, df.b, df.c))
>>> print(df)
a b c d
0 1 0 2 1
1 0 3 2 3
2 0 0 2 2

Change of column names after concat()

I have 2 dfs:
df1
a b
0 1 2
1 3 4
df2
c d
0 5 4
1 2 3
After concat, I get werid column names:
[In:]
df3=pd.concat([df1, df2], axis=1)
[Out:]
a b (c,) (d,)
0 1 2 5 4
1 3 4 2 3
df2 has had tuples in its columns before, maybe that's the reason.
If I try to get the dtypes, I get int64 for all columns.
If I just had to rename the columns, it would not be any problem, but it seems like operating with these columns brings up a problem with the dimension of these columns.
Does anyone understand the issue?
You can flatten the column index list using list comprehension:
df3.columns = [x for t in df3.columns.to_list() for x in t]
Example:
>>> df1 = pd.DataFrame({'a':[1, 3], 'b':[2, 4]})
>>> df2 = pd.DataFrame([[5, 4],[2, 3]], columns = pd.MultiIndex(levels=[[ 'c', 'd']], codes=[[0, 1]]))
>>> df3 = pd.concat([df1, df2], axis=1)
>>> df3
a b (c,) (d,)
0 1 2 5 4
1 3 4 2 3
>>> df3.columns = [x for t in df3.columns.to_list() for x in t]
>>> df3
a b c d
0 1 2 5 4
1 3 4 2 3
Flatten your column headers:
df1 = pd.DataFrame({'a':[1, 3], 'b':[2, 4]})
df2 = pd.DataFrame([[5, 4],[2, 3]], columns = pd.MultiIndex(levels=[[ 'c', 'd']], codes=[[0, 1]]))
df2.columns = df2.columns.map(''.join)
df3 = pd.concat([df1, df2], axis=1)
df3
Output:
a b c d
0 1 2 5 4
1 3 4 2 3

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-
As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)
A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4
You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11
Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

Categories

Resources