Concatenate column values in Pandas DataFrame with "NaN" values - python

I'm trying to concatenate Pandas DataFrame columns with NaN values.
In [96]:df = pd.DataFrame({'col1' : ["1","1","2","2","3","3"],
'col2' : ["p1","p2","p1",np.nan,"p2",np.nan], 'col3' : ["A","B","C","D","E","F"]})
In [97]: df
Out[97]:
col1 col2 col3
0 1 p1 A
1 1 p2 B
2 2 p1 C
3 2 NaN D
4 3 p2 E
5 3 NaN F
In [98]: df['concatenated'] = df['col2'] +','+ df['col3']
In [99]: df
Out[99]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D NaN
4 3 p2 E p2,E
5 3 NaN F NaN
Instead of 'NaN' values in "concatenated" column, I want to get "D" and "F" respectively for this example?

I don't think your problem is trivial. However, here is a workaround using numpy vectorization :
In [49]: def concat(*args):
...: strs = [str(arg) for arg in args if not pd.isnull(arg)]
...: return ','.join(strs) if strs else np.nan
...: np_concat = np.vectorize(concat)
...:
In [50]: np_concat(df['col2'], df['col3'])
Out[50]:
array(['p1,A', 'p2,B', 'p1,C', 'D', 'p2,E', 'F'],
dtype='|S64')
In [51]: df['concatenated'] = np_concat(df['col2'], df['col3'])
In [52]: df
Out[52]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F
[6 rows x 4 columns]

You could first replace NaNs with empty strings, for the whole dataframe or the column(s) you desire.
In [6]: df = df.fillna('')
In [7]: df['concatenated'] = df['col2'] +','+ df['col3']
In [8]: df
Out[8]:
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 D ,D
4 3 p2 E p2,E
5 3 F ,F

We can use stack which will drop the NaN, then use groupby.agg and ','.join the strings:
df['concatenated'] = df[['col2', 'col3']].stack().groupby(level=0).agg(','.join)
col1 col2 col3 concatenated
0 1 p1 A p1,A
1 1 p2 B p2,B
2 2 p1 C p1,C
3 2 NaN D D
4 3 p2 E p2,E
5 3 NaN F F

Related

Sum column in one dataframe based on row value of another dataframe

Say, I have one data frame df:
a b c d e
0 1 2 dd 5 Col1
1 2 3 ee 9 Col2
2 3 4 ff 1 Col4
There's another dataframe df2:
Col1 Col2 Col3
0 1 2 4
1 2 3 5
2 3 4 6
I need to add a column sum in the first dataframe, wherein it sums values of columns in the second dataframe df2, based on values of column e in df1.
Expected output
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col4 0
The Sum value in the last row is 0 because Col4 doesn't exist in df2.
What I tried: Writing some lamdas, apply function. Wasn't able to do it.
I'd greatly appreciate the help. Thank you.
Try
df['Sum']=df.e.map(df2.sum()).fillna(0)
df
Out[89]:
a b c d e Sum
0 1 2 dd 5 Col1 6.0
1 2 3 ee 9 Col2 9.0
2 3 4 ff 1 Col4 0.0
Try this. The following solution sums all values for a particular column if present in df2 using apply method and returns 0 if no such column exists in df2.
df1.loc[:,"sum"]=df1.loc[:,"e"].apply(lambda x: df2.loc[:,x].sum() if(x in df2.columns) else 0)
Use .iterrows() to iterate through a data frame pulling out the values for each row as well as index.
A nest for loop style of iteration can be used to grab needed values from the second dataframe and apply them to the first
import pandas as pd
df1 = pd.DataFrame(data={'a': [1,2,3], 'b': [2,3,4], 'c': ['dd', 'ee', 'ff'], 'd': [5,9,1], 'e': ['Col1','Col2','Col3']})
df2 = pd.DataFrame(data={'Col1': [1,2,3], 'Col2': [2,3,4], 'Col3': [4,5,6]})
df1['Sum'] = df1['a'].apply(lambda x: None)
for index, value in df1.iterrows():
sum = 0
for index2, value2 in df2.iterrows():
sum += value2[value['e']]
df1['Sum'][index] = sum
Output:
a b c d e Sum
0 1 2 dd 5 Col1 6
1 2 3 ee 9 Col2 9
2 3 4 ff 1 Col3 15

pandas drop last group element

I have a DataFrame df = pd.DataFrame({'col1': ["a","b","c","d","e", "f","g","h"], 'col2': [1,1,1,2,2,3,3,3]}) that looks like
Input:
col1 col2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 3
6 g 3
7 h 3
I want to drop the last row bases off of grouping "col2" which would look like...
Expected Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
I wrote df.groupby('col2').tail(1) which gets me what I want to delete but when I try to write df.drop(df.groupby('col2').tail(1)) I get an axis error. What would be a solution to this
Look like duplicated would work:
df[df.duplicated('col2', keep='last') |
(~df.duplicated('col2', keep=False)) # this is to keep all single-row groups
]
Or with your approach, you should drop the index:
# this would also drop all single-row groups
df.drop(df.groupby('col2').tail(1).index)
Output:
col1 col2
0 a 1
1 b 1
3 d 2
5 f 3
6 g 3
try this:
df.groupby('col2', as_index=False).apply(lambda x: x.iloc[:-1,:]).reset_index(drop=True)

Fill in same amount of characters where other column is NaN

I have the following dummy dataframe:
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h NaN
2 i,j,k,l,m ii~jj~kk~ll~mm
The real dataset has shape 500000, 90.
I need to unnest these values to rows and I'm using the new explode method for this, which works fine.
The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.
expected output
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
Attempt 1:
df['Col2'].fillna(df['Col1'].str.count(',')*'~')
Attempt 2:
np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])
This works, but I feel like there's an easier method for this:
characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)
print(df)
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]
final = pd.concat([d1,d2], axis=1)
print(final)
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
1 e
1 f
1 g
1 h
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
Question: is there an easier and more generalized method for this? Or is my method fine as is.
pd.concat
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in df}, axis=1
).stack()
Col1 Col2
0 0 a aa
1 b bb
2 c cc
3 d dd
1 0 e NaN
1 f NaN
2 g NaN
3 h NaN
2 0 i ii
1 j jj
2 k kk
3 l ll
4 m mm
This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in delims}, axis=1
).stack()
Same thing, different look
delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()
One way is using str.repeat and fillna() not sure how efficient this is though:
df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))
0 aa~bb~cc~dd
1 ~~~
2 ii~jj~kk~ll~mm
Name: Col2, dtype: object
Just split the dataframe into two
df1=df.dropna()
df2=df.drop(df1.index)
d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()
final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]:
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
1 e NaN
1 f NaN
1 g NaN
1 h NaN
zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:
from itertools import zip_longest, chain
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
# Col1 Col2
#0 a,b,c,d aa~bb~cc~dd
#1 e,f,g,h NaN
#2 i,j,k,l,m ii~jj~kk~ll~mm
#3 x,y xx~yy~zz
l = [zip_longest(*x, fillvalue='')
for x in zip(df.Col1.str.split(',').fillna(''),
df.Col2.str.split('~').fillna(''))]
pd.DataFrame(chain.from_iterable(l))
0 1
0 a aa
1 b bb
2 c cc
3 d dd
4 e
5 f
6 g
7 h
8 i ii
9 j jj
10 k kk
11 l ll
12 m mm
13 x xx
14 y yy
15 zz

Python - Swap values in multiple dataframes

I have a DataFrame like this
id val1 val2
0 A B
1 B B
2 A A
3 A A
And I would like swap values such as:
id val1 val2
0 B A
1 A A
2 B B
3 B B
I need to consider that the df could have other columns that I would like to keep unchanged.
You can use pd.DataFrame.applymap with a dictionary:
d = {'B': 'A', 'A': 'B'}
df = df.applymap(d.get).fillna(df)
print(df)
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For performance, in particular memory usage, you may wish to use categorical data:
for col in df.columns[1:]:
df[col] = df[col].astype('category')
df[col] = df[col].cat.rename_categories(d)
Try stacking, mapping, and then unstacking:
df[['val1', 'val2']] = (
df[['val1', 'val2']].stack().map({'B': 'A', 'A': 'B'}).unstack())
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
For a (much) faster solution, use a nested list comprehension.
mapping = {'B': 'A', 'A': 'B'}
df[['val1', 'val2']] = [
[mapping.get(x, x) for x in row] for row in df[['val1', 'val2']].values]
df
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
You can swap two values efficiently using numpy.where. However, if there are more than two values, this method stops working.
a = df[['val1', 'val2']].values
df[['val1', 'val2']] = np.where(a=='A', 'B', 'A')
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
To adapt this keep other values the same, you can use np.select:
c1 = a=='A'
c2 = a=='B'
np.select([c1, c2], ['B', 'A'], a)
Use factorize and roll the corresponding values
def swaparoo(col):
i, r = col.factorize()
return pd.Series(r[(i + 1) % len(r)], col.index)
df[['id']].join(df[['val1', 'val2']].apply(swaparoo))
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
Alternative gymnastics using the same function. This incorporates the whole dataframe into the factorization.
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index()
Examples
df = pd.DataFrame(dict(id=range(4), val1=[*'ABAA'], val2=[*'BBAA']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 B B
2 2 A A
3 3 A A
id val1 val2
0 0 B A
1 1 A A
2 2 B B
3 3 B B
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2
0 0 A B
1 1 A B
2 2 A B
3 3 A B
id val1 val2
0 0 B A
1 1 B A
2 2 B A
3 3 B A
df = pd.DataFrame(dict(id=range(4), val1=[*'AAAA'], val2=[*'BBBB'], val3=[*'CCCC']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 A B C
2 2 A B C
3 3 A B C
id val1 val2 val3
0 0 B C A
1 1 B C A
2 2 B C A
3 3 B C A
df = pd.DataFrame(dict(id=range(4), val1=[*'ABCD'], val2=[*'BCDA'], val3=[*'CDAB']))
print(
df,
df.set_index('id').stack().pipe(swaparoo).unstack().reset_index(),
sep='\n\n'
)
id val1 val2 val3
0 0 A B C
1 1 B C D
2 2 C D A
3 3 D A B
id val1 val2 val3
0 0 B C D
1 1 C D A
2 2 D A B
3 3 A B C
Using replace : why we need a C here , check this
df[['val1','val2']].replace({'A':'C','B':'A','C':'B'})
Out[263]:
val1 val2
0 B A
1 A A
2 B B
3 B B

Selecting a dataframe column without dropping the label

How do I select a dataframe column, df['col'], without dropping the name of the column?
df
index colname col1 col2 col3
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
Desired output:
df['col1']
index colname col1
1 0
2 3
3 6
4 9
Edit: as correctly answered, df[['col1']] does the job... Now a bit more tricky. What if the columns are multiindexed?
df grpname A B ... Z
index colname cA1 ... cAN cB1 ... cBN ... cZ1 ... cZN
1 a11 ... a1N b11 ... b1N ... z11 ... z1N
2 a21 ... a2N b21 ... b2N ... z21 ... z2N
3 a31 ... a3N b31 ... b3N ... z31 ... z3N
4 a41 ... a4N b41 ... b4N ... z41 ... z4N
I want to get
df grpname A
index colname cA1 cA2
1 a11 a12
2 a21 a22
3 a31 a32
4 a41 a42
Looks like .xs() only allows me to retrieve a certain column, namely df.xs( ('A', 'cAi'), level=('grpname','colname'), axis=1, drop_level=False) ), and df[['A']]['cA1':'cAi'] doesn't work either?
For a single column selection then df['col'] will return a series, if you want to keep the column name then you need to double subscript which will return a dataframe:
In [2]:
import pandas as pd
pd.set_option('display.notebook_repr_html', False)
import io
temp = """index col1 col2 col3
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11"""
df = pd.read_csv(io.StringIO(temp), sep='\s+',index_col=[0])
df
Out[2]:
col1 col2 col3
index
1 0 1 2
2 3 4 5
3 6 7 8
4 9 10 11
In [4]:
df[['col1']]
Out[4]:
col1
index
1 0
2 3
3 6
4 9
contrast this with:
In [5]:
df['col1']
Out[5]:
index
1 0
2 3
3 6
4 9
Name: col1, dtype: int64
EDIT
As #joris has pointed out you can see that the name is displayed at the bottom of the output, the name isn't lost as such just a different output
There is a way to do it if you are sure of the space taken by each column.
Here is the example ...
np.loadtxt("df.txt",
dtype={
'names': ('index', 'colname', 'col1', 'col2', 'col3'),
'formats': (np.float, np.string, np.float, np.float, np.float)},
delimiter= ' ', skiprows=1)

Categories

Resources