Suppose I have a dataframe looking something like this:
col1 col2 col3 col4
0 A B F O
1 A G Q
2 A C G P
3 A H
4 A D I
5 A D I
6 A J U
7 A E J
How can I shift the columns if the column value is empty?
col1 col2 col3 col4
0 A B F O
1 A G Q
2 A C G P
3 A H
4 A D I
5 A D I
6 A J U
7 A E J
I thought I could check current column, if it's empty, take the next column value and make that empty.
for col in df.columns:
df[col] = np.where((df[col] == ''), df[f'col{int(col[-1])+1}'], df[col])
df[f'col{int(col[-1])+1}'] = np.where((df[col] == ''), '', df[col])
But I am failing somewhere. Sample df below.
df = pd.DataFrame(
{
'col1': ['A','A','A','A','A','A','A','A'],
'col2': ['B','','C','','D','D','','E'],
'col3': ['F','G','G','H','I','I','J',''],
'col4': ['O','Q','P','','','','U','J']
}
)
One way is to use np.argsort:
s = df.to_numpy()
orders = np.argsort(s=='', axis=1, kind='mergesort')
df[:] = s[np.arange(len(s))[:,None],orders]
Output:
col1 col2 col3 col4
0 A B F O
1 A G Q
2 A C G P
3 A H
4 A D I
5 A D I
6 A J U
7 A E J
Note:
A very similar approach can be found in this question.
Replace empty string with NaN
df = df.replace('', np.nan)
Apply dropna row-wise
odf = df.apply(lambda x: pd.Series(x.dropna().values), axis=1)
To retain column names,
odf.columns = df.columns
NOTE: It is always good to represent missing data with NaN
Output
col1 col2 col3 col4
0 A B F O
1 A G Q NaN
2 A C G P
3 A H NaN NaN
4 A D I NaN
5 A D I NaN
6 A J U NaN
7 A E J NaN
You can count the number of empty values for a column, then sort it, and finally get the desired datframe..
counts = {}
for col in df.columns.to_list():
counts[col] = (df[col]== '').sum() #Based on the example you have provided.
# Then sort the dictionary based on counts.
counts = dict(sorted(counts.items(), key=lambda item: item[1]))
#Assign back to the dataframe.
df = df[[*counts.keys()]]
df
col1 col3 col2 col4
0 A F B O
1 A G Q
2 A G C P
3 A H
4 A I D
5 A I D
6 A J U
7 A E J
Related
I have a dataframe like below:
Original data
index string
0 a,b,c,d,e,f
1 a,b,c,d,e,f
2 a,(I,j,k),c,d,e,f
I want to be:
To be data
index col1 col2 col3 col4 col5 col6
0 a b c d e f
1 a b c d e f
2 a (I,j,k) c d e f
You can split on commas that are not inside brackets. Then convert the result to a DataFrame and assign to df columns:
df[['col {}'.format(i) for i in range(1,7)]] = df['string'].str.split(r",\s*(?![^()]*\))").apply(pd.Series)
Output:
index string col 1 col 2 col 3 col 4 col 5 col 6
0 0 a,b,c,d,e,f a b c d e f
1 1 a,b,c,d,e,f a b c d e f
2 2 a,(I,j,k),c,d,e,f a (I,j,k) c d e f
Try this :
df = df['string'].str.split(r",\s*(?![^()]*\))", expand= True)
df.columns = ['col1','col2','col3','col4','col5','col6']
My Pandas df is like following and want to apply groupby and then want to calculate the average and first of many columns
index col1 col2 col3 col4 col5 col6
0 a c 1 2 f 5
1 a c 1 2 f 7
2 a d 1 2 g 9
3 b d 6 2 g 4
4 b e 1 2 g 8
5 b e 1 2 g 2
something like this I tried
df.groupby(['col1','col5').agg({['col6','col3']:'mean',['col4','col2']:'first'})
expecting output
col1 col5 col6 col3 col4 col2
a f 6 1 2 c
a g 9 1 2 d
b g 4 3 2 e
but it seems, list is not an option here, in my real dataset I have 100 of columns of different nature so I cant pass them individually. Any thoughts on passing them as list?
if you have lists depending on the aggregation, you can do:
l_mean = ['col6','col3']
l_first = ['col4','col2']
df.groupby(['col1','col5']).agg({**{col:'mean' for col in l_mean},
**{col:'first' for col in l_first}})
the notation **{} is for unpacking dictionary, doing {**{}, **{}} create one dictionary from 2 dictionaries (it could be ore than two), it is like union of dictionaries. And doing {col:'mean' for col in l_mean} create a dictionary with each col of the list as a key and 'mean' as value, it is dictionary comprehension.
Or using concat:
gr = df.groupby(['col1','col5'])
pd.concat([gr[l_mean].mean(),
gr[l_first].first()],
axis=1)
and reset_index after to get the expected output
(
df.groupby(['col1','col5'])
.agg(col6=('col6', 'mean'),
col3=('col3', 'mean'),
col4=('col4', 'first'),
col2=('col2', 'first'))
)
this is an extension of #Ben.T's solution, just wrapping it in a function and passing it via the pipe method :
#set the list1, list2
def fil(grp,list1,list2):
A = grp.mean().filter(list1)
B = grp.first().filter(list2)
C = A.join(B)
return C
grp1 = ['col6','col3']
grp2 = ['col4','col2']
m = df.groupby(['col1','col5']).pipe(fil,grp1,grp2)
m
I have the following dummy dataframe:
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm']})
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h NaN
2 i,j,k,l,m ii~jj~kk~ll~mm
The real dataset has shape 500000, 90.
I need to unnest these values to rows and I'm using the new explode method for this, which works fine.
The problem is the NaN, these will cause unequal lengths after the explode, so I need to fill in the same amount of delimiters as the filled values. In this case ~~~ since row 1 has three comma's.
expected output
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
Attempt 1:
df['Col2'].fillna(df['Col1'].str.count(',')*'~')
Attempt 2:
np.where(df['Col2'].isna(), df['Col1'].str.count(',')*'~', df['Col2'])
This works, but I feel like there's an easier method for this:
characters = df['Col1'].str.replace('\w', '').str.replace(',', '~')
df['Col2'] = df['Col2'].fillna(characters)
print(df)
Col1 Col2
0 a,b,c,d aa~bb~cc~dd
1 e,f,g,h ~~~
2 i,j,k,l,m ii~jj~kk~ll~mm
d1 = df.assign(Col1=df['Col1'].str.split(',')).explode('Col1')[['Col1']]
d2 = df.assign(Col2=df['Col2'].str.split('~')).explode('Col2')[['Col2']]
final = pd.concat([d1,d2], axis=1)
print(final)
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
1 e
1 f
1 g
1 h
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
Question: is there an easier and more generalized method for this? Or is my method fine as is.
pd.concat
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in df}, axis=1
).stack()
Col1 Col2
0 0 a aa
1 b bb
2 c cc
3 d dd
1 0 e NaN
1 f NaN
2 g NaN
3 h NaN
2 0 i ii
1 j jj
2 k kk
3 l ll
4 m mm
This loops on columns in df. It may be wiser to loop on keys in the delims dictionary.
delims = {'Col1': ',', 'Col2': '~'}
pd.concat({
k: df[k].str.split(delims[k], expand=True)
for k in delims}, axis=1
).stack()
Same thing, different look
delims = {'Col1': ',', 'Col2': '~'}
def f(c): return df[c].str.split(delims[c], expand=True)
pd.concat(map(f, delims), keys=delims, axis=1).stack()
One way is using str.repeat and fillna() not sure how efficient this is though:
df.Col2.fillna(pd.Series(['~']*len(df)).str.repeat(df.Col1.str.count(',')))
0 aa~bb~cc~dd
1 ~~~
2 ii~jj~kk~ll~mm
Name: Col2, dtype: object
Just split the dataframe into two
df1=df.dropna()
df2=df.drop(df1.index)
d1 = df1['Col1'].str.split(',').explode()
d2 = df1['Col2'].str.split('~').explode()
d3 = df2['Col1'].str.split(',').explode()
final = pd.concat([d1, d2], axis=1).append(d3.to_frame(),sort=False)
Out[77]:
Col1 Col2
0 a aa
0 b bb
0 c cc
0 d dd
2 i ii
2 j jj
2 k kk
2 l ll
2 m mm
1 e NaN
1 f NaN
1 g NaN
1 h NaN
zip_longest can be useful here, given you don't need the original Index. It will work regardless of which column has more splits:
from itertools import zip_longest, chain
df = pd.DataFrame({'Col1':['a,b,c,d', 'e,f,g,h', 'i,j,k,l,m', 'x,y'],
'Col2':['aa~bb~cc~dd', np.NaN, 'ii~jj~kk~ll~mm', 'xx~yy~zz']})
# Col1 Col2
#0 a,b,c,d aa~bb~cc~dd
#1 e,f,g,h NaN
#2 i,j,k,l,m ii~jj~kk~ll~mm
#3 x,y xx~yy~zz
l = [zip_longest(*x, fillvalue='')
for x in zip(df.Col1.str.split(',').fillna(''),
df.Col2.str.split('~').fillna(''))]
pd.DataFrame(chain.from_iterable(l))
0 1
0 a aa
1 b bb
2 c cc
3 d dd
4 e
5 f
6 g
7 h
8 i ii
9 j jj
10 k kk
11 l ll
12 m mm
13 x xx
14 y yy
15 zz
Say I have a table like:
col1 col2 col3 col4
a b c [d
e [f g h
i j k l
m n o [p
I want to load only the columns that contain a value that starts with left bracket [ .
So i want the following to be returned as a dataframe
col 2 col4
b [d
[f h
j l
n [p
I want to load only the columns that contain a value that starts with right bracket [
For this you need
series.str.startswith():
df.loc[:,df.apply(lambda x: x.str.startswith('[')).any()]
col2 col4
0 b [d
1 [f h
2 j l
3 n [p
Note that there is a difference between startswith and contains. The docs are explanatory.
Can you try the following:
>>> df = pd.DataFrame([[1, 2, 4], [4, 5, 6], [7, '[8', 9]])
>>> df = df.astype('str')
>>> df
0 1 2
0 1 2 4
1 4 5 6
2 7 [8 9
>>> df[df.columns[[df[i].str.contains('[', regex=False).any() for i in df.columns]]]
1
0 2
1 5
2 [8
>>>
Use this:
s=df.applymap(lambda x: '[' in x).any()
print(df[s[s].index])
Output:
col2 col4
0 b [d
1 [f h
2 j l
3 n [pa
Please try this, hope this works for you
df = pd.DataFrame([['a', 'b', 'c','[d'], ['e','[f','g','h'],['i','j','k','l'],['m','n','o','[p']],columns=['col1','col2','col3','col4'])
cols = []
for col in df.columns:
if df[col].str.contains('[',regex=False).any() == True:
cols.append(col)
df[cols]
Output
col2 col4
0 b [d
1 [f h
2 j l
3 n [p
I have 2 dataframes that I want to sort that are similar in structure to what I have shown below, but the rows of values when looking at only the first 3 columns are jumbled. How do I sort the dataframes such that the row indices match?
Also it could so happen that there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index. How would I go about doing this?
Dataframe1:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
Dataframe2:
Col1 Col2 Col3 Col4
0 f e g 6
1 a b c 5
2 b c d 3
Is this what you want?:
import pandas as pd
df=pd.DataFrame({'a':[1,3,2],'b':[4,6,5]})
print(df.sort_values(df.columns.tolist()))
Output:
a b
0 1 4
2 2 5
1 3 6
How do I sort the dataframes such that the row indices match
You can sort by the columns that should determine order on both data frames & reset index.
cols = ['Col1', 'Col2', 'Col3']
df1.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
df2.sort_values(cols).reset_index(drop=True)
#outputs:
Col1 Col2 Col3 Col4
0 a b c 5
1 b c d 3
2 f e g 6
...there may not be matching rows in which case I want to create a blank entry in the other dataframe at that index
lets add 1 more row to df1
df1 = pd.DataFrame({
'Col1': list('abfh'),
'Col2': list('bceg'),
'Col3': list('cdgi'),
'Col4': [1,4,5,7]
})
df1
# outputs:
Col1 Col2 Col3 Col4
0 a b c 1
1 b c d 4
2 f e g 5
3 h g i 7
We can use an outer join to add a blank row to df2 where each column in pd.Nan at index 3
if you have sorted both databases already, you can merge using the indexes
df3 = df1.merge(df2, 'left', left_index=True, right_index=True, suffixes=('_x', ''))
otherwise, merge on the columns that *should* determine the sort order, this will create a new dataframe with joined values, sorted in the same way df1 is sorted
df3 = df1.merge(df2, 'left', on=cols, suffixes=('_x', ''))
Then filter out the columns from the left data frame
df3.iloc[:, ~df3.columns.str.endswith('_x')]
#outputs:
Col1 Col2 Col3 Col4
0 f e g 6.0
1 a b c 5.0
2 b c d 3.0
3 NaN NaN NaN NaN