How to Join columns with common text - python

I have a dataframe with multiple slimier columns to marge
ID col0 col1 col2 col3 col4 col5
1 jack in A A jf w/n y/h 56
2 sam z/n b/w A A 93
3 john e/e jg b/d A 33
4 Adam jj b/b b/d NaN 15
What I want now is to merge the column with A to be like this
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A A n - A 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
I tried the first solution in here Is there a python way to merge multiple cells with condition
yet the result ended up missing info:
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A NaN 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
Can any one figure what is not not working with this line
s = df.filter(regex=r'col[1-4]').stack()
s = s[s.str.contains('A')].groupby(level=0).agg(' - '.join)
df['A'] = s

Let's try this,
(
df.filter(regex=r'col[1-4]').fillna("").
apply(lambda x: " - ".join([v for v in x if "A" in v]), axis=1)
)

Related

Dataframe append with multiindex

I have a dataframe d1 with multiindex of col1 and col2:
col3 col4 col5
col1 col2
1 2 3 4 5
2 3 4 5 6
And another dataframe d2 with exact same structure:
col3 col4 col5
col1 col2
20 30 40 50 60
2 3 44 55 66
How to do d1.append(d2), to make it become, which override the previous keys:
col3 col4 col5
col1 col2
1 2 3 4 5
20 30 40 50 60
2 3 44 55 66
Try with combine_first
out = d2.combine_first(d1)
You could use pandas.concat with keep last
pd.concat([df1, df2]).groupby(level=[0, 1]).last()
#BENY's answer is more user friendly and readable.

Subset rows in df depending on conditions

Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3
Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3
You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

position or move pandas column to a specific column index

I have a DF mydataframe and it has multiple columns (over 75 columns) with default numeric index:
Col1 Col2 Col3 ... Coln
I need to arrange/change position to as follows:
Col1 Col3 Col2 ... Coln
I can get the index of Col2 using:
mydataframe.columns.get_loc("Col2")
but I don't seem to be able to figure out how to swap, without manually listing all columns and then manually rearrange in a list.
Try:
new_cols = [Col1, Col3, Col2] + df.columns[3:]
df = df[new_cols]
How to proceed:
store the names of columns in a list;
swap the names in that list;
apply the new order on the dataframe.
code:
l = list(df)
i1, i2 = l.index('Col2'), l.index('Col3')
l[i2], l[i1] = l[i1], l[i2]
df = df[l]
I'm imagining you want what #sentence is assuming. You want to swap the positions of 2 columns regardless of where they are.
This is a creative approach:
Create a dictionary that defines which columns get switched with what.
Define a function that takes a column name and returns an ordering.
Use that function as a key for sorting.
d = {'Col3': 'Col2', 'Col2': 'Col3'}
k = lambda x: df.columns.get_loc(d.get(x, x))
df[sorted(df, key=k)]
Col0 Col1 Col3 Col2 Col4
0 0 1 3 2 4
1 5 6 8 7 9
2 10 11 13 12 14
3 15 16 18 17 19
4 20 21 23 22 24
Setup
df = pd.DataFrame(
np.arange(25).reshape(5, 5)
).add_prefix('Col')
Using np.r_ to create array of column index:
Given sample as follows:
df:
col1 col2 col3 col4 col5 col6 col7 col8 col9 col10
0 0 1 2 3 4 5 6 7 8 9
1 10 11 12 13 14 15 16 17 18 19
i, j = df.columns.slice_locs('col2', 'col10')
df[df.columns[np.r_[:i, i+1, i, i+2:j]]]
Out[142]:
col1 col3 col2 col4 col5 col6 col7 col8 col9 col10
0 0 2 1 3 4 5 6 7 8 9
1 10 12 11 13 14 15 16 17 18 19

Join two data frame with two columns values of a df with a single column values of another dataframe. based on some conditions?

I have a dataframe like this:
df1
col1 col2 col3 col4
1 2 A S
3 4 A P
5 6 B R
7 8 B B
I have another data frame:
df2
col5 col6 col3
9 10 A
11 12 R
I want to join these two data frame if any value of col3 and col4 of df1 matches with col3 values of df2 it will join.
the final data frame will look like:
df3
col1 col2 col3 col5 col6
1 2 A 9 10
3 4 A 9 10
5 6 R 11 12
If col3 value presents in df2 then it will join via col3 values else it will join via col4 values if it presents in col3 values of df2
How to do this in most efficient way using pandas/python?
Use double merge with default inner join, for second filter out rows matched in df3, last concat together:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2[~df2['col3'].isin(df1['col3'])], on='col3'))
df = pd.concat([df3, df4],ignore_index=True)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9 10
1 3 4 A 9 10
2 5 6 R 11 12
EDIT: Use left join and last combine_first:
df3 = df1.drop('col4', axis=1).merge(df2, on='col3', how='left')
df4 = (df1.drop('col3', axis=1).rename(columns={'col4':'col3'})
.merge(df2, on='col3', how='left'))
df = df3.combine_first(df4)
print (df)
col1 col2 col3 col5 col6
0 1 2 A 9.0 10.0
1 3 4 A 9.0 10.0
2 5 6 B 11.0 12.0
3 7 8 B NaN NaN

how to lag columns in batch in dataframe

I have a data frame with more then 100 columns. i need to lag 60 of them, and i know columns names for which i need to lag. Is there a way to lag them in batch or just few lines?
Say I have a dataframe like belwo
col1 col2 col3 col4 col5 col6 ... col100
1 2 3 4 5 6 8
3 9 15 19 21 23 31
The only way i know is to do it one by one. i.e run df['col1_lag']=df['col'].shift(1) for each column.
It seems too much for so many columns. Is there a better way to do this? Thanks in advance.
Use shift with add_prefix for new DataFrame and join to original:
df1 = df.join(df.shift().add_suffix('_lag'))
#alternative
#df1 = pd.concat([df, df.shift().add_suffix('_lag')], axis=1)
print (df1)
col1 col2 col3 col4 col5 col6 col100 col1_lag col2_lag col3_lag \
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 2.0 3.0
col4_lag col5_lag col6_lag col100_lag
0 NaN NaN NaN NaN
1 4.0 5.0 6.0 8.0
If want lag only some columns is possible filter them by list:
cols = ['col1','col3','col5']
df2 = df.join(df[cols].shift().add_suffix('_lag'))
print (df2)
col1 col2 col3 col4 col5 col6 col100 col1_lag col3_lag col5_lag
0 1 2 3 4 5 6 8 NaN NaN NaN
1 3 9 15 19 21 23 31 1.0 3.0 5.0

Categories

Resources