Join 3 rows on a dataframe [duplicate] - python

I have following DF
col1 | col2 | col3 | col4 | col5 | col6
0 - | 15.0 | - | - | - | -
1 - | - | - | - | - | US
2 - | - | - | Large | - | -
3 ABC1 | - | - | - | - | -
4 - | - | 24RA | - | - | -
5 - | - | - | - | 345 | -
I want to collapse rows into one as follows
output DF:
col1 | col2 | col3 | col4 | col5 | col6
0 ABC1 | 15.0 | 24RA | Large | 345 | US
I do not want to iterate over columns but want to use pandas to achieve this.

Option 0
Super Simple
pd.concat([pd.Series(df[c].dropna().values, name=c) for c in df], axis=1)
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345.0 US
Can we handle more than one value per column?
Sure we can!
df.loc[2, 'col3'] = 'Test'
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345.0 US
1 NaN NaN 24RA NaN NaN NaN
Option 1
Generalized solution using np.where like a surgeon
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345 US
df.loc[2, 'col3'] = 'Test'
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -
Option 2
mask to make nulls then stack to get rid of them
Or we could have
# This should work even if `'-'` are NaN
# but you can skip the `.mask(df == '-')`
s = df.mask(df == '-').stack().reset_index(0, drop=True)
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-')
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -

You can use max, but you need to convert the null values in the string-valued columsn (which is a bit ugly unfortunately)
>>> df = pd.DataFrame({'col1':[np.nan, "ABC1"], 'col2':[15.0, np.nan]})
>>> df.apply(lambda c: c.fillna('') if c.dtype is np.dtype('O') else c).max()
col1 ABC1
col2 15
dtype: object
You could also you a combination of backfill and forwardfill to fill in the gaps, this could be useful if only want to apply this to some of your columns:
>>> df.apply(lambda c: c.fillna(method='bfill').fillna(method='ffill'))

Related

How to Join columns with common text

I have a dataframe with multiple slimier columns to marge
ID col0 col1 col2 col3 col4 col5
1 jack in A A jf w/n y/h 56
2 sam z/n b/w A A 93
3 john e/e jg b/d A 33
4 Adam jj b/b b/d NaN 15
What I want now is to merge the column with A to be like this
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A A n - A 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
I tried the first solution in here Is there a python way to merge multiple cells with condition
yet the result ended up missing info:
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A NaN 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
Can any one figure what is not not working with this line
s = df.filter(regex=r'col[1-4]').stack()
s = s[s.str.contains('A')].groupby(level=0).agg(' - '.join)
df['A'] = s
Let's try this,
(
df.filter(regex=r'col[1-4]').fillna("").
apply(lambda x: " - ".join([v for v in x if "A" in v]), axis=1)
)

Subset rows in df depending on conditions

Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3
Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3
You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

Operation on pandas data frames

I don't know how to describe my problem in words, I just model it
Problem modeling:
Let say we have two dataframes df1, df2 with the same columns
df1
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | -100 | 2 | -100
df2
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 12 | 23 | 34 | 45
Given these two df-s we get
df_result
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | 23 | 2 | 45
I.e. we get df1 where all -100 substituted with values from df2 accordingly.
Question: How can I do it without for-loop? In particular, is there an operation in pandas or on two lists of the same size that could do what we need?
PS: I can do it with for loop but it will be much slower.
You can use this:
df1[df1==-100] = df2
This is how it works step-by-step:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1,-100,2,-100],[-100,3,-100,-100]]), columns=['col1','col2','col3','col4'])
df1
col1 col2 col3 col4
1 -100 2 -100
-100 3 -100 -100
df2 = pd.DataFrame(np.array([[12,23,34,45],[1,2,3,4]]), columns=['col1','col2','col3','col4'])
df2
col1 col2 col3 col4
12 23 34 45
1 2 3 4
By using boolean indexing you have that
df1==-100
col1 col2 col3 col4
False True False True
True False True True
So when True you can assign the corresponding value of df2:
df1[df1==-100]=df2
df1
col1 col2 col3 col4
1 23 2 45
1 3 3 4

Convert table having string column, array column to all string columns

I am trying to convert a table containing string columns and array columns to a table with string columns only
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |[2,3] | [4,5] |
| 2 |[6,7,8] | [8,9,10] |
+-----+--------------------+--------------------+
How can I get expected result like that:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 | 2 | 4 |
| 1 | 3 | 5 |
| 2 | 6 | 8 |
| 2 | 7 | 9 |
| 2 | 8 | 10 |
+-----+--------------------+--------------------+
The confusion comes from mixing scalar columns and list columns.
Under the assumption that -for every row- col2 and col3 are of the same length, we can first translate all scalar columns into list columns and then concatenate:
df = pd.DataFrame({'col1': [1,2],
'col2': [[2,3] , [6,7,8]],
'col3': [[4,5], [8,9,10]]})
# First, we turn all columns into list columns
df['col1'] = df['col1'].apply(lambda x: [x]) * df['col2'].apply(len)
# Then we concatenate the lists
df.apply(np.concatenate)
Output:
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
4 2 8 10
Conver the columns to lists and after that to numpy.array, finally convert them to a DataFrame:
vals1 = np.array(df.col2.values.tolist())
vals2 = np.array(df.col3.values.tolist())
col1 = np.repeat(df.col1, vals1.shape[1])
df = pd.DataFrame(np.column_stack((col1, vals1.ravel(), vals2.ravel())), columns=df.columns)
print(df)
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9

Collapsing rows in a Pandas dataframe if all rows have only one value in their columns

I have following DF
col1 | col2 | col3 | col4 | col5 | col6
0 - | 15.0 | - | - | - | -
1 - | - | - | - | - | US
2 - | - | - | Large | - | -
3 ABC1 | - | - | - | - | -
4 - | - | 24RA | - | - | -
5 - | - | - | - | 345 | -
I want to collapse rows into one as follows
output DF:
col1 | col2 | col3 | col4 | col5 | col6
0 ABC1 | 15.0 | 24RA | Large | 345 | US
I do not want to iterate over columns but want to use pandas to achieve this.
Option 0
Super Simple
pd.concat([pd.Series(df[c].dropna().values, name=c) for c in df], axis=1)
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345.0 US
Can we handle more than one value per column?
Sure we can!
df.loc[2, 'col3'] = 'Test'
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345.0 US
1 NaN NaN 24RA NaN NaN NaN
Option 1
Generalized solution using np.where like a surgeon
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345 US
df.loc[2, 'col3'] = 'Test'
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -
Option 2
mask to make nulls then stack to get rid of them
Or we could have
# This should work even if `'-'` are NaN
# but you can skip the `.mask(df == '-')`
s = df.mask(df == '-').stack().reset_index(0, drop=True)
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-')
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -
You can use max, but you need to convert the null values in the string-valued columsn (which is a bit ugly unfortunately)
>>> df = pd.DataFrame({'col1':[np.nan, "ABC1"], 'col2':[15.0, np.nan]})
>>> df.apply(lambda c: c.fillna('') if c.dtype is np.dtype('O') else c).max()
col1 ABC1
col2 15
dtype: object
You could also you a combination of backfill and forwardfill to fill in the gaps, this could be useful if only want to apply this to some of your columns:
>>> df.apply(lambda c: c.fillna(method='bfill').fillna(method='ffill'))

Categories

Resources