Join 3 rows on a dataframe [duplicate]

Join 3 rows on a dataframe [duplicate] - python

I have following DF
col1 | col2 | col3 | col4 | col5 | col6
0 - | 15.0 | - | - | - | -
1 - | - | - | - | - | US
2 - | - | - | Large | - | -
3 ABC1 | - | - | - | - | -
4 - | - | 24RA | - | - | -
5 - | - | - | - | 345 | -
I want to collapse rows into one as follows
output DF:
col1 | col2 | col3 | col4 | col5 | col6
0 ABC1 | 15.0 | 24RA | Large | 345 | US
I do not want to iterate over columns but want to use pandas to achieve this.

Option 0
Super Simple
pd.concat([pd.Series(df[c].dropna().values, name=c) for c in df], axis=1)
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345.0 US
Can we handle more than one value per column?
Sure we can!
df.loc[2, 'col3'] = 'Test'
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345.0 US
1 NaN NaN 24RA NaN NaN NaN
Option 1
Generalized solution using np.where like a surgeon
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345 US
df.loc[2, 'col3'] = 'Test'
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -
Option 2
mask to make nulls then stack to get rid of them
Or we could have
# This should work even if `'-'` are NaN
# but you can skip the `.mask(df == '-')`
s = df.mask(df == '-').stack().reset_index(0, drop=True)
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-')
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -

You can use max, but you need to convert the null values in the string-valued columsn (which is a bit ugly unfortunately)
>>> df = pd.DataFrame({'col1':[np.nan, "ABC1"], 'col2':[15.0, np.nan]})
>>> df.apply(lambda c: c.fillna('') if c.dtype is np.dtype('O') else c).max()
col1 ABC1
col2 15
dtype: object
You could also you a combination of backfill and forwardfill to fill in the gaps, this could be useful if only want to apply this to some of your columns:
>>> df.apply(lambda c: c.fillna(method='bfill').fillna(method='ffill'))

Related

How to Join columns with common text

I have a dataframe with multiple slimier columns to marge
ID col0 col1 col2 col3 col4 col5
1 jack in A A jf w/n y/h 56
2 sam z/n b/w A A 93
3 john e/e jg b/d A 33
4 Adam jj b/b b/d NaN 15
What I want now is to merge the column with A to be like this
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A A n - A 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
I tried the first solution in here Is there a python way to merge multiple cells with condition
yet the result ended up missing info:
ID col0 col1 col2 col3 col4 A col5
1 jack in A A jf w/n y/h in A - A jf 56
2 sam z/n b/w A n A NaN 93
3 john e/e jg b/d A A 33
4 Adam jj b/b b/d NaN NaN 15
Can any one figure what is not not working with this line
s = df.filter(regex=r'col[1-4]').stack()
s = s[s.str.contains('A')].groupby(level=0).agg(' - '.join)
df['A'] = s

Let's try this,
(
df.filter(regex=r'col[1-4]').fillna("").
apply(lambda x: " - ".join([v for v in x if "A" in v]), axis=1)
)

Subset rows in df depending on conditions

Hello I have a df such as :
I wondered how I can subset row where :
COL1 contains a string "ok"
COL2 > 4
COL3 < 4
here is an exemple
COL1 COL2 COL3
AB_ok_7 5 2
AB_ok_4 2 5
AB_uy_2 5 2
AB_ok_2 2 2
U_ok_7 12 3
I should display only :
COL1 COL2 COL3
AB_ok_7 5 2
U_ok_7 12 3

Like this:
In [2288]: df[df['COL1'].str.contains('ok') & df['COL2'].gt(4) & df['COL3'].lt(4)]
Out[2288]:
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

You can use boolean indexing and chaining all the conditions.
m = df['COL1'].str.contains('ok')
m1 = df['COL2'].gt(4)
m2 = df['COL3'].lt(4)
df[m & m1 & m2]
COL1 COL2 COL3
0 AB_ok_7 5 2
4 U_ok_7 12 3

Operation on pandas data frames

I don't know how to describe my problem in words, I just model it
Problem modeling:
Let say we have two dataframes df1, df2 with the same columns
df1
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | -100 | 2 | -100
df2
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 12 | 23 | 34 | 45
Given these two df-s we get
df_result
idx | col1 | col2 | col3 | col4
---------------------------------
0 | 1 | 23 | 2 | 45
I.e. we get df1 where all -100 substituted with values from df2 accordingly.
Question: How can I do it without for-loop? In particular, is there an operation in pandas or on two lists of the same size that could do what we need?
PS: I can do it with for loop but it will be much slower.

You can use this:
df1[df1==-100] = df2
This is how it works step-by-step:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.array([[1,-100,2,-100],[-100,3,-100,-100]]), columns=['col1','col2','col3','col4'])
df1
col1 col2 col3 col4
1 -100 2 -100
-100 3 -100 -100
df2 = pd.DataFrame(np.array([[12,23,34,45],[1,2,3,4]]), columns=['col1','col2','col3','col4'])
df2
col1 col2 col3 col4
12 23 34 45
1 2 3 4
By using boolean indexing you have that
df1==-100
col1 col2 col3 col4
False True False True
True False True True
So when True you can assign the corresponding value of df2:
df1[df1==-100]=df2
df1
col1 col2 col3 col4
1 23 2 45
1 3 3 4

Convert table having string column, array column to all string columns

I am trying to convert a table containing string columns and array columns to a table with string columns only
Here is how current table looks like:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 |[2,3] | [4,5] |
| 2 |[6,7,8] | [8,9,10] |
+-----+--------------------+--------------------+
How can I get expected result like that:
+-----+--------------------+--------------------+
|col1 | col2 | col3 |
+-----+--------------------+--------------------+
| 1 | 2 | 4 |
| 1 | 3 | 5 |
| 2 | 6 | 8 |
| 2 | 7 | 9 |
| 2 | 8 | 10 |
+-----+--------------------+--------------------+

The confusion comes from mixing scalar columns and list columns.
Under the assumption that -for every row- col2 and col3 are of the same length, we can first translate all scalar columns into list columns and then concatenate:
df = pd.DataFrame({'col1': [1,2],
'col2': [[2,3] , [6,7,8]],
'col3': [[4,5], [8,9,10]]})
# First, we turn all columns into list columns
df['col1'] = df['col1'].apply(lambda x: [x]) * df['col2'].apply(len)
# Then we concatenate the lists
df.apply(np.concatenate)
Output:
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9
4 2 8 10

Conver the columns to lists and after that to numpy.array, finally convert them to a DataFrame:
vals1 = np.array(df.col2.values.tolist())
vals2 = np.array(df.col3.values.tolist())
col1 = np.repeat(df.col1, vals1.shape[1])
df = pd.DataFrame(np.column_stack((col1, vals1.ravel(), vals2.ravel())), columns=df.columns)
print(df)
col1 col2 col3
0 1 2 4
1 1 3 5
2 2 6 8
3 2 7 9

Collapsing rows in a Pandas dataframe if all rows have only one value in their columns

I have following DF
col1 | col2 | col3 | col4 | col5 | col6
0 - | 15.0 | - | - | - | -
1 - | - | - | - | - | US
2 - | - | - | Large | - | -
3 ABC1 | - | - | - | - | -
4 - | - | 24RA | - | - | -
5 - | - | - | - | 345 | -
I want to collapse rows into one as follows
output DF:
col1 | col2 | col3 | col4 | col5 | col6
0 ABC1 | 15.0 | 24RA | Large | 345 | US
I do not want to iterate over columns but want to use pandas to achieve this.

Option 0
Super Simple
pd.concat([pd.Series(df[c].dropna().values, name=c) for c in df], axis=1)
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345.0 US
Can we handle more than one value per column?
Sure we can!
df.loc[2, 'col3'] = 'Test'
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345.0 US
1 NaN NaN 24RA NaN NaN NaN
Option 1
Generalized solution using np.where like a surgeon
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 24RA Large 345 US
df.loc[2, 'col3'] = 'Test'
v = df.values
i, j = np.where(np.isnan(v))
s = pd.Series(v[i, j], df.columns[j])
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-') # <-- don't fill to get NaN
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -
Option 2
mask to make nulls then stack to get rid of them
Or we could have
# This should work even if `'-'` are NaN
# but you can skip the `.mask(df == '-')`
s = df.mask(df == '-').stack().reset_index(0, drop=True)
c = s.groupby(level=0).cumcount()
s.index = [c, s.index]
s.unstack(fill_value='-')
col1 col2 col3 col4 col5 col6
0 ABC1 15.0 Test Large 345 US
1 - - 24RA - - -

You can use max, but you need to convert the null values in the string-valued columsn (which is a bit ugly unfortunately)
>>> df = pd.DataFrame({'col1':[np.nan, "ABC1"], 'col2':[15.0, np.nan]})
>>> df.apply(lambda c: c.fillna('') if c.dtype is np.dtype('O') else c).max()
col1 ABC1
col2 15
dtype: object
You could also you a combination of backfill and forwardfill to fill in the gaps, this could be useful if only want to apply this to some of your columns:
>>> df.apply(lambda c: c.fillna(method='bfill').fillna(method='ffill'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join 3 rows on a dataframe [duplicate] - python

Related

How to Join columns with common text

Subset rows in df depending on conditions

Operation on pandas data frames

Convert table having string column, array column to all string columns

Collapsing rows in a Pandas dataframe if all rows have only one value in their columns

Categories

Resources