Pandas merging rows with same values based on multiple columns - python

I have a sample dataset like this
Col1 Col2 Col3
A 1,2,3 A123
A 4,5 A456
A 1,2,3 A456
A 4,5 A123
I just want to merge the Col2 and Col3 into single row based on the unique Col1.
Expected Result:
Col1 Col2 Col3
A 1,2,3,4,5 A123,A456
I referred some solutions and tried with the following. But it only appends single column.
df.groupby(df.columns.difference(['Col3']).tolist())\
.Col3.apply(pd.Series.unique).reset_index()

Drop duplicates with subsets Col1 and 3
groupby Col1
Then aggregate, using the string concatenate method
(df.drop_duplicates(['Col1','Col3'])
.groupby('Col1')
.agg(Col2 = ('Col2',lambda x: x.str.cat(sep=',')),
Col3 = ('Col3', lambda x: x.str.cat(sep=','))
)
.reset_index()
)
Col1 Col2 Col3
0 A 1,2,3,4,5 A123,A456

Related

Applying function to pandas dataframe column after str.findall

I have such a dataframe df with two columns:
Col1 Col2
'abc-def-ghi' 1
'abc-opq-rst' 2
I created a new column Col3 like this:
df['Col3'] = df['Col1'].str.findall('abc', flags=re.IGNORECASE)
And got such a dataframe afterwards:
Col1 Col2 Col3
'abc-def-ghi' 1 [abc]
'abc-opq-rst' 2 [abc]
What I want to do now is to create a new column Col4 where I get a one if Col3 contains 'abc' and otherwise zero.
I tried to do this with a function:
def f(row):
if row['Col3'] == '[abc]':
val = 1
else:
val = 0
return val
And applied this to my pandas dataframe:
df['Col4'] = df.apply(f, axis=1)
But I only get 0, also in column that contain 'abc'. I think there is something wrong with my if-statement.
How can I solve this?
Just do
df['Col4'] = df.Col3.astype(bool).astype(int)

Pandas: How to swap a row's cell values so that they are in alphabetical order

I have the following dataframe:
COL1 | COL2 | COL3
'Mary'| 'John' | 'Adam'
How can I reorder this row so that 'Mary', 'John', and 'Adam' are ordered alphabetically in COL1, COL2, and COL3, like so:
COL1 | COL2 | COL3
'Adam'| 'John' | 'Mary'
Using sort
df.values.sort()
df
Out[256]:
COL1 COL2 COL3
0 'Adam' 'John' 'Mary'
You can assign values via np.sort:
df.iloc[:] = pd.DataFrame(np.sort(df.values, axis=1))
# also works, performance not yet tested
# df[:] = pd.DataFrame(np.sort(df.values, axis=1))
print(df)
COL1 COL2 COL3
0 Adam John Mary

Transforming a CSV from wide to long format

I have a csv like this:
col1,col2,col2_val,col3,col3_val
A,1,3,5,6
B,2,3,4,5
and i want to transfer this csv like this :
col1,col6,col7,col8
A,Col2,1,3
A,col3,5,6
there are col3 and col3_val so i want to keep col3 in col6 and values of col3 in col7 and col3_val's value in col8 in the same row where col3's value is stored.
I think what you're looking for is df.melt and df.groupby:
In [63]: df.rename(columns=lambda x: x.strip('_val')).melt('col1')\
.groupby(['col1', 'variable'], as_index=False)['value'].apply(lambda x: pd.Series(x.values))\
.add_prefix('value')\
.reset_index()
Out[63]:
col1 variable value0 value1
0 A col2 1 3
1 A col3 5 6
2 B col2 2 3
3 B col3 4 5
Credit to John Galt for help with the second part.
If you wish to rename columns, assign the whole expression above to df_out and then do:
df_out.columns = ['col1', 'col6', 'col7', 'col8']
Saving this should be straightforward with df.to_csv.

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Change dataframe to index value pair

I have a pandas dataframe 'df' of shape 2000x50 which appears as:
Col1 Col2 Col3
row1 0.046878 0.298156 0.743520
row2 0.442526 0.881977 0.885514
row3 0.075382 0.622636 0.706607
Rows and cols don't have a consistent naming in my real scenario.
I want to create a data frame with multi index as:
(row1, col1), 0.046878
(row3, col2), 0.622636, etc
Is there a more concise way to do this other than to extract column names and indexes, form cartisian product to create indexes like (row1, col1) etc and flatten the values stored in 'df'.
Use stack for Series and then to_frame for DataFrame:
df = df.stack().to_frame('col')
print (df)
col
row1 Col1 0.046878
Col2 0.298156
Col3 0.743520
row2 Col1 0.442526
Col2 0.881977
Col3 0.885514
row3 Col1 0.075382
Col2 0.622636
Col3 0.706607
And then sample:
df = df.stack().to_frame('col').sample(n=3)
print (df)
col
row1 Col2 0.298156
row3 Col1 0.075382
Col2 0.622636

Categories

Resources