Split dataframe multivalue columns - python

I have pipe delimited file with data as:
col1|col2|col3|col4|col5
row1|r1|src1#src2|val1#val2|val4#val5
row2|r2|src2#src1|val11#val12|val14#val15
row3|r3|src1#src2|val44|val23#val33
col3, col4 and col5 are multi value fields with # as a separator. Now based on the src2 position in col3, I need to consider the value in col4 and col5 for the same position. For example if src2 is in second position, then take only values from col4 and col5 which are at second position as below:
Output :
col1|col2|col3|col4|col5
row1|r1|src2|val2|val5
row2|r2|src2|val11|val14
row3|r3|src2||val33
I have written this code, but i am not getting the desired result with this:
df2 = (df["col3"].str.split("#", expand=True))
df2.loc[df2[0] == 'src2', 'Position'] = '0'
df2.loc[df2[1] == 'src2', 'Position'] = '1'
df.loc[df2['Position']=='0' ,'col4'] = df["col4"].str.split("#", expand=True)[0]
print(df['col4'])

Because you have uneven "sub-list" length, this is not directly achievable with explode.
Here is an approach using itertools.zip_longest and apply:
from itertools import zip_longest
cols = ['col3', 'col4', 'col5']
df2 = df.copy()
df2[cols] = (df2[cols]
.apply(lambda c: c.str.split('#'))
.apply(lambda r: next(filter(lambda x: x[0]=='src2',
zip_longest(*r)), float('nan')),
axis=1, result_type='expand')
)
output:
col1 col2 col3 col4 col5
0 row1 r1 src2 val2 val5
1 row2 r2 src2 val11 val14
2 row3 r3 src2 None val33

Related

Pandas merging rows with same values based on multiple columns

I have a sample dataset like this
Col1 Col2 Col3
A 1,2,3 A123
A 4,5 A456
A 1,2,3 A456
A 4,5 A123
I just want to merge the Col2 and Col3 into single row based on the unique Col1.
Expected Result:
Col1 Col2 Col3
A 1,2,3,4,5 A123,A456
I referred some solutions and tried with the following. But it only appends single column.
df.groupby(df.columns.difference(['Col3']).tolist())\
.Col3.apply(pd.Series.unique).reset_index()
Drop duplicates with subsets Col1 and 3
groupby Col1
Then aggregate, using the string concatenate method
(df.drop_duplicates(['Col1','Col3'])
.groupby('Col1')
.agg(Col2 = ('Col2',lambda x: x.str.cat(sep=',')),
Col3 = ('Col3', lambda x: x.str.cat(sep=','))
)
.reset_index()
)
Col1 Col2 Col3
0 A 1,2,3,4,5 A123,A456

Pandas: How to swap a row's cell values so that they are in alphabetical order

I have the following dataframe:
COL1 | COL2 | COL3
'Mary'| 'John' | 'Adam'
How can I reorder this row so that 'Mary', 'John', and 'Adam' are ordered alphabetically in COL1, COL2, and COL3, like so:
COL1 | COL2 | COL3
'Adam'| 'John' | 'Mary'
Using sort
df.values.sort()
df
Out[256]:
COL1 COL2 COL3
0 'Adam' 'John' 'Mary'
You can assign values via np.sort:
df.iloc[:] = pd.DataFrame(np.sort(df.values, axis=1))
# also works, performance not yet tested
# df[:] = pd.DataFrame(np.sort(df.values, axis=1))
print(df)
COL1 COL2 COL3
0 Adam John Mary

Creating a pandas dataframe from an unknown number of lists of columns

I have limited my requirements to 5 columns and 3 rows for easy explanation. My column header will come to string and my rows will come to a string. I want all the rows to be added to a dataframe. Here is what I have tried
import pandas as pd
Column_Header = "Col1,Col2,Col3,Col4,Col5" # We have upto 500 columns
df = pd.DataFrame(columns=Column_Header.split(","))
#we will get upto 100000 rows from a server response
Row1 = "Val11,Val12,Val13,Val14,Val15"
Row2 = "Val21,Val22,Val23,Val124,Val25"
Row3 = "Val31,Val32,Val33,Val34,Val35"
df_temp = pd.DataFrame(data = Row1.split(",") , columns = Column_Header.split(","))
pd.concat(df,df_temp)
print(pd)
The best and fastest is create list of all data by list comprehension and call DataFrame constructor only once:
Column_Header = "Col1,Col2,Col3,Col4,Col5"
Row1 = "Val11,Val12,Val13,Val14,Val15"
Row2 = "Val21,Val22,Val23,Val124,Val25"
Row3 = "Val31,Val32,Val33,Val34,Val35"
rows = [Row1,Row2,Row3]
L = [x.split(',') for x in rows]
print (L)
[['Val11', 'Val12', 'Val13', 'Val14', 'Val15'],
['Val21', 'Val22', 'Val23', 'Val124', 'Val25'],
['Val31', 'Val32', 'Val33', 'Val34', 'Val35']]
df = pd.DataFrame(data = L , columns = Column_Header.split(","))
print (df)
Col1 Col2 Col3 Col4 Col5
0 Val11 Val12 Val13 Val14 Val15
1 Val21 Val22 Val23 Val124 Val25
2 Val31 Val32 Val33 Val34 Val35
If this is a viable option, it would be simpler to leave all the data munging to pd.read_csv. Convert all your strings to a single multiline string, and pass it through a StringIO buffer to read_csv.
import io
data = '\n'.join([Column_Header, Row1, Row2, Row3])
df = pd.read_csv(io.StringIO(data))
df
Col1 Col2 Col3 Col4 Col5
0 Val11 Val12 Val13 Val14 Val15
1 Val21 Val22 Val23 Val124 Val25
2 Val31 Val32 Val33 Val34 Val35
If you're on python2.x, the io module is available as the cStringIO module, so you'd have to import it as:
import cStringIO as io

Transforming a CSV from wide to long format

I have a csv like this:
col1,col2,col2_val,col3,col3_val
A,1,3,5,6
B,2,3,4,5
and i want to transfer this csv like this :
col1,col6,col7,col8
A,Col2,1,3
A,col3,5,6
there are col3 and col3_val so i want to keep col3 in col6 and values of col3 in col7 and col3_val's value in col8 in the same row where col3's value is stored.
I think what you're looking for is df.melt and df.groupby:
In [63]: df.rename(columns=lambda x: x.strip('_val')).melt('col1')\
.groupby(['col1', 'variable'], as_index=False)['value'].apply(lambda x: pd.Series(x.values))\
.add_prefix('value')\
.reset_index()
Out[63]:
col1 variable value0 value1
0 A col2 1 3
1 A col3 5 6
2 B col2 2 3
3 B col3 4 5
Credit to John Galt for help with the second part.
If you wish to rename columns, assign the whole expression above to df_out and then do:
df_out.columns = ['col1', 'col6', 'col7', 'col8']
Saving this should be straightforward with df.to_csv.

Change dataframe to index value pair

I have a pandas dataframe 'df' of shape 2000x50 which appears as:
Col1 Col2 Col3
row1 0.046878 0.298156 0.743520
row2 0.442526 0.881977 0.885514
row3 0.075382 0.622636 0.706607
Rows and cols don't have a consistent naming in my real scenario.
I want to create a data frame with multi index as:
(row1, col1), 0.046878
(row3, col2), 0.622636, etc
Is there a more concise way to do this other than to extract column names and indexes, form cartisian product to create indexes like (row1, col1) etc and flatten the values stored in 'df'.
Use stack for Series and then to_frame for DataFrame:
df = df.stack().to_frame('col')
print (df)
col
row1 Col1 0.046878
Col2 0.298156
Col3 0.743520
row2 Col1 0.442526
Col2 0.881977
Col3 0.885514
row3 Col1 0.075382
Col2 0.622636
Col3 0.706607
And then sample:
df = df.stack().to_frame('col').sample(n=3)
print (df)
col
row1 Col2 0.298156
row3 Col1 0.075382
Col2 0.622636

Categories

Resources