Transforming a CSV from wide to long format - python

I have a csv like this:
col1,col2,col2_val,col3,col3_val
A,1,3,5,6
B,2,3,4,5
and i want to transfer this csv like this :
col1,col6,col7,col8
A,Col2,1,3
A,col3,5,6
there are col3 and col3_val so i want to keep col3 in col6 and values of col3 in col7 and col3_val's value in col8 in the same row where col3's value is stored.

I think what you're looking for is df.melt and df.groupby:
In [63]: df.rename(columns=lambda x: x.strip('_val')).melt('col1')\
.groupby(['col1', 'variable'], as_index=False)['value'].apply(lambda x: pd.Series(x.values))\
.add_prefix('value')\
.reset_index()
Out[63]:
col1 variable value0 value1
0 A col2 1 3
1 A col3 5 6
2 B col2 2 3
3 B col3 4 5
Credit to John Galt for help with the second part.
If you wish to rename columns, assign the whole expression above to df_out and then do:
df_out.columns = ['col1', 'col6', 'col7', 'col8']
Saving this should be straightforward with df.to_csv.

Related

Group a pandas df on variable columns

I have a dataframe which I want to aggregate as follows: I want to group by col1 and col6 and make a new column for col2 and col3.
col1 col2 col3 col6
a it1 3 f
a it2 5 f
b it6 7 g
b it7 8 g
I would like the result to look like this:
col1 col6 new_col
a f pd.DataFrame({"col2": ["it1", "it2"],"col3":[3,5]})
b g pd.DataFrame({"col2": ["it6", "it7"],"col3":[7,8]})
I tried the following:
def aggregate(gr):
return(pd.DataFrame("col2":gr["col2"], "col3":gr["col3"]))
df.groupby("col1").agg(aggregate)
but aggregate seems not to be the right solution for this.
What is the right way to do this?
It is not entirely clear what you are trying to achieve so here are two ideas. First, if you are going to convert to json anyway, you can convert each group to json:
df.groupby(['col1','col6']).apply(lambda d: d.to_json())
produces
col1 col6
a f {"col1":{"0":"a","1":"a"},"col2":{"0":"it1","1...
b g {"col1":{"2":"b","3":"b"},"col2":{"2":"it6","3...
second, you can have dataframes inside a dataframe, here is how you can do that:
dd = {}
for idx, gr in df.groupby(['col1','col6']):
dd[idx] = aggregate(gr)
dfout = pd.DataFrame(columns = ['newcol'], index = dd.keys())
for idx in dfout.index:
dfout.at[idx, 'newcol'] = dd[idx]
Here is how it is printed with the help of the tabulate package:nicely:
from tabulate import tabulate
print(tabulate(dfout, headers = 'keys'))
newcol
---------- ------------
('a', 'f') col2 col3
0 it1 3
1 it2 5
('b', 'g') col2 col3
2 it6 7
3 it7 8
so has the right DataFrames inside dfout. When converted to json it looks like this:
dfout.to_json()
'{"newcol":{"(\'a\', \'f\')":{"col2":{"0":"it1","1":"it2"},"col3":{"0":3,"1":5}},"(\'b\', \'g\')":{"col2":{"2":"it6","3":"it7"},"col3":{"2":7,"3":8}}}}'

How to join column in pandas ignoring the value of Zero with mixed datatypes

My data looks like this: (I have 28 columns)
col1 col2 col3 col4 col5
AA 0 0 B 0
0 CC 0 D 0
0 0 E F G
I am trying to merge these columns to get an output like this:
col1 col2 col3 col4 col5 col6
AA 0 0 B 0 AA;B
0 C 0 DD 0 C;DD
0 0 E F G E;F;G
I want to merge only the non-numeric characters into the new column.
I tried like this:
cols=['col1','col2', 'col3', 'col4', 'col5']
df2["col6"] = df2[cols].apply(lambda x: ';'.join(x.dropna()), axis=1)
But it doesn't take out the zeros. I am aware it is a small change but couldn't figure it out.
Thanks
try via where() method and apply() method:
df2["col6"]=df2.where((df2!='0')&(df2!=0)).apply(lambda x: ';'.join(x.dropna()), axis=1)
If there are numbers other than 0(including 0) then use:
df2["col6"]=(df2.where(df2.apply(lambda x:x.str.isalpha(),1))
.apply(lambda x: ';'.join(x.dropna()), axis=1))
With your shown samples please try following. Trying to fix OP's attempts here. Simple explanation would be, major change is to use condition x[x!=0] to make boolean mask in OP's attempted code(join function).
df2['col6'] = df2[cols].apply(lambda x: ';'.join(x[x!=0]), axis=1)

Apply lambda to populate column with mean from columns left of given column

Given:
d = {'col1': [1,2], 'col2': [2,2], 'col3': [3,2], 'col4': [np.nan,np.nan], 'col5': [1,2], 'col6': [2,2], 'col7': [3,2], 'col8': [np.nan,np.nan]}
df = pd.DataFrame(data=d)
df
col1 col2 col3 col4 col5 col6 col7 col8
0 1 2 3 NaN 1 2 3 NaN
1 2 2 2 NaN 2 2 2 NaN
what lambda could be applied to populate col4 with mean of col1, col2, and col3, and populate col8 with mean of col5, col6, and col7--in one statement?
If you really want to use a lambda you can do:
df['mean1'] = df.apply(lambda row: np.mean(row['col1'],row['col2'],row['col3']),axis=1)
df['mean1'] = df.apply(lambda row: np.mean(row['col5'],row['col6'],row['col7']),axis=1)
Alternatively you can do in one line like the below using pandas .mean though I think it's clearer on two lines
df['mean1'], df['mean2'] = df[['col1','col2','col3']].mean(axis=1), df[['col5','col6','col7']].mean(axis=1)
df['col4'] = df[['col1', 'col2','col3']].mean(axis=1)
df['col8'] = df[['col5', 'col6','col7']].mean(axis=1)
Chained
df[['col4', 'col8']]=df[['col1', 'col2','col3']].mean(axis=1),df[['col5', 'col6','col7']].mean(axis=1)
df
Or slice and apply mean
df.iloc[:,:3].mean(axis=1)
df.iloc[:,-4:-1].mean(axis=1)
Tied together
df[['col4', 'col8']]=df.iloc[:,:3].mean(axis=1),df.iloc[:,-4:-1].mean(axis=1)

Pandas merging rows with same values based on multiple columns

I have a sample dataset like this
Col1 Col2 Col3
A 1,2,3 A123
A 4,5 A456
A 1,2,3 A456
A 4,5 A123
I just want to merge the Col2 and Col3 into single row based on the unique Col1.
Expected Result:
Col1 Col2 Col3
A 1,2,3,4,5 A123,A456
I referred some solutions and tried with the following. But it only appends single column.
df.groupby(df.columns.difference(['Col3']).tolist())\
.Col3.apply(pd.Series.unique).reset_index()
Drop duplicates with subsets Col1 and 3
groupby Col1
Then aggregate, using the string concatenate method
(df.drop_duplicates(['Col1','Col3'])
.groupby('Col1')
.agg(Col2 = ('Col2',lambda x: x.str.cat(sep=',')),
Col3 = ('Col3', lambda x: x.str.cat(sep=','))
)
.reset_index()
)
Col1 Col2 Col3
0 A 1,2,3,4,5 A123,A456

Find unique values for each column

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Categories

Resources