I have a dataframe which I want to aggregate as follows: I want to group by col1 and col6 and make a new column for col2 and col3.
col1 col2 col3 col6
a it1 3 f
a it2 5 f
b it6 7 g
b it7 8 g
I would like the result to look like this:
col1 col6 new_col
a f pd.DataFrame({"col2": ["it1", "it2"],"col3":[3,5]})
b g pd.DataFrame({"col2": ["it6", "it7"],"col3":[7,8]})
I tried the following:
def aggregate(gr):
return(pd.DataFrame("col2":gr["col2"], "col3":gr["col3"]))
df.groupby("col1").agg(aggregate)
but aggregate seems not to be the right solution for this.
What is the right way to do this?
It is not entirely clear what you are trying to achieve so here are two ideas. First, if you are going to convert to json anyway, you can convert each group to json:
df.groupby(['col1','col6']).apply(lambda d: d.to_json())
produces
col1 col6
a f {"col1":{"0":"a","1":"a"},"col2":{"0":"it1","1...
b g {"col1":{"2":"b","3":"b"},"col2":{"2":"it6","3...
second, you can have dataframes inside a dataframe, here is how you can do that:
dd = {}
for idx, gr in df.groupby(['col1','col6']):
dd[idx] = aggregate(gr)
dfout = pd.DataFrame(columns = ['newcol'], index = dd.keys())
for idx in dfout.index:
dfout.at[idx, 'newcol'] = dd[idx]
Here is how it is printed with the help of the tabulate package:nicely:
from tabulate import tabulate
print(tabulate(dfout, headers = 'keys'))
newcol
---------- ------------
('a', 'f') col2 col3
0 it1 3
1 it2 5
('b', 'g') col2 col3
2 it6 7
3 it7 8
so has the right DataFrames inside dfout. When converted to json it looks like this:
dfout.to_json()
'{"newcol":{"(\'a\', \'f\')":{"col2":{"0":"it1","1":"it2"},"col3":{"0":3,"1":5}},"(\'b\', \'g\')":{"col2":{"2":"it6","3":"it7"},"col3":{"2":7,"3":8}}}}'
Related
My data looks like this: (I have 28 columns)
col1 col2 col3 col4 col5
AA 0 0 B 0
0 CC 0 D 0
0 0 E F G
I am trying to merge these columns to get an output like this:
col1 col2 col3 col4 col5 col6
AA 0 0 B 0 AA;B
0 C 0 DD 0 C;DD
0 0 E F G E;F;G
I want to merge only the non-numeric characters into the new column.
I tried like this:
cols=['col1','col2', 'col3', 'col4', 'col5']
df2["col6"] = df2[cols].apply(lambda x: ';'.join(x.dropna()), axis=1)
But it doesn't take out the zeros. I am aware it is a small change but couldn't figure it out.
Thanks
try via where() method and apply() method:
df2["col6"]=df2.where((df2!='0')&(df2!=0)).apply(lambda x: ';'.join(x.dropna()), axis=1)
If there are numbers other than 0(including 0) then use:
df2["col6"]=(df2.where(df2.apply(lambda x:x.str.isalpha(),1))
.apply(lambda x: ';'.join(x.dropna()), axis=1))
With your shown samples please try following. Trying to fix OP's attempts here. Simple explanation would be, major change is to use condition x[x!=0] to make boolean mask in OP's attempted code(join function).
df2['col6'] = df2[cols].apply(lambda x: ';'.join(x[x!=0]), axis=1)
The code below creates multiple empty dataframes named from the report2 list. They are then populated with a filtered existing dataframe called dfsource.
With a nested for loop, I'd like to filter each of these dataframes using a list of values but the sub loop does not work as shown.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
You can reference a variable in a query by using #
df_dict[i]=dfsource.query('COL1==#x')
So the total code looks like this
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
df_dict[i]=pd.DataFrame()
for x in report:
df_dict[i]=dfsource.query('COL1==#x')
#df_dict[i]=dfsource.query('COL1=="A"') #Example, this works filtering for value A but not what I need.
print(df_dict['A_US'])
print(df_dict['B_US'])
print(df_dict['C_US'])
which outputs
COL1 COL2
0 A D
1 B E
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
COL1 COL2
2 C F
However, I think you want to create a new dictionary based on the i and x of each list, then you can move the creation of the dataframe to the second for loop and then create a new key for each iteration.
import pandas as pd
report=['A','B','C']
suffix='_US'
report2=[s + suffix for s in report]
print (report2) #result: ['A_US', 'B_US', 'C_US']
source = {'COL1': ['A','B','C'], 'COL2': ['D','E','F']}
dfsource=pd.DataFrame(source)
print(dfsource)
df_dict = {}
for i in report2:
for x in report:
new_key = x + i
df_dict[new_key]=pd.DataFrame()
df_dict[new_key]=dfsource.query('COL1==#x')
for item in df_dict.items():
print(item)
Outputs 9 unique dataframes which are filtered based on whatever x value was passed.
('AA_US', COL1 COL2
0 A D)
('BA_US', COL1 COL2
1 B E)
('CA_US', COL1 COL2
2 C F)
('AB_US', COL1 COL2
0 A D)
('BB_US', COL1 COL2
1 B E)
('CB_US', COL1 COL2
2 C F)
('AC_US', COL1 COL2
0 A D)
('BC_US', COL1 COL2
1 B E)
('CC_US', COL1 COL2
2 C F)
I have a sample dataset like this
Col1 Col2 Col3
A 1,2,3 A123
A 4,5 A456
A 1,2,3 A456
A 4,5 A123
I just want to merge the Col2 and Col3 into single row based on the unique Col1.
Expected Result:
Col1 Col2 Col3
A 1,2,3,4,5 A123,A456
I referred some solutions and tried with the following. But it only appends single column.
df.groupby(df.columns.difference(['Col3']).tolist())\
.Col3.apply(pd.Series.unique).reset_index()
Drop duplicates with subsets Col1 and 3
groupby Col1
Then aggregate, using the string concatenate method
(df.drop_duplicates(['Col1','Col3'])
.groupby('Col1')
.agg(Col2 = ('Col2',lambda x: x.str.cat(sep=',')),
Col3 = ('Col3', lambda x: x.str.cat(sep=','))
)
.reset_index()
)
Col1 Col2 Col3
0 A 1,2,3,4,5 A123,A456
I have a csv like this:
col1,col2,col2_val,col3,col3_val
A,1,3,5,6
B,2,3,4,5
and i want to transfer this csv like this :
col1,col6,col7,col8
A,Col2,1,3
A,col3,5,6
there are col3 and col3_val so i want to keep col3 in col6 and values of col3 in col7 and col3_val's value in col8 in the same row where col3's value is stored.
I think what you're looking for is df.melt and df.groupby:
In [63]: df.rename(columns=lambda x: x.strip('_val')).melt('col1')\
.groupby(['col1', 'variable'], as_index=False)['value'].apply(lambda x: pd.Series(x.values))\
.add_prefix('value')\
.reset_index()
Out[63]:
col1 variable value0 value1
0 A col2 1 3
1 A col3 5 6
2 B col2 2 3
3 B col3 4 5
Credit to John Galt for help with the second part.
If you wish to rename columns, assign the whole expression above to df_out and then do:
df_out.columns = ['col1', 'col6', 'col7', 'col8']
Saving this should be straightforward with df.to_csv.
I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !
You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object
I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']