Pandas GroupBy on column names - python

I have a dataframe, we can proxy by
df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})
and a category series
category = pd.Series(['A', 'B', 'B', 'A'], ['a', 'b', 'c', 'd'])
I'd like to get a sum of df's columns grouped into the categories 'A', 'B'. Maybe something like:
result = df.groupby(??, axis=1).sum()
returning
result = pd.DataFrame({'A':[3,3,4], 'B':[1,1,0]})

Use groupby + sum on the columns (the axis=1 is important here):
df.groupby(df.columns.map(category.get), axis=1).sum()
A B
0 3 1
1 3 1
2 4 0

After reindex you can assign the category to the column of df
df=df.reindex(columns=category.index)
df.columns=category
df.groupby(df.columns.values,axis=1).sum()
Out[1255]:
A B
0 3 1
1 3 1
2 4 0
Or pd.Series.get
df.groupby(category.get(df.columns),axis=1).sum()
Out[1262]:
A B
0 3 1
1 3 1
2 4 0

Here what i did to group dataframe with similar column names
data_df:
1 1 2 1
q r f t
Code:
df_grouped = data_df.groupby(data_df.columns, axis=1).agg(lambda x: ' '.join(x.values))
df_grouped:
1 2
q r t f

Related

Subtract values in a column in blocks

Suppose there is the following dataframe:
import pandas as pd
df = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'], 'Value': [1, 2, 3, 4, 5, 6]})
I would like to subtract the values from group B and C with those of group A and make a new column with the difference. That is, I would like to do something like this:
df[df['Group'] == 'B']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
df[df['Group'] == 'C']['Value'].reset_index() - df[df['Group'] == 'A']['Value'].reset_index()
and place the result in a new column. Is there a way of doing it without a for loop?
Assuming you want to subtract the first A to the first B/C, second A to second B/C, etc. the easiest might be to reshape:
df2 = (df
.assign(cnt=df.groupby('Group').cumcount())
.pivot('cnt', 'Group', 'Value')
)
# Group A B C
# cnt
# 0 1 3 5
# 1 2 4 6
df['new_col'] = df2.sub(df2['A'], axis=0).melt()['value']
variant:
df['new_col'] = (df
.assign(cnt=df.groupby('Group').cumcount())
.groupby('cnt', group_keys=False)
.apply(lambda d: d['Value'].sub(d.loc[d['Group'].eq('A'), 'Value'].iloc[0]))
)
output:
Group Value new_col
0 A 1 0
1 A 2 0
2 B 3 2
3 B 4 2
4 C 5 4
5 C 6 4

count word frequency with groupby

I have a csv file only one tag column:
tag
A
B
B
C
C
C
C
When run groupby to count the word frequency, the output do not have the frequency number
#!/usr/bin/env python3
import pandas as pd
def count(fname):
df = pd.read_csv(fname)
print(df)
dfg = df.groupby('tag').count().reset_index()
print(dfg)
return
count("save.txt")
Output no frequency column:
tag
0 A
1 B
2 B
3 C
4 C
5 C
6 C
tag
0 A
1 B
2 C
expect output:
tag freq
0 A 1
1 B 2
2 C 4
Looks close to me, per my comment:
df = pd.DataFrame({'tag': ['A', 'B', 'B', 'C', 'C', 'C', 'C']})
df.groupby(['tag'], as_index=False).agg(freq=('tag', 'count'))
You could create the addtional column then count values:
Input:
df['freq'] = 1
df = df['tag'].value_counts()
Output:
tag freq
0 C 4
1 B 2
2 A 1
You should use value_counts() and not count()
df.groupby("tag").value_counts().reset_index().rename(columns={0: "freq"})
outputs:
tag freq
0 A 1
1 B 2
2 C 4
To sort in descending order,
df.groupby("tag").value_counts().reset_index().rename(columns={0: "freq"}).sort_values(
by="freq", ascending=False
)

Concatenating multiple pandas dataframes when columns are not aligned

I have 3 dataframes:
df1
A B C
1 1 1
2 2 2
df2
A B C
3 3 3
4 4 4
df3
A B
5 5
So I want to concat all dataframes to become the following one:
A B C
1 1 1
2 2 2
3 3 3
4 4 4
5 5 NaN
I tried with pd.concat([df1,df2,df3]) with both axis=0 and axis=1 but none of them works as expected.
df = pd.concat([df1,df2,df3], ignore_index=True)
df.fillna("NA", inplace=True)
If there are same common columns names , working nice - common columns are aligned properly:
print (df1.columns.tolist())
['A', 'B', 'C']
print (df2.columns.tolist())
['A', 'B', 'C']
print (df3.columns.tolist())
['A', 'B']
If possible som trailing whitespaces, is possible use str.strip:
print (df1.columns.tolist())
['A', 'B ', 'C']
df1.columns = df1.columns.str.strip()
print (df1.columns.tolist())
['A', 'B', 'C']
Also parameter ignore_index=True is for default RangeIndex after concat, for avoid duplicated index and add parameter sort for avoid FutureWarning:
df = pd.concat([df1,df2,df3], ignore_index=True, sort=True)
print (df)
A B C
0 1 1 1.0
1 2 2 2.0
2 3 3 3.0
3 4 4 4.0
4 5 5 NaN
I think you need to tell concat to ignore the index:
result = pd.concat([df1,df2,df3], ignore_index=True)

Most efficient way to return Column name in a pandas df

I have a pandas df that contains 4 different columns. For every row theres a value thats of importance. I want to return the Column name where that value is displayed. So for the df below I want to return the Column name when the value 2 is labelled.
d = ({
'A' : [2,0,0,2],
'B' : [0,0,2,0],
'C' : [0,2,0,0],
'D' : [0,0,0,0],
})
df = pd.DataFrame(data=d)
Output:
A B C D
0 2 0 0 0
1 0 0 2 0
2 0 2 0 0
3 2 0 0 0
So it would be A,C,B,A
I'm doing this via
m = (df == 2).idxmax(axis=1)[0]
And then changing the row. But this isn't very efficient.
I'm also hoping to produce the output as a Series from pandas df
Use DataFrame.dot:
df.astype(bool).dot(df.columns).str.cat(sep=',')
Or,
','.join(df.astype(bool).dot(df.columns))
'A,C,B,A'
Or, as a list:
df.astype(bool).dot(df.columns).tolist()
['A', 'C', 'B', 'A']
...or a Series:
df.astype(bool).dot(df.columns)
0 A
1 C
2 B
3 A
dtype: object

python pandas: merge two data frame but didn't merge the repeat rows

I have two dataframe: df1 and df2.
df1 is following:
name exist
a 1
b 1
c 1
d 1
e 1
df2 (just have one column:name)is following:
name
e
f
g
a
h
I want to merge these two dataframe, and didn't merge repeat names, I mean, if the name in df2 exist in df1, just show one time, else if the name is df2 not exist in df1, set the exist value is 0 or Nan. for example as df1(there is a and e), and df2(there is a and e, just showed a, e one time), I want to be the following df:
a 1
b 1
c 1
d 1
e 1
f 0
g 0
h 0
I used the concat function to do it, my code is following:
import pandas as pd
df1 = pd.DataFrame({'name': ['a', 'b', 'c', 'd', 'e'],
'exist': ['1', '1', '1', '1', '1']})
df2 = pd.DataFrame({'name': ['e', 'f', 'g', 'h', 'a']})
df = pd.concat([df1, df2])
print(df)
but the result is wrong(name a and e is repeated to be showed):
exist name
0 1 a
1 1 b
2 1 c
3 1 d
4 1 e
0 NaN e
1 NaN f
2 NaN g
3 NaN h
4 NaN a
please give your hands, thanks in advance!
As indicated by your title, you can use merge instead of concat and specify how parameter as outer since you want to keep all records from df1 and df2 which defines an outer join:
import pandas as pd
pd.merge(df1, df2, on = 'name', how = 'outer').fillna(0)
# exist name
# 0 1 a
# 1 1 b
# 2 1 c
# 3 1 d
# 4 1 e
# 5 0 f
# 6 0 g
# 7 0 h

Categories

Resources