count word frequency with groupby

count word frequency with groupby - python

I have a csv file only one tag column:
tag
A
B
B
C
C
C
C
When run groupby to count the word frequency, the output do not have the frequency number
#!/usr/bin/env python3
import pandas as pd
def count(fname):
df = pd.read_csv(fname)
print(df)
dfg = df.groupby('tag').count().reset_index()
print(dfg)
return
count("save.txt")
Output no frequency column:
tag
0 A
1 B
2 B
3 C
4 C
5 C
6 C
tag
0 A
1 B
2 C
expect output:
tag freq
0 A 1
1 B 2
2 C 4

Looks close to me, per my comment:
df = pd.DataFrame({'tag': ['A', 'B', 'B', 'C', 'C', 'C', 'C']})
df.groupby(['tag'], as_index=False).agg(freq=('tag', 'count'))

You could create the addtional column then count values:
Input:
df['freq'] = 1
df = df['tag'].value_counts()
Output:
tag freq
0 C 4
1 B 2
2 A 1

You should use value_counts() and not count()
df.groupby("tag").value_counts().reset_index().rename(columns={0: "freq"})
outputs:
tag freq
0 A 1
1 B 2
2 C 4
To sort in descending order,
df.groupby("tag").value_counts().reset_index().rename(columns={0: "freq"}).sort_values(
by="freq", ascending=False
)

Related

Pandas new column from indexing list by row value

I am looking to create a new column in a Pandas data frame with the value of a list filtered by the df row value.
df = pd.DataFrame({'Index': [0,1,3,2], 'OtherColumn': ['a', 'b', 'c', 'd']})
Index OtherColumn
0 a
1 b
3 c
2 d
l = [1000, 1001, 1002, 1003]
Desired output:
Index OtherColumn Value
0 a -
1 b -
3 c 1003
2 d -
My code:
df.loc[df.OtherColumn == 'c', 'Value'] = l[df.Index]
Which returns an error since 'df.Index' is not recognised as a int but as a list (not filter by OtherColumn == 'c').
For R users, I'm looking for:
df[OtherColumn == 'c', Value := l[Index]]
Thanks.

Convert list to numpy array for indexing and then filter by mask in both sides:
m = df.OtherColumn == 'c'
df.loc[m, 'Value'] = np.array(l)[df.Index][m]
print (df)
Index OtherColumn Value
0 0 a NaN
1 1 b NaN
2 3 c 1003.0
3 2 d NaN
Or use numpy.where:
m = df.OtherColumn == 'c'
df['Value'] = np.where(m, np.array(l)[df.Index], '-')
print (df)
Index OtherColumn Value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -
Or:
df['value'] = np.where(m, df['Index'].map(dict(enumerate(l))), '-')

Use Series.where + Series.map:
df['value']=df['Index'].map(dict(enumerate(l))).where(df['OtherColumn']=='c','-')
print(df)
Index OtherColumn value
0 0 a -
1 1 b -
2 3 c 1003
3 2 d -

Creating new DataFrame from the cartesian product of 2 lists

What I want to achieve is the following in Pandas:
a = [1,2,3,4]
b = ['a', 'b']
Can I create a DataFrame like:
column1 column2
'a' 1
'a' 2
'a' 3
'a' 4
'b' 1
'b' 2
'b' 3
'b' 4

Use itertools.product with DataFrame constructor:
a = [1, 2, 3, 4]
b = ['a', 'b']
from itertools import product
# pandas 0.24.0+
df = pd.DataFrame(product(b, a), columns=['column1', 'column2'])
# pandas below
# df = pd.DataFrame(list(product(b, a)), columns=['column1', 'column2'])
print (df)
column1 column2
0 a 1
1 a 2
2 a 3
3 a 4
4 b 1
5 b 2
6 b 3
7 b 4

I will put here another method, just in case someone prefers it.
full mockup below:
import pandas as pd
a = [1,2,3,4]
b = ['a', 'b']
df=pd.DataFrame([(y, x) for x in a for y in b], columns=['column1','column2'])
df
result below:
column1 column2
0 a 1
1 b 1
2 a 2
3 b 2
4 a 3
5 b 3
6 a 4
7 b 4

Pandas GroupBy on column names

I have a dataframe, we can proxy by
df = pd.DataFrame({'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]})
and a category series
category = pd.Series(['A', 'B', 'B', 'A'], ['a', 'b', 'c', 'd'])
I'd like to get a sum of df's columns grouped into the categories 'A', 'B'. Maybe something like:
result = df.groupby(??, axis=1).sum()
returning
result = pd.DataFrame({'A':[3,3,4], 'B':[1,1,0]})

Use groupby + sum on the columns (the axis=1 is important here):
df.groupby(df.columns.map(category.get), axis=1).sum()
A B
0 3 1
1 3 1
2 4 0

After reindex you can assign the category to the column of df
df=df.reindex(columns=category.index)
df.columns=category
df.groupby(df.columns.values,axis=1).sum()
Out[1255]:
A B
0 3 1
1 3 1
2 4 0
Or pd.Series.get
df.groupby(category.get(df.columns),axis=1).sum()
Out[1262]:
A B
0 3 1
1 3 1
2 4 0

Here what i did to group dataframe with similar column names
data_df:
1 1 2 1
q r f t
Code:
df_grouped = data_df.groupby(data_df.columns, axis=1).agg(lambda x: ' '.join(x.values))
df_grouped:
1 2
q r t f

How to simply add a column level to a pandas dataframe

let say I have a dataframe that looks like this:
df = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df
Out[92]:
A B
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
Asumming that this dataframe already exist, how can I simply add a level 'C' to the column index so I get this:
df
Out[92]:
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
I saw SO anwser like this python/pandas: how to combine two dataframes into one with hierarchical column index? but this concat different dataframe instead of adding a column level to an already existing dataframe.
-

As suggested by #StevenG himself, a better answer:
df.columns = pd.MultiIndex.from_product([df.columns, ['C']])
print(df)
# A B
# C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4

option 1
set_index and T
df.T.set_index(np.repeat('C', df.shape[1]), append=True).T
option 2
pd.concat, keys, and swaplevel
pd.concat([df], axis=1, keys=['C']).swaplevel(0, 1, 1)

A solution which adds a name to the new level and is easier on the eyes than other answers already presented:
df['newlevel'] = 'C'
df = df.set_index('newlevel', append=True).unstack('newlevel')
print(df)
# A B
# newlevel C C
# a 0 0
# b 1 1
# c 2 2
# d 3 3
# e 4 4

You could just assign the columns like:
>>> df.columns = [df.columns, ['C', 'C']]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>
Or for unknown length of columns:
>>> df.columns = [df.columns.get_level_values(0), np.repeat('C', df.shape[1])]
>>> df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4
>>>

Another way for MultiIndex (appanding 'E'):
df.columns = pd.MultiIndex.from_tuples(map(lambda x: (x[0], 'E', x[1]), df.columns))
A B
E E
C D
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

I like it explicit (using MultiIndex) and chain-friendly (.set_axis):
df.set_axis(pd.MultiIndex.from_product([df.columns, ['C']]), axis=1)
This is particularly convenient when merging DataFrames with different column level numbers, where Pandas (1.4.2) raises a FutureWarning (FutureWarning: merging between different levels is deprecated and will be removed ... ):
import pandas as pd
df1 = pd.DataFrame(index=list('abcde'), data={'A': range(5), 'B': range(5)})
df2 = pd.DataFrame(index=list('abcde'), data=range(10, 15), columns=pd.MultiIndex.from_tuples([("C", "x")]))
# df1:
A B
a 0 0
b 1 1
# df2:
C
x
a 10
b 11
# merge while giving df1 another column level:
pd.merge(df1.set_axis(pd.MultiIndex.from_product([df1.columns, ['']]), axis=1),
df2,
left_index=True, right_index=True)
# result:
A B C
x
a 0 0 10
b 1 1 11

Another method, but using a list comprehension of tuples as the arg to pandas.MultiIndex.from_tuples():
df.columns = pd.MultiIndex.from_tuples([(col, 'C') for col in df.columns])
df
A B
C C
a 0 0
b 1 1
c 2 2
d 3 3
e 4 4

in Pandas, how to create a variable that is n for the nth observation within a group?

consider this
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df
Out[128]:
B C
0 a 1
1 a 2
2 b 6
3 b 2
I want to create a variable that simply corresponds to the ordering of observations after sorting by 'C' within each groupby('B') group.
df.sort_values(['B','C'])
Out[129]:
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2
How can I do that? I am thinking about creating a column that is one, and using cumsum but that seems too clunky...

I think you can use range with len(df):
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3],
'B': ['a', 'a', 'b'],
'C': [5, 3, 2]})
print df
A B C
0 1 a 5
1 2 a 3
2 3 b 2
df.sort_values(by='C', inplace=True)
#or without inplace
#df = df.sort_values(by='C')
print df
A B C
2 3 b 2
1 2 a 3
0 1 a 5
df['order'] = range(1,len(df)+1)
print df
A B C order
2 3 b 2 1
1 2 a 3 2
0 1 a 5 3
EDIT by comment:
I think you can use groupby with cumcount:
import pandas as pd
df = pd.DataFrame({'B': ['a', 'a', 'b', 'b'], 'C': [1, 2, 6,2]})
df.sort_values(['B','C'], inplace=True)
#or without inplace
#df = df.sort_values(['B','C'])
print df
B C
0 a 1
1 a 2
3 b 2
2 b 6
df['order'] = df.groupby('B', sort=False).cumcount() + 1
print df
B C order
0 a 1 1
1 a 2 2
3 b 2 1
2 b 6 2

Nothing wrong with Jezrael's answer but there's a simpler (though less general) method in this particular example. Just add groupby to JohnGalt's suggestion of using rank.
>>> df['order'] = df.groupby('B')['C'].rank()
B C order
0 a 1 1.0
1 a 2 2.0
2 b 6 2.0
3 b 2 1.0
In this case, you don't really need the ['C'] but it makes the ranking a little more explicit and if you had other unrelated columns in the dataframe then you would need it.
But if you are ranking by more than 1 column, you should use Jezrael's method.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

count word frequency with groupby - python

Looks close to me, per my comment: df = pd.DataFrame({'tag': ['A', 'B', 'B', 'C', 'C', 'C', 'C']}) df.groupby(['tag'], as_index=False).agg(freq=('tag', 'count'))

You could create the addtional column then count values: Input: df['freq'] = 1 df = df['tag'].value_counts() Output: tag freq 0 C 4 1 B 2 2 A 1

Related

Pandas new column from indexing list by row value

Creating new DataFrame from the cartesian product of 2 lists

Pandas GroupBy on column names

How to simply add a column level to a pandas dataframe

in Pandas, how to create a variable that is n for the nth observation within a group?

Categories

Resources