Pandas df.mode with multiple values per cell in column - python

I have a dataframe with a Keywords column. Each cell in that column has 5-10 individual values (comma seperated) which consist of 1 - 3 words. How can I count the most occurring keywords in the column?
I have tried df.Keywords.mode but it returns all values for each cell because they obviously don't occur multiple times within each cell.
Here an image to clarify:
All input is appreciated,
Thanks!

First use Series.str.split with expand=True for DataFrame and reshape by DataFrame.stack, then count by Series.value_counts and get top values by Series.head:
df = pd.DataFrame({'Keywords':['aa,bb,vv,vv','aa,aa,cc,bb','zz,bb,aa,ss']})
N = 5
df1 = (df.Keywords.str.split(',', expand=True)
.stack()
.value_counts()
.head(N)
.rename_axis('val')
.reset_index(name='count'))
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1
Another solution if no missing values is flatten splitted lists and count by Counter:
from collections import Counter
N = 5
df1 = pd.DataFrame(Counter([y for x in df.Keywords for y in x.split(',')]).most_common(N),
columns=['val','count'])
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1

Related

Count the occurrences of entire row in pandas DataFrame

I need to count the occurrence of entire row in pandas DataFrame
For example if I have the Data Frame:
A = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
The expected result should be:
'a','b','c' : 2
'b','a','c' : 1
value_counts only counts the occurrence of one element in a series (one column)
I can create new column that will have all the element of that row and count values in that column, but I hope for a better solution.
You can do this:
A = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
A.groupby(A.columns.tolist(),as_index=False).size()
which returns:
0 1 2 size
0 a b c 2
1 b a c 1
Below code would provide the result you are expecting.
df = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
df.groupby(list(df.columns))[0].count().to_frame('count').reset_index()
above code results in following dataframe.
0 1 2 count
0 a b c 2
1 b a c 1

Compare columns in Pandas between two unequal size Dataframes for condition check

I have two pandas DF. Of unequal sizes. For example :
Df1
id value
a 2
b 3
c 22
d 5
Df2
id value
c 22
a 2
No I want to extract from DF1 those rows which has the same id as in DF2. Now my first approach is to run 2 for loops, with something like :
x=[]
for i in range(len(DF2)):
for j in range(len(DF1)):
if DF2['id'][i] == DF1['id'][j]:
x.append(DF1.iloc[j])
Now this is okay, but for 2 files of 400,000 lines in one and 5,000 in another, I need an efficient Pythonic+Pnadas way
import pandas as pd
data1={'id':['a','b','c','d'],
'value':[2,3,22,5]}
data2={'id':['c','a'],
'value':[22,2]}
df1=pd.DataFrame(data1)
df2=pd.DataFrame(data2)
finaldf=pd.concat([df1,df2],ignore_index=True)
Output after concat
id value
0 a 2
1 b 3
2 c 22
3 d 5
4 c 22
5 a 2
Final Ouput
finaldf.drop_duplicates()
id value
0 a 2
1 b 3
2 c 22
3 d 5
You can concat the dataframes , then check if all the elements are duplicated or not , then drop_duplicates and keep just the first occurrence:
m = pd.concat((df1,df2))
m[m.duplicated('id',keep=False)].drop_duplicates()
id value
0 a 2
2 c 22
You can try this:
df = df1[df1.set_index(['id']).index.isin(df2.set_index(['id']).index)]

Counting mode occurrences for all columns in a dataframe

I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?
Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64
You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4
First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4
Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4

How to create pandas dummies based on column values

I would like to create dummies based on column values...
This is what the df looks like
I want to create this
This is so far my approach
import pandas as pd
df =pd.read_csv('test.csv')
v =df.Values
v_set=set()
for line in v:
line=line.split(',')
for x in line:
if x!="":
v_set.add(x)
else:
continue
for val in v_set:
df[val]=''
By the above code I am able to create columns in my df like this
How do I go about updating the row values to create dummies?
This is where I am having problems.
Thanks in advance.
You could use pandas.Series.str.get_dummies. This will alllow you to split the column directly with a delimiter.
df = pd.concat([df.ID, df.Values.str.get_dummies(sep=",")], axis=1)
ID 1 2 3 4
0 1 1 1 0 0
1 2 0 0 1 1
df.Values.str.get_dummies(sep=",") will generate
1 2 3 4
0 1 1 0 0
1 0 0 1 1
Then, we do a pd.concat to glue to df together.

Group by value of sum of columns with Pandas

I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4

Categories

Resources