Count the occurrences of entire row in pandas DataFrame - python

I need to count the occurrence of entire row in pandas DataFrame
For example if I have the Data Frame:
A = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
The expected result should be:
'a','b','c' : 2
'b','a','c' : 1
value_counts only counts the occurrence of one element in a series (one column)
I can create new column that will have all the element of that row and count values in that column, but I hope for a better solution.

You can do this:
A = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
A.groupby(A.columns.tolist(),as_index=False).size()
which returns:
0 1 2 size
0 a b c 2
1 b a c 1

Below code would provide the result you are expecting.
df = pd.DataFrame([['a','b','c'],['b','a','c'],['a','b','c']])
df.groupby(list(df.columns))[0].count().to_frame('count').reset_index()
above code results in following dataframe.
0 1 2 count
0 a b c 2
1 b a c 1

Related

How to set value of first row of pandas dataframe meeting condition?

I want to update the first row of a dataframe that meets a certain condition. Like in this question Get first row of dataframe in Python Pandas based on criteria but for setting instead of just selecting.
df[df['Qty'] > 0].iloc[0] = 5
The above line does not seem to do anything.
Given df below:
a b
0 1 2
1 2 1
2 3 1
you change the values in the first row where the value in column b is equal to 1 by:
df.loc[df[df['b'] == 1].index[0]] = 1000
Output:
a b
0 1 2
1 1000 1000
2 3 1
If you want to change the value in a specific column(s), you can do that too:
df.loc[df[df['b'] == 1].index[0],'a'] = 1000
a b
0 1 2
1 1000 1
2 3 1
I believe what you're looking for is:
idx = df[df['Qty'] > 0].index[0]
df.loc[[idx], ['Qty']] = 5

Get the count value of the column in the dataframe and set remaining rows as 1

I have a dataframe with occ empty:
ID occ
a
a
b
a
b
c
Now I want to make another column that will count the number of occurence and only show the count in first row and rest should remain "1":
expected result
ID occ
a 3
a 1
b 2
a 1
b 1
c 1
In here 'a' is 3 times, 'b' is 2 times and 'c' is 1 times.
All other rows of a and b are to be shown 1.
I got the count by :
df['ID'].value_counts()
but it throws an error when I try to put it in a dataframe using:
df['occ']=df['Value'].value_counts()
TypeError: unhashable type: 'list'
While creating occ column, assign value 1 as initial value and then you can use pd.DataFrame.duplicated passing parameter keep='first' to create the masking for first occurrence of the values, and assign the counts:
df['occ']=1
df.loc[~df.duplicated(keep='first'), 'occ'] = df['ID'].value_counts().values
OUTPUT:
ID occ
0 a 3
1 a 1
2 b 2
3 a 1
4 b 1
5 c 1
PS: It may fail in scenario when first occurance of values in ID column are not in sorted form, you may want to sort ID column first in that scenario using df.sort_values(by=['ID'], inplace=True, ignore_index=True), or you can assign the counts selectively comparing the values in ID column

Pandas df.mode with multiple values per cell in column

I have a dataframe with a Keywords column. Each cell in that column has 5-10 individual values (comma seperated) which consist of 1 - 3 words. How can I count the most occurring keywords in the column?
I have tried df.Keywords.mode but it returns all values for each cell because they obviously don't occur multiple times within each cell.
Here an image to clarify:
All input is appreciated,
Thanks!
First use Series.str.split with expand=True for DataFrame and reshape by DataFrame.stack, then count by Series.value_counts and get top values by Series.head:
df = pd.DataFrame({'Keywords':['aa,bb,vv,vv','aa,aa,cc,bb','zz,bb,aa,ss']})
N = 5
df1 = (df.Keywords.str.split(',', expand=True)
.stack()
.value_counts()
.head(N)
.rename_axis('val')
.reset_index(name='count'))
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1
Another solution if no missing values is flatten splitted lists and count by Counter:
from collections import Counter
N = 5
df1 = pd.DataFrame(Counter([y for x in df.Keywords for y in x.split(',')]).most_common(N),
columns=['val','count'])
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1

How to count number of records in a group and save them in a csv file?

I have a dataset as below:
import pandas as pd
dict = {"A":[1,1,1,1,5],"B":[1,1,2,4,1]}
dt = pd.DataFrame(data=dict)
so, it is as below:
A B
1 1
1 1
1 2
1 4
5 1
i need to apply a groupby based on A and B count how many records each group has?
i have applied the below solution:
dtSize = dt.groupby(by=["A","B"], as_index=False).size()
dtSize.to_csv("./datasets/Final DT/dtSize.csv", sep=',', encoding='utf-8', index=False)
I have 2 problems:
When i open the saved file, it only contains the last column which includes number element in each group, but it does not include the groups
when i print the final dtSize it is as below:
so, some similar records in A is missed.
My favorit output is as below in a .csv file
A B Number of elements in group
1 1 2
1 2 1
1 4 1
5 1 1
Actually, data from A isn't missing. GroupBy.size returns a Series, so A and B are used as a MultiIndex. Due to this, repeated values for A in the first three rows aren't printed.
You're close. You need to reset the index and, optionally, name the result:
dt.groupby(['A', 'B']).size().reset_index(name='Size')
The result is:
A B Size
0 1 1 2
1 1 2 1
2 1 4 1
3 5 1 1

Repeating particular row of pandas dataframe

I want to repeat a specific row of pandas data frame for a given number of times.
For example, this is my data frame
df= pd.DataFrame({
'id' : ['1','1', '2', '2','2','3'],
'val' : ['2015_11','2016_2','2011_9','2011_11','2012_2','2018_2'],
'data':['a','a','b','b','b','c']
})
print(df)
Here, "Val" column contains date in string format. It has a specific pattern 'Year_month'. For the same "id", I want the rows repeated the number of times that is equivalent to the difference between the given "val" column values. All other columns except the val column should have the duplicated value of previous row.
The output should be:
Using resample:
df.val = pd.to_datetime(df.val, format='%Y_%m')
out = df.set_index('val').groupby('id').data.resample('1m').ffill().reset_index()
out.assign(val=out.val.dt.strftime('%Y_%m'))
id val data
0 1 2015_11 a
1 1 2015_12 a
2 1 2016_01 a
3 1 2016_02 a
4 2 2011_09 b
5 2 2011_10 b
6 2 2011_11 b
7 2 2011_12 b
8 2 2012_01 b
9 2 2012_02 b
10 3 2018_02 c

Categories

Resources