return highest frequency using pandas - python

I have a dataframe
name country gender
Ada US 1
Aby UK 0
Alan US 0
Eli US 1
Eddy US 1
Bing NW 0
Bing US 1
Eli UK 0
Eli US 0
Alan US 1
Ada UK 0
Some names are assigned with different gender and country. E.g. Eli has US and 1 also has UK and 0.
I have used
groupby('name')['gender]
groupby('name')['code']
After the groupby, I am hoping to return the "gender" and "country" with the highest frequency. For example, if Eli has two US and one UK, then the country should be US. Same rule applies to gender.
For gender I used > 0.5 rule
df= df_inv.groupby('name')['gender'].mean()
df = df_inv.reset_index()
df['gender'] = (df['gender']>=0.5).astype(int)
Is there easier way to write this code? Also, is there any solution for categorical variable like country?

You should group by two properties (name and country/gender), build a table, and choose the column with the maximum value in each row:
df.groupby(['name','country']).size().unstack().idxmax(1)
#name
#Aby UK
#Ada UK
#Alan US
#Bing NW
#Eddy US
#Eli US
df.groupby(['name','gender']).size().unstack().idxmax(1)
#name
#Aby 0
#Ada 0
#Alan 0
#Bing 0
#Eddy 1
#Eli 0
You can later join the results if you want.

We can do groupby with function mode by agg
df = df.groupby('name').agg({'country':lambda x : x.mode()[0],'gender':lambda x : int(x.mean()>0.5)})
Out[154]:
country gender
name
Aby UK 0
Ada UK 0
Alan US 0
Bing NW 0
Eddy US 1
Eli US 0

Looks like this will do the work... pls check and confirm
a=df.groupby('name')['gender'].max().to_frame().reset_index()
b=df.groupby('name')['country'].max().to_frame().reset_index()
df=b
df['gender']=a['gender']
del a,b

Related

How can I count # of occurences of more than one column (eg city & country)?

Given the following data ...
city country
0 London UK
1 Paris FR
2 Paris US
3 London UK
... I'd like a count of each city-country pair
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1
The following works but feels like a hack:
df = pd.DataFrame([('London', 'UK'), ('Paris', 'FR'), ('Paris', 'US'), ('London', 'UK')], columns=['city', 'country'])
df.assign(**{'n': 1}).groupby(['city', 'country']).count().reset_index()
I'm assigning an additional column n of all 1s, grouping on city&country, and then count()ing occurrences of this new 'all 1s' column. It works, but adding a column just to count it feels wrong.
Is there a cleaner solution?
There is a better way..use value_counts
df.value_counts().reset_index(name='n')
city country n
0 London UK 2
1 Paris FR 1
2 Paris US 1

Counting number of values in a column using groupby on a specific conditon in pandas

I have a dataframe which looks something like this:
dfA
name field country action
Sam elec USA POS
Sam elec USA POS
Sam elec USA NEG
Tommy mech Canada NEG
Tommy mech Canada NEG
Brian IT Spain NEG
Brian IT Spain NEG
Brian IT Spain POS
I want to group the dataframe based on the first 3 columns adding a new column "No of data". This is something which I do using this:
dfB = dfA.groupby(["name", "field", "country"], dropna=False).size().reset_index(name = "No_of_data")
This gives me a new dataframe which looks something like this:
dfB
name field country No_of_data
Sam elec USA 3
Tommy mech Canada 2
Brian IT Spain 3
But now I also want to add a new column to this particular dataframe which tells me what is the count of number of "POS" for every combination of "name", "field" and "country". Which should look something like this:
dfB
name field country No_of_data No_of_POS
Sam elec USA 3 2
Tommy mech Canada 2 0
Brian IT Spain 3 1
How do I add the new column (No_of_POS) to the table dfB when I dont have the information about "POS NEG" in it and needs to be taken from dfA.
You can use a dictionary with functions in the aggregate method:
dfA.groupby(["name", "field", "country"], as_index=False)['action']\
.agg({'No_of_data': 'size', 'No_of_POS': lambda x: x.eq('POS').sum()})
You can precompute the boolean before aggregating; performance should be better as the data size increases :
(df.assign(action = df.action.eq('POS'))
.groupby(['name', 'field', 'country'],
sort = False,
as_index = False)
.agg(no_of_data = ('action', 'size'),
no_of_pos = ('action', 'sum'))
name field country no_of_data no_of_pos
0 Sam elec USA 3 2
1 Tommy mech Canada 2 0
2 Brian IT Spain 3 1
You can add an aggregation function when you're grouping your data. Check agg() function, maybe this will help.

how to groupby a dataframe if its row value is not present?

i have a dataframe of countrywise open and solved complaints
country complaints status
india network issue solved
usa internet speed issue open
uk network issue open
india internet speed issue solved
usa network issue open
uk voice issue solved
I wanted to group by countries where status is open
i tried
df = df[df.status=="open"]
then
df.groupby("countries",as_index=True).count
the output i got is
country complaints
usa 2
uk 1
but the output want is
country complaints
usa 2
uk 1
india 0
since india has no open complaints I am unable to get india after groupby. How take data is a way such that the groupby also brings india value as 0
You can do:
df['status'].eq('open').groupby(df['country']).sum()
Output:
country
india 0
uk 1
usa 2
Name: status, dtype: int64
If you want to only count that country values so for that instead of using group by you can use value_count() method, it's use to count the categorical value
so your code will be look like this
df = df[df.status=="open"]
df.country.value_counts()
then your output will become like this
india 0
uk 1
usa 2

Creating new variable by aggregation in python

I'm pretty new to python and pandas, and know only the basics. Nowadays I'm conducting a research and I need your kind help.
Let’s say I have data on births, containing 2 variables: Date and Country.
Date Country
1.1.20 USA
1.1.20 USA
1.1.20 Italy
1.1.20 England
2.1.20 Italy
2.1.20 Italy
3.1.20 USA
3.1.20 USA
Now I want to create a third variable, let’s call him ‘Births’, which contains the number of births in country at a date. In other words, I want to stick to just one row for each date+country combination by aggregating the number of countries in each date, so I end up with something like this:
Date Country Births
1.1.20 USA 2
1.1.20 Italy 1
1.1.20 England 1
2.1.20 Italy 2
3.1.20 USA 2
I’ve tried many things, but nothing seemed to work. Any help will be much appreciated.
Thanks,
Eran
I guess you can use the groupby method of your DataFrame, then use the size method to count the number of individuals in each group :
df.groupby(by=['Date', 'Country']).size().reset_index(name='Births')
Output:
Date Country Births
0 1.1.20 England 1
1 1.1.20 Italy 1
2 1.1.20 USA 2
3 2.1.20 Italy 2
4 3.1.20 USA 2
Also, the pandas documentation has several examples related to group-by operations : https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html.

Pandas: Delete rows of a DataFrame if total count of a particular column occurs only 1 time

I'm looking to delete rows of a DataFrame if total count of a particular column occurs only 1 time
Example of raw table (values are arbitrary for illustrative purposes):
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
9 Bolivia #McDonalds 3456
10 Kenya #Schools 3455
11 Ukraine #Cars 3456
12 US #Tshirts 3456789
Intended outcome:
print df
Country Series Value
0 Bolivia Population 123
1 Kenya Population 1234
2 Ukraine Population 12345
3 US Population 123456
5 Bolivia GDP 23456
6 Kenya GDP 234567
7 Ukraine GDP 2345678
8 US GDP 23456789
I know that df.Series.value_counts()>1 will identify which df.Series occur more than 1 time; and that the code returned will look something like the following:
Population
True
GDP
True
#McDonalds
False
#Schools
False
#Cars
False
#Tshirts
False
I want to write something like the following so that my new DataFrame drops column values from df.Series that occur only 1 time, but this doesn't work:
df.drop(df.Series.value_counts()==1,axis=1,inplace=True)
You can do this by creating a boolean list/array by either list comprehensions or using DataFrame's string manipulation methods.
The list comprehension approach is:
vc = df['Series'].value_counts()
u = [i not in set(vc[vc==1].index) for i in df['Series']]
df = df[u]
The other approach is to use the str.contains method to check whether the values of the Series column contain a given string or match a given regular expression (used in this case as we are using multiple strings):
vc = df['Series'].value_counts()
pat = r'|'.join(vc[vc==1].index) #Regular expression
df = df[~df['Series'].str.contains(pat)] #Tilde is to negate boolean
Using this regular expressions approach is a bit more hackish and may require some extra processing (character escaping, etc) on pat in case you have regex metacharacters in the strings you want to filter out (which requires some basic regex knowledge). However, it's worth noting this approach is about 4x faster than using the list comprehension approach (tested on the data provided in the question).
As a side note, I recommend avoiding using the word Series as a column name as that's the name of a pandas object.
This is an old question, but the current answer doesn't work for any moderately large dataframes. A much faster and more "dataframe" way is to add a value count column and filter out count.
Create the dataset:
df = pd.DataFrame({'Country': 'Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US Bolivia Kenya Ukraine US'.split(),
'Series': 'Pop Pop Pop Pop GDP GDP GDP GDP McDonalds Schools Cars Tshirts'.split()})
Drop rows that have a count < 1 for the column ('Series' in this case):
# Group values for Series and add 'cnt' column with count
df['cnt'] = df.groupby(['Series'])['Country'].transform('count')
# Drop indexes for count value == 1, and dropping 'cnt' column
df.drop(df[df.cnt==1].index)[['Country','Series']]

Categories

Resources