Pandas group by gender shows more than two groups - python

I have a dataframe that shows each audience's ranking for a bunch of movies. I wanted to make a list of movies with the most ratings, for each gender.
Here's what I did:
most_rated_gen=lens.groupby(['sex','title']).size().sort_values(ascending=False).to_frame()
I was expecting to see a dataframe that looks something like this:
sex | title
M A
B
C
D
F B
C
D
A
Instead, I got this:
I don't know why it shows M F M F M. Any ideas how I could fix this?

You can use nlargest() if your aggregated column has a name. Assuming the column name is ratings_count. You can use this code.
most_rated_gen.groupby(['sex'])['ratings_count'].nlargest()
Source

As you group by sex, the output will contain the sex column.
You have a shortcut for your operation with value_counts:
df.value_counts(['sex', 'title']).sort_index(kind='mergesort')
If you want your data to be sorted by index while preserving the order of values then you have to use sort_index with kind='mergesort' as parameter.

Related

establish counts of elements of pandas dataframe

Currently working to implement some fuzzy matching logic to group together emails with similar patterns and I need to improve the efficiency of part of the code but not sure what the best path forward is. I use a package to output a pandas dataframe that looks like this:
I redacted the data, but it's just four columns with an ID #, the email associated with a given ID, a group ID number that identifies the cluster a given email falls into, and then the group rep which is the most mathematically central email of a given cluster.
What I want to do is count the number of occurrences of each distinct element in the group rep column and create a new dataframe that's just two columns with one column having the group rep email and then the second column having the corresponding count of that group rep in the original dataframe. It should look something like this:
As of now, I'm converting my group reps to a list and then using a for-loop to create a list of tuples(I think?) with each tuple containing a centroid email group identifiers and the number of times that identifier occurs in the original df (aka the number of emails in the original data that belong to that centroid email's group). The code looks like this:
groups = list(df['group rep'].unique())
# preparing list of tuples with group count
req_groups = []
for g in groups:
count = (g, df['group rep'].value_counts()[g])
#print(count)
req_groups.append(count)
print(req_groups)
Unfortunately, this operation takes far too long. I'm sure there's a better solution, but could definitely use some help finding a path forward. Thanks in advance for your help!
You can use df.groupby('group rep').count().
Let's consider the following dataframe :
email
0 zucchini#yahoo.fr
1 apple#gmail.com
2 citrus#protonmail.com
3 banana#gmail.com
4 pear#gmail.com
5 apple#gmail.com
6 citrus#protonmail.com
Proposed script
import pandas as pd
import operator
m = {'email':['zucchini#yahoo.fr','apple#gmail.com','citrus#protonmail.com','banana#gmail.com',
'pear#gmail.com','apple#gmail.com','citrus#protonmail.com']}
df = pd.DataFrame(m)
counter = pd.DataFrame.from_dict({c: [operator.countOf(df['email'], c)] for c in df['email'].unique()})
cnt_df = counter.T.rename(columns={0:'count'})
print(cnt_df)
Result
count
zucchini#yahoo.fr 1
apple#gmail.com 2
citrus#protonmail.com 2
banana#gmail.com 1
pear#gmail.com 1

Converting one row in a pandas df to multiple by splitting a specific column

I have a pandas dataframe in which the row is of the format.
Index Name Categories
0 bob a,b,c,d
1 sally b,d,f
etc.
And i want this in the format
name categories
bob a
b
c
d
sally b
d
f
Does anyone know a smart way to split that into rows about that elements' commas?
Cheers
I have considered the option of breaking down the database and rebuilding a new one, i'm capable of that but this will ultimately be implemented on a 12,000,000 row database and I am looking for a more straightforward approach.

pandas select row if value in another column changs

Let's say I have a large data set that follows a similar structure:
where the id repeats multiple times. I would like to select any id where the value in column b changed with the desired output as such:
How might I be able to achieve that via pandas?
It is not entirely clear what you are asking for. You say
I would like to select any id where the value in column b changed
but 'changed' from what?
Perhaps the following can be helpful -- it will show you all unique ColumnB strings for each id
Using a sample df:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2], 'colb':['a','a','b','c','d','c','c']})
we use groupby and unique:
df.groupby('id')['colb'].unique().explode().to_frame()
output:
colb
id
1 a
1 b
2 c
2 d
so for id=1 we have a and b as unique phrases, and for id=2 we have c and d

Grouped by counts in Python

I am a beginner in Python and trying to learn it. We have a df called allyears that has years, gender, names in it.
Something like this:
name
sex
number
year
John
M
1
2010
Jane
F
2
2011
I want to group the top10 names for a given year with their respective counts. I tried this code, but this is not returning what I am looking for.
males = allyears[(allyears.year>2009)&(allyears.sex=='M')]
maleNameCounts = pd.DataFrame(males.groupby(['year', 'name']).count())
maleNameCounts.sort_values('number', ascending=True)
How should I be approaching this problem?
Hope this helps:
Add a column with counts
df["name_count"] = df[name].map(df.name.value_counts())
Optional to remove duplicates
df = df.drop_duplicates(["name"])
Sort (by counts)
df = df.sort_values("name_count")
Note that this can all be tweaked were necessary.
You can try following:
males = allyears[(allyears.year>2009)&(allyears.sex=='M'),]
maleNameCounts = df.groupby(['Year', 'Name']).size().nlargest(10).reset_index().rename(columns={0:'count'})
maleNameCounts

How to access index of string value in a cell of pandas data frame?

I'm working with the Bureau of Labor Statistics data which looks like this:
series_id year period value
CES0000000001 2006 M01 135446.0
series_id[3][4] indicate the supersector. for example, CES10xxxxxx01 would be Mining & Logging. There are 15 supersectors that I'm concerned with and hence I want to create 15 separate data frames for each supersector to perform time series analysis. So I'm trying to access each value as a list to achieve something like:
# *psuedocode*:
mining_and_logging = df[df.series_id[3]==1 and df.series_id[4]==0]
Can I avoid writing a for loop where I convert each value to a list then access by index and add the row to the new dataframe?
How can I achieve this?
One way to do what you want and recursively store the dataframes through a for loop could be:
First, create an auxiliary column to make your life easier:
df['id'] = df['series_id'][3:5] #Exctract characters 3 and 4 of every string (counting from zero)
Then, you create an empty dictionary and populate it:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]
Now you'll have a dictionary with 15 dataframes inside. For example, if you want to call the dataframe associated with id = 01, you just do:
dict_df['01']
Hope it helps !
Solved it by combining answers from Juan C and G. Anderson.
Select the 3rd and 4th character:
df['id'] = df.series_id.str.slice(start=3, stop=5)
And then the following to create dataframes:
dict_df = {}
for unique_id in df.id.unique():
dict_df[unique_id] = df[df.id == unique_id]

Categories

Resources