Extracting top-N occurrences in a grouped dataframe using pandas - python

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you

You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)

I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

Combining dummies and count for pandas dataframe

I have a pandas dataframe like this:
as a plain text:
{'id;sub_id;value;total_stuff related to id and sub_id':
['aaa;1;cat;10', 'aaa;1;cat;10', 'aaa;1;dog;10', 'aaa;2;cat;7',
'aaa;2;dog;7', 'aaa;3;cat;5', 'bbb;1;panda;20', 'bbb;1;cat;20',
'bbb;2;panda;12']}
The desired output I want is this.
Note that there are many different "values" possible, so I would need to automate the creation of dummies variables (nb_animals).
But these dummies variables must contain the number of occurences by id and sub_id.
The total_stuff is always the same value for a given id/sub_id combination.
I've tried using get_dummies(df, columns = ['value']), which gave me this table.
using get_dummies
as a plain text:
{'id;sub_id;value_cat;value_dog;value_panda;total_stuff related to id
and sub_id': ['aaa;1;2;1;0;10', 'aaa;1;2;1;0;10', 'aaa;1;2;1;0;10',
'aaa;2;1;1;0;7', 'aaa;2;1;1;0;7', 'aaa;3;1;0;0;5', 'bbb;1;1;0;1;20',
'bbb;1;1;0;1;20', 'bbb;2;0;0;1;12']}
I'd love to use some kind of df.groupby(['id','sub_id']).agg({'value_cat':'sum', 'value_dog':'sum', ... , 'total_stuff':'mean'}), but writing all of the possible animal values would be too tedious.
So how to get a proper aggregated count/sum for values, and average for total_stuff (since total_stuff is unique per id/sub_id combination)
Thanks
EDIT : Thanks chikich for the neat answer. The agg_dict is what I needed
Use pd.get_dummies to transform categorical data
df = pd.get_dummies(df, prefix='nb', columns='value')
Then group by id and subid
agg_dict = {key: 'sum' for key in df.columns if key[:3] == 'nb_'}
agg_dict['total_stuff'] = 'mean'
df = df.groupby(['id', 'subid']).agg(agg_dict).reset_index()

calculate sum of rows in pandas dataframe grouped by date

I have a csv that I loaded into a Pandas Dataframe.
I then select only the rows with duplicate dates in the DF:
df_dups = df[df.duplicated(['Date'])].copy()
I'm trying to get the sum of all the rows with the exact same date for 4 columns (all float values), like this:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].sum()
However, this does not give the desired result. When I examine df_sum.groups, I've noticed that it did not include the first date in the indices. So for two items with the same date, there would only be one index in the groups object.
pprint(df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].groups)
I have no idea how to get the sum of all duplicates.
I've also tried:
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())
This gives the same result, which makes sense I guess, as the indices in the groupby object are not complete. What am I missing here?
Check the documentation for the method duplicated. By default duplicates are marked with True except for the first occurence, which is why the first date is not included in your sums.
You only need to pass in keep=False in duplicated for your desired behaviour.
df_dups = df[df.duplicated(['Date'], keep=False)].copy()
After that the sum can be calculated properly with the expression you wrote
df_sum = df_dups.groupby('Date')["Received Quantity","Sent Quantity","Fee Amount","Market Value"].apply(lambda x : x.sum())

Python Pandas DataFrame - How to sum values in 1 column based on partial match in another column (date type)?

I have encountered some issues while processing my dataset using Pandas DataFrame.
Here is my dataset:
My data types are displayed below:
My dataset is derived from:
MY_DATASET = pd.read_excel(EXCEL_FILE_PATH, index_col = None, na_values = ['NA'], usecols = "A, D")
I would like to sum all values in the "NUMBER OF PEOPLE" column for each month in the "DATE" column. For example, all values in "NUMBER OF PEOPLE" column would be added as long as the value in the "DATE" column was "2020-01", "2020-02" ...
However, I am stuck since I am unsure how to use the .groupby on partial match.
After 1) is completed, I am also trying to convert the values in the "DATE" column from YYYY-MM-DD to YYYY-MMM, like 2020-Jan.
However, I am unsure if there is such a format.
Does anyone know how to resolve these issues?
Many thanks!
Check
s = df['NUMBER OF PEOPLE'].groupby(pd.to_datetime(df['DATE'])).dt.strftime('%Y-%b')).sum()
You can get an abbeviated month name using strftime('%b') but the month name will be all in lowercase:
df['group_time'] = df.date.apply(lambda x: x.strftime('%Y-%B'))
If you need the first letter of the month in uppercase, you could do something like this:
df.group_date = df.group_date.apply(lambda x: f'{x[0:5]}{x[5].upper()}{x[6:]}'
# or in one step:
df['group_date']= df.date.apply(lambda x: x.strftime('%Y-%B')).apply(lambda x: f'{x[0:5]}
...: {x[5].upper()}{x[6:]}')
Now you just need to .groupby and .sum():
result = df['NUMBER OF PEOPLE'].groupby(df.group_date).sum()
I did some tinkering around and found that this worked for me as well:
Cheers all

Sorting by columns after grouping generating error

Could anyone please tell me why sorting is generating an error here? I suspect it is related to indexing but reset_index didnt solve the issue
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data']\
.transform(lambda x : ' '.join(x))\
.sort_values(['ID','Date']) .
KeyError: ('ID', 'Date')
What I was trying to do is to sort the dataframe regardless grouping. In R you would do ungroup() first not sure anything simliar is necessary in Pyhton? Thanks
df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
This above code will give you a Pandas Series which consists of only one column Text_Data. But when you apply sort_values(['ID','Date']), this generates an error because there are no ID and Date Columns present here.
You can separately sort your dataframe and transformed your column into Series. Then, delete that column from dataframe and append the transformed column to it like this,
df = df.sort_values(['ID','Date'])
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
del df['Text_Data']
df['Text_Data] = df['s'].values

Categories

Resources