Groupby giving keyerror - python

I have a dataframe, df, defined as:
Empty DataFrame
Columns: []
Index: [timestamp, device_type, os]
I am trying to groupby timestamp and device type and preform .agg on it such as:
df.groupby(['timestamp', 'device_type']).agg({'sessions_sum': 'sum'})
This is giving me a KeyError:
** KeyError: KeyError('timestamp',)
I have read over pandas documentation but I am unsure where I am going wrong. How can I successfully use groupby?

Error is because, timestamp, device_type, os are index of the df, not actual columns.
So, you can either do a:
df.reset_index(inplace=True)
df.groupby(['timestamp', 'device_type']).agg({'sessions_sum': 'sum'})
OR:
df.groupby(level=0).agg

Related

Extracting top-N occurrences in a grouped dataframe using pandas

I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.

Operate on columns in pandas groupby

Assume I have a dataframe df which has 4 columns col = ["id","date","basket","gender"] and a function
def is_valid_date(df):
idx = some_scalar_function(df["basket") #returns an index
date = df["date"].values[idx]
return (date>some_date)
I have always understood the groupby as a "creation of a new dataframe" when splitting in the "split-apply-combine" (losely speaking) thus if I want to apply is_valid_date to each group of id, I would assume I could do
df.groupby("id").agg(get_first_date)
but it throws KeyError: 'basket' in the idx=some_scalar_function(df["basket"])
If use GroupBy.agg it working with each column separately, so cannot selecting like df["basket"], df["date"].
Solution is use GroupBy.apply with your custom function:
df.groupby("id").apply(get_first_date)

pandas:drop multiple columns which name in a list and assigned to a new dataframe

I have a dataframe with several columns:
df
pymnt_plan ... settlement_term days
Now I know which columns I Want to delete/drop, based on the following list:
mylist = ['pymnt_plan',
'recoveries',
'collection_recovery_fee',
'policy_code',
'num_tl_120dpd_2m',
'hardship_flag',
'debt_settlement_flag_date',
'settlement_status',
'settlement_date',
'settlement_amount',
'settlement_percentage',
'settlement_term']
How to drop multiple columns which their names in a list and assigned to a new dataframe? In this case:
df2
days
You can do
new_df = df[list]
df = df.drop(columns=list)
In Pandas 0.20.3 using 'df = df.drop(columns=list)' I get:
TypeError: drop() got an unexpected keyword argument 'columns'
So you can use this instead:
df = df.drop(axis=1, labels=list)

Pandas .describe() only returning 4 statistics on int dataframe (count, unique, top, freq)... no min, max, etc

Why could this be? My data seems pretty simple and straightforward, it's a 1 column dataframe of ints, but .describe only returns count, unique, top, freq... not max, min, and other expected outputs.
(Note .describe() functionality is as expected in other projects/datasets)
It seems pandas doesn't recognize your data as int.
Try to do this explicitly:
print(df.astype(int).describe())
Try:
df.agg(['count', 'nunique', 'min', 'max'])
You can add or remove the different aggregation functions to that list.
And when I have quite a few columns I personally like to transpose it:
df.agg(['count', 'nunique', 'min', 'max']).transpose()
To reduce the aggregations on a subset of columns you different ways to do it.
By containig a word: example 'ID'
df.filter(like='ID').agg(['count', 'nunique'])
By type of data:
df.select_dtypes(include=['int']).agg(['count', 'nunique'])
df.select_dtypes(exclude=['float64']).agg(['count', 'nunique'])
try to change your features into numerical values to return all the statics you need :
df1['age'] = pd.to_numeric(df1['age'], errors='coerce')

How to group a Series by values in pandas?

I currently have a pandas Series with dtype Timestamp, and I want to group it by date (and have many rows with different times in each group).
The seemingly obvious way of doing this would be something similar to
grouped = s.groupby(lambda x: x.date())
However, pandas' groupby groups Series by its index. How can I make it group by value instead?
grouped = s.groupby(s)
Or:
grouped = s.groupby(lambda x: s[x])
Three methods:
DataFrame: pd.groupby(['column']).size()
Series: sel.groupby(sel).size()
Series to DataFrame:
pd.DataFrame( sel, columns=['column']).groupby(['column']).size()
For anyone else who wants to do this inline without throwing a lambda in (which tends to kill performance):
s.to_frame(0).groupby(0)[0]
You should convert it to a DataFrame, then add a column that is the date(). You can do groupby on the DataFrame with the date column.
df = pandas.DataFrame(s, columns=["datetime"])
df["date"] = df["datetime"].apply(lambda x: x.date())
df.groupby("date")
Then "date" becomes your index. You have to do it this way because the final grouped object needs an index so you can do things like select a group.
To add another suggestion, I often use the following as it uses simple logic:
pd.Series(index=s.values).groupby(level=0)

Categories

Resources