Could anyone please tell me why sorting is generating an error here? I suspect it is related to indexing but reset_index didnt solve the issue
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data']\
.transform(lambda x : ' '.join(x))\
.sort_values(['ID','Date']) .
KeyError: ('ID', 'Date')
What I was trying to do is to sort the dataframe regardless grouping. In R you would do ungroup() first not sure anything simliar is necessary in Pyhton? Thanks
df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
This above code will give you a Pandas Series which consists of only one column Text_Data. But when you apply sort_values(['ID','Date']), this generates an error because there are no ID and Date Columns present here.
You can separately sort your dataframe and transformed your column into Series. Then, delete that column from dataframe and append the transformed column to it like this,
df = df.sort_values(['ID','Date'])
df['s'] = df.groupby(['ID','Date'],as_index=False)['Text_Data'].transform(lambda x : ' '.join(x))
del df['Text_Data']
df['Text_Data] = df['s'].values
Related
I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]
I want to find the top 1% in my dataframe and append all the values in a list. Then i can check the first value inside and use it as a filter in the dataframe, any idea how to do it ? Or if you have a simplier way to do it !
You can find the dataframe i use here :
https://raw.githubusercontent.com/srptwice/forstack/main/resultat_projet.csv
What i tried is to watch my dataframe with heatmap (from Seaborn) and use a filter like that :
df4 = df2[df2 > 50700]
You can use df.<column name>.quantile(<percentile>) to get the top % of a dataframe. For example, the code below would get you the rows for df2 where bfly column is at the top 10% (90th percentile)
import pandas as pd
df = pd.read_csv('./resultstat_projet.csv')
df.columns = df.columns.str.replace(' ', '') # remove blank spaces in columns
df2 = df[df.bfly > df.bfly.quantile(0.9)]
print(df2)
I've been trying to find out the top-3 highest frequency restaurant names under each type of restaurant
The columns are:
rest_type - Column for the type of restaurant
name - Column for the name of the restaurant
url - Column used for counting occurrences
This was the code that ended up working for me after some searching:
df_1=df.groupby(['rest_type','name']).agg('count')
datas=df_1.groupby(['rest_type'], as_index=False).apply(lambda x : x.sort_values(by="url",ascending=False).head(3))
['url'].reset_index().rename(columns={'url':'count'})
The final output was as follows:
I had a few questions pertaining to the above code:
How are we able to groupby using rest_type again for datas variable after grouping it earlier. Should it not give the missing column error? The second groupby operation is a bit confusing to me.
What does the first formulated column level_0 signify? I tried the code with as_index=True and it created an index and column pertaining to rest_type so I couldn't reset the index. Output below:
Thank you
You can use groupby a second time as it is present in the index which is recognized by groupby.
level_0 comes from the reset_index command because you index is unnamed.
That said, and provided I understand your dataset, I feel that you could achieve your goal more easily:
import random
df = pd.DataFrame({'rest_type': random.choices('ABCDEF', k=20),
'name': random.choices('abcdef', k=20),
'url': range(20), # looks like this is a unique identifier
})
def tops(s, n=3):
return s.value_counts().sort_values(ascending=False).head(n)
df.groupby('rest_type')['name'].apply(tops, n=3)
edit: here is an alternative to format the result as a dataframe with informative column names
(df.groupby('rest_type')
.apply(lambda x: x['name'].value_counts().nlargest(3))
.reset_index().rename(columns={'name': 'counts', 'level_1': 'name'})
)
I have a similar case where the above query looks working partially. In my case the cooccurrence value is coming as 1 always.
Here in my input data frame.
And my query is below
top_five_family_cooccurence_df = (common_top25_cooccurance1_df.groupby('family') .apply(lambda x: x['related_family'].value_counts().nlargest(5)) .reset_index().rename(columns={'related_family': 'cooccurence', 'level_1': 'related_family'}) )
I am getting result as
Where as The cooccurrence is always giving me 1.
I have encountered some issues while processing my dataset using Pandas DataFrame.
Here is my dataset:
My data types are displayed below:
My dataset is derived from:
MY_DATASET = pd.read_excel(EXCEL_FILE_PATH, index_col = None, na_values = ['NA'], usecols = "A, D")
I would like to sum all values in the "NUMBER OF PEOPLE" column for each month in the "DATE" column. For example, all values in "NUMBER OF PEOPLE" column would be added as long as the value in the "DATE" column was "2020-01", "2020-02" ...
However, I am stuck since I am unsure how to use the .groupby on partial match.
After 1) is completed, I am also trying to convert the values in the "DATE" column from YYYY-MM-DD to YYYY-MMM, like 2020-Jan.
However, I am unsure if there is such a format.
Does anyone know how to resolve these issues?
Many thanks!
Check
s = df['NUMBER OF PEOPLE'].groupby(pd.to_datetime(df['DATE'])).dt.strftime('%Y-%b')).sum()
You can get an abbeviated month name using strftime('%b') but the month name will be all in lowercase:
df['group_time'] = df.date.apply(lambda x: x.strftime('%Y-%B'))
If you need the first letter of the month in uppercase, you could do something like this:
df.group_date = df.group_date.apply(lambda x: f'{x[0:5]}{x[5].upper()}{x[6:]}'
# or in one step:
df['group_date']= df.date.apply(lambda x: x.strftime('%Y-%B')).apply(lambda x: f'{x[0:5]}
...: {x[5].upper()}{x[6:]}')
Now you just need to .groupby and .sum():
result = df['NUMBER OF PEOPLE'].groupby(df.group_date).sum()
I did some tinkering around and found that this worked for me as well:
Cheers all
I am looking for a cleaner way to add subtotals to Pandas groupby.
Here is my DataFrame:
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B'], 50),
'Sub-Category':np.random.choice( ['X','Y'], 50),
'Product':np.random.choice( ['Product 1','Product 2'], 50),
'Units_Sold':np.random.randint(1,100, size=(50)),
'Dollars_Sold':np.random.randint(100,1000, size=50),
'Date':np.random.choice( pd.date_range('1/1/2011','03/31/2011',
freq='D'), 50, replace=False)})
From there, I create a new Groupby Dataframe as such:
df1 = df.groupby(['Category','Sub-Category','Product',pd.TimeGrouper(key='Date',freq='M')]).agg({'Units_Sold':'sum','Dollars_Sold':'sum'}).unstack().fillna(0)
I would like to provide sub-totals for both Category & Sub-Category. I can do this using this code:
df2 = df1.groupby(level=[0,1]).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.get_level_values(0),
df2.index.get_level_values(1) + ' Total',
len(df2) * ['']])
df3 = df1.groupby(level=[0]).sum()
df3.index = pd.MultiIndex.from_arrays([df3.index.get_level_values(0) + ' Total',
len(df3) * [''],
len(df3) * ['']])
pd.concat([df1,df2,df3]).sort_index()
This gives me the DataFrame I want:
Final DataFrame
My question - is there a more pythonic way to do this than to have to create a new DataFrame for each level then concat together? I have researched this, but can not find a better way. I have to do this for many different MultiIndex dataframes & am seeking a better solution.
Thanks in advance for your help!
EDIT WITH ADDITIONAL INFORMATION:
Thank you to both #Wen & #DaFanat for their replies. I attempted to use the link #Wen provided on my data [link]:Python (Pandas) Add subtotal on each lvl of multiindex dataframe
pd.concat([df.assign(\
**{x: 'Total' for x in "CategorySub-CategoryProduct"[i:]}\
).groupby(list('abc')).sum() for i in range(1,4)])\
.sort_index()
This sums the total, however it ignores the dates that make up the second level of the columns. It leaves me with this outcome.Resulting Image
I've tried to add in a TimeGrouper with the groupby, but that returns an error. Any help would be greatly appreciated. Thanks!
I can get you a lot closer by aligning your attempt above with the example from #piRSquared.
The list must match the MultiIndex. Try this instead:
iList = ['Category','Sub-Category','Product']
pd.concat([
df1.assign(
**{x: '' for x in iList[i:]}
).groupby(iList).sum() for i in range(1,4)
]).sort_index()
It doesn't present the word "Total" in the right place, nor are the totals at the bottom of each group, but at least it's more-or-less functionally correct. My totals won't match because the values in the DataFrame are random.
It took me a while to work through the original answer provided in Python (Pandas) Add subtotal on each lvl of multiindex dataframe. But the same logic applies here.
The assign() replaces the values in the columns with what is in the dict that is returned by the dict comprehension executed over the elements of the list of MultiIndex columns.
Then groupby() only finds unique values for those non-blanked-out columns and sums them accordingly.
These groupbys are enclosed in a list comprehension, so pd.concat() then just combines these sets of rows.
And sort_index() puts the index labels in ascending order.
(Yes, you still get a warning about "both a column name and an index level," but it still works.)