I want to produce an aggregation along a certain criterion, but also need a row with the same aggregation applied to the non-aggregated dataframe.
When using customers.groupby('year').size(), is there a way to keep the total among the groups, in order to output something like the following?
year customers
2011 3
2012 5
total 8
The only thing I could come up with so far is the following:
n_customers_per_year.loc['total'] = customers.size()
(n_customers_per_year is the dataframe aggregated by year. While this method is fairly straightforward for a single index, it seems to get messy when it has to be done on a multi-indexed aggregation.)
I believe the pivot_table method has a 'totals' boolean argument. Have a look.
margins : boolean, default False Add all row / columns (e.g. for
subtotal / grand totals)
I agree that this would be a desirable feature, but I don't believe it is currently implemented. Ideally, one would like to display an aggregation (e.g. sum) along one or more axis and or levels.
A workaround is to create a series that is the sum and then concatenate it to your DataFrame when delivering the data.
Related
I have a dataset filled with Medicare beneficiaries. The question is: 'What proportion of patients have at least one of the chronic conditions described in the independent variables alzheimers, arthritis, cancer, copd, depression, diabetes, heart.failure, ihd, kidney, osteoporosis, and stroke?'
I tried creating a subset and using isnull() & any(), but i can´t get a proper solution.. also tried df.loc but it only lets me name one column..
i am attaching the dataset for better understanding!
https://drive.google.com/file/d/1R--YEsBCDHMXjqNzAumT2zzUAYvM1bWA/view?usp=sharing
Thanks!!
My Try´s:
claimss.loc[:, ["alzheimers","diabetes","arthritis"] == 1]
(Wanted to try it first with 3 columns, doesn´t work in the first place..)
Try with Subset:
filtered_df = df.loc[raw_df] == 1]
(Created a Subset where only index and independet variables(diseases) appear and tried to look for null-rows)
If need filter only some columns names use subset for filter columns names, compare by 1 by DataFrame.eq and last test at least one True by DataFrame.any:
claimss[claimss[["alzheimers","diabetes","arthritis"]].eq(1).any(axis=1)]
If need percentage use mean with boolean mask:
out = claimss[["alzheimers","diabetes","arthritis"]].eq(1).any(axis=1).mean()
I have the following as my code with the following graph displaying
Problem is, I wanted to compare each attribute(mean age, age amount etc) next to
each other. By default, pandas takes x as the index. How do I change it to take column names instead(Then there will be 3 comparisons, one for each attribute)
IIUC use transpose with DataFrame.plot.bar:
df.T.plot.bar()
I have a dataframe with multiple scores and multiple dates. My goal is to bin each day into equal sized buckets (let's say 5 buckets) based on whatever score I choose. The problem is that some scores have an abundance of ties and therefore I need to first compute rank to introduce a tie-breaker criteria and then the qcut can be applied.
The simple solution is to create a field for the rank and then do groupby('date')['rank'].transform(pd.qcut). However, since efficiency is key, this implies doing two expensive groupbys and I was wondering if it is possible to "chain" the two operations into one sweep.
This is the closest I got; my goal is to create 5 buckets but the qcut seems to be wrong since it is asking me to provide hundreds of labels
df_main.groupby('date')['score'].\
apply(lambda x: pd.qcut(x.rank(method='first'),
5,
duplicates='drop',
labels=lbls)
)
Thanks
I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)
I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:
Imgur link - my reputation not high enough to post inline images
I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).
I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.
The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:
bt = cleaned2018_df.pivot_table(index='Broker',
values=['Interaction Id','Net Points'],
columns='Date',
aggfunc={'Interaction Id':pd.Series.nunique,
'Net Points':np.sum},
fill_value=0)
pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()
As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.
This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).
I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.
If my explanation is not clear, I'm happy to clarify further. Thank you in advance.
I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.
The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.
For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.
The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.
Corrected code:
#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')
# to maintain original grouping, pass in a list of your original top level,
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
axis=1).sum()
This works
imgur link to dataframe image as my rep is too low