GroupBy on a dataframe with multiindex columns using periodindex

GroupBy on a dataframe with multiindex columns using periodindex - python

I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:
Imgur link - my reputation not high enough to post inline images
I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).
I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.
The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:
bt = cleaned2018_df.pivot_table(index='Broker',
values=['Interaction Id','Net Points'],
columns='Date',
aggfunc={'Interaction Id':pd.Series.nunique,
'Net Points':np.sum},
fill_value=0)
pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()
As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.
This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).
I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.
If my explanation is not clear, I'm happy to clarify further. Thank you in advance.

I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.
The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.
For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.
The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.
Corrected code:
#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')
# to maintain original grouping, pass in a list of your original top level,
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
axis=1).sum()
This works
imgur link to dataframe image as my rep is too low

Related

Pandas: going from long to wide format in a dataframe

I am having trouble going from a long format to a wide one, in pandas. There are plenty of examples going from wide to long, but I did not find one from long to wide.
I am trying to reformat my dataframe and pivot, groupby, unstack are a bit confusing for my use case.
This is how I want it to be. The numbers are actually the intensity column from the second image.
And this is how it is now
I tried to build a MultiIndex based on Peptide, Charge and Protein. Then I tried to pivot based on that multi index, and keep all the samples and their intensity as values:
df.set_index(['Peptide', 'Charge', 'Protein'], append=False)
df.pivot(index=df.index, columns='Sample', values='Intensity')
Of course, this does not work since my index is now a combination of the 3 and not an actual column in the dataframe.
It tells me
KeyError: None of [RangeIndex(start=0, stop=3397898, step=1)] are in the [columns]
I tried also to group by, but I am not sure how to move from the long format back to wide. I am quite new to the dataframe way of thinking and I want to learn how to do this right.
It was very tempting for me to do an old school "java"-like approach with 4 for loops and building it as a matrix. Thank you in advnace!

I think based on your attempt that this might work:
df2 = df.pivot(['Peptide', 'Charge', 'Protein'], columns='Sample', values='Intensity').reset_index()
After that, if you want to remove the name from the column axis:
df2 = df2.rename_axis(None, axis=1)

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.

The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.

Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

Pandas - select lowest value to date

I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?

Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.

Python Pandas - Carrying the index over as the group name/index for a groupby produced dataframe

I have used groupby in pandas, however the label for the groups is simply an arbitrary value, whereas I would like this label to be the index of the original dataframe (which is datetime) so that I can create a new dataframe which I can plot in terms of datetime.
grouped_data = df.groupby(
['X',df.X.ne(df.X.shift()).cumsum().rename('grp')])
grouped_data2 = grouped_data['Y'].agg(np.trapz).loc[2.0:4.0]
The column x has changing values from 1-4 and the second line of code is intended to integrate the column Y in the groups where X is either 2 or 3. These are repeating units, so I don't want all the 2s and all the 3s integrated together, I want the period of time where it goes: 22222333333 as one group and then apply the np.trapz again to the next group where it goes: 2222233333. That way I should have a new dataframe with an index corresponding to the start of these time periods and values which are an integral of these periods.

If I understand correctly, you've already set your index to DateTime values? If yes, try the grouper function:
df.groupby(pd.Grouper(key={index name}, freq={appropriate offset alias}))
Without a sample data-set, I can't really provide a complete solution, but this should solve your indexing issue:)
Grouper Function tutorial
Offset aliases

pandas.DataFrame.groupby(): keep 'whole' when grouping, in addition to groups

I want to produce an aggregation along a certain criterion, but also need a row with the same aggregation applied to the non-aggregated dataframe.
When using customers.groupby('year').size(), is there a way to keep the total among the groups, in order to output something like the following?
year customers
2011 3
2012 5
total 8
The only thing I could come up with so far is the following:
n_customers_per_year.loc['total'] = customers.size()
(n_customers_per_year is the dataframe aggregated by year. While this method is fairly straightforward for a single index, it seems to get messy when it has to be done on a multi-indexed aggregation.)

I believe the pivot_table method has a 'totals' boolean argument. Have a look.
margins : boolean, default False Add all row / columns (e.g. for
subtotal / grand totals)

I agree that this would be a desirable feature, but I don't believe it is currently implemented. Ideally, one would like to display an aggregation (e.g. sum) along one or more axis and or levels.
A workaround is to create a series that is the sum and then concatenate it to your DataFrame when delivering the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.