I have the following as my code with the following graph displaying
Problem is, I wanted to compare each attribute(mean age, age amount etc) next to
each other. By default, pandas takes x as the index. How do I change it to take column names instead(Then there will be 3 comparisons, one for each attribute)
IIUC use transpose with DataFrame.plot.bar:
df.T.plot.bar()
Related
I have a pivot_table generated DataFrame with a single index for its rows, and a MultiIndex for its columns. The top level of the MultiIndex is the name of the data I am running calculations on, and the second level is the DATE of that data. The values are the result of those calculations. It looks like this:
Imgur link - my reputation not high enough to post inline images
I am trying to group this data by quarters (Q42018, for example), instead of every single day (the native format of the data).
I found this post that uses PeriodIndex and GroupBy to convert an index of dates into an index of quarters/years to be quite elegant and make the most sense.
The problem is that this solution is for a dataframe with only single index columns. I'm running into a problem trying to do this because my columns are a multi-index, and I can't figure out how to get it to work. Here is my attempt thus far:
bt = cleaned2018_df.pivot_table(index='Broker',
values=['Interaction Id','Net Points'],
columns='Date',
aggfunc={'Interaction Id':pd.Series.nunique,
'Net Points':np.sum},
fill_value=0)
pidx = pd.PeriodIndex(bt.columns.levels[1], freq='Q')
broker_qtr_totals = bt.groupby(pidx, axis=1, level=1).sum()
As you can see, I'm grabbing the second level of the MultiIndex that contains all the dates, and running it through the PeriodIndex function to get back an index of quarters. I then pass that PeriodIndex into groupby, and tell it to operate on columns and the second level where the dates are.
This returns a ValueError response of Grouper and axis must be same length. And I know the reason is because the pidx value I'm passing in to the GroupBy is of length x, whereas the column axis of the dataframe is length 2x (since the 1st level of the multiindex has 2 values).
I'm just getting hung up on how to properly apply this to the entire index. I can't seem to figure it out syntactically, so I wanted to rely on the community's expertise to see if someone could help me out.
If my explanation is not clear, I'm happy to clarify further. Thank you in advance.
I figured this out, and am going to post the answer in case anyone else with a similar problem lands here. I was thinking about the problem correctly, but had a few errors in my first attempt.
The length error was due to me passing an explicit reference to the 2nd level of the MultiIndex into the PeriodIndex function, and then passing that into groupby. The better solution is to use the .get_level_values function, as this takes into account the multi-level nature of the index and returns the appropriate # of values based on how many items are in higher levels.
For instance - if you have a DataFrame with MultiIndex columns with 2 levels - and those 2 levels each contain 3 values, your table will have 9 columns, as the lower level is broken out for each value in the top level. My initial solution was just grabbing those 3 values from the second level directly, instead of all 9. get_level_values corrects for this.
The second issue was that I was passing just this PeriodIndex object by itself into the groupby. That will work, but then it basically just disregards the top level of the MultiIndex. So you need to make sure to pass in a list that contains the original top level, and your new second level that you want to group by.
Corrected code:
#use get_level_values instead of accessing levels directly
pIdx = pd.PeriodIndex(bt.columns.get_level_values(1), freq='Q')
# to maintain original grouping, pass in a list of your original top level,
# and the new second level
broker_qtr_totals = bt.groupby(by=[bt.columns.get_level_values(0), pidx],
axis=1).sum()
This works
imgur link to dataframe image as my rep is too low
I'm new to Pandas.
I've got a dataframe where I want to group by user and then find their lowest score up until that date in the their speed column.
So I can't just use df.groupby(['user'])['speed'].transform('min) as this would give the min of all values not just form the current row to the first.
What can I use to get what I need?
Without seeing your dataset it's hard to help you directly. The problem does boil down to the following. You need to select the range of data you want to work with (so select rows for the date range and columns for the user/speed).
That would look something like x = df.loc[["2-4-2018","2-4-2019"], ['users', 'speed']]
From there you could do a simple x['users'].min() for the value or x['users'].idxmin() for the index of the value.
I haven't played around for a bit with Dataframes, but you're looking for how to slice Dataframes.
I have used groupby in pandas, however the label for the groups is simply an arbitrary value, whereas I would like this label to be the index of the original dataframe (which is datetime) so that I can create a new dataframe which I can plot in terms of datetime.
grouped_data = df.groupby(
['X',df.X.ne(df.X.shift()).cumsum().rename('grp')])
grouped_data2 = grouped_data['Y'].agg(np.trapz).loc[2.0:4.0]
The column x has changing values from 1-4 and the second line of code is intended to integrate the column Y in the groups where X is either 2 or 3. These are repeating units, so I don't want all the 2s and all the 3s integrated together, I want the period of time where it goes: 22222333333 as one group and then apply the np.trapz again to the next group where it goes: 2222233333. That way I should have a new dataframe with an index corresponding to the start of these time periods and values which are an integral of these periods.
If I understand correctly, you've already set your index to DateTime values? If yes, try the grouper function:
df.groupby(pd.Grouper(key={index name}, freq={appropriate offset alias}))
Without a sample data-set, I can't really provide a complete solution, but this should solve your indexing issue:)
Grouper Function tutorial
Offset aliases
So I have a function replaceMonth(string), which is just a series of if statements that returns a string derived from a column in a pandas dataframe. Then I need to replace the original string with the derived one.
The dataframe is defined like this:
Index ID Year DSFS DrugCount
0 111111 Y1 3- 4 months 1
There are around 80K rows in the dataframe. What I need to do is to replace what is in column DSFS with the result from the replaceMonth(string) function.
So if, for example, the value in the first row of DSFS was '3-4 months', if I ran that string through replaceMonth() it would give me '_3_4' as the return value. Then I need to change the value in the dataframe from the '3- 4 months' to '_3_4'.
I've been trying to use apply on the dataframe but I'm either getting the syntax wrong or not understanding what it's doing correctly, like this:
dataframe['DSFS'].apply(replaceMonth(dataframe['DSFS']))
That doesn't ring right to me but I'm not sure where I'm messing up on it. I'm fairly new to Python so it's probably the syntax. :)
Any help is greatly appreciated!
When you apply you pass the function that you want applied to each element.
Try
dataframe['DSFS'].apply(replaceMonth)
Reassign to the dataframe to preserve the changes
dataframe['DSFS'] = dataframe['DSFS'].apply(replaceMonth)
I want to produce an aggregation along a certain criterion, but also need a row with the same aggregation applied to the non-aggregated dataframe.
When using customers.groupby('year').size(), is there a way to keep the total among the groups, in order to output something like the following?
year customers
2011 3
2012 5
total 8
The only thing I could come up with so far is the following:
n_customers_per_year.loc['total'] = customers.size()
(n_customers_per_year is the dataframe aggregated by year. While this method is fairly straightforward for a single index, it seems to get messy when it has to be done on a multi-indexed aggregation.)
I believe the pivot_table method has a 'totals' boolean argument. Have a look.
margins : boolean, default False Add all row / columns (e.g. for
subtotal / grand totals)
I agree that this would be a desirable feature, but I don't believe it is currently implemented. Ideally, one would like to display an aggregation (e.g. sum) along one or more axis and or levels.
A workaround is to create a series that is the sum and then concatenate it to your DataFrame when delivering the data.