I am trying to get the proportion of each category in the data set by day, to be able to plot it eventually.
Sample (daily_usage):
type date count
0 A 2016-03-01 70
1 A 2016-03-02 64
2 A 2016-03-03 38
3 A 2016-03-04 82
4 A 2016-03-05 37
...
412 G 2016-03-27 149
413 G 2016-03-28 382
414 G 2016-03-29 232
415 G 2016-03-30 312
416 G 2016-03-31 412
I plotted the mean and median by type just fine with the following code:
daily_usage.groupby('type')['count'].agg(['median','mean']).plot(kind='bar')
But I wanted a similar plot with the proportion of the daily counts instead. However, for plotting it eventually, I don't need to show the date. It would be just to show the average/median daily proportion for each type.
The proportion interpretation I mean is, for example, for the first line: type A happened 70 times in March 1; considering all other events in March 1, there is a sum of 948 events. The proportion of type A in March 1 is 70/948. This would be computed for all rows. The final plot will have to show each type on the x-axis, and the average daily proportion on the y-axis
I tried getting the proportion in two ways.
First one:
daily_usage['ratio'] = (daily_usage / daily_usage.groupby('date').transform(sum))['count']
The denominator in this first try gives me this sample output, so it looks like it should be very easy to divide the original count column by this new daily count column:
count
0 ... 948
1 ... 910
2 ... 588
3 ... 786
4 ... 530
5 ... 1043
Error:
TypeError: unsupported operand type(s) for /: 'str' and 'str'
Second one:
daily_usage.div(day_total,axis='count')
where day_total = daily_usage.groupby('date').agg({'count':'sum'}).reset_index()
Error:
TypeError: ufunc true_divide cannot use operands with types dtype('<M8[ns]') and dtype('<M8[ns]')
What's a better way to do this?
if you just want to have your new column in your dataframe you can do the following:
df['ratio'] = (df.groupby(['type','date'])['count'].transform(sum) / df.groupby('date')['count'].transform(sum))
However, it has nearly been 20 mins now that I'm trying to figure out what you're trying to plot exactly and since I still didn't really get your intention I ask from you to leave a detailed comment in case you need help plotting and precise what you want to plot and how ( one plot for the daily usage of each day or some other form ).
PS:
in my code df refers to your daily_usage dataframe.
Hope this was helpful.
Related
It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters.
I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.
something like:
t (ns)
cluster
71
1524957248.4375
1
72
1524957265.625
1
699
14624846476.5625
2
700
14624846653.125
2
701
14624846661.287
2
1161
25172864926.5625
3
1160
25172864935.9375
3
Thanks for your help!
Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3
You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)
I have this dataset from US Census Bureau with weighted data:
Weight Income ......
2 136 72000
5 18 18000
10 21 65000
11 12 57000
23 43 25700
The first person represents 136 people, the second 18 and so on. There are a lot of other columns and I need to do several charts and calculations. I will be too much work to apply the weight every time I need to do a chart, pivot table, etc.
Ideally, I would like to use this:
df2 = df.iloc [np.repeat (df.index.values, df.PERWT )]
To create an unweighted or flat dataframe.
This produces a new large (1.4GB) dataframe:
Weight Wage
0 136 72000
0 136 72000
0 136 72000
0 136 72000
0 136 72000
.....
The thing is that using all the columns of the dataset, my computer runs out of memory.
Any idea on how to use the weights to create a new weighted dataframe?
I've tied this:
df2 = df.sample(frac=1, weights=df['Weight'])
But it seems to produce the same data. Changing frac to 0.5 could be a solution, but I'll lose 50% of the information.
Thanks!
I have a pandas dataframe containing time series data for imaged neurons, with columns for the area, + mean/min/max values of 4 imaged cells, over 2400 rows for each observations. There are two additional columns in the dataframe- Time(seconds) at which each observation occurred, and a column specifying the experimental condition (basal, +some drug, etc):
Area1 Mean1 Min1 Max1 Area2 Mean2 ... epoch Time(s)
0 28 253.536 31 854 22 109.045 ... basal 0
1 28 181.643 16 677 22 73.591 ... basal 0.2
2 28 163.036 16 589 22 66.727 ... basal 0.4
and so on for 2400 rows
When trying to use the seaborn lineplot function to plot a line graph of mean time series, averaged over all cells (i.e. the average across the 'mean_' columns for each row, plotted against the time column, I get an error:
ValueError: could not broadcast input array from shape (2400) into
shape (4)* message.
My code is:
ax = sns.lineplot(x='Time(s)', y=neurons_df.filter(like='Mean'), hue="condition",data=neurons_df)
I have implemented near identical code successfully before, the only difference being I was previously plotting one Y time series at a time, whereas now I am trying to plot several simultaneously.
Any ideas where I'm going wrong?
I am using the df.groupby() method:
g1 = df[['md', 'agd', 'hgd']].groupby(['md']).agg(['mean', 'count', 'std'])
It produces exactly what I want!
agd hgd
mean count std mean count std
md
-4 1.398350 2 0.456494 -0.418442 2 0.774611
-3 -0.281814 10 1.314223 -0.317675 10 1.161368
-2 -0.341940 38 0.882749 0.136395 38 1.240308
-1 -0.137268 125 1.162081 -0.103710 125 1.208362
0 -0.018731 603 1.108109 -0.059108 603 1.252989
1 -0.034113 178 1.128363 -0.042781 178 1.197477
2 0.118068 43 1.107974 0.383795 43 1.225388
3 0.452802 18 0.805491 -0.335087 18 1.120520
4 0.304824 1 NaN -1.052011 1 NaN
However, I now want to access the groupby object columns like a "normal" dataframe.
I will then be able to:
1) calculate the errors on the agd and hgd means
2) make scatter plots on md (x axis) vs agd mean (hgd mean) with appropriate error bars added.
Is this possible? Perhaps by playing with the indexing?
1) You can rename the columns and proceed as normal (will get rid of the multi-indexing)
g1.columns = ['agd_mean', 'agd_std','hgd_mean','hgd_std']
2) You can keep multi-indexing and use both levels in turn (docs)
g1['agd']['mean count']
It is possible to do what you are searching for and it is called transform. You will find an example that does exactly what you are searching for in the pandas documentation here.
I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?
What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20