sorting the quintile output from qcut in pandas python - python

I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?

What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20

Related

How to sample data from Pandas Dataframe based on value count from another column

I have a dataframe of about 400,000 observations. I want to sample 50,000 observations based on the amount of each state that's in a 'state' column. So if there is 5% of all observations from TX, then 2,500 of the samples should be from TX, and so on.
I tried the following:
import pandas as pd
df.sample(n=50000, weights = 'state', random_state = 101)
That gave me this error.
TypeError: '<' not supported between instances of 'str' and 'int`
Is there a different way to do this?
Weights modify the probability of any one row to be selected, but can’t provide strict guarantees on counts of given values, as you want. For that you would need .groupby('state'):
>>> rate = df['state'].value_counts(normalize=True)
>>> rate
TX 0.5
NY 0.3
CA 0.2
>>> df.groupby('state').apply(lambda s: s.sample(int(10 * rate[s.name]))).droplevel('state')
state val
69 CA 33
19 CA 99
37 NY 89
36 NY 63
75 NY 3
42 TX 42
53 TX 52
50 TX 68
72 TX 70
2 TX 18
Replace 10 with the number of samples you want, so 50_000. This gives slightly more flexibility than the more efficient answer by #Psidom.
You can use groupby.sample:
df.groupby('state').sample(frac=0.125, random_state=101)
weights parameter is different from groups, it expects list of numbers as sample probability which is used when you want non equal probability weighting for different rows.
For instance the following sample will always return a data frame from the first two rows since the last two rows have weights of 0 and will never get selected:
df = pd.DataFrame({'a': [1,2,3,4]})
df.sample(n=2, weights=[0.5,0.5,0,0])
a
0 1
1 2

Applying `pd.qcut` on multiple columns

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4x4).
I tried a very direct approach which obviously didn't produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?
The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation... the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

Accessing columns with MultiIndex after using pandas groupby and aggregate

I am using the df.groupby() method:
g1 = df[['md', 'agd', 'hgd']].groupby(['md']).agg(['mean', 'count', 'std'])
It produces exactly what I want!
agd hgd
mean count std mean count std
md
-4 1.398350 2 0.456494 -0.418442 2 0.774611
-3 -0.281814 10 1.314223 -0.317675 10 1.161368
-2 -0.341940 38 0.882749 0.136395 38 1.240308
-1 -0.137268 125 1.162081 -0.103710 125 1.208362
0 -0.018731 603 1.108109 -0.059108 603 1.252989
1 -0.034113 178 1.128363 -0.042781 178 1.197477
2 0.118068 43 1.107974 0.383795 43 1.225388
3 0.452802 18 0.805491 -0.335087 18 1.120520
4 0.304824 1 NaN -1.052011 1 NaN
However, I now want to access the groupby object columns like a "normal" dataframe.
I will then be able to:
1) calculate the errors on the agd and hgd means
2) make scatter plots on md (x axis) vs agd mean (hgd mean) with appropriate error bars added.
Is this possible? Perhaps by playing with the indexing?
1) You can rename the columns and proceed as normal (will get rid of the multi-indexing)
g1.columns = ['agd_mean', 'agd_std','hgd_mean','hgd_std']
2) You can keep multi-indexing and use both levels in turn (docs)
g1['agd']['mean count']
It is possible to do what you are searching for and it is called transform. You will find an example that does exactly what you are searching for in the pandas documentation here.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values

This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how
For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])

Categories

Resources