Applying `pd.qcut` on multiple columns - python

I have a DataFrame containing 2 columns x and y that represent coordinates in a Cartesian system. I want to obtain groups with an even(or almost even) number of points. I was thinking about using pd.qcut() but as far as I can tell it can be applied only to 1 column.
For example, I would like to divide the whole set of points with 4 intervals in x and 4 intervals in y (numbers might not be equal) so that I would have roughly even number of points. I expect to see 16 intervals in total (4x4).
I tried a very direct approach which obviously didn't produce the right result (look at 51 and 99 for example). Here is the code:
df['x_bin']=pd.qcut(df.x,4)
df['y_bin']=pd.qcut(df.y,4)
grouped=df.groupby([df.x_bin,df.y_bin]).count()
print(grouped)
The output:
x_bin y_bin
(7.976999999999999, 7.984] (-219.17600000000002, -219.17] 51 51
(-219.17, -219.167] 60 60
(-219.167, -219.16] 64 64
(-219.16, -219.154] 99 99
(7.984, 7.986] (-219.17600000000002, -219.17] 76 76
(-219.17, -219.167] 81 81
(-219.167, -219.16] 63 63
(-219.16, -219.154] 53 53
(7.986, 7.989] (-219.17600000000002, -219.17] 78 78
(-219.17, -219.167] 77 77
(-219.167, -219.16] 68 68
(-219.16, -219.154] 51 51
(7.989, 7.993] (-219.17600000000002, -219.17] 70 70
(-219.17, -219.167] 55 55
(-219.167, -219.16] 77 77
(-219.16, -219.154] 71 71
Am I making a mistake in thinking it is possible to do with pandas only or am I missing something else?

The problem is that the distribution of the rows might not be the same according to x than according to y.
You are empirically mimicking a correlation analysis and finding out that there is slight negative relation... the y values are higher in the lower end of x scale and rather flat on the higher end of x.
So, if you want even number of datapoints on each bin I would suggest splitting the df into x bins and then applying qcut on y for each x bin ( so y bins have different cut points but even sample size)
Edit
Something like:
split_df = [(xbin, xdf) for xbin, xdf in df.groupby(pd.qcut(df.x, 4))] # no aggregation so far, just splitting the df evenly on x
split_df = [(xbin, xdf.groupby(pd.qcut(xdf.y)).x.size())
for xbin, xdf in split_df] # now each xdf is evenly cut on y
Now you need to work on each xdf separately. Attempting to concatenate all xdfs will result in an error. Index for xdfs is a CategoricalIndex, and the first xdf needs to have all categories for concat to work (i.e. split_df[0][1].index must include the bins of all other xdfs). Or you could change the Index to the center of the interval as a float64 on both xbins and ybins.

Related

How to sample data from Pandas Dataframe based on value count from another column

I have a dataframe of about 400,000 observations. I want to sample 50,000 observations based on the amount of each state that's in a 'state' column. So if there is 5% of all observations from TX, then 2,500 of the samples should be from TX, and so on.
I tried the following:
import pandas as pd
df.sample(n=50000, weights = 'state', random_state = 101)
That gave me this error.
TypeError: '<' not supported between instances of 'str' and 'int`
Is there a different way to do this?
Weights modify the probability of any one row to be selected, but can’t provide strict guarantees on counts of given values, as you want. For that you would need .groupby('state'):
>>> rate = df['state'].value_counts(normalize=True)
>>> rate
TX 0.5
NY 0.3
CA 0.2
>>> df.groupby('state').apply(lambda s: s.sample(int(10 * rate[s.name]))).droplevel('state')
state val
69 CA 33
19 CA 99
37 NY 89
36 NY 63
75 NY 3
42 TX 42
53 TX 52
50 TX 68
72 TX 70
2 TX 18
Replace 10 with the number of samples you want, so 50_000. This gives slightly more flexibility than the more efficient answer by #Psidom.
You can use groupby.sample:
df.groupby('state').sample(frac=0.125, random_state=101)
weights parameter is different from groups, it expects list of numbers as sample probability which is used when you want non equal probability weighting for different rows.
For instance the following sample will always return a data frame from the first two rows since the last two rows have weights of 0 and will never get selected:
df = pd.DataFrame({'a': [1,2,3,4]})
df.sample(n=2, weights=[0.5,0.5,0,0])
a
0 1
1 2

Adding confidence intervals for population rates in a dataframe

I have a dataframe where I have created a new column which sums the first three columns (dates) with values. Then I have created a rate for each row based on population column.
I would like to create lower and upper 95% confidence levels for the "sum_of_days_rate" for each row in this dataset.
I can create a mean of the first three columns but not sure how to create lower and upper values for the sum of these three columns rate.
Sample of the dataset below:
data= {'09/01/2021': [74,84,38],
'10/11/2021': [43,35,35],
"12/01/2021": [35,37,16],
"population": [23000,69000,48000]}
df = pd.DataFrame (data, columns = ['09/01/2021','10/11/2021', "12/01/2021", "population"])
df['sum_of_days'] = df.loc[:, df.columns[0:3]].sum(1)
df['sum_of_days_rate'] = df['sum_of_days']/df['population'] * 100000
To estimate the confidence interval you need to make certain assumptions about the data, how it is distributed or what would be the associated error. I am not sure what your data points mean, why you are summing them up etc.
A commonly used distribution for rates would a poisson distribution and you can construct the confidence interval, given a mean:
lb, ub = scipy.stats.poisson.interval(0.95,df.sum_of_days_rate)
df['lb'] = lb
df['ub'] = ub
The arrays ub and lb are the upper and lower bound of the 95% confidence interval. Final data frame looks like this:
09/01/2021 10/11/2021 12/01/2021 population sum_of_days sum_of_days_rate lb ub
0 74 43 35 23000 152 660.869565 611.0 712.0
1 84 35 37 69000 156 226.086957 197.0 256.0
2 38 35 16 48000 89 185.416667 159.0 213.0

How can I find the groups with size more than a value in python?

I read the data into a DataFrame and called it data. I have the following query in python:
data[data["gender"]=="male"].groupby('age').city.nunique().sort_values(ascending=False)
age
29 86
24 85
21 81
25 81
20 81
28 78
27 78
now I want to find those groups whose size is more than 80. how can I do that in python?
The result of your aggregation and sorting call is a pandas series whose index are the groups you are looking for. So to find the groups with greater than a certain cutOffvalue
cutOffValue = 80
counts = data[data["gender"]=="male"].groupby('age').city.nunique().sort_values(ascending=False)
groups = counts[counts > cutOffValue].index
And of course, if you want it as a list or set, you could easily cast the final value
groups = list(groups)

Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values

This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how
For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])

sorting the quintile output from qcut in pandas python

I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?
What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20

Categories

Resources