Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values

Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values - python

This is a snippet of my Data Frame in pandas
SubJob DetectorName CategoryID DefectID Image
0 0 NECK:1 79 5
1 0 NECK:2 79 6
2 0 NECK:3 92 4
3 0 NECK:4 99 123
4 0 NECK:5 99 124
5 1 NECK:6 79 47
6 1 NECK:7 91 631
7 1 NECK:8 98 646
8 1 NECK:9 99 7
9 2 NECK:10 79 15
10 2 NECK:11 89 1023
11 2 NECK:12 79 1040
12 2 NECK:13 79 2458
13 3 NECK:14 73 2459
14 3 NECK:15 87 2517
15 3 NECK:15 79 3117
16 3 NECK:16 79 3118
till n which is very large
We have multiple subjobs whichare sorted inside which we have multiple categoryId which are sorted and inside categoryId we have multiple defectId which are also sorted
I have a separate nested list
[[CategoryId, DefectId, Image-Link] [CategoryId, DefectId, Image-Link] ...m times]
m is large
here category id , defect id represents integer values and image link is string
now i repeatedly pick a categoryId, DefectId from list and find a row in dataframe corresponding to that categoryId, DefectId and add image link in that row
my current code is
for image_info_list in final_image_info_list:
# add path of image in Image_Link
frame_main.ix[(frame_main["CategoryID"].values == image_info_list[0])
&
(frame_main["DefectID"].values == image_info_list[1]),
"Image_Link"] = image_info_list[2]
which is working perfectly but my issue is since n, m is very large it is lot of time to compute it is there any other appropriate approach
can i apply binary search here ? if yes then how

For a fixed n, if m is large enough, you can perform queries more efficiently by some preprocessing.
(I would start with Idea 2 below, because Idea 1 is much more work to implement.)
Idea 1
First, sort the dataframe by [CategoryId, DefectId, Image-Link]. Following that, you can find any triplet by a triple application of a bisect algorithms, one per column, on the column's values.
The cost of what you're doing now is O(m n). The cost of my suggestion is O(n log(n) + m log(n)).
This will work better for some values of m and n, and worse for others. E.g., if m = Θ(n), then your current algorithm is Θ(n2) = ω(n log(n)). YMMV.
Idea 2
Since Image-link is a string sequence, I'm guessing pandas has a harder time searching for specific values within it. You can preprocess by making a dictionary mapping each value to a list of indices within the Dataframe. In the extreme case, where each Image-link value has O(1) rows, this can reduce the time from Θ(mn) to Θ(n + m).
Edit
In the extreme case the OP mentions in the comment, all Image-link values are unique. In this case, it is possible to build a dictionary mapping their values to indices like so:
dict([(k, i) for (i, k) in enumerate(df['Image-link'].values)])

Related

Grouping rows by proximity of floats in a python pandas dataframe

It feels so straight forward but I haven't found the answer to my question yet. How does one group by proximity, or closeness, of two floats in pandas?
Ok, I could do this the loopy way but my data is big and I hope I can expand my pandas skills with your help and do this elegantly:
I have a column of times in nanoseconds in my DataFrame. I want to group these based on the proximity of their values to little clusters. Most of them will be two rows per cluster maybe up to five or six. I do not know the number of clusters. It will be a massive amount of very small clusters.
I thought I could e.g. introduce a second index or just an additional column with 1 for all rows of the first cluster, 2 for the second and so forth so that groupby gets straight forward thereafter.
something like:
t (ns)
cluster
71
1524957248.4375
1
72
1524957265.625
1
699
14624846476.5625
2
700
14624846653.125
2
701
14624846661.287
2
1161
25172864926.5625
3
1160
25172864935.9375
3
Thanks for your help!

Assuming you want to create the "cluster" column from the index based on the proximity of the successive values, you could use:
thresh = 1
df['cluster'] = df.index.to_series().diff().gt(thresh).cumsum().add(1)
using the "t (ns)":
thresh = 1
df['cluster'] = df['t (ns)'].diff().gt(thresh).cumsum().add(1)
output:
t (ns) cluster
71 1.524957e+09 1
72 1.524957e+09 1
699 1.462485e+10 2
700 1.462485e+10 2
701 1.462485e+10 2
1161 2.517286e+10 3
1160 2.517286e+10 3

You can 'round' the t (ns) column by floor dividing them with a threshold value and looking at their differences:
df[['t (ns)']].assign(
cluster=(df['t (ns)'] // 10E7)
.diff().gt(0).cumsum().add(1)
)
Or you can experiment with the number of clusters you try to organize your data:
bins=3
df[['t (ns)']].assign(
bins=pd.cut(
df['t (ns)'], bins=bins).cat.rename_categories(range(1, bins + 1)
)
)

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Select middle value for nth rows in Python

I am creating new dataframe which should contain an only middle value (not Median!!) for every nth rows, however my code doesn't work!
I've tried several approaches through pandas or simple Python but I always fail.
value date index
14 40 1983-07-15 14
15 86 1983-07-16 15
16 12 1983-07-17 16
17 78 1983-07-18 17
18 69 1983-07-19 18
19 78 1983-07-20 19
20 45 1983-07-21 20
21 47 1983-07-22 21
22 48 1983-07-23 22
23 ..... ......... ..
RSDF5 = RSDF4.groupby(pd.Grouper(freq='15D', key='DATE')).[int(len(RSDF5)//2)].reset_index()
I know that the code is wrong and I am completely out of ideas!
SyntaxError: invalid syntax

A solution based on indexes.
df is your original dataframe, N is the number of rows you want to group (assumed to be ad odd number, so there is a unique middle row).
df2 = df.groupby(np.arange(len(df))//N).apply(lambda x : x.iloc[len(x)//2])
Be aware that if the total number or rows is not divisible by N, the last group is shorter (you still get its middle value, though).
If N is an even number, you get the central row closer to the end of the group: for example, if N=6, you get the 4th row of each group of 6 rows.

Finding subset of dataframe rows that maximize one column sum while limiting sum of another

A beginner to pandas and python, I'm trying to find select the 10 rows in a dataframe such that the following requirements are fulfilled:
Only 1 of each category in a categorical column
Maximize sum of a column
While keeping sum of another column below a specified threshold
The concept I struggle with is how to do all of this at the same time. In this case, the goal is to select 10 rows resulting in a subset where sum of OPW is maximized, while the sum of salary remains below an integer threshold, and all strings in POS are unique. If it helps understanding the problem, I'm basically trying to come up with the baseball dream team on a budget, with OPW being the metric for how well the player performs and POS being the position I would assign them to. The current dataframe looks like this:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
439 heltoto01 41.002660 1B 10600000
918 thomafr04 38.107000 1B 7000000
920 thomeji01 37.385272 1B 6337500
68 berkmla01 36.210367 1B 10250000
785 ramirma02 35.785630 OF 13050000
616 martied01 32.906884 3B 3500000
775 pujolal01 32.727629 1B 13870949
966 walkela01 30.644305 OF 6050000
354 giambja01 30.440007 1B 3103333
859 sheffga01 29.090699 OF 9916667
511 jonesch06 28.383418 3B 10833333
357 gilesbr02 28.160054 OF 7666666
31 bagweje01 27.133545 1B 6875000
282 edmonji01 23.486406 CF 4500000
0 abreubo01 23.056375 RF 9000000
392 griffke02 22.965706 OF 8019599
... ... ... ...
If my team was only 3 people, with a OF,1B, and 3B, and I had a sumsalary threshold of $19,100,000, I would get the following team:
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
918 thomafr04 38.107000 1B 7000000
616 martied01 32.906884 3B 3500000
The output would ideally be another dataframe with just the 10 rows that fulfill the requirements. The only solution I can think of is to bootstrap a bunch of teams (10 rows) with each row having a unique POS, remove teams above the 'salary' sum threshold, and then sort_value() the teams by df.OPW.sum(). Not sure how to implement that though. Perhaps there is a more elegant way to do this?
Edit: Changed dataframe to provide more information, added more context.

This is a linear programming problem. For each POS, you're trying to maximize individual OPW while total salary across the entire team is subject to a constraint. You can't solve this with simple pandas operations, but PuLP could be used to formulate and solve it (see the Case Studies there for some examples).
However, you could get closer to a manual solution by using pandas to group by (or sort by) POS and then either (1) sort by OPW descending and salary ascending, or (2) add some kind of "return on investment" column (OPW divided by salary, perhaps) and sort on that descending to find the players that give you the biggest bang for the buck in each position.

IIUC you can use groupby with aggregating sum:
df1 = df.groupby('category', as_index=False).sum()
print (df1)
category value cost
0 A 70 2450
1 B 67 1200
2 C 82 1300
3 D 37 4500
Then filter by boolean indexing with treshold:
tresh = 3000
df1 = df1[df1.cost < tresh]
And last get top 10 values by nlargest:
#in sample used top 3, in real data is necessary set to 10
print (df1.nlargest(3,columns=['value']))
category value cost
2 C 82 1300
0 A 70 2450
1 B 67 1200

sorting the quintile output from qcut in pandas python

I have a ebola dataset with 499 records. I am trying to find the number of observations in each quintile based on the prob(probability variable). the number of observations should fall into categories 0-20%, 20-40% etc. My code I think to do this is,
test = pd.qcut(ebola.prob,5).value_counts()
this returns
[0.044, 0.094] 111
(0.122, 0.146] 104
(0.106, 0.122] 103
(0.146, 0.212] 92
(0.094, 0.106] 89
My question is how do I sort this to return the correct number of observations for 0-20%, 20-40% 40-60% 60-80% 80-100%?
I have tried
test.value_counts(sort=False)
This returns
104 1
89 1
92 1
103 1
111 1
Is this the order 104,89,92,103,111? for each quintile?
I am confused because if I look at the probability outputs from my first piece of code it looks like it should be 111,89,103,104,92?

What you're doing is essentially correct but you might have two issues:
I think you are using pd.cut() instead of pd.qcut().
You are applying value_counts() one too many times.
(1) You can reference this question here here; when you use pd.qcut(), you should have the same number of records in each bin (assuming that your total records are evenly divisible by the # of bins) which you do not. Maybe check and make sure you are using the one you intended to use.
Here is some random data to illustrate (2):
>>> np.random.seed(1234)
>>> arr = np.random.randn(100).reshape(100,1)
>>> df = pd.DataFrame(arr, columns=['prob'])
>>> pd.cut(df.prob, 5).value_counts()
(0.00917, 1.2] 47
(-1.182, 0.00917] 34
(1.2, 2.391] 9
(-2.373, -1.182] 8
(-3.569, -2.373] 2
Adding the sort flag will get you what you want
>>> pd.cut(df.prob, 5).value_counts(sort=False)
(-3.569, -2.373] 2
(-2.373, -1.182] 8
(-1.182, 0.00917] 34
(0.00917, 1.2] 47
(1.2, 2.391] 9
or with pd.qcut()
>>> pd.qcut(df.prob, 5).value_counts(sort=False)
[-3.564, -0.64] 20
(-0.64, -0.0895] 20
(-0.0895, 0.297] 20
(0.297, 0.845] 20
(0.845, 2.391] 20

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficient Repeated DataFrame Row Selections Based on Multiple Columns' Values - python

Related

Grouping rows by proximity of floats in a python pandas dataframe

Summarising features with multiple values in Python for Machine Learning model

Select middle value for nth rows in Python

Finding subset of dataframe rows that maximize one column sum while limiting sum of another

sorting the quintile output from qcut in pandas python

Categories

Resources