Finding the first row which stratifies two conditions in python data frame

Finding the first row which stratifies two conditions in python data frame - python

I have a data frame which looks like this:
data frame
I want to write a code which locates points that have distance less than 250 from the next point. When it finds the point searches for the first point that is more than 250 away with speed greater than 5.
For example in the sample data set, first find row 7 and then locate row 10 which is more than 250 away and has speed of 10.8 and return the index of row 10
I have write this code so far:
for i in (number+1 for number in range(data_gpd.index[-1]-1)):
if (data_gpd['distance'][i+1]< 250):
I'm not sure what should I do after this condition. I had in mind to use "Next" statement with conditions but I was only able to find it for list comprehension with one condition.
I really appreciate your help as I'm new to python and not sure which syntax would work better

You can use the pandas function loc and associated conditions to return a pandas DataFrame.
First Condition:
df['distance'] < 250
Second Condition:
df['speed'] > 5
Combined Condition:
(df['distance'] < 250) & (df['speed'] > 5)
Using loc and combined condition:
df.loc[(df['distance'] < 250) & (df['speed'] > 5)]
Input:
time location distance speed
0 300 9071 9071 108.00
1 300 18376 9304 11.00
2 300 28006 9630 115.00
3 200 30506 2500 45.00
4 400 31606 1100 9.90
5 500 31706 100 0.72
6 150 31756 50 1.20
7 20 31766 10 1.80
8 50 31916 150 10.80
Output:
time location distance speed
8 50 31916 150 10.8

Related

How do meet a specific criteria for column in panda data frame as well as checking whether the value is more than equal to 10,000

Hello am doing my assignment and I have encountered a question that I can't answer. The question is to create another DataFrame df_urban consisting of all columns of the original dataset but comprising of only applicants with Urban status in their Property_Area attribute (exclude Rural and Semiurban) with ApplicantIncome of at least S$10,000. Reset the row index and display the last 10 rows of this DataFrame.
Picture of the question
My code however will not meet the criteria of Applicant Income of at least 10,000 as well as only urban status in the area.
df_urban = df
df_urban.iloc[-10:[11]]
I Was wondering what is the solution to the question.
Data picture

you can use the '&' operator to limit the data by multiple column conditions:
df_urban = df[(df[col]==<condition>) & (df[col] >= <condition>)]

Following is a simple code snippet performing a proof of principle in extracting a subset of the primary data frame to produce a subset data frame of only "Urban" locations.
import pandas as pd
df=pd.read_csv('Applicants.csv',delimiter='\t')
print(df)
df_urban = df[(df['Property_Area'] == 'Urban')]
print(df_urban)
Using a simply built CSV file, here is a sample of the output.
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
0 4583 1508 128000 360 1 Rural
1 1222 0 55000 360 1 Rural
2 8285 0 64000 360 1 Urban
3 3988 1144 75000 360 1 Rural
4 2588 0 84700 360 1 Urban
5 5248 0 48550 360 1 Rural
6 7488 0 111000 360 1 SemiUrban
7 3252 1112 14550 360 1 Rural
8 1668 0 67500 360 1 Urban
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
2 8285 0 64000 360 1 Urban
4 2588 0 84700 360 1 Urban
8 1668 0 67500 360 1 Urban
Hope that helps.
Regards.

See below. I leave it to you to work out how to reset index. You might want to look at .tail() to display last rows.
df_urban = df[(df['ApplicantIncome'] > 10000) & (df['Property_Area'] == 'Urban')]

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Create subcolumns in pandas dataframe python

I have a dataframe with multiple columns
df = pd.DataFrame({"cylinders":[2,2,1,1],
"horsepower":[120,100,89,70],
"weight":[5400,6200,7200,1200]})
cylinders horsepower weight
0 2 120 5400
1 2 100 6200
2 1 80 7200
3 1 70 1200
i would like to create a new dataframe and make two subcolumns of weight with the median and mean while gouping it by cylinders.
example:
weight
cylinders horsepower median mean
0 1 100 5299 5000
1 1 120 5100 5200
2 2 70 7200 6500
3 2 80 1200 1000
For my example tables i have used random values. I cant manage to achieve that.
I know how to get median and mean its described here in this stackoverflow question.
:
df.weight.median()
df.weight.mean()
df.groupby('cylinders') #groupby cylinders
But how to create this subcolumn?

The following code fragment adds the two requested columns. It groups the rows by cylinders, calculates the mean and median of weight, and combines the original dataframe and the result:
result = df.join(df.groupby('cylinders')['weight']\
.agg(['mean', 'median']))\
.sort_values(['cylinders', 'mean']).ffill()
# cylinders horsepower weight mean median
#2 1 80 7200 5800.0 5800.0
#3 1 70 1200 5800.0 5800.0
#1 2 100 6200 4200.0 4200.0
#0 2 120 5400 4200.0 4200.0
You cannot have "subcolumns" for select columns in pandas. If a column has "subcolumns," all other columns must have "subcolumns," too. It is called multiindexing.

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!

Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Filtering records in Pandas python - syntax error

I have a pandas data frame that looks like this:
duration distance speed hincome fi_cost type
0 359 1601 4 3 40.00 cycling
1 625 3440 6 3 86.00 cycling
2 827 4096 5 3 102.00 cycling
3 1144 5704 5 2 143.00 cycling
If I use the following I export a new csv that pulls only those records with a distance less than 5000.
distance_1 = all_results[all_results.distance < 5000]
distance_1.to_csv('./distance_1.csv',",")
Now, I wish to export a csv with values from 5001 to 10000. I can't seem to get the syntax right...
distance_2 = all_results[10000 > all_results.distance < 5001]
distance_2.to_csv('./distance_2.csv',",")

Unfortunately because of how Python chained comparisons work, we can't use the 50 < x < 100 syntax when x is some vectorlike quantity. You have several options.
You could create two boolean Series and use & to combine them:
>>> all_results[(all_results.distance > 3000) & (all_results.distance < 5000)]
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Use between to create a boolean Series and then use that to index (note that it's inclusive by default, though):
>>> all_results[all_results.distance.between(3000, 5000)] # inclusive by default
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling
Or finally you could use .query:
>>> all_results.query("3000 < distance < 5000")
duration distance speed hincome fi_cost type
1 625 3440 6 3 86 cycling
2 827 4096 5 3 102 cycling

5001 < all_results.distance < 10000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding the first row which stratifies two conditions in python data frame - python

Related

How do meet a specific criteria for column in panda data frame as well as checking whether the value is more than equal to 10,000

Summarising features with multiple values in Python for Machine Learning model

Create subcolumns in pandas dataframe python

Pandas timeseries bins and indexing

Filtering records in Pandas python - syntax error

Categories

Resources