Add column to dataframe based on intervals from another dataframe - python

I have a gpx file that I am manipulating. I would like to add a column to it that describes the terrain based on another dataframe that lists the terrain by distance. Here are the dataframes:
GPS_df
lat lon alt time dist total_dist
0 44.565335 -123.312517 85.314 2020-09-07 14:00:01 0.000000 0.000000
1 44.565336 -123.312528 85.311 2020-09-07 14:00:02 0.000547 0.000547
2 44.565335 -123.312551 85.302 2020-09-07 14:00:03 0.001137 0.001685
3 44.565332 -123.312591 85.287 2020-09-07 14:00:04 0.001985 0.003670
4 44.565331 -123.312637 85.270 2020-09-07 14:00:05 0.002272 0.005942
... ... ... ... ... ... ...
12481 44.565576 -123.316116 85.517 2020-09-07 17:28:14 0.002318 26.091324
12482 44.565559 -123.316072 85.587 2020-09-07 17:28:15 0.002469 26.093793
12483 44.565554 -123.316003 85.637 2020-09-07 17:28:16 0.003423 26.097217
12484 44.565535 -123.315966 85.697 2020-09-07 17:28:17 0.002249 26.099465
12485 44.565521 -123.315929 85.700 2020-09-07 17:28:18 0.002066 26.101532
terrain_df:
dist terrain
0 0.0 Start
1 3.0 Road
2 5.0 Gravel
3 8.0 Trail-hard
4 12.0 Gravel
5 16.0 Trail-med
6 18.0 Road
7 22.0 Gravel
8 23.0 Trail-easy
9 26.2 Road
I have come up with the following code, that works, but I would like to make it more efficient by eliminating the looping:
GPS_df['terrain']=""
i=0
for j in range(0,len(GPS_df)):
if GPS_df.total_dist[j]<= terrain_df.dist[i]:
GPS_df.terrain[j]=terrain_df.terrain[i]
else:
i=i+1
GPS_df.terrain[j]=terrain_df.terrain[i]
I have tried a half a dozen different ways, but none seem to work correctly. I am sure there is an easy way to do it, but I just don't have the skills and experience to figure it out so far, so I am looking for some help. I tried using cut and add the labels, but cut requires unique labels. I could use cut and then replace the generated intervals with labels in another way, but that doesn't seems like the best approach either. I also tried this approach that I found from another question, but it filled the column with the first label only (I also am having trouble understanding how it works, so it makes it tough to troubleshoot).
bins = terrain_df['dist']
names = terrain_df['terrain']
d = dict(enumerate(names, 1))
GPS_df['terrain2'] = np.vectorize(d.get)(np.digitize(GPS_df['dist'], bins))
Appreciate any guidance that you can give me.

I believe pandas.merge_asof should do the trick. Try:
result = pd.merge_asof(left=GPS_df, right=terrain_df, left_on='total_dist', right_on='dist', direction='backward')

Related

Store a dataframe and block new updates

I'm struggling to solve this problem. I'm creating a data frame that is generated by data that can vary from one day to another but I need to save the first version and block new updates.
This is the code:
# Create data frame for the ideal burndown line
df_ideal_burndown = pd.DataFrame(columns=['dates', 'ideal_trend'])
df_ideal_burndown['dates'] = range_sprint
#### Dates preparation
df_ideal_burndown['dates'] = pd.to_datetime(df_ideal_burndown['dates'], dayfirst=True)
df_ideal_burndown['dates'] = df_ideal_burndown['dates'].dt.strftime('%Y-%m-%d')
# Define the sprint lenght
days_sprint = int(len(range_sprint)) - int(cont_nonworking)
# Get how many items are in the current sprint
commited = len(df_current_sprint)
# Define the ideal number of items should be delivered by day
ideal_burn = round(commited/days_sprint,1)
# Create a list of remaining items to be delivered by day
burndown = [commited - ideal_burn]
# Day of the sprint -> starts with 2, since the first day is already in the list above
sprint_day = 2
# Iterate to create the ideal trend line in numbers
for i in range(1, len(df_ideal_burndown), 1):
burndown.append(round((commited - (ideal_burn * sprint_day)),1))
sprint_day += 1
# Add the ideal burndown to the column
df_ideal_burndown['ideal_trend'] = burndown
df_ideal_burndown
This is the output:
dates ideal_trend
0 2022-03-14 18.7
1 2022-03-15 17.4
2 2022-03-16 16.1
3 2022-03-17 14.8
4 2022-03-18 13.5
5 2022-03-21 12.2
6 2022-03-22 10.9
7 2022-03-23 9.6
8 2022-03-24 8.3
9 2022-03-25 7.0
10 2022-03-28 5.7
11 2022-03-29 4.4
12 2022-03-30 3.1
13 2022-03-31 1.8
14 2022-04-01 0.5
My main problem is related to commited = len(df_current_sprint), since the df_current_sprint is (and needs to be) used by other parts of my code.
Basically, even if the API returns new data that should be stored in the df_current_sprint I should use the version I'd just created.
I am pretty new to Python and I do not know if there is a way to store and, let's say, cache this information until I need to use fresh new data.
I appreciate your support, clues, and guidance.
Marcelo

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.
You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.
There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Pandas Way of Weighted Average in a Large DataFrame

I do have a large dataset (around 8 million rows x 25 columns) in Pandas and I am struggling to find a way to compute weighted average on this dataframe which in turn creates another data frame.
Here is how my dataset looks like (very simplified version of it):
prec temp
location_id hours
135 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
136 1 12.0 4.0
2 14.0 4.1
3 14.3 3.5
4 15.0 4.5
5 15.0 4.2
6 15.0 4.7
7 15.5 5.1
I have a multi-index on [location_id, hours]. I have around 60k locations and 140 hours for each location (making up the 8 million rows).
The rest of the data is numeric (float) or categorical. I have only included 2 columns here, normally there are around 20 columns.
What I am willing to do is to create a new data frame that is basically a weighted average of this data frame. The requirements indicate that 12 of these location_ids should be averaged out by a specified weight to form the combined_location_id values.
For example, location_ids 1,3,5,7,9,11,13,15,17,19,21,23 with their appropriate weights (separate data coming in from another data frame) should be weighted averaged to from the combined_location_id CL_1's data.
That is a lot of data to handle and I wasn't able to find a completely Pandas way of solving it. Therefore, I went with a for loop approach. It is extremely slow and I am sure this is not the right way to do it:
def __weighted(self, ds, weights):
return np.average(ds, weights=weights)
f = {'hours': 'first', 'location_id': 'first',
'temp': lambda x: self.__weighted(x, weights), 'prec': lambda x: self.__weighted(x, weights)}
data_frames = []
for combined_location in all_combined_locations:
mapped_location_ids = combined_location.location_ids
weights = combined_location.weights_of_location_ids
data_for_this_combined_location = pd.concat(df_data.loc[df_data.index.get_level_values(0) == location_id] for location_id in mapped_location_ids)
data_grouped_by_distance = data_for_this_combined_location.groupby("hours", as_index=False)
data_grouped_by_distance = data_grouped_by_distance.agg(f)
data_frames.append(data_grouped_by_distance)
df_combined_location_data = pd.concat(data_frames)
df_combined_location_data.set_index(['location_id', 'hours'], inplace=True)
This works well functionally, however the performance and the memory consumption is horrible. It is taking over 2 hours on my dataset and that is currently not acceptable. The existence of the for loop is an indicator that this could be handled better.
Is there a better/faster way to implement this?
From what I saw you can reduce one for loop with mapped_location_ids
data_for_this_combined_location = df_data.loc[df_data.index.get_level_values(0).isin(mapped_location_ids)]

calculating moving average in pandas

So, this is fairly a new topic for me and I don't quite understand it yet. I wanted to make a new column in a dataset that contains the moving average of the volume column. The window size is 5 and moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively
I did this.
df['Volume_moving'] = df.iloc[:,5].rolling(window=5).mean()
df
Date Open High Low Close Volume Adj Close Volume_moving
0 2012-10-15 632.35 635.13 623.85 634.76 15446500 631.87 NaN
1 2012-10-16 635.37 650.30 631.00 649.79 19634700 646.84 NaN
2 2012-10-17 648.87 652.79 644.00 644.61 13894200 641.68 NaN
3 2012-10-18 639.59 642.06 630.00 632.64 17022300 629.76 NaN
4 2012-10-19 631.05 631.77 609.62 609.84 26574500 607.07 18514440.0
... ... ... ... ... ... ... ... ...
85 2013-01-08 529.21 531.89 521.25 525.31 16382400 525.31 17504860.0
86 2013-01-09 522.50 525.01 515.99 517.10 14557300 517.10 16412620.0
87 2013-01-10 528.55 528.72 515.52 523.51 21469500 523.51 18185340.0
88 2013-01-11 521.00 525.32 519.02 520.30 12518100 520.30 16443720.0
91 2013-01-14 502.68 507.50 498.51 501.75 26179000 501.75 18221260.0
However, I think that the result is not accurate as I tried it with a different dataframe and get the exact same result.
Can anyone please help me with this?
Try with this:
df['Volume_moving'] = df['Volume'].rolling(window=5).mean()

Append/Merge Dataframe with LDA output

I'm working on an LDA model using Gensim and spacy.
Generically:
ldamodel = Lda(doc_term_matrix, num_topics=4, random_state = 100, update_every=3, chunksize = 50, id2word = dictionary, passes=100, alpha='auto')
ldamodel.print_topics(num_topics=4, num_words=6)
I'm at the point where I have some output and I'd like to append my original Dataframe (from which the text came from) with the topics and a percent contribution for each document.
The original df looks like this
id group text
234 1 here is some text
837 7 here is some text
494 2 here is some text
223 1 here is some text
I do some standard preprocessing including lemmatization, removing stop words, etc. and then compute percent contributions for each document.
my output looks like this
Document_No Dominant_Topic ... Keywords Text
0 0 1.0 ... RT, new, work, amp, year, today, people, look,... 0
1 1 0.0 ... like, time, good, know, day, find, research, a... 1
2 2 1.0 ... RT, new, work, amp, year, today, people, look,... 2
3 3 3.0 ... study, t, change, use, want, Trump, love, stud... 3
4 4 3.0 ... study, t, change, use, want, Trump, love, stud... 4
I thought I could just concat the 2 dfs back together like so:
results = pd.concat([df, results])
but when I do that the indices don't match and I'm left with a sort of Frankenstein df that looks like this
id group text Document_No Dominant_Topic ...
NaN NaN NaN 0 1.0 ...
NaN NaN NaN 1 0.0 ...
494 2 here is some text NaN NaN ...
223 1 here is some text NaN NaN ...
Happy to post fuller code if that would be helpful, but I'm hoping someone just knows a better way to do this from same point as I might print topics.

Categories

Resources