Running this code produces the error message :
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I have 6 years' worth of competitor's results from a 1/2 marathon in one csv file.
The function year_runners aims to create a new column for each year with a difference in finishing time between each runner.
Is there a more efficient way of producing the same result?
Thanks in advance.
Pos Gun_Time Chip_Time Name Number Category
1 1900-01-01 01:19:15 1900-01-01 01:19:14 Steve Hodges 324 Senior Male
2 1900-01-01 01:19:35 1900-01-01 01:19:35 Theo Bately 92 Supervet Male
#calculating the time difference between each finisher in year and adding this result to into a new column called time_diff
def year_runners(year, x, y):
print('Event held in', year)
# x is the first number (position) for the runner of that year,
# y is the last number (position) for that year e.g. 2016 event spans from df[246:534]
time_diff = 0
#
for index, row in df.iterrows():
time_diff = df2015.loc[(x + 1),'Gun_Time'] - df2015.loc[(x),'Chip_Time']
# using Gun time as the start-time for all.
# using chip time as finishing time for each runner.
# work out time difference between the x-placed runner and the runner behind (x + 1)
df2015.loc[x,'time_diff'] = time_diff #set the time_diff column to the value of time_diff for
#each row of x in the dataframe
print("Runner",(x+1),"time, minus runner" , x,"=",time_diff)
x += 1
if x > y:
break
Hi everyone, this was solved using the shift technique.
youtube.com/watch?v=nZzBj6n_abQ
df2015['shifted_Chip_Time'] = df2015['Chip_Time'].shift(1)
df2015['time_diff'] = df2015['Gun_Time'] - df2015['shifted_Chip_Time']
Related
I'm trying to predict delays based on weather 2 hours before scheduled travel. I have one dataset of travel data (call df1) and one dataset of weather (call df2). In order to predict the delay, I am trying to join df1 and df2 with an offset of 2 hours. That is, I want to look at the weather data 2 hours before the scheduled travel data. A paired down view of the data would look something like this
example df1 (travel data):
travel_data
location
departure_time
delayed
blah
KPHX
2015-04-23T15:02:00.000+0000
1
bleh
KRDU
2015-04-27T15:19:00.000+0000
0
example df2 (weather data):
location
report_time
weather_data
KPHX
2015-01-01 01:53:00
blih
KRDU
2015-01-01 09:53:00
bloh
I would like to join the data first on location and then on the timestamp data with a minimum 2 hour offset. If there are multiple weather reports greater than 2 hours earlier than departure time, I would like to join the travel data with the closest report to a 2 hour offset as possible.
So far I have used
joinedDF = airlines_6m_recode.join(weather_filtered, (col("location") == col("location")) & (col("departure_time") == (col("report_date") + f.expr('INTERVAL 2 HOURS'))), "inner")
This works only for the times when the departure time and (report date - 2hrs) match exactly, so I'm losing a large percentage of my data. Is there a way to join to the next closest report date outside the 2hr buffer?
I have looked into window functions but they don't describe how to do joins.
Change the join condition to be >= and get largest report timestamp after partitioning by location.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# 1.Join as per conditions
# 2. Partition by location, order by report_ts desc, add row_number
# 3. Filter row_number == 1
joinedDF = airlines_6m_recode.join(
weather_filtered,
(airlines_6m_recode["location"] == weather_filtered["location"]) & (weather_filtered["report_time_ts"] <= airlines_6m_recode["departure_time_ts"] - F.expr("INTERVAL 2 HOURS"))
, "inner")\
.withColumn("row_number", F.row_number().over(Window.partitionBy(airlines_6m_recode['location'])\
.orderBy(weather_filtered["report_time_ts"].desc())))
# Just to Print Intermediate result.
joinedDF.show()
joinedDF.filter('row_number == 1').show()
I'm trying to get real prices for my data in pandas. Right now, I am just playing with one year's worth of data (3962050 rows) and it took me 443 seconds to inflate the values using the code below. Is there a quicker way to find the real value? Is it possible to use pooling? I have many more years and if would take too long to wait every time.
Portion of df:
year quarter fare
0 1994 1 213.98
1 1994 1 214.00
2 1994 1 214.00
3 1994 1 214.50
4 1994 1 214.50
import cpi
import pandas as pd
def inflate_column(data, column):
"""
Adjust for inflation the series of values in column of the
dataframe data. Using cpi library.
"""
print('Beginning to inflate ' + column)
start_time = time.time()
df = data.apply(lambda x: cpi.inflate(x[column],
x.year), axis=1)
print("Inflating process took", time.time() - start_time, " seconds to run")
return df
df['real_fare'] = inflate_column(df, 'fare')
You have multiple values for each year: you can just call one for every year, store it in dict and then use the value instead of calling to cpi.inflate everytime.
all_years = df["year"].unique()
dict_years = {}
for year in all_years:
dict_years[year] = cpi.inflate(1.0, year)
df['real_fare'] = # apply here: dict_years[row['year']]*row['fare']
You can fill the last line using apply, or try do it in some other way like df['real_fare']=df['fare']*...
I fully understand there are a few versions of this questions out there, but none seem to get at the core of my problem. I have a pandas Dataframe with roughly 72,000 rows from 2015 to now. I am using a calculation that finds the most impactful words for a given set of text (tf_idf). This calculation does not account for time, so I need to break my main Dataframe down into time-based segments, ideally every 15 and 30 days (or n days really, not week/month), then run the calculation on each time-segmented Dataframe in order to see and plot what words come up more and less over time.
I have been able to build part of this this out semi-manually with the following:
def dateRange():
start = input("Enter a start date (MM-DD-YYYY) or '30' for last 30 days: ")
if (start != '30'):
datetime.strptime(start, '%m-%d-%Y')
end = input("Enter a end date (MM-DD-YYYY): ")
datetime.strptime(end, '%m-%d-%Y')
dataTime = data[(data['STATUSDATE'] > start) & (data['STATUSDATE'] <= end)]
else:
dataTime = data[data.STATUSDATE > datetime.now() - pd.to_timedelta('30day')]
return dataTime
dataTime = dateRange()
dataTime2 = dateRange()
def calcForDateRange(dateRangeFrame):
##### LONG FUNCTION####
return word and number
calcForDateRange(dataTime)
calcForDateRange(dataTime2)
This works - however, I have to manually create the 2 dates which is expected as I created this as a test. How can I split the Dataframe by increments and run the calculation for each dataframe?
dicts are allegedly the way to do this. I tried:
dict_of_dfs = {}
for n, g in data.groupby(data['STATUSDATE']):
dict_of_dfs[n] = g
for frame in dict_of_dfs:
calcForDateRange(frame)
The dict result was 2015-01-02: Dataframe with no frame. How can I break this down into a 100 or so Dataframes to run my function on?
Also, I do not fully understand how to break down ['STATUSDATE'] by number of days specifically?
I would to avoid iterating as much as possible, but I know I probably will have to someehere.
THank you
Let us assume you have a data frame like this:
date = pd.date_range(start='1/1/2018', end='31/12/2018', normalize=True)
x = np.random.randint(0, 1000, size=365)
df = pd.DataFrame(x, columns = ["X"])
df['Date'] = date
df.head()
Output:
X Date
0 328 2018-01-01
1 188 2018-01-02
2 709 2018-01-03
3 259 2018-01-04
4 131 2018-01-05
So this data frame has 365 rows, one for each day of the year.
Now if you want to group this data into intervals of 20 days and assign each group to a dict, you can do the following
df_dict = {}
for k,v in df.groupby(pd.Grouper(key="Date", freq='20D')):
df_dict[k.strftime("%Y-%m-%d")] = pd.DataFrame(v)
print(df_dict)
How about something like this. It creates a dictionary of non empty dataframes keyed on the
starting date of the period.
import datetime as dt
start = '12-31-2017'
interval_days = 30
start_date = pd.Timestamp(start)
end_date = pd.Timestamp(dt.date.today() + dt.timedelta(days=1))
dates = pd.date_range(start=start_date, end=end_date, freq=f'{interval_days}d')
sub_dfs = {d1.strftime('%Y%m%d'): df.loc[df.dates.ge(d1) & df.dates.lt(d2)]
for d1, d2 in zip(dates, dates[1:])}
# Remove empty dataframes.
sub_dfs = {k: v for k, v in sub_dfs.items() if not v.empty}
I have time series entries that I need to resample. On the more extreme end of things I've imagined that someone might generate data for 15 months -- this tends to be about 1300 records (about 5 location entries to every 2 metric entries). But after resampling to 15 minute intervals, the full set is about 41000 rows.
My data is less than a couple of dozen columns right now, so 20 column * 40k ≈ 800k values that need to be calculated .. this seems like I should be able to get my time down to below 10 seconds really. I've done an initial profile and it looks like the bottleneck is mostly in one pair of pandas methods for resampling that I am calling -- and they are amazingly slow! Its to the point where I am wondering if there is something wrong ..why would pandas be so slow to resample?
This produces a timeout in google cloud functions. That's what I need to avoid.
There's two sets of data: location and metric. Sample location data might look like this:
location bar girlfriends grocers home lunch park relatives work
date user
2018-01-01 00:00:01 0ce65715-4ec7-4ca2-aab0-323c57603277 0 0 0 1 0 0 0 0
sample metric data might look like this:
user date app app_id metric
0 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 2018-01-01 01:30:43 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.02
1 6ca1a9ce-8501-49f5-b7d9-70ac66331fdc 2018-01-01 04:14:59 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.10
I need to union those two subsets to a single ledger, with columns for each location name and each app. The values in apps are samples of constants, so I need to "connect the dots". The values in locations are location change events, so I need to keep repeating the same value until the next change event. In all, it like this:
app_1 app_2 user bar grocers home lunch park relatives work
date
2018-01-31 00:00:00 0.146250 0.256523 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 1 0 0 0 0
2018-01-31 00:15:00 0.146290 0.256562 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 0 0 0 0 1
This code does that, but needs to be optimized. What are the weakest links here? I've added basic sectional profiling:
import time
start = time.time()
locDf = locationDf.copy()
locDf.set_index('date', inplace=True)
# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
print("section1:", time.time() - start)
userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):
start = time.time()
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')
print("section2.1:", time.time() - start)
start = time.time()
# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')
userDf['user'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
print("section2.2:", time.time() - start)
start = time.time()
# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')
print("section2.3:", time.time() - start)
start = time.time()
# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('user', axis=1, inplace=True)
print("section2.4:", time.time() - start)
start = time.time()
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['user'].fillna(user, inplace=True)
print("section2.5:", time.time() - start)
start = time.time()
for loc in locationDf[locationDf['user'] == user].location.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)
print("section3:", time.time() - start)
start = time.time()
# fill app NaNs
for app in a2_df['app'].unique():
userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)
print("section4:", time.time() - start)
results:
section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984
Both "big" sections have calls to resample and agg('max').
notes:
I found this problem from 12 months ago: Pandas groupby + resample first is really slow - since version 0.22 -- seems like perhaps resample() in groupby is broken currently.
I try to calculate how often a state is entered and how long it lasts. For example I have the three possible states 1,2 and 3, which state is active is logged in a pandas Dataframe:
test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14))
For example the state 1 is entered two times (at index 3 and 12), the first time it lasts three hours, the second time two hours (so on average 2.5). State 2 is entered 3 times, on average for 2.66 hours.
I know that I can mask data I'm not interested in, for example to analyize for state 1:
state1 = test.mask(test!=1)
but from there on I can't find a way to go on.
I hope the comments give enough explanation - the key point is you can use a custom rolling window function and then cumsum to group the rows into "clumps" of the same state.
# set things up
freq = "1h"
df = pd.DataFrame(
[2,2,2,1,1,1,2,2,2,3,2,2,1,1],
index=pd.date_range('00:00', freq=freq, periods=14)
)
# add a column saying if a row belongs to the same state as the one before it
df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1)
# the cumulative sum - each "clump" gets its own integer id
df["value_group"] = df["is_first"].cumsum()
# get the rows corresponding to states beginning
start = df.groupby("value_group", as_index=False).nth(0)
# get the rows corresponding to states ending
end = df.groupby("value_group", as_index=False).nth(-1)
# put the timestamp indexes of the "first" and "last" state measurements into
# their own data frame
start_end = pd.DataFrame(
{
"start": start.index,
# add freq to get when the state ended
"end": end.index + pd.Timedelta(freq),
"value": start[0]
}
)
# convert timedeltas to seconds (float)
start_end["duration"] = (
(start_end["end"] - start_end["start"]).apply(float) / 1e9
)
# get average state length and counts
agg = start_end.groupby("value").agg(["mean", "count"])["duration"]
agg["mean"] = agg["mean"] / (60 * 60)
And the output:
mean count
value
1 2.500000 2
2 2.666667 3
3 1.000000 1