A value that need to be extended in time - python

people.
I need to sum values of a data frame in different columns.
OUT with the amount invested
IN with the amount received
DRAW with the amount taken.
So, OUT is the total invested. If you have a DRAW, means that value was taken for the investment.
As an example, -100 (LINE 1) + 100 (LINE 2[DRAW]) means that you took part of the investment. In this case, the value received in the column IN, means we have 110-100 (both in line 2. One in column IN, the other, in DRAW) of income, giving us a total income of 10 units (10% of the investment = (110-100)/100 = (IN-DRAW)/OUT).
We also could have a DRAW without a return, as in line 12. In this example, from this line on, the income will be calculate in 2x-200 (-400) + 20 = -380.
After line 5, we have an investment of 2 times -200; -400 in the total and no DRAWs and OUTs, until line 12.
My doubt lies in what is the best way to calculate the % in each month based on the OUTs, INs and DRAWs in all table.
LINE
DATE
OUT
IN
DRAW
1
2020-01-20
-100
-
2
2020-02-10
-
110
100
3
2020-02-11
-200
-
4
2020-02-21
-
20
5
2020-02-25
-200
-
6
2020-02-26
-200
-
7
2020-02-26
-
20
8
2020-03-09
-
40
9
2020-04-01
-
10
10
2020-04-07
-
20
11
2020-04-10
-
10
12
2020-05-10
-
-
20

I came with a solution, still not knowing if is the best one, but worked fine.
I made a new column join OUT and DRAW together (OUTDRAW).
This column was created with OUT data, and them filled the NaN spaces with DRAW values (this worked only because the values are not in the same line):
df['OUTDRAW'] = df['OUT']
df['OUTDRAW'].fillna(df['DRAW'], inplace=True)
After, filled NaN with 0 and made a cumsum on it.
df['OUTDRAW'].fillna(0, inplace=True)
df['OUTDRAW'].cumsum()
This gave me the column
LINE
DATE
OUT
DRAW
OUTDRAW
1
2020-01-20
-100
-
-100
2
2020-02-10
-
100
0
3
2020-02-11
-200
-
-200
4
2020-02-21
-
20
-200
5
2020-02-25
-200
-
-400
6
2020-02-26
-200
-
-600
7
2020-02-26
-
20
-600
8
2020-03-09
-
40
-600
9
2020-04-01
-
10
-600
10
2020-04-07
-
20
-600
11
2020-04-10
-
10
-600
12
2020-05-10
-
20
-580
So, now we have a column in time that can be used to calculate % in time.
Note: If you want to make it by month, first make a new column with the months (be careful with the years), group by it and them do the cumsum or your values will be calculate wrongly.

Related

Overwrite a value in a pandas dataframe column based on a calculation function applied to it

From the following DataFrame:
worktime = 1440
person = [11,22,33,44,55]
begin_date = '2019-10-01'
shift= [1,2,3,1,2]
pause = [90,0,85,70,0]
occu = [60,0,40,20,0]
time_u = [50,40,80,20,0]
time_a = [84.5,0.0,10.5,47.7,0.0]
time_p = 0
time_q = [35.9,69.1,0.0,0.0,84.4]
df = pd.DataFrame({'date':pd.date_range(begin_date, periods=len(person)),'person':person,'shift':shift,'worktime':worktime,'pause':pause,'occu':occu, 'time_u':time_u,'time_a':time_a,'time_p ':time_p,'time_q':time_q,})
Output:
date person shift worktime pause occu time_u time_a time_p time_q
0 2019-10-01 11 1 1440 90 60 50 84.5 0 35.9
1 2019-10-02 22 2 1440 0 0 40 0.0 0 69.1
2 2019-10-03 33 3 1440 85 40 80 10.5 0 0.0
3 2019-10-04 44 1 1440 70 20 20 47.7 0 0.0
4 2019-10-05 55 2 1440 0 0 0 0.0 0 84.4
I am looking for a suitable function that takes the already contained value of the columns and uses it in a calculation and then overwrites it with the result of the calculation.
It concerns the columns time_u, time_a, time_p and time_q and should be applied according to the following principle:
time_u = worktime - pause - occu - (existing value of time_u)
time_a = (new value of time_u) - time_a
time_p = (new value of time_a) - time_p
time_q = (new value of time_p)- time_q
Is there a possible function that could be used here?
Using this formula manually, the output would look like this:
date person shift worktime pause occu time_u time_a time_p time_q
0 2019-10-01 11 1 1440 90 60 1240 1155.5 1155.5 1119.6
1 2019-10-02 22 2 1440 0 0 1400 1400 1400 1330.9
2 2019-10-03 33 3 1440 85 40 1235 1224.5 1224.5 1224.5
3 2019-10-04 44 1 1440 70 20 1330 1282.3 1282.3 1282.3
4 2019-10-05 55 2 1440 0 0 1440 1440 1440 1355.6
Unfortunately, this task is way beyond my skill level, so any help in setting up the appropriate function would be greatly appreciated.
Many thanks in advance
You can simply apply the relationships you have supplied sequentially. Or are you looking for something else? By the way, you put an extra space at the end of 'time_p'
df['time_u'] = df['worktime'] - df['pause'] - df['occu'] - df['time_u']
df['time_a'] = df['time_u'] - df['time_a']
df['time_p'] = df['time_a'] - df['time_p']
df['time_q'] = df['time_p'] - df['time_q']

Facebook Prophet: Providing different data sets to build a better model

My data frame looks like that. My goal is to predict event_id 3 based on data of event_id 1 & event_id 2
ds tickets_sold y event_id
3/12/19 90 90 1
3/13/19 40 130 1
3/14/19 13 143 1
3/15/19 8 151 1
3/16/19 13 164 1
3/17/19 14 178 1
3/20/19 10 188 1
3/20/19 15 203 1
3/20/19 13 216 1
3/21/19 6 222 1
3/22/19 11 233 1
3/23/19 12 245 1
3/12/19 30 30 2
3/13/19 23 53 2
3/14/19 43 96 2
3/15/19 24 120 2
3/16/19 3 123 2
3/17/19 5 128 2
3/20/19 3 131 2
3/20/19 25 156 2
3/20/19 64 220 2
3/21/19 6 226 2
3/22/19 4 230 2
3/23/19 63 293 2
I want to predict sales for the next 10 days of that data:
ds tickets_sold y event_id
3/24/19 20 20 3
3/25/19 30 50 3
3/26/19 20 70 3
3/27/19 12 82 3
3/28/19 12 94 3
3/29/19 12 106 3
3/30/19 12 118 3
So far my model is that one. However, I am not telling the model that these are two separate events. However, it would be useful to consider all data from different events as they belong to the same organizer and therefore provide more information than just one event. Is that kind of fitting possible for Prophet?
# Load data
df = pd.read_csv('event_data_prophet.csv')
df.drop(columns=['tickets_sold'], inplace=True, axis=0)
df.head()
# The important things to note are that cap must be specified for every row in the dataframe,
# and that it does not have to be constant. If the market size is growing, then cap can be an increasing sequence.
df['cap'] = 500
# growth: String 'linear' or 'logistic' to specify a linear or logistic trend.
m = Prophet(growth='linear')
m.fit(df)
# periods is the amount of days that I look in the future
future = m.make_future_dataframe(periods=20)
future['cap'] = 500
future.tail()
forecast = m.predict(future)
forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail()
fig1 = m.plot(forecast)
Start dates of events seem to cause peaks. You can use holidays for this by setting the starting date of each event as a holiday. This informs prophet about the events (and their peaks). I noticed event 1 and 2 are overlapping. I think you have multiple options here to deal with this. You need to ask yourself what the predictive value of each event is related to event3. You don't have too much data, that will be the main issue. If they have equal value, you could change the date of one event. For example 11 days earlier. The unequal value scenario could mean you drop 1 event.
events = pd.DataFrame({
'holiday': 'events',
'ds': pd.to_datetime(['2019-03-24', '2019-03-12', '2019-03-01']),
'lower_window': 0,
'upper_window': 1,
})
m = Prophet(growth='linear', holidays=events)
m.fit(df)
Also I noticed you forecast on the cumsum. I think your events are stationary therefor prophet probably benefits from forecasting on the daily ticket sales rather than the cumsum.

Filtering pandas dataframe for a steady speed condition

Below is a sample dataframe which is similar to mine except the one I am working on has 200,000 data points.
import pandas as pd
import numpy as np
df=pd.DataFrame([
[10.07,5], [10.24,5], [12.85,5], [11.85,5],
[11.10,5], [14.56,5], [14.43,5], [14.85,5],
[14.95,5], [10.41,5], [15.20,5], [15.47,5],
[15.40,5], [15.31,5], [15.43,5], [15.65,5]
], columns=['speed','delta_t'])
df
speed delta_t
0 10.07 5
1 10.24 5
2 12.85 5
3 11.85 5
4 11.10 5
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
9 10.41 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5
std_dev = df.iloc[0:3,0].std() # this will give 1.55
print(std_dev)
I have 2 columns, 'Speed' and 'Delta_T'. Delta_T is the difference in time between subsequent rows in my actual data (it has date and time). The operating speed keeps varying and what I want to achieve is to filter out all data points where the speed is nearly steady, say by filtering for a standard deviations of < 0.5 and Delta_T >=15 min. For example, if we start with the first speed, the code should be able to keep jumping to the next speeds, keep calculating the standard deviation and if it less than 0.5 and it delta_T sums up to 30 min and more I should be copy that data into a new dataframe.
So for this dataframe I will be left with index 5 to 8 and 10 to15.
Is this possible? Could you please give me some suggestion on how to do it? Sorry I am stuck. It seems to complicated to me.
Thank you.
Best Regards Arun
Let use rolling,shift and std:
Calculate the rolling std for a window of 3, the find those stds less than 0.5 and use shift(-2) to get the values at the start of the window where std was less than 0.5. Using boolean indexing with |(or) we can get the entire steady state range.
df_std = df['speed'].rolling(3).std()
df_ss = df[(df_std < 0.5) | (df_std < 0.5).shift(-2)]
df_ss
Output:
speed delta_t
5 14.56 5
6 14.43 5
7 14.85 5
8 14.95 5
10 15.20 5
11 15.47 5
12 15.40 5
13 15.31 5
14 15.43 5
15 15.65 5

Nested if loop with DataFrame is very,very slow

I have 10 million rows to go through and it will take many hours to process, I must be doing something wrong
I converted the names of my df variables for ease in typing
Close=df['Close']
eqId=df['eqId']
date=df['date']
IntDate=df['IntDate']
expiry=df['expiry']
delta=df['delta']
ivMid=df['ivMid']
conf=df['conf']
The below code works fine, just ungodly slow, any suggestions?
print(datetime.datetime.now().time())
for i in range(2,1000):
if delta[i]==90:
if delta[i-1]==50:
if delta[i-2]==10:
if expiry[i]==expiry[i-2]:
df.Skew[i]=ivMid[i]-ivMid[i-2]
print(datetime.datetime.now().time())
14:02:11.014396
14:02:13.834275
df.head(100)
Close eqId date IntDate expiry delta ivMid conf Skew
0 37.380005 7 2008-01-02 39447 1 50 0.3850 0.8663
1 37.380005 7 2008-01-02 39447 1 90 0.5053 0.7876
2 36.960007 7 2008-01-03 39448 1 50 0.3915 0.8597
3 36.960007 7 2008-01-03 39448 1 90 0.5119 0.7438
4 35.179993 7 2008-01-04 39449 1 50 0.4055 0.8454
5 35.179993 7 2008-01-04 39449 1 90 0.5183 0.7736
6 33.899994 7 2008-01-07 39452 1 50 0.4464 0.8400
7 33.899994 7 2008-01-07 39452 1 90 0.5230 0.7514
8 31.250000 7 2008-01-08 39453 1 10 0.4453 0.7086
9 31.250000 7 2008-01-08 39453 1 50 0.4826 0.8246
10 31.250000 7 2008-01-08 39453 1 90 0.5668 0.6474 0.1215
11 30.830002 7 2008-01-09 39454 1 10 0.4716 0.7186
12 30.830002 7 2008-01-09 39454 1 50 0.4963 0.8479
13 30.830002 7 2008-01-09 39454 1 90 0.5735 0.6704 0.1019
14 31.460007 7 2008-01-10 39455 1 10 0.4254 0.6737
15 31.460007 7 2008-01-10 39455 1 50 0.4929 0.8218
16 31.460007 7 2008-01-10 39455 1 90 0.5902 0.6411 0.1648
17 30.699997 7 2008-01-11 39456 1 10 0.4868 0.7183
18 30.699997 7 2008-01-11 39456 1 50 0.4965 0.8411
19 30.639999 7 2008-01-14 39459 1 10 0.5117 0.7620
20 30.639999 7 2008-01-14 39459 1 50 0.4989 0.8804
21 30.639999 7 2008-01-14 39459 1 90 0.5887 0.6845 0.077
22 29.309998 7 2008-01-15 39460 1 10 0.4956 0.7363
23 29.309998 7 2008-01-15 39460 1 50 0.5054 0.8643
24 30.080002 7 2008-01-16 39461 1 10 0.4983 0.6646
At this rate it will take 7.77 hrs to process
Basically, the whole point of numpy & pandas is to avoid loops like the plague, and do things in a vectorial way. As you noticed, without that, speed is gone.
Let's break your problem into steps.
The Conditions
Here, your your first condition can be written like this:
df.delta == 90
(Note how this compares the entire column at once. This is much much faster than your loop!).
and the second one can be written like this (using shift):
df.delta.shift(1) == 50
The rest of your conditions are similar.
Note that to combine conditions, you need to use parentheses. So, the first two conditions, together, should be written as:
(df.delta == 90) & (df.delta.shift(1) == 50)
You should be able to now write an expression combining all your conditions. Let's call it cond, i.e.,
cond = (df.delta == 90) & (df.delta.shift(1) == 50) & ...
The assignment
To assign things to a new column, use
df['skew'] = ...
We just need to figure out what to put on the right-hand-sign
The Right Hand Side
Since we have cond, we can write the right-hand-side as
np.where(cond, df.ivMid - df.ivMid.shift(2), 0)
What this says is: when condition is true, take the second term; when it's not, take the third term (in this case I used 0, but do whatever you like).
By combining all of this, you should be able to write a very efficient version of your code.

pandas subtracting two grouped dataframes of different size

i have two dataframes:
my stock solutions (df1):
pH salt_conc
5.5 0 23596.0
200 19167.0
400 17052.5
6.0 0 37008.5
200 27652.0
400 30385.5
6.5 0 43752.5
200 41146.0
400 39965.0
and my measurements after i did something (df2):
pH salt_conc id
5.5 0 8 20953.0
11 24858.0
200 3 20022.5
400 13 17691.0
20 18774.0
6.0 0 14 38639.0
200 1 37223.5
2 36597.0
7 37039.0
10 37088.5
15 35968.5
16 36344.5
17 34894.0
18 36388.5
400 9 33386.0
6.5 0 4 41401.5
12 44933.5
200 5 43074.5
400 6 42210.5
19 41332.5
I would like to normalize each measurement in the second dataframe (df2) with its corresponding stock solution from which i took the sample.
Any suggestions ?
Figured it out with the help of this post:
SO: Binary operation broadcasting across multiindex
I had to reset the index of both grouped dataframes and set it again.
df_initial = df_initial.reset_index().set_index(['pH','salt_conc'])
df_second = df_second.reset_index().set_index(['pH','salt_conc'])
No i can do any calculation i want to do.

Categories

Resources