Pandas, get pct change period mean - python

I have a Data Frame which contains a column like this:
pct_change
0 NaN
1 -0.029767
2 0.039884 # period of one
3 -0.026398
4 0.044498 # period of two
5 0.061383 # period of two
6 -0.006618
7 0.028240 # period of one
8 -0.009859
9 -0.012233
10 0.035714 # period of three
11 0.042547 # period of three
12 0.027874 # period of three
13 -0.008823
14 -0.000131
15 0.044907 # period of one
I want to get all the periods where the pct change was positive into a list, so with the example column it will be:
raise_periods = [1,2,1,3,1]

Assuming that the column of your dataframe is a series called y which contains the pct_changes, the following code provides a vectorized solution without loops.
y = df['pct_change']
raise_periods = (y < 0).cumsum()[y > 0]
raise_periods.groupby(raise_periods).count()

eventually, the answer provided by #gioxc88 didn't get me where I wanted, but it did put me in the right direction.
what I ended up doing is this:
def get_rise_avg_period(cls, df):
df[COMPOUND_DIFF] = df[NEWS_COMPOUND].diff()
df[CONSECUTIVE_COMPOUND] = df[COMPOUND_DIFF].apply(lambda x: 1 if x > 0 else 0)
# group together the periods of rise and down changes
unfiltered_periods = [list(group) for key, group in itertools.groupby(df.consecutive_high.values.tolist())]
# filter out only the rise periods
positive_periods = [li for li in unfiltered_periods if 0 not in li]
I wanted to get the average length of this positive periods, so I added this at the end:
period = round(np.mean(positive_periods_lens))

Related

How do you perform conditional operations on different elements in a Pandas DataFrame?

Let's say I have a Pandas Dataframe of the price and stock history of a product at 10 different points in time:
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df
price stock
0 10 30
1 10 20
2 11 13
3 15 8
4 20 4
5 10 30
6 10 20
7 11 13
8 15 8
9 20 4
How do I perform operations between specific rows that meet certain criteria?
In my example row 0 and row 5 meet the criteria "stock over 25" and row 4 and row 9 meet the criteria "stock under 5".
I would like to calculate:
df['price'][4] - df['price'][0] and
df['price'][9] - df['price'][5]
but not
df['price'][9] - df['price'][0] or
df['price'][4] - df['price'][5].
In other words, I would like to calculate the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25; over the whole series.
Of course, I would like to do this over larger datasets where picking them manually is not good.
First, set up data frame and add some calculations:
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[np.arange(10)])
df['price'] = 10,10,11,15,20,10,10,11,15,20
df['stock'] = 30,20,13,8,4,30,20,13,8,4
df['stock_under_5'] = df['stock'] < 5
df['stock_over_25'] = df['stock'] > 25
df['cum_stock_under_5'] = df['stock_under_5'].cumsum()
df['change_stock_under_5'] = df['cum_stock_under_5'].diff()
df['change_stock_under_5'].iloc[0] = df['stock_under_5'].iloc[0]*1
df['next_row_change_stock_under_5'] = df['change_stock_under_5'].shift(-1)
df['cum_stock_over_25'] = df['stock_over_25'].cumsum()
df['change_stock_over_25'] = df['cum_stock_over_25'].diff()
df['change_stock_over_25'].iloc[0] = df['stock_over_25'].iloc[0]*1
df['next_row_change_stock_over_25'] = df['change_stock_over_25'].shift(-1)
df['row'] = np.arange(df.shape[0])
df['next_row'] = df['row'].shift(-1)
df['next_row_price'] = df['price'].shift(-1)
Next we find all windows where either the stock went over 25 or below 5 by grouping over the cumulative marker of those events.
changes = (
df.groupby(['cum_stock_under_5', 'cum_stock_over_25'])
.agg({'row':'first', 'next_row':'last', 'change_stock_under_5':'max', 'change_stock_over_25':'max',
'next_row_change_stock_under_5':'max', 'next_row_change_stock_over_25':'max',
'price':'first', 'next_row_price':'last'})
.assign(price_change = lambda x: x['next_row_price'] - x['price'])
.reset_index(drop=True)
)
For each window we find what happened at the beginning of the window: if change_stock_under_5 = 1 it means the window started with the stock going under 5, if change_stock_over_25 = 1 it started with the stock going over 25.
Same for the end of the window using the columns next_row_change_stock_under_5 and next_row_change_stock_over_25
Now, we can readily extract the stock price change in rows where the stock went from being over 25 to being under 5:
from_over_to_below = changes[(changes['change_stock_over_25']==1) & (changes['next_row_change_stock_under_5']==1)]
and the other way around:
from_below_to_over = changes[(changes['change_stock_under_5']==1) & (changes['next_row_change_stock_over_25']==1)]
You can for example calculate the average price change when the stock went from over 25 to below 5:
from_over_to_below.price_change.mean()
In order to give a better explanation, will separate the approach by creating two different functions:
The first one will be the event detection, let's call it detect_event.
The second one will calculate the the price between the current event and the previous one, in the list generated by the first function. We will call it calculate_price_change.
Starting with the first function, here it is key to understand very well the goals we want to reach or the constraints/conditions we want to satisfy.
Will leave two, of more, potential options, given the various interpretations of the question:
A. The initial is what I could get from my initial understanding
B. The second part will be one of the interpretations one could get from #Iyar Lyn comment (I can see more interpretations, but won't consider in this answer as the approach will be similar).
Within option A, we will create a function to detect where a stock is under 5 or 25
def detect_event(df):
# Create a list of the indexes of the events where stock was under 5 or over 25
events = []
# Loop through the dataframe
for i in range(len(df)):
# If stock is under 5, add the index to the list
if df['stock'][i] < 5:
events.append(i)
# If stock is over 25, add the index to the list
elif df['stock'][i] > 25:
events.append(i)
# Return the list of indexes of the events where stock was under 5 or over 25
return events
The comments make it self-explanatory, but, basically, this will return a list of indexes of the rows where stock is under 5 or over 25.
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 4, 5, 9]
Within the option B, assuming one wants to know the events where the stock went from under 5 to over 25, and vice-versa, consecutively (there are more ways to interpret this), then one can use the following function
def detect_event(df):
# Create a list of the indexes of the events where we will store the elements in the conditions
events = []
for i, stock in enumerate(df['stock']):
# If the index is 0, add the index of the first event to the list of events
if i == 0:
events.append(i)
# If the index is not 0, check if the stock went from over 25 to under 5 or from under 5 to over 25
else:
# If the stock went from over 25 to under 5, add the index of the event to the list of events
if stock < 5 and df['stock'][i-1] > 25:
events.append(i)
# If the stock went from under 5 to over 25, add the index of the event to the list of events
elif stock > 25 and df['stock'][i-1] < 5:
events.append(i)
# Return the list of events
return events
With OP's df this will return
events = detect_event(df)
[Out]:
[0, 5]
Note that 0 is the element in the first position, that we are appending by default.
As for the second function, once the conditions are well defined, meaning we know clearly what we want, and adapted the first function, detect_event, accordingly, we can now detect the changes in the prices.
In order to detect the price change between the events that satisfy the conditions we defined previously, one will use a different function: calculate_price_change.
This function will take both the dataframe df and the list events generated by the previous function, and return a list with the prices diferences.
def calculate_price_change(df, events):
# Create a list to store the price change between the most recent event where stock was under 5 vs the most recent event where stock was over 25
price_change = []
# Loop through the list of indexes of the events
for i, event in enumerate(events):
# If the index is 0, the price change is 0
if i == 0:
price_change.append(0)
# If the index is not 0, calculate the price change between the current and past events
else:
price_change.append(df['price'][event] - df['price'][events[i-1]])
return price_change
Now we if one calls this last function using the df and the list created with the first function detect_event, one gets the following
price_change = calculate_price_change(df, events)
[Out]:
[0, 10, -10, 10]
Notes:
As it is, the question gives room for multiple interpretations. That's why my initial flag for "Needs details or clarity". For the future one might want to review: How do I ask a good question?
and its hyperlinks.
I understand that sometimes we won't be able to specify everything that we want (as we might not even know - due to various reasons), so communication is key. Therefore, appreciate Iyar Lin's time and contributions as they helped improve this answer.

suggestion on how to solve an infinte loop problem (python-pandas)

I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F

Building complex subsets in Pandas DataFrame

I'm making my way around GroupBy, but I still need some help. Let's say that I've a DataFrame with columns Group, giving objects group number, some parameter R and spherical coordinates RA and Dec. Here is a mock DataFrame:
df = pd.DataFrame({
'R' : (-21.0,-21.5,-22.1,-23.7,-23.8,-20.4,-21.8,-19.3,-22.5,-24.7,-19.9),
'RA': (154.362789,154.409301,154.419191,154.474165,154.424842,162.568516,8.355454,8.346812,8.728223,8.759622,8.799796),
'Dec': (-0.495605,-0.453085,-0.481657,-0.614827,-0.584243,8.214719,8.355454,8.346812,8.728223,8.759622,8.799796),
'Group': (1,1,1,1,1,2,2,2,2,2,2)
})
I want to built a selection containing for each group the "brightest" object, i.e. the one with the smallest R (or the greatest absolute value, since Ris negative) and the 3 closest objects of the group (so I keep 4 objects in each group - we can assume that there is no group smaller than 4 objects if needed).
We assume here that we have defined the following functions:
#deg to rad
def d2r(x):
return x * np.pi / 180.0
#rad to deg
def r2d(x):
return x * 180.0 / np.pi
#Computes separation on a sphere
def calc_sep(phi1,theta1,phi2,theta2):
return np.arccos(np.sin(theta1)*np.sin(theta2) +
np.cos(theta1)*np.cos(theta2)*np.cos(phi2 - phi1) )
and that separation between two objects is given by r2d(calc_sep(RA1,Dec1,RA2,Dec2)), with RA1 as RA for the first object, and so on.
I can't figure out how to use GroupBy to achieve this...
What you can do here is build a more specific helper function that gets applied to each "sub-frame" (each group).
GroupBy is really just a facility that creates something like an iterator of (group id, DataFrame) pairs, and a function is applied to each of these when you call .groupby().apply. (That glazes over a lot of details, see here for some details on internals if you're interested.)
So after defining your three NumPy-based functions, also define:
def sep_df(df, keep=3):
min_r = df.loc[df.R.argmin()]
RA1, Dec1 = min_r.RA, min_r.Dec
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
idx = sep.nsmallest(keep+1).index
return df.loc[idx]
Then just apply and you get a MultiIndex DataFrame where the first index level is the group.
print(df.groupby('Group').apply(sep_df))
Dec Group R RA
Group
1 3 -0.61483 1 -23.7 154.47416
2 -0.48166 1 -22.1 154.41919
0 -0.49561 1 -21.0 154.36279
4 -0.58424 1 -23.8 154.42484
2 8 8.72822 2 -22.5 8.72822
10 8.79980 2 -19.9 8.79980
6 8.35545 2 -21.8 8.35545
9 8.75962 2 -24.7 8.75962
With some comments interspersed:
def sep_df(df, keep=3):
# Applied to each sub-Dataframe (this is what GroupBy does under the hood)
# Get RA and Dec values at minimum R
min_r = df.loc[df.R.argmin()] # Series - row at which R is minimum
RA1, Dec1 = min_r.RA, min_r.Dec # Relevant 2 scalars within this row
# Calculate separation for each pair including minimum R row
# The result is a series of separations, same length as `df`
sep = r2d(calc_sep(RA1,Dec1,df['RA'], df['Dec']))
# Get index values of `keep` (default 3) smallest results
# Retain `keep+1` values because one will be the minimum R
# row where separation=0
idx = sep.nsmallest(keep+1).index
# Restrict the result to those 3 index labels + your minimum R
return df.loc[idx]
For speed, consider passing sort=False to GroupBy if the result still works for you.
I want to built a selection containing for each group the "brightest" object...and the 3 closest objects of the group
step 1:
create a dataframe for the brightest object in each group
maxR = df.sort_values('R').groupby('Group')['Group', 'Dec', 'RA'].head(1)
step 2:
merge the two frames on Group & calculate the separation
merged = df.merge(maxR, on = 'Group', suffixes=['', '_max'])
merged['sep'] = merged.apply(
lambda x: r2d(calc_sep(x.RA, x.Dec, x.RA_max, x.Dec_max)),
axis=1
)
step 3:
order the data frame, group by 'Group', (optional) discard intermediate fields & take the first 4 rows from each group
finaldf = merged.sort_values(['Group', 'sep'], ascending=[1,1]
).groupby('Group')[df.columns].head(4)
Produces the following data frame with your sample data:
Dec Group R RA
4 -0.584243 1 -23.8 154.424842
3 -0.614827 1 -23.7 154.474165
2 -0.481657 1 -22.1 154.419191
0 -0.495605 1 -21.0 154.362789
9 8.759622 2 -24.7 8.759622
8 8.728223 2 -22.5 8.728223
10 8.799796 2 -19.9 8.799796
6 8.355454 2 -21.8 8.355454

Find first time a value occurs in the dataframe

I have a dataframe with year-quarter (e.g. 2015-Q4), the customer_ID, and amount booked, and many other columns irrelevant for now. I want to create a column that has the first time each customer made a booking. I tried this:
alldata.sort_values(by=['Total_Apps_Reseller_Bookings_USD', 'Year_Quarter'],
ascending=[1, 1],
inplace=True)
first_q = alldata[['Customer_ID', 'Year_Quarter']].groupby(by='Customer_ID').first()
but I am not sure it worked.
Also, I then want to have another column that tells me how many quarters after the first booking that booking was made. I failed using replace and dictionary, so I used a merge. I create an numeric id for each quarter of booking, and first quarter from above, and then subtract the two:
q_booking_num = pd.DataFrame({'Year_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_booking_num, on='Year_Quarter', how='outer')
q_first_num = pd.DataFrame({'First_Quarter': x, 'First_Quarter_id': np.arange(28)})
alldata = pd.merge(alldata, q_first_num, on='First_Quarter', how='outer')
this doesn't seem to have worked at all as I see 'first quarters' that are after some bookings that were already made.
You need to specify which column to use for taking the first value:
first_q = (alldata[['Customer_ID','Year_Quarter']]
.groupby(by='Customer_ID')
.Year_Quarter
.first()
)
Here is some sample data for three customers:
df = pd.DataFrame({'customer_ID': [1,
2, 2,
3, 3, 3],
'Year_Quarter': ['2010-Q1',
'2010-Q1', '2011-Q1',
'2010-Q1', '2011-Q1', '2012-Q1'],
'Total_Apps_Reseller_Bookings_USD': [1,
2, 3,
4, 5, 6]})
Below, I convert text quarters (e.g. '2010-Q1') to a numeric equivalent by taking the int value of the first for characters (df.Year_Quarter.str[:4].astype(int)). I then multiply it by four and add the value of the quarter. This value is only used for differencing to determine the total number of quarters since the first order.
Next, I use transform on the groupby to take the min value of these quarters we just calculated. Using transform keeps this value in the same shape as the original dataframe.
I then calcualte the quarters_since_first_order as the difference between the quarter and the first quarter.
df['quarters'] = df.Year_Quarter.str[:4].astype(int) * 4 + df.Year_Quarter.str[-1].astype(int)
first_order_quarter_no = df.groupby('customer_ID').quarters.transform(min)
df['quarters_since_first_order'] = quarters - first_order_quarter_no
del df['quarters'] # Clean-up.
>>> df
Total_Apps_Reseller_Bookings_USD Year_Quarter customer_ID quarters_since_first_order
0 1 2010-Q1 1 0
1 2 2010-Q1 2 0
2 3 2011-Q1 2 4
3 4 2010-Q1 3 0
4 5 2011-Q1 3 4
5 6 2012-Q1 3 8
For part 1:
I think you need to sort a little differently to get your desired outcome:
alldata.sort_values(by=['Customer_ID', 'Year_Quarter',
'Total_Apps_Reseller_Bookings_USD'],
ascending=[1, 1],inplace=True)
first_q = alldata[['Customer_ID','Year_Quarter']].groupby(by='Customer_ID').head(1)
For part 2:
Continuing off of part 1, you can merge the values back on to the original dataframe. At that point, you can write a custom function to subtract your date strings and then apply it to each row.
Something like:
def qt_sub(val, first):
year_dif = val[0:4] - first[0:4]
qt_dif = val[6] - first[6]
return 4 * int(year_dif) + int(qt_dif)
alldata['diff_from_first'] = alldata.apply(lambda x: qt_sub(x['Year_Quarter'],
x['First_Sale']),
axis=1)

Calculate the duration of a state with a pandas Dataframe

I try to calculate how often a state is entered and how long it lasts. For example I have the three possible states 1,2 and 3, which state is active is logged in a pandas Dataframe:
test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14))
For example the state 1 is entered two times (at index 3 and 12), the first time it lasts three hours, the second time two hours (so on average 2.5). State 2 is entered 3 times, on average for 2.66 hours.
I know that I can mask data I'm not interested in, for example to analyize for state 1:
state1 = test.mask(test!=1)
but from there on I can't find a way to go on.
I hope the comments give enough explanation - the key point is you can use a custom rolling window function and then cumsum to group the rows into "clumps" of the same state.
# set things up
freq = "1h"
df = pd.DataFrame(
[2,2,2,1,1,1,2,2,2,3,2,2,1,1],
index=pd.date_range('00:00', freq=freq, periods=14)
)
# add a column saying if a row belongs to the same state as the one before it
df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1)
# the cumulative sum - each "clump" gets its own integer id
df["value_group"] = df["is_first"].cumsum()
# get the rows corresponding to states beginning
start = df.groupby("value_group", as_index=False).nth(0)
# get the rows corresponding to states ending
end = df.groupby("value_group", as_index=False).nth(-1)
# put the timestamp indexes of the "first" and "last" state measurements into
# their own data frame
start_end = pd.DataFrame(
{
"start": start.index,
# add freq to get when the state ended
"end": end.index + pd.Timedelta(freq),
"value": start[0]
}
)
# convert timedeltas to seconds (float)
start_end["duration"] = (
(start_end["end"] - start_end["start"]).apply(float) / 1e9
)
# get average state length and counts
agg = start_end.groupby("value").agg(["mean", "count"])["duration"]
agg["mean"] = agg["mean"] / (60 * 60)
And the output:
mean count
value
1 2.500000 2
2 2.666667 3
3 1.000000 1

Categories

Resources