Pandas, subtract values based on value of another column - python

In Pandas, I'm trying to figure out how to generate a column that is the difference between the time of the current row and time of the last row in which the value of another column is True:
So given the dataframe:
df = pd.DataFrame({'Time':[5,10,15,20,25,30,35,40,45,50],
'Event_Occured': [True,False,False,True,True,False,False,True,False,False]})
print df
Event_Occured Time
0 True 5
1 False 10
2 False 15
3 True 20
4 True 25
5 False 30
6 False 35
7 True 40
8 False 45
9 False 50
I'm trying to generate a column that would look like this:
Event_Occured Time Time_since_last
0 True 5 0
1 False 10 5
2 False 15 10
3 True 20 0
4 True 25 0
5 False 30 5
6 False 35 10
7 True 40 0
8 False 45 5
9 False 50 10
Thanks very much!
​

Using df.Event_Occured.cumsum() gives you distinct groups to groupby. Then applying a function per group that subtracts the first member's value from every member gets you what you want.
df['Time_since_last'] = \
df.groupby(df.Event_Occured.cumsum()).Time.apply(lambda x: x - x.iloc[0])
df

Here's an alternative that fills the values corresponding to Falses with the last valid observation:
df['Time'] - df.loc[df['Event_Occured'], 'Time'].reindex(df.index).ffill()
Out:
0 0.0
1 5.0
2 10.0
3 0.0
4 0.0
5 5.0
6 10.0
7 0.0
8 5.0
9 10.0
Name: Time, dtype: float64

Related

Pandas: How to compute a conditional rolling/accumulative maximum within a group

I would like to achieve the following results in the column condrolmax (based on column close) (conditional rolling/accumulative max) without using a stupidly slow for loop.
Index close bool condrolmax
0 1 True 1
1 3 True 3
2 2 True 3
3 5 True 5
4 3 False 5
5 3 True 3 --> rolling/accumulative maximum reset (False cond above)
6 4 True 4
7 5 False 4
8 7 False 4
9 5 True 5 --> rolling/accumulative maximum reset (False cond above)
10 7 False 5
11 8 False 5
12 6 True 6 --> rolling/accumulative maximum reset (False cond above)
13 8 True 8
14 5 False 8
15 5 True 5 --> rolling/accumulative maximum reset (False cond above)
16 7 True 7
17 15 True 15
18 16 True 16
The code to create this dataframe:
# initialise data of lists.
data = {'close':[1,3,2,5,3,3,4,5,7,5,7,8,6,8,5,5,7,15,16],
'bool':[True, True, True, True, False, True, True, False, False, True, False,
False, True, True, False, True, True, True, True],
'condrolmax': [1,3,3,5,5,3,4,4,4,5,5,5,6,8,8,5,7,15,16]}
# Create DataFrame
df = pd.DataFrame(data)
I am sure it is possible to vectorize that (one liner). Any suggestions ?
Thanks again !
You can set group and then use cummax(), as follows:
# Set group: New group if current row `bool` is True and last row `bool` is False
g = (df['bool'] & (~df['bool']).shift()).cumsum()
# Get cumulative max of column `close` within the group
df['condrolmax'] = df.groupby(g)['close'].cummax()
Result:
print(df)
close bool condrolmax
0 1 True 1
1 3 True 3
2 2 True 3
3 5 True 5
4 3 False 5
5 3 True 3
6 4 True 4
7 5 False 5
8 7 False 7
9 5 True 5
10 7 False 7
11 8 False 8
12 6 True 6
13 8 True 8
14 5 False 8
15 5 True 5
16 7 True 7
17 15 True 15
18 16 True 16
First make groups using your condition (bool changing from False to True) and cumsum, then apply your rolling after a groupby:
group = (df['bool']&(~df['bool']).shift()).cumsum()
df.groupby(group)['close'].rolling(2, min_periods=1).max()
output:
0 0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
1 5 3.0
6 4.0
7 5.0
8 7.0
2 9 5.0
10 7.0
11 8.0
3 12 6.0
13 8.0
14 8.0
4 15 5.0
16 7.0
17 15.0
18 16.0
Name: close, dtype: float64
To insert back as a column:
df['condrolmax'] = df.groupby(group)['close'].rolling(2, min_periods=1).max().droplevel(0)
output:
close bool condrolmax
0 1 True 1.0
1 3 True 3.0
2 2 True 3.0
3 5 True 5.0
4 3 False 5.0
5 3 True 3.0
6 4 True 4.0
7 5 False 5.0
8 7 False 7.0
9 5 True 5.0
10 7 False 7.0
11 8 False 8.0
12 6 True 6.0
13 8 True 8.0
14 5 False 8.0
15 5 True 5.0
16 7 True 7.0
17 15 True 15.0
18 16 True 16.0
NB. if you want the boundary to be included in the rolling, use min_periods=1 in rolling
I'm not sure how we can use linear algebra and vectorizing to make this function faster, but using list comprehension, we write a faster algorithm. First, define the function as:
def faster_condrolmax(df):
df['cond_index'] = [df.index[i] if df['bool'][i]==False else 0 for i in
df.index]
df['cond_comp_index'] = [np.max(df.cond_index[0:i]) for i in df.index]
df['cond_comp_index'] = df['cond_comp_index'].fillna(0).astype(int)
df['condrolmax'] = np.zeros(len(df.close))
df['condrolmax'] = [np.max(df.close[df.cond_comp_index[i]:i]) if
df.cond_comp_index[i]<i else df.close[i] for
i in range(len(df.close))]
return df
Then, you can use:
!pip install line_profiler
%load_ext line_profiler
to add and load the line profiler and see how long each line of the code takes with this:
%lprun -f faster_condrolmax faster_condrolmax(df)
which will result as:
Each line profiling results
of just see how long the whole function takes:
%timeit faster_condrolmax(df)
which will result as:
Total algorithm profiling result
If you use the SeaBean's function, you can get better results half the speed it takes for my proposed functions. However, the speed estimated for SeaBean's doesn't seem robust, and to estimate his functions, you should run it on a larger dataset and then decide. That's all because %timeit reports like this:
SeaBean's function profiling result

Group by a boolean variable and create a new column with the result for each group pandas

This may be a litte confusing, but I have the following dataframe:
exporter assets liabilities
False 5 1
True 10 8
False 3 1
False 24 20
False 40 2
True 12 11
I want to calculate a ratio with this formula df['liabilieties'].sum()/df['assets'].sum())*100
And I expect to create a new column where the values are the ratio but calculated for each boolean value, like this:
exporter assets liabilities ratio
False 5 1 33.3
True 10 8 86.3
False 3 1 33.3
False 24 20 33.3
False 40 2 33.3
True 12 11 86.3
Use DataFrame.groupby on column exporter and transform the datafarme using sum, then use Series.div to divide liabilities by assets and use Series.mul to multiply by 100:
d = df.groupby('exporter').transform('sum')
df['ratio'] = d['liabilities'].div(d['assets']).mul(100).round(2)
Result:
print(df)
exporter assets liabilities ratio
0 False 5 1 33.33
1 True 10 8 86.36
2 False 3 1 33.33
3 False 24 20 33.33
4 False 40 2 33.33
5 True 12 11 86.36

Pandas: Select first occurance of DataFrame rows between range

I have a dataframe from which I want to select data between a range, only the first occurrence of this range.
The dataframe:
data = {'x':[1,2,3,4,5,6,7,6.5,5.5,4.5,3.5,2.5,1], 'y':[1,4,3,3,52,3,74,64,15,41,31,12,11]}
df = pd.DataFrame(data)
eg: select x from 2 to 6, first occurarence:
x y
0 1.0 1 #out of range
1 2.0 4 #out of range
2 3.0 3 #this first occurrence
3 4.0 3 #this first occurrence
4 5.0 52 #thisfirst occurrence
5 6.0 3 #out of range
6 7.0 74 #out of range
7 6.5 64 #out of range
8 5.5 15 #not this since repeating RANGE
9 4.5 41 #not this since repeating RANGE
10 3.5 31 #not this since repeating RANGE
11 2.5 12 #not this since repeating RANGE
12 1.0 11 #out of range
Output
x y
2 3.0 3 #this first occurrence
3 4.0 3 #this first occurrence
4 5.0 52 #thisfirst occurrence
I am trying to modify this example: Select DataFrame rows between two dates to select data between 2 values for their first occurrence:
xlim=[2,6]
mask = (df['x'] > xlim[0]) & (df['x'] <= xlim[1])
df=df.loc[mask] #need to make it the first occurrence here
Here's one approach:
# mask with True whenever a value is within the range
m = df.x.between(2,6, inclusive=False)
# logical XOR with the next row and cumsum
# Keeping only 1s will result in the dataframe of interest
df.loc[(m ^ m.shift()).cumsum().eq(1)]
x y
2 3.0 3
3 4.0 3
4 5.0 52
Details -
df.assign(in_range=m, is_next_different=(m ^ m.shift()).cumsum())
x y in_range is_next_different
0 1.0 1 False 0
1 2.0 4 False 0
2 3.0 3 True 1
3 4.0 3 True 1
4 5.0 52 True 1
5 6.0 3 False 2
6 7.0 74 False 2
7 6.5 64 False 2
8 5.5 15 True 3
9 4.5 41 True 3
10 3.5 31 True 3
11 2.5 12 True 3
12 1.0 11 False 4

Update: How to compare values in 3 consecutive rows in pandas dataframe?

I am looking for a solution which would compare values in 3 consecutive rows of data and update column if condition is true.
import pandas as pd aapl = pd.read_csv(....)
aapl['3lows'] = False
aapl.head(10)
and the output is the table where for each row there are columns with
Row number/ Date / Open / High / Low / Close / Adj Close / Volume / 3lows
0 / 2006-01-03 / 10.340000 / 10.678572 / 10.321428 / 10.678572 / 9.572629 / 201808600 / False
Now I want to run some "script" to update column 3lows to true if value in column Low from row that is being updated e.g. 100 is lower than from row 99, and from 99 lower than from 98 and from 98 lower than 97.
IIUC:
Let use something like this:
#where is is the s = appl['Low']; let's make up some data
s = pd.Series([100,99,98,97,99,100,99,95,94,93,92,100,95])
s.diff().rolling(3).max().lt(0)
Returns:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 True
10 True
11 False
12 False
dtype: bool
Details:
s
Output:
0 100
1 99
2 98
3 97
4 99
5 100
6 99
7 95
8 94
9 93
10 92
11 100
12 95
dtype: int64
Compare each value to previous using diff:
s.diff()
Output:
0 NaN
1 -1.0
2 -1.0
3 -1.0
4 2.0
5 1.0
6 -1.0
7 -4.0
8 -1.0
9 -1.0
10 -1.0
11 8.0
12 -5.0
dtype: float64
Now, let's look at a rolling windows of 3 values if the max is less than zero then you have three declines in a value:
s.diff().rolling(3).max().lt(0)
Output:
0 False
1 False
2 False
3 True
4 False
5 False
6 False
7 False
8 True
9 True
10 True
11 False
12 False
dtype: bool
Now, let's compare our result to the original data:
print(pd.concat([s,s.diff().rolling(3).max().lt(0)], axis=1))
0 1
0 100 False
1 99 False
2 98 False
3 97 True
4 99 False
5 100 False
6 99 False
7 95 False
8 94 True
9 93 True
10 92 True
11 100 False
12 95 False

How to assign DataFrame observations to groups according to a particular distribution?

I have a pandas DataFrame where each observation (row) represents a person.
I want to assign every person who satisfies a particular condition to different groups. I need this because my final aim is to create a network and link the persons in the same groups with some probabilities depeneding on the group.
So, for instance, I want to assign all children aged between 6 and 10 to schools. Then in the end I will create links between the children in the same school with a particular probability p.
I know the size distribution of the schools in the area I want to simulate.
So I want to draw school sizes from this distribution and then "fill up" the schools with all the children aged from 6 to 10.
I am new to pandas: the way I was thinking to do this was to create a new column, fill it up with NaN and then just assign a school ID to the different students.
Let's say my DataFrame df is this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID': range(11), 'AGE': [15, 6, 54, 8, 10, 39, 2, 7, 9, 10, 6]})
df
Out[1]:
AGE ID
0 15 0
1 6 1
2 54 2
3 8 3
4 10 4
5 39 5
6 2 6
7 7 7
8 9 8
9 10 9
10 6 10
(Incidentally, I don't know how to put the ID column first, but anyway in real life I'm reading the dataframe from a CSV file so that's not a problem).
Now, what I'd like to do is create another column, ELEM_SCHOOL_ID, initialize it to NaN and just assign values to those who are the right age.
What I have succeded to do until now is: create a subset of the DataFrame with the persons who satisfy the age condition.
df['IN_ELEM_SCH'] = np.where((df['AGE']>5) & (df['AGE']<11), 'True', 'False')
df
Out[2]:
AGE ID IN_ELEM_SCH
0 15 0 False
1 6 1 True
2 54 2 False
3 8 3 True
4 10 4 True
5 39 5 False
6 2 6 False
7 7 7 True
8 9 8 True
9 10 9 True
10 6 10 True
Then, I would need to add another column, ELEM_SCHOOL_ID that contains the ID of the particular elementary school every student is attending.
I can initialize the new column with:
df["ELEM_SCHOOL_ID"] = np.nan
df
Out[84]:
AGE ID IN_ELEM_SCH SCHOOL_ID
0 15 0 False NaN
1 6 1 True NaN
2 54 2 False NaN
3 8 3 True NaN
4 10 4 True NaN
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True NaN
8 9 8 True NaN
9 10 9 True NaN
10 6 10 True NaN
What I want to do now is:
Draw a number from the school size distribution: n0
For n0 random persons satisfying the age condition (so those who have IN_ELEM_SCHOOL == True), assign 0 to SCHOOL_ID
Draw another number from the school size distribution: n1
For n1 random persons still not assigned to a school, assign 1 to SCHOOL_ID
Repeat until all the persons with IN_ELEM_SCH == True have been assigned a school ID.
So, for example, let's say that the first school size drawn from the distribution is n0=2, the second n1=3 and the third n2=4.
I want to end up with something like this:
AGE ID IN_ELEM_SCH SCHOOL_ID
0 15 0 False NaN
1 6 1 True 0
2 54 2 False NaN
3 8 3 True 1
4 10 4 True 2
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True 1
8 9 8 True 1
9 10 9 True 2
10 6 10 True 0
In real life, the school size is distributed as a lognormal distribution. Say, with parameters mu = 4 and sigma = 1
I can then draw from this distribution:
s = np.random.lognormal(mu, sigma, 100)
But I still wasn't able to figure out how to assign the schools.
I apologize for the length of this question, but I wanted to be clear.
Thank you very much for any hint or help you could give me.
Pandas will automatically match on the index when assigning new data. Checkout the pandas docs on indexing.
Note: You wouldn't normally create the extra IN_ELEM_SCHOOL column (i.e. third line in the code below is unnecessary).
mu, sigma = 1, 0.5
m = (5 < df['AGE']) & (df['AGE'] < 11)
df['IN_ELEM_SCHOOL'] = m
s = m[m].sample(frac=1)
n, i = 0, 0
while n < len(s):
num_students = int(np.random.lognormal(mu, sigma))
s[n: n + num_students] = i
i += 1
n += num_students
df['SCHOOL_ID'] = s
df
returns
AGE ID IN_ELEM_SCHOOL SCHOOL_ID
0 15 0 False NaN
1 6 1 True 0.0
2 54 2 False NaN
3 8 3 True 1.0
4 10 4 True 2.0
5 39 5 False NaN
6 2 6 False NaN
7 7 7 True 1.0
8 9 8 True 0.0
9 10 9 True 0.0
10 6 10 True 1.0

Categories

Resources