I have a table like this :
import pandas as pd
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]
ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})
col_to_roll max_duration
0 0 0
1 0 0
2 0 0
3 2000 12
4 0 0
5 0 0
6 700 8
7 0 0
8 0 0
9 530 2
10 1000 5
11 820 15
12 0 0
13 0 0
14 200 3
For each row position i, I want to do a rolling sum of col_to_roll between indexes i-7 and i-4 (both included). The caveat is that I want the values "further in the past" to be counted more, depending on the column max_duration (which tells for how many timesteps in the future that value can still have an effect).
There's a higher bound which is the remaining timesteps to be counted (min 1, max 4). So if I'm on row number 7 doing the roll-up sum: the value on row number 1 will be counted min(max_duration[1],4), the value on row number 2 will be counted min(max_duration[2],3) etc.
I could do it the brute force way :
new_col = []
for i in range(7,len(ex)) :
rolled_val = sum([ex.iloc[j].col_to_roll*min(ex.iloc[j].max_duration , i-j+1-4) \
for j in range(i-7,i-3)])
new_col.append(rolled_val)
ex['rolled_col'] = [np.nan]*7+new_col
Which lands the following results from the example above :
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0
That being said, I'd appreciate a more elegant (and more importantly, more efficient) way to get this result with some pandas magic.
Just to share my ideas, this can be solved by using numpy without a for-loop
import numpy as np
ex_len = ex.shape[0]
inds = np.vstack([range(i-7,i-3) for i in range(7,ex_len)])
# part one
col_to_roll = np.take(ex.col_to_roll.values,inds)
# part two
max_duration = np.take(ex.max_duration.values,inds)
duration_to_compare = np.array([[i-j+1-4 for j in range(i-7,i-3)]for i in range(7,ex_len)])
min_mask = max_duration > duration_to_compare
max_duration[min_mask] = duration_to_compare[min_mask]
new_col = np.sum(col_to_roll*max_duration,axis=1)
ex['rolled_col'] = np.concatenate(([np.nan]*7,new_col))
Here is my humble idea about an elegant and efficient method for this task. To not reinvent the wheel, let's install pandarallel by invoking pip install pandarallel. I am a fan of multiprocessing thing, and it should help with larger data.
import pandas as pd
import numpy as np
from pandarallel import pandarallel
def rocknroll(index):
if index>=7:
a = ex['col_to_roll'].iloc[index-7:index-3]
b = map(min, ex['max_duration'].iloc[index-7:index-3], [4,3,2,1])
return sum(map(mul, a, b))
else:
return np.nan
pandarallel.initialize()
values = [0,0,0,2000,0,0,700,0,0,530,1000,820,0,0,200]
durations = [0,0,0,12,0,0,8,0,0,2,5,15,0,0,3]
ex = pd.DataFrame({'col_to_roll' : values, 'max_duration': durations})
ex['index_copy'] = list(range(0, len(ex)))
ex['rolled_col'] = ex['index_copy'].apply(rocknroll)
ex.drop(columns={'index_copy'}, inplace=True)
print(ex)
Output:
INFO: Pandarallel will run on 8 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0
Further information about proper element-wise operation can be found here Element-wise addition of 2 lists?
You can use pd.rolling() to create rolling windows in combination with apply to calculate the rolled_coll sum for the specified rolling windows
First calculate the window size using the lower & upper bound (and add 1 to include both indices). This enables you to play around with different time intervals.
lower_bound = -7
upper_bound = -4
window_size = upper_bound - lower_bound + 1
Second define the function to apply on each rolling window. In your case taking take the product of the col_to_roll and the minimum value of max_duration & a list of range 4 to 0 and summing all values in the sliding window.
def calculate_rolled_count(series, ex):
index = series.index
min_values = np.minimum(ex.loc[index, 'max_duration'].values, list(range(4, 0, -1)))
return np.sum(ex.loc[index, 'col_to_roll'] * min_values)
Finally assign a new column rolled_coll to your original dataframe and apply the defined function over all rolling windows. We have to shift the columns to make the value correspond to the desired row (as the rolling window by default sets the values to the right bound of the window)
ex.assign(rolled_col=lambda x: x.rolling(window_size)
.apply(lambda x: calculate_rolled_count(x, ex))
.shift(-upper_bound)['max_duration'])
Result
col_to_roll max_duration rolled_col
0 0 0 NaN
1 0 0 NaN
2 0 0 NaN
3 2000 12 NaN
4 0 0 NaN
5 0 0 NaN
6 700 8 NaN
7 0 0 2000.0
8 0 0 4000.0
9 530 2 6000.0
10 1000 5 8700.0
11 820 15 1400.0
12 0 0 2100.0
13 0 0 3330.0
14 200 3 2060.0
I have the following df: visitor can make multiple visits, and the number of page views is recorded in each visit.
df = pd.DataFrame({'visitor_id':[1,1,2,1],'visit_id':[1,2,1,3], 'page_views':[10,20,30,40]})
page_views visit_id visitor_id
0 10 1 1
1 20 2 1
2 30 1 2
3 40 3 1
What I need is to create an additional column called weight, which will diminish with a certain parameter. For example, if this parameter is 1/2, the newest visit has a weight of 1, 2nd newest visit a weight of 1/2, 3rd is 1/4 and so on.
E.g. I want my dataframe to look like:
page_views visit_id visitor_id weight
0 10 1(oldest) 1 0.25
1 20 2 1 0.5
2 30 1(newest) 2 1
3 40 3(newest) 1 1
Then I will be able to group using their weight e.g.
df.groupby(['visitor_id']).Weight.sum() to get weighted page views group by.
Doesnt work as expected
df = pd.DataFrame({'visitor_id':[1,1,2,2,1,1],'visit_id':[5,6,1,2,7,8], 'page_views':[10,20,30,30,40,50]})
df['New']=df.groupby('visitor_id').visit_id.transform('max') - df.visit_id
df['weight'] = pd.Series([1/2]*len(df)).pow(df.New.values)
df
page_views visit_id visitor_id New weight
0 10 5 1 3 0
1 20 6 1 2 0
2 30 1 2 1 0
3 30 2 2 0 1
4 40 7 1 1 0
5 50 8 1 0 1
Is this what you need ?
df.groupby('visitor_id').visit_id.apply(lambda x : 1*1/2**(max(x)-x))
Out[1349]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: visit_id, dtype: float64
Maybe try this
df['New']=df.groupby('visitor_id').visit_id.transform('max')-df.visit_id
pd.Series([1/2]*len(df)).pow(df.New.values)
Out[45]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: New, dtype: float64
I would like to filter and replace. For the columns with are lower or higher than zero and not NaN's, I would like to set for one and the others, set to zero.
mask = ((ts[x] > 0)
| (ts[x] < 0))
ts[mask]=1
ts[ts[x]==1]
I did this and is working but I have to deal with the values that do not attend this condition replacing with zero.
Any recommendations? I am quite confusing, and also would be better to use where function in this case?
Thanks all!
Sample Data
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
Expected result
asset.relativeSetpoint.350
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
You can do this by applying a logical AND on the two conditions and converting the resultant mask to integer.
df
asset.relativeSetpoint.350
0 -60.0
1 0.0
2 NaN
3 100.0
4 0.0
5 NaN
6 -120.0
7 -245.0
8 0.0
9 123.0
10 0.0
11 -876.0
(df['asset.relativeSetpoint.350'].ne(0)
& df['asset.relativeSetpoint.350'].notnull()).astype(int)
0 1
1 0
2 0
3 1
4 0
5 0
6 1
7 1
8 0
9 1
10 0
11 1
Name: asset.relativeSetpoint.350, dtype: int64
The first condition df['asset.relativeSetpoint.350'].ne(0) gets a boolean mask of all elements that are not equal to 0 (this would include <0, >0, and NaN).
The second condition df['asset.relativeSetpoint.350'].notnull() will get a boolean mask of elements that are not NaNs.
The two masks are ANDed, and converted to integer.
How about using apply?
df[COLUMN_NAME] = df[COLUMN_NAME].apply(lambda x: 1 if x != 0 else 0)
I currently have a table similar to this one:
CRED | ACBA
1 | 2
0 | 3
1 | 4
1 | 2
0 | 1
etc...
I was able to get information on the frequency of occurrence of a category (1,2,3,4) in column ACBA depending on the value in CRED (1,0) using:
pd.crosstab(df.CRED, df.ACBA)
ACBA 1 2 3 4
CRED
0 9 11 1 7
1 18 22 4 28
Now I would like to sum the values of ACBA for a specific value of CRED and then be able to divide each single value by that sum and create a new table with the result. Ex:
For CRED = 0 --> 9+11+1+7=28 then --> 9/28 11/28 1/28 7/28 to reach the final table:
1 2 3 4
CRED0 0.25 0.30 0.055 0.38
Does anyone have an idea of how to do this? I am new to Python and completely stuck on this. The idea is that I would repeat this technique across 22 other columns.Thanks
a = {'CRED': [1,0,1,1,0], 'ACBA': [2,3,4,2,1]}
df = pd.DataFrame(a)
output
ACBA CRED
0 2 1
1 3 0
2 4 1
3 2 1
4 1 0
then as what you used crosstab it
df1 = pd.crosstab(df.CRED, df.ACBA)
ACBA 1 2 3 4
CRED
0 1 0 1 0
1 0 2 0 1
then get the percentage
df1.apply(lambda a: a / a.sum() * 100, axis=1)
ACBA 1 2 3 4
CRED
0 50.0 0.000000 50.0 0.000000
1 0.0 66.666667 0.0 33.333333
I have a dataframe like this:
T data
0 0 10
1 1 20
2 2 30
3 3 40
4 4 50
5 0 5
6 1 13
7 2 21
8 0 3
9 1 7
10 2 11
11 3 15
12 4 19
The values in T are sequences which all range from 0 up to a certain value whereby the maximal number can differ between the sequences.
Normally, the values in data are NOT equally spaced, that is now just for demonstration purposes.
What I want to achieve is to add a third column called dataDiv where each value in data of a certain sequence is divided by the value at T = 0 that belongs to the respective sequence. In my case, I have 3 sequences and for the first sequence I want to divide each value by 10, in the second sequence each value should be divided by 5 and in the third by 3.
So the expected outcome would look like this:
T data dataDiv
0 0 10 1.000000
1 1 20 2.000000
2 2 30 3.000000
3 3 40 4.000000
4 4 50 5.000000
5 0 5 1.000000
6 1 13 2.600000
7 2 21 4.200000
8 0 3 1.000000
9 1 7 2.333333
10 2 11 3.666667
11 3 15 5.000000
12 4 19 6.333333
The way I currently implement it is as follows:
I first determine the indices at which T = 0. Then I loop through these indices and divide the data in data by the value at T=0 of the respective sequence which gives me the desired output (which is shown above). The code looks as follows:
import pandas as pd
df = pd.DataFrame({'T': range(5) + range(3) + range(5),
'data': range(10, 60, 10) + range(5, 25, 8) + range(3, 21, 4)})
# get indices where T = 0
idZE = df[df['T'] == 0].index.tolist()
# last index of dataframe
idZE.append(max(df.index)+1)
# add the column with normalzed values
df['dataDiv'] = df['data']
# loop through indices where T = 0 and normalize values
for ix, indi in enumerate(idZE[:-1]):
df['dataDiv'].iloc[indi:idZE[ix + 1]] = df['data'].iloc[indi:idZE[ix + 1]] / df['data'].iloc[indi]
My question is: Is there any smarter solution than this which avoids the loop?
The following approach avoids loops if favour of vectorized computations and should perform faster. The basic idea is to label runs of integers in column 'T', find the first value in each of these groups and then divide the values in 'data' by the appropriate first value.
df['grp'] = (df['T'] == 0).cumsum() # label consecutive runs of integers
x = df.groupby('grp')['data'].first() # first value in each group
df['dataDiv'] = df['data'] / df['grp'].map(x) # divide
This gives the DataFrame with the desired column:
T data grp dataDiv
0 0 10 1 1.000000
1 1 20 1 2.000000
2 2 30 1 3.000000
3 3 40 1 4.000000
4 4 50 1 5.000000
5 0 5 2 1.000000
6 1 13 2 2.600000
7 2 21 2 4.200000
8 0 3 3 1.000000
9 1 7 3 2.333333
10 2 11 3 3.666667
11 3 15 3 5.000000
12 4 19 3 6.333333
(You can then drop the 'grp' column if you wish: df.drop('grp', axis=1).)
As #DSM points out below, the three lines of code could be collapsed to into one with the use of groupby.transform:
df['dataDiv'] = df['data'] / df.groupby((df['T'] == 0).cumsum())['data'].transform('first')