I have the following df: visitor can make multiple visits, and the number of page views is recorded in each visit.
df = pd.DataFrame({'visitor_id':[1,1,2,1],'visit_id':[1,2,1,3], 'page_views':[10,20,30,40]})
page_views visit_id visitor_id
0 10 1 1
1 20 2 1
2 30 1 2
3 40 3 1
What I need is to create an additional column called weight, which will diminish with a certain parameter. For example, if this parameter is 1/2, the newest visit has a weight of 1, 2nd newest visit a weight of 1/2, 3rd is 1/4 and so on.
E.g. I want my dataframe to look like:
page_views visit_id visitor_id weight
0 10 1(oldest) 1 0.25
1 20 2 1 0.5
2 30 1(newest) 2 1
3 40 3(newest) 1 1
Then I will be able to group using their weight e.g.
df.groupby(['visitor_id']).Weight.sum() to get weighted page views group by.
Doesnt work as expected
df = pd.DataFrame({'visitor_id':[1,1,2,2,1,1],'visit_id':[5,6,1,2,7,8], 'page_views':[10,20,30,30,40,50]})
df['New']=df.groupby('visitor_id').visit_id.transform('max') - df.visit_id
df['weight'] = pd.Series([1/2]*len(df)).pow(df.New.values)
df
page_views visit_id visitor_id New weight
0 10 5 1 3 0
1 20 6 1 2 0
2 30 1 2 1 0
3 30 2 2 0 1
4 40 7 1 1 0
5 50 8 1 0 1
Is this what you need ?
df.groupby('visitor_id').visit_id.apply(lambda x : 1*1/2**(max(x)-x))
Out[1349]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: visit_id, dtype: float64
Maybe try this
df['New']=df.groupby('visitor_id').visit_id.transform('max')-df.visit_id
pd.Series([1/2]*len(df)).pow(df.New.values)
Out[45]:
0 0.25
1 0.50
2 1.00
3 1.00
Name: New, dtype: float64
Related
This is my first question on this forum.
I am conducting experiments in which I measure the current-voltage curve of a device applying different experimental conditions.
The different experimental condition are encoded into a parameter K
I am performing measurements of the current I using back & forth voltage sweeps with V varying from O to 2V then from 2V to -2V and then back to 0V.
Measurements are conducted several times for each value of Kto get an average of the current at each voltage point (backward and forward values). Each measurement is ascribed to a parameter named iter (varying from 0 to 3 for instance).
I have collected all data into a pandas dataframe df and I am putting below a code able of produce the typical dfI have (the real one is way too large):
import numpy as np
import pandas as pd
K_col=[]
iter_col=[]
V_col=[]
I_col=[]
niter = 3
V_val = [0,1,2,1,0,-1,-2,-1,0]
K_val = [1,2]
for K in K_val:
for it in range(niter):
for V in V_val:
K_col.append(K)
iter_col.append(it+1)
V_col.append(V)
I_col.append((2*K+np.random.random())*V)
d={'K':K_col,'iter':iter_col,'V':V_col,'I':I_col}
df=pd.DataFrame(d)
I would like to compute the average value of I at each voltage and compare the impact of the experimental condition K.
For example let's look at 2 measurements conducted for K=1:
df[(df.K==1)&(df.iter.isin([1,2]))]
output:
K iter V I
0 1 1 0 0.000000
1 1 1 1 2.513330
2 1 1 2 4.778719
3 1 1 1 2.430393
4 1 1 0 0.000000
5 1 1 -1 -2.705487
6 1 1 -2 -4.235055
7 1 1 -1 -2.278295
8 1 1 0 0.000000
9 1 2 0 0.000000
10 1 2 1 2.535058
11 1 2 2 4.529292
12 1 2 1 2.426209
13 1 2 0 0.000000
14 1 2 -1 -2.878359
15 1 2 -2 -4.061515
16 1 2 -1 -2.294630
17 1 2 0 0.000000
We can see that for experiment 1 (iter=1) V goes multiple times at 0 (indexes 0, 4 and 8). i do not want to loose these different datapoints.
the first data point for I_avg should be (I[0]+I[9])/2 which would correspond to the first measurement at 0V. The second data point should be (I[1]+I[10])/2 that would correspond the the avg_I measured at 1V with increasing values of V etc...up to (I[8]+I[17])/2 which would be my last data point at 0V.
My first thought was to use the groupby() method using K and V as keys but this wouldn't work because V is varying back & forth hence we have duplicate values of V for each measurements and the groupby would just focus on unique values of V.
The final dataframe I would like to have should looks like this:
K V avg_I
0 1 0 0.000000
1 1 1 2.513330
2 1 2 4.778719
3 1 1 2.430393
4 1 0 0.000000
5 1 -1 -2.705487
6 1 -2 -4.235055
7 1 1 -2.278295
8 1 0 0.000000
9 1 0 0.000000
10 2 1 2.513330
11 2 2 4.778719
12 2 1 2.430393
13 2 0 0.000000
14 2 -1 -2.705487
15 2 -2 -4.235055
16 2 1 -2.278295
17 2 0 0.000000
Would anyone have an idea on how doing this?
In order to compute the mean taking into consideration also the position of each observation during the iterations you could add an extra column containing this information like this:
len_iter = 9
num_iter = len(df['iter'].unique())
num_K = len(df['K'].unique())
df['index'] = np.tile(np.arange(len_iter), num_iter*num_K)
And then compute the group by and mean to get the desired result:
df.groupby(['K', 'V', 'index'])['I'].mean().reset_index().drop(['index'], axis=1)
K V I
0 1 -2 -5.070126
1 1 -1 -2.598104
2 1 -1 -2.576927
3 1 0 0.000000
4 1 0 0.000000
5 1 0 0.000000
6 1 1 2.232128
7 1 1 2.359398
8 1 2 4.824657
9 2 -2 -9.031487
10 2 -1 -4.125880
11 2 -1 -4.350776
12 2 0 0.000000
13 2 0 0.000000
14 2 0 0.000000
15 2 1 4.535478
16 2 1 4.492122
17 2 2 8.569701
If I understand this correctly, you want to have a new datapoint that represents the average I for each V category. We can achieve this by getting the average value of I for each V and then map it on the full dataframe.
avg_I = df.groupby(['V'], as_index=False).mean()[['V', 'I']]
df['avg_I'] = df.apply(lambda x: float(avg_I['I'][avg_I['V'] == x['V']]), axis=1)
df.head()
output:
K iter V I avg_I
0 1 1 0 0.00 0.00
1 1 1 1 2.34 3.55
2 1 1 2 4.54 6.89
3 1 1 1 2.02 3.55
4 1 1 0 0.00 0.00
df.plot()
Assume the following DataFrame:
id A
1 0
2 10
3 200
4 3000
I would like to make a calculation betweeen all rows to all other rows.
For example, if the calculation were lambda r1, r2: abs(r1-r2), then the output would be (in some order)
id col_name
1 10
2 200
3 3000
4 190
5 2990
6 2800
Questions:
How to get only the above output?
How to associate a result to its creators in the most "pandas like" way?
I would like to keep everything in a single table as much as possible, in a way that still supports reasonable lookup.
The size of my data is not large, and never will be.
EDIT1:
One way that would answer my question 2 would be
id col_name origin1 origin2
1 10 1 2
2 200 1 3
3 3000 1 4
4 190 2 3
5 2990 2 4
6 2800 3 4
And I would like to know if this is standard, and has a built in way of doing this, or if there is another/better way
IIUC itertools
import itertools
s=list(itertools.combinations(df.index, 2))
pd.Series([df.A.loc[x[1]]-df.A.loc[x[0]] for x in s ])
Out[495]:
0 10
1 200
2 3000
3 190
4 2990
5 2800
dtype: int64
Update
s=list(itertools.combinations(df.index, 2))
pd.DataFrame([x+(df.A.loc[x[1]]-df.A.loc[x[0]],) for x in s ])
Out[518]:
0 1 2
0 0 1 10
1 0 2 200
2 0 3 3000
3 1 2 190
4 1 3 2990
5 2 3 2800
Use broadcasted subtraction, then np.tril_indices to extract the lower diagonal (positive values).
# <= 0.23
# u = df['A'].values
# 0.24+
u = df['A'].to_numpy()
u2 = (u[:,None] - u)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
0 10
1 200
2 190
3 3000
4 2990
5 2800
dtype: int64
Or, use subtract.outer to avoid the conversion to array beforehand.
u2 = np.subtract.outer(*[df.A]*2)
pd.Series(u2[np.tril_indices_from(u2, k=-1)])
If you need the index as well, use
idx = np.tril_indices_from(u2, k=-1)
pd.DataFrame({
'val':u2[np.tril_indices_from(u2, k=-1)],
'row': idx[0],
'col': idx[1]
})
val row col
0 10 1 0
1 200 2 0
2 190 2 1
3 3000 3 0
4 2990 3 1
5 2800 3 2
I have a pandas dataframe containing retail sales data which shows the total number of a product sold each week and the stock left at the end of the week. Unfortunately, the dataset only shows a row when a product has been sold and the stock left changes.
I would like to bulk out the dataset so that for each week there is a line for each product being sold. I've shown an example of this below - how can this be done?
As-Is:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 3 3 7
To-Be:
Week Product Sold Stock
1 1 1 10
1 2 1 10
1 3 1 10
2 1 2 8
2 2 0 10
2 3 3 7
Create a dataframe using product from itertools with all the combinations of both columns 'Week' and 'Product' and use merge with your original data. Let's say your dataframe is called dfp:
from itertools import product
new_dfp = (pd.DataFrame(list(product(dfp.Week.unique(), dfp.Product.unique())),columns=['Week','Product'])
.merge(dfp,how='left'))
You get the missing row in new_dfp:
Week Product Sold Stock
0 1 1 1.0 10.0
1 1 2 1.0 10.0
2 1 3 1.0 10.0
3 2 1 2.0 8.0
4 2 2 NaN NaN
5 2 3 3.0 7.0
Now you fillna on both column with different values:
new_dfp['Sold'] = new_dfp['Sold'].fillna(0).astype(int) # because no sold in missing rows
new_dfp['Stock'] = new_dfp.groupby('Product')['Stock'].fillna(method='ffill').astype(int)
To fill 'Stock', you need to groupby product and use the method 'ffill' to put the same value than last 'week'. At the end, you get:
Week Product Sold Stock
0 1 1 1 10
1 1 2 1 10
2 1 3 1 10
3 2 1 2 8
4 2 2 0 10
5 2 3 3 7
I have a pd.DataFrame I'd like to transform:
id values days time value_per_day
0 1 15 15 1 1
1 1 20 5 2 4
2 1 12 12 3 1
I'd like to aggregate these into equal buckets of 10 days. Since days at time 1 is larger than 10, this should spill into the next row, having the value/day of the 2nd row an average of the 1st and the 2nd.
Here is the resulting output, where (values, 0) = 15*(10/15) = 10 and (values, 1) = (5+20)/2:
id values days value_per_day
0 1 10 10 1.0
1 1 25 10 2.5
2 1 10 10 1.0
3 1 2 2 1.0
I've tried pd.Grouper:
df.set_index('days').groupby([pd.Grouper(freq='10D', label='right'), 'id']).agg({'values': 'mean'})
Out[146]:
values
days id
5 days 1 16
15 days 1 10
But I'm clearly using it incorrectly.
csv for convenience:
id,values,days,time
1,10,15,1
1,20,5,2
1,12,12,3
Notice: this is a time cost solution
newdf=df.reindex(df.index.repeat(df.days))
v=np.arange(sum(df.days))//10
dd=pd.DataFrame({'value_per_day': newdf.groupby(v).value_per_day.mean(),'days':np.bincount(v)})
dd
Out[102]:
days value_per_day
0 10 1.0
1 10 2.5
2 10 1.0
3 2 1.0
dd.assign(value=dd.days*dd.value_per_day)
Out[103]:
days value_per_day value
0 10 1.0 10.0
1 10 2.5 25.0
2 10 1.0 10.0
3 2 1.0 2.0
I did not include groupby id here, if you need that for your real data, you can do for loop with df.groupby(id) , then apply above steps within the for loop
Suppose I want to create a new column that counts the number of days since the state was 1. As an example, the current columns would be the first three below. The forth column is what I'm trying to get.
Index State Days Since_Days
1 1 0 0
2 0 20 20
3 0 40 40
4 1 55 55
5 1 60 5
6 1 70 10
Without resorting to for-loop, what is a pandas way to approach this?
You can also try following where first you group by State and for those that have State == 1, fill by difference. Then, for those which has State == 0 will be na which can be filled by corresponding Days column value
df.loc[df.State == 1, 'Since_Days'] = df.groupby('State')['Days'].diff().fillna(0)
df['Since_Days'].fillna(df['Days'],inplace=True)
print(df)
Result:
Index State Days Since_Days
0 1 1 0 0.0
1 2 0 20 20.0
2 3 0 40 40.0
3 4 1 55 55.0
4 5 1 60 5.0
5 6 1 70 10.0
The values to be subtracted can be formed with:
ser = df['Days'].where(df['State']==1, np.nan).ffill().shift()
If you subtract this from the original Days column, you'll have:
df['Days'].sub(ser, fill_value=0).astype('int')
Out:
0 0
1 20
2 40
3 55
4 5
5 10
Name: Days, dtype: int64