I am trying to rank some values in one column over a rolling period of N days instead of having the ranking done over the entire set. I have seen several methods here using rolling_apply but I have read that this is no longer in python. For example, in the following table;
A
01-01-2013
100
02-01-2013
85
03-01-2013
110
04-01-2013
60
05-01-2013
20
06-01-2013
40
For the column A above, how can I have the rank as below for N = 3;
A
Ranked_A
01-01-2013
100
NaN
02-01-2013
85
Nan
03-01-2013
110
1
04-01-2013
60
3
05-01-2013
20
3
06-01-2013
40
2
Yes we have some work around, still with rolling but need apply
df.A.rolling(3).apply(lambda x: pd.Series(x).rank(ascending=False)[-1])
01-01-2013 NaN
02-01-2013 NaN
03-01-2013 1.0
04-01-2013 3.0
05-01-2013 3.0
06-01-2013 2.0
Name: A, dtype: float64
Related
I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this:
Time(us) Voltage (V)
0 32.96554106
0.5 32.9149649
1 32.90484966
1.5 32.86438874
2 32.8542735
2.5 32.76323642
3 32.74300595
3.5 32.65196886
4 32.58116224
4.5 32.51035562
5 32.42943376
5.5 32.38897283
6 32.31816621
6.5 32.28782051
7 32.26759005
7.5 32.21701389
8 32.19678342
8.5 32.16643773
9 32.14620726
9.5 32.08551587
10 32.04505495
10.5 31.97424832
11 31.92367216
11.5 31.86298077
12 31.80228938
12.5 31.78205891
13 31.73148275
13.5 31.69102183
14 31.68090659
14.5 31.67079136
15 31.64044567
15.5 31.59998474
16 31.53929335
16.5 31.51906288
I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below.
I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this:
I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following:
You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data:
# get threshold of gradient
m = df['Voltage (V)'].diff().gt(2)
# group start = value above threshold preceded by value below threshold
group = (m&~m.shift(fill_value=False)).cumsum().add(1)
df2 = (df
.assign(id=group,
t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0])
)
.pivot(index='id', columns='t', values='Voltage (V)')
)
output:
t 0.0 0.5 1.0 1.5 2.0 2.5 \
id
1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236
2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364
3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977
4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899
5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397
6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762
7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153
8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371
9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981
10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096
11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684
...
t 748.5 749.0
id
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 21.059913 21.161065
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
[11 rows x 1499 columns]
plot:
df2.T.plot()
i have a data set named customer_base, containing over 800K rows like below:
ID
AGE
GENDER
OCCUPATION
1
64
101
"occ1"
2
64
100
"occ2"
2
66
100
Nan
2
Nan
100
"occ2"
3
Nan
101
"occ3"
3
Nan
Nan
Nan
3
32
Nan
Nan
.
.
.
.
and after a grouping operation the desired version of it should be like below:
ID
AGE
GENDER
OCCUPATION
1
64
101
"occ1"
2
66
100
"occ2"
3
32
101
"occ3"
.
.
.
.
previously i tried a code sample like below to get a table as clean as possible, but it took too much time. now i need a faster function to get any of the available values of occupation column.
customer_base.groupby("ID",
as_index=False).agg({"GENDER":"max",
"AGE":"max",
"OCCUPATION":lambda x: np.nan if len(x[x.notna()])==0 else x[x.notna()].values[0]})
thanks in advance for your optimization ideas, sorry for possible question duplication
Use GroupBy.first for first non NaNs values:
df = customer_base.groupby("ID", as_index=False).agg({"AGE":"max",
"GENDER":"max",
"OCCUPATION":'first'})
print (df)
ID AGE GENDER OCCUPATION
0 1 64.0 101.0 "occ1"
1 2 66.0 100.0 "occ2"
2 3 32.0 101.0 "occ3"
I'm building a report in Python to automate a lot of manual transformation we do in Excel at the moment. I'm able to extract the data and pivot it, to get something like this
Date
Category 1
Category 2
Category 3
Misc
01/01/21
40
30
30
10
02/01/21
30
20
50
20
Is it possible to divide the misc total for each date in to the other categories by ratio? So I would end up with the below
Date
Category 1
Category 2
Category 3
01/01/21
44
33
33
02/01/21
36
24
60
The only way I can think of is to split the misc values off to their own table, work out the ratios of the other categories, and then add misc * ratio to each category value, but I just wondered if there was a function I could use to condense the working on this?
Thanks
I think your solution hits the nail on the head. However it can be quite dense already:
>>> cat = df.filter(regex='Category')
>>> df.update(cat + cat.mul(df['Misc'] / cat.sum(axis=1), axis=0))
>>> df.drop(columns=['Misc'])
Date Category 1 Category 2 Category 3
0 01/01/21 44.0 33.0 33.0
1 02/01/21 36.0 24.0 60.0
cat.mul(df['Misc'] / cat.sum(axis=1), axis=0) gets you the reallocated misc values per row, since you multiply each value by misc and divide it by the row total. .mul() allows to do the the multiplication while specifying along which axis, the rest is about having the right columns.
I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})
print(df)
a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]
# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64
I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!
Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0