how to get min values of columns by rolling another columns? - python

GROUP_NAV_DATE GROUP_REH_VALUE target
0 2018/11/29 1 1.06
1 2018/11/30 1.0029 1.063074
2 2018/12/3 1.03 1.0918
3 2018/12/4 1.032 1.09392
4 2018/12/5 1.0313 1.093178
5 2020/12/6 1.034 1.09604
6 2020/12/8 1.062 1.12572
7 2020/12/9 1.07 1.1342
8 2020/12/10 1 1.06
9 2020/12/11 0.99 1.0494
10 2020/12/12 0.96 1.0176
11 2020/12/13 1.062 1.12572
goal
create first_date column that value is from GROUP_NAV_DATE.The logic is that the value of GROUP_REH_VALUE is the first time less than target values in GROUP_REH_VALUE columns, and the result is greater than original date for each row.
For example, GROUP_REH_VALUE=1 for index 0, the first match is 2020/12/8. For index 9, the first match is 2020/12/13 not 2020/12/8.
Note: for each row, target values is 1.06*GROUP_REH_VALUE.
Expect
GROUP_NAV_DATE GROUP_REH_VALUE target first_date
0 2018/11/29 1 1.06 2020/12/8
1 2018/11/30 1.0029 1.063074 2020/12/9
2 2018/12/3 1.03 1.0918 NA
3 2018/12/4 1.032 1.09392 NA
4 2018/12/5 1.0313 1.093178 NA
5 2020/12/6 1.034 1.09604 NA
6 2020/12/8 1.062 1.12572 NA
7 2020/12/9 1.07 1.1342 NA
8 2020/12/10 1 1.06 2020/12/13
9 2020/12/11 0.99 1.0494 2020/12/13
10 2020/12/12 0.96 1.0176 2020/12/13
11 2020/12/13 1.062 1.12572 NA
Try
I try rolling and idxmin but when it depends on another columns, I could not ger answer.

You can use expanding but this code works only because:
There is a direct relation between GROUP_REH_VALUE and target columns 1.06*GROUP_REH_VALUE so the target column is useless.
You have a numeric index because expanding checks if the return value is numeric else you will raise an TypeError: must be real number, not str if GROUP_NAV_DATE is the index.
def f(sr):
m = sr.iloc[-1]*1.06 < sr
return sr[m].last_valid_index() if sum(m) else np.nan
# Need to reverse dataframe because you are looking forward.
idx = df.loc[::-1, 'GROUP_REH_VALUE'].expanding().apply(f).dropna()
# Set dates
df.loc[idx.index, 'first_time'] = df.loc[idx, 'GROUP_NAV_DATE'].tolist()
Output:
>>> df
GROUP_NAV_DATE GROUP_REH_VALUE target first_time
0 2018/11/29 1.0000 1.060000 2020/12/8
1 2018/11/30 1.0029 1.063074 2020/12/9
2 2018/12/3 1.0300 1.091800 NaN
3 2018/12/4 1.0320 1.093920 NaN
4 2018/12/5 1.0313 1.093178 NaN
5 2020/12/6 1.0340 1.096040 NaN
6 2020/12/8 1.0620 1.125720 NaN
7 2020/12/9 1.0700 1.134200 NaN
8 2020/12/10 1.0000 1.060000 2020/12/13
9 2020/12/11 0.9900 1.049400 2020/12/13
10 2020/12/12 0.9600 1.017600 2020/12/13
11 2020/12/13 1.0620 1.125720 NaN

Related

How to multiply different columns in different dataframes using Pandas

I have 2 dataframes that I want to multiply. I want to multiply multiple columns from dataframe 1 with one column in dataframe 2
raw_material_LCI = dataframe1[["climate change","ozone depletion",
"ionising radiation, hh","photochemical ozone formation, hh",
"particulate matter","human toxicity, non-cancer",
"human toxicity, cancer","acidification",
"eutrophication, freshwater","eutrophication, marine",
"eutrophication, terrestrial","ecotoxicity, freshwater",
"land use", "resource use, fossils","resource use, minerals and metals",
"water scarcity"]] * dataframe2["mass_frac"]
The above code returns a dataframe where all the values are NaN. The names of the columns all are fields with numeric values in them.
I decided to try multiply dataframe1 with just a single value to see if it worked e.g. example below
raw_material_LCI = dataframe1[["climate change","ozone depletion",
"ionising radiation, hh","photochemical ozone formation, hh",
"particulate matter","human toxicity, non-cancer",
"human toxicity, cancer","acidification",
"eutrophication, freshwater","eutrophication, marine",
"eutrophication, terrestrial","ecotoxicity, freshwater",
"land use", "resource use, fossils","resource use, minerals and metals",
"water scarcity"]] * 0.7
The example with the single value returns a dataframe with numbers, so it works. Does anyone know why the multiplication in the first instance does not work? I have looked at multiple articles on multiplying columns in different dataframes in Python and cannot find a solution.
You have to align both row and column indexes when you multiply two dataframes and align row index when you multiply a DataFrame by a Series:
>>> df
A B C D E
0 0.787081 0.350508 0.058542 0.492340 0.489379
1 0.512436 0.501375 0.108115 0.960808 0.841969
2 0.055247 0.305830 0.976043 0.016188 0.006424
3 0.303570 0.914876 0.157100 0.767454 0.340381
4 0.446077 0.595001 0.307799 0.115410 0.568281
5 0.226516 0.636902 0.086790 0.079260 0.402414
6 0.451920 0.526025 0.012470 0.931610 0.267155
7 0.472778 0.137005 0.227569 0.941355 0.584782
8 0.944396 0.769115 0.497214 0.531419 0.570797
9 0.788023 0.310288 0.336480 0.585466 0.432246
>>> sr
0 0.920878
1 0.445332
2 0.894407
3 0.613317
4 0.242270
5 0.299121
6 0.843052
7 0.279014
8 0.526778
9 0.249538
dtype: float64
So, this produces nan values:
>>> df * sr
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
but using mul along index axis works as expected:
>>> df.mul(sr, axis=0) # but not df.mul(sr) (same as df*sr)
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
If your series and dataframe have not the same length, you get a partial result:
>>> df.mul(sr.iloc[:5], axis=0)
A B C D E
0 0.724805 0.322775 0.053910 0.453385 0.450658
1 0.228204 0.223279 0.048147 0.427878 0.374956
2 0.049413 0.273536 0.872980 0.014479 0.005745
3 0.186185 0.561109 0.096352 0.470693 0.208762
4 0.108071 0.144151 0.074571 0.027961 0.137678
5 NaN NaN NaN NaN NaN
6 NaN NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN
9 NaN NaN NaN NaN NaN
>>> df.mul(sr.iloc[5:], axis=0)
A B C D E
0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 0.067756 0.190511 0.025961 0.023708 0.120371
6 0.380992 0.443466 0.010513 0.785396 0.225226
7 0.131912 0.038226 0.063495 0.262651 0.163162
8 0.497487 0.405153 0.261921 0.279940 0.300683
9 0.196642 0.077429 0.083965 0.146096 0.107862
Take care to have the same index between instances.

How to split a dataframe containing voltage over time value, so that it can store values of each waveform/bit separately

I have several csv files which have data of voltage over time and each csv files are approximately 7000 rows and the data looks like this:
Time(us) Voltage (V)
0 32.96554106
0.5 32.9149649
1 32.90484966
1.5 32.86438874
2 32.8542735
2.5 32.76323642
3 32.74300595
3.5 32.65196886
4 32.58116224
4.5 32.51035562
5 32.42943376
5.5 32.38897283
6 32.31816621
6.5 32.28782051
7 32.26759005
7.5 32.21701389
8 32.19678342
8.5 32.16643773
9 32.14620726
9.5 32.08551587
10 32.04505495
10.5 31.97424832
11 31.92367216
11.5 31.86298077
12 31.80228938
12.5 31.78205891
13 31.73148275
13.5 31.69102183
14 31.68090659
14.5 31.67079136
15 31.64044567
15.5 31.59998474
16 31.53929335
16.5 31.51906288
I read the csv file with pandas dataframe and after plotting a figure in matplotlib with data from one csv file, the figure looks like below.
I would like to split every single square waveform/bit and store the corresponding voltage values for each bit separately. So the resulting voltage values of each bit would be stored in a row and should look like this:
I don't have any idea how to do that. I guess I have to write a function where I have to assign a threshold value that, if the voltage values are going down for maybe 20 steps of time than capture all the values or if the voltage level is going up for 20 steps of time than capture all the voltage values. Could someone help?
If you get the gradient of your Voltage (here using diff as the time is regularly spaced), this gives you the following:
You can thus easily use a threshold (I tested with 2) to identify the peak starts. Then pivot your data:
# get threshold of gradient
m = df['Voltage (V)'].diff().gt(2)
# group start = value above threshold preceded by value below threshold
group = (m&~m.shift(fill_value=False)).cumsum().add(1)
df2 = (df
.assign(id=group,
t=lambda d: d['Time (us)'].groupby(group).apply(lambda s: s-s.iloc[0])
)
.pivot(index='id', columns='t', values='Voltage (V)')
)
output:
t 0.0 0.5 1.0 1.5 2.0 2.5 \
id
1 32.965541 32.914965 32.904850 32.864389 32.854273 32.763236
2 25.045314 27.543777 29.182444 30.588462 31.114454 31.984364
3 25.166697 27.746081 29.415095 30.719960 31.326873 32.125977
4 25.277965 27.877579 29.536477 30.912149 31.367334 32.206899
5 25.379117 27.978732 29.667975 30.780651 31.670791 32.338397
6 25.631998 27.634814 28.959909 30.173737 30.659268 31.053762
7 23.528030 26.137759 27.948386 29.253251 30.244544 30.649153
8 23.639297 26.380525 28.464263 29.971432 30.902034 31.458371
9 23.740449 26.542369 28.707028 30.295120 30.881803 31.862981
10 23.871948 26.673867 28.889103 30.305235 31.185260 31.873096
11 24.387824 26.694097 28.342880 29.678091 30.315350 31.134684
...
t 748.5 749.0
id
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 21.059913 21.161065
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 NaN NaN
11 NaN NaN
[11 rows x 1499 columns]
plot:
df2.T.plot()

How to Group by the mean of specific columns in Python

In the dataframe below:
import pandas as pd
import numpy as np
df= {
'Gen':['M','M','M','M','F','F','F','F','M','M','M','M','F','F','F','F'],
'Site':['FRX','FX','FRX','FRX','FRX','FX','FRX','FX','FX','FX','FX','FRX','FRX','FRX','FRX','FRX'],
'Type':['L','L','L','L','L','L','L','L','R','R','R','R','R','R','R','R'],
'AIC':['<1','<1','<1','<1',1,1,1,1,2,2,2,2,'>2','>2','>2','>2'],
'AIC_TRX':[1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4],
'diff':[-1,-1,-1,-1,0,0,0,0,1,1,1,1,3,3,3,3],
'series':[1,2,4,8,1,2,4,8,1,2,4,8,1,2,4,8],
'Grwth_Time1':[150.78,162.34,188.53,197.69,208.07,217.76,229.48,139.51,146.87,182.54,189.57,199.97,229.28,244.73,269.91,249.19],
'Grwth_Time2':[250.78,262.34,288.53,297.69,308.07,317.7,329.81,339.15,346.87,382.54,369.59,399.97,329.28,347.73,369.91,349.12],
'Grwth_Time3':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time4':[270.84,282.14,298.53,306.69,318.73,327.47,369.63,389.59,398.75,432.18,449.78,473.55,494.85,509.39,515.52,539.23],
'Grwth_Time5':[25.78,22.34,28.53,27.69,30.07,17.7,29.81,33.15,34.87,32.54,36.59,39.97,29.28,34.73,36.91,34.12],
'Grwth_Time6':[240.18,232.14,258.53,276.69,338.07,307.74,359.16,339.25,365.87,392.48,399.97,410.75,429.08,448.39,465.15,469.33],
'Grwth_Time7':[27.84,28.14,29.53,30.69,18.73,27.47,36.63,38.59,38.75,24.18,24.78,21.55,13.85,9.39,15.52,39.23],
}
df = pd.DataFrame(df,columns = ['Gen','Site','Type','AIC','AIC_TRX','diff','series','Grwth_Time1','Grwth_Time2','Grwth_Time3','Grwth_Time4','Grwth_Time5','Grwth_Time6','Grwth_Time7'])
df.info()
I want to do the following:
Find the average of each unique series per AIC_TRX for each Grwth_Time (Grwth_Time1, Grwth_Time2,....,Grwth_Time7)
Export all the outputs as one xlsx file (refer to the figure below)
The desired outputs look like the figure below (note: the numbers in this output are not the actual average values, they were randomly generated)
My attempt:
# Select the columns -> AIC_TRX, series, Grwth_Time1,Grwth_Time2,....,Grwth_Time7
df1 = df[['AIC_TRX', 'diff', 'series',
'Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3', 'Grwth_Time4',
'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']]
#Below is where I need help, I want to groupby the 'series' and 'AIC_TRX' for all the 'Grwth_Time1_to_7'
df1.groupby('series').Grwth_Time1.agg(['mean'])
Thanks in advance
You have to groupby two columns: ['series', 'AIC_TRX'] and find mean of each Grwth_Time.
df.groupby(['series', 'AIC_TRX'])[['Grwth_Time1', 'Grwth_Time2', 'Grwth_Time3',
'Grwth_Time4', 'Grwth_Time5', 'Grwth_Time6', 'Grwth_Time7']].mean().unstack().to_excel("output.xlsx")
Output:
AIC_TRX 1 2 3 4
series
1 150.78 208.07 146.87 229.28
2 162.34 217.76 182.54 244.73
4 188.53 229.48 189.57 269.91
8 197.69 139.51 199.97 249.19
AIC_TRX 1 2 3 4
series
1 250.78 308.07 346.87 329.28
2 262.34 317.70 382.54 347.73
4 288.53 329.81 369.59 369.91
8 297.69 339.15 399.97 349.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 270.84 318.73 398.75 494.85
2 282.14 327.47 432.18 509.39
4 298.53 369.63 449.78 515.52
8 306.69 389.59 473.55 539.23
AIC_TRX 1 2 3 4
series
1 25.78 30.07 34.87 29.28
2 22.34 17.70 32.54 34.73
4 28.53 29.81 36.59 36.91
8 27.69 33.15 39.97 34.12
AIC_TRX 1 2 3 4
series
1 240.18 338.07 365.87 429.08
2 232.14 307.74 392.48 448.39
4 258.53 359.16 399.97 465.15
8 276.69 339.25 410.75 469.33
AIC_TRX 1 2 3 4
series
1 27.84 18.73 38.75 13.85
2 28.14 27.47 24.18 9.39
4 29.53 36.63 24.78 15.52
8 30.69 38.59 21.55 39.23
Just use the df.apply method to average across each column based on series and AIC_TRX grouping.
result = df1.groupby(['series', 'AIC_TRX']).apply(np.mean, axis=1)
Result:
series AIC_TRX
1 1 0 120.738
2 4 156.281
3 8 170.285
4 12 196.270
2 1 1 122.358
2 5 152.758
3 9 184.494
4 13 205.175
4 1 2 135.471
2 6 171.968
3 10 187.825
4 14 214.907
8 1 3 142.183
2 7 162.849
3 11 196.851
4 15 216.455
dtype: float64

Pandas: how to select rows in data frame based on condition of a specific value on a specific column [duplicate]

This question already has answers here:
Pandas split DataFrame by column value
(5 answers)
Closed 3 years ago.
I have a given data frame as below example:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
And I wrote a function that should split the dataset into 2 data frames, based on comparison of a value in a specific column and a specific value.
For example, if I have col_idx = 2 and value=18.3 the result should be:
df1 - below the value:
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.1425 0.2839 0.2414
3 843786 M 12.45 15.7 82.57 477.1 0.1278 0.17 0.1578
4 844359 M 18.25 19.98 119.6 1040 0.09463 0.109 0.1127
df2 - above the value:
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974
The function should look like:
def split_dataset(data_set, col_idx, value):
below_df = ?
above_df = ?
return below_df, above_df
Can anybody complete my script please?
below_df = data_set[data_set[col_idx] < value]
above_df = data_set[data_set[col_idx] > value] # you have to deal with data_set[col_idx] == value though
You can use loc:
def split_dataset(data_set, col_idx, value):
below_df = df.loc[df[col_idx]<=value]
above_df = df.loc[df[col_idx]>=value]
return below_df, above_df
df1,df2=split_dataset(df,'2',18.3)
Output:
df1
0 1 2 3 4 5 6 7 8
2 84348301 M 11.42 20.38 77.58 386.1 0.14250 0.2839 0.2414
3 843786 M 12.45 15.70 82.57 477.1 0.12780 0.1700 0.1578
4 844359 M 18.25 19.98 119.60 1040.0 0.09463 0.1090 0.1127
df2
0 1 2 3 4 5 6 7 8
0 842517 M 20.57 17.77 132.9 1326.0 0.08474 0.07864 0.0869
1 84300903 M 19.69 21.25 130.0 1203.0 0.10960 0.15990 0.1974
Note:
Note that in this function call the names of the columns are numbers. You have to know before calling the function the correct type of columns. You may have to use string type or not.
You should also define what happens if the value with which the data frame is divided (value) is included in the column of the data frame.

Populating new DataFrame by multi-criteria selection from old one with different structure

I'm using Pandas for data analysis. I have an input file like this snippet:
VEH SEC POS ACCELL SPEED
2 8.4 36.51 -0.2929 27.39
3 8.4 23.57 -0.7381 33.09
4 8.4 6.18 0.6164 38.8
1 8.5 47.76 0 25.57
I need to reorganize the data so that the rows are the unique (ordered) values from SEC as the 1st column, and then the other columns would be VEH1_POS, VEH1_SPEED, VEH1_ACCELL, VEH2_POS, VEH2_SPEED, VEH2_ACCELL, etc.:
TIME VEH1_POS VEH1_SPEED VEH1_ACCEL VEH2_POS, VEH2_SPEED, etc.
0.1 6.2 3.7 0.0 7.5 2.1
0.2 6.8 3.2 -0.5 8.3 2.1
etc.
So, for example, the value for VEH1_POS for each row in the new dataframe would be filled in by selecting values from the POS column in the original dataframe using the row where the SEC value matches the TIME value for the row in the new dataframe and the VEH value == 1.
To set up the rows in the new data frame I'm doing this:
start = inputdf['SIMSEC'].min()
end = inputdf['SIMSEC'].max()
time_steps = frange(start, end, 0.1)
outputdf['TIME'] = time_steps
But I'm lost at how to select the right values from the input dataframe and create the rest of the new dataframe for further analysis. Note also that the input file will NOT have data for every VEH for every SEC (time stamp). So the solution needs to handle that as well. My best guess was:
outputdf['veh1_pos'] = np.where((inputdf['VEH NO'] == 1) & (inputdf['SIMSEC'] == row['Time Step']))
but that doesn't work.
import pandas as pd
# your data
# ==========================
print(df)
Out[272]:
VEH SEC POS ACCELL SPEED
0 2 8.4 36.51 -0.2929 27.39
1 3 8.4 23.57 -0.7381 33.09
2 4 8.4 6.18 0.6164 38.80
3 1 8.5 47.76 0.0000 25.57
# reshaping
# ==========================
result = df.set_index(['SEC','VEH']).unstack()
Out[278]:
POS ACCELL SPEED
VEH 1 2 3 4 1 2 3 4 1 2 3 4
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN
So here, the column has multi-level index where 1st level is POS, ACCELL, SPEED and 2nd level is VEH=1,2,3,4.
# if you want to rename the column
temp_z = result.columns.get_level_values(0)
temp_y = result.columns.get_level_values(1)
temp_x = ['VEH'] * len(temp_y)
result.columns = ['{}{}_{}'.format(x,y,z) for x,y,z in zip(temp_x, temp_y, temp_z)]
Out[298]:
VEH1_POS VEH2_POS VEH3_POS VEH4_POS VEH1_ACCELL VEH2_ACCELL VEH3_ACCELL VEH4_ACCELL VEH1_SPEED VEH2_SPEED VEH3_SPEED VEH4_SPEED
SEC
8.4 NaN 36.51 23.57 6.18 NaN -0.2929 -0.7381 0.6164 NaN 27.39 33.09 38.8
8.5 47.76 NaN NaN NaN 0 NaN NaN NaN 25.57 NaN NaN NaN

Categories

Resources