Reiterative sum based on previous values and an ID on Python?

Reiterative sum based on previous values and an ID on Python? - python

I'd like to see if there's a way to calculate something like the following on Python, is it possible?
ID Rotation Starting_degree Current_degree
1 40 360 320
1 55 320 265
2 70 360 290
1 15 265 250
2 20 290 270
3 30 360 330
3 60 330 270
1 25 250 225
In general my code is df['current_degree'] = df.apply(lambda row: row.starting_degree - row.rotation, axis = 1), but I'd like the starting degree figure to change based on ID and any previous calculations.
With each new ID the starting degree resets to 360.

IIUC, you want to calculate the current degree given the rotation:
# assume that all IDs start with 360
df['Start'] = 360
# grouping by ID
groups = df.groupby("ID")
# compute the total rotation by cumsum
df['Rot_so_far'] = groups.Rotation.cumsum()
# current degree
df['Current_degree'] = df['Start'] - df['Rot_so_far']
You may want to do a modulo by 360 for non-negative current degree.

Related

DataSet in Panda: Increase Offset if next value is smaller then previous one

i have following DataSet initially:
value;date
100;2021-01-01
160;2021-02-01
250;2021-02-15
10;2021-03-01
90;2021-04-01
150;2021-04-15
350;2021-06-01
20;2021-07-01
100;2021-08-01
10;2021-08-10
Whenever the value "Value" drops (e.g. from 250 to 10 on 2021-03-01), I want to save the old value as offset.
When the value drops again (e.g. from 350 to 20 on 2021-07-01) I want to add the new offset to the old one (350 + 250).
Afterwards I want to add the offsets with the values, so that I get the following DataSet at the end:
value;date;offset;corrected_value
100;2021-01-01;0;100
160;2021-02-01;0;160
250;2021-02-15;0;250
10;2021-03-01;250;260
90;2021-04-01;250;340
150;2021-04-15;250;400
350;2021-06-01;250;600
20;2021-07-01;600;620
100;2021-08-01;600;700
10;2021-08-10;700;710
My current (terrible) approach whichis not working:
df['date'] = pd.to_datetime(df['date'])
df.index = df['date']
del df['date']
df.drop_duplicates(keep='first')
df['previous'] = df['value'].shift(1)
def pn(current, previous, offset):
if not pd.isna(previous):
if current < previous:
return previous + offset
return offset
df['offset'] = 0
df['offset'] = df.apply(lambda row: pn(row['value'], row['previous'], row['offset']), axis=1)
Your help is so much appreciated, thank you!
Cheers

Find the desired positions in column 'value' with pd.Series.diff and pd.Series.shift. Fill with 0 and compute the cumsum. Add the 'offset' column to 'value'
df['offset'] = df.value[df.value.diff().lt(0).shift(-1, fill_value=False)]
df['offset'] = df.offset.shift(1).fillna(0).cumsum().astype('int')
df['correct_value'] = df.offset + df.value
df
Output
value date offset correct_value
0 100 2021-01-01 0 100
1 160 2021-02-01 0 160
2 250 2021-02-15 0 250
3 10 2021-03-01 250 260
4 90 2021-04-01 250 340
5 150 2021-04-15 250 400
6 350 2021-06-01 250 600
7 20 2021-07-01 600 620
8 100 2021-08-01 600 700
9 10 2021-08-10 700 710

Maximal Subset of Pandas Column based on a CutoFF

I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)
df:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
LA7 185
LA8 180
LA9 150
LA10 100
I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.
So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :
dfnew:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.
Will appreciate any help. Thanks.

IIUC, you could do:
# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)
# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0
print(df[mask])
Output
ACOL BCOL
0 LA1 234
1 LA2 230
2 LA3 220
3 LA4 218
4 LA5 210
5 LA6 200

How can I locate a subgroup of a dataframe based on more than one variable and replace a value only for that subgroup in the original dataframe?

I am quite new in Python and I have been facing some trouble to do the following:
I have a dataframe that I had to group based on different variables in order to analyze the data.
Package Package category Moisture Length Height Packing weight
0 YYS X NON DRY 2000 200 200
1 XXS Y NON DRY 190 20 200
2 GGT Z DRY 350 32 680
3 YYS X DRY 1000 209 280
4 YYS X DRY 3500 209 280
5 GGT Z DRY 350 37 680
6 XXS Y NON DRY 345 29 600
7 GGT Z DRY 350 37 680
8 GGT Z DRY 350 37 680
9 YYS X DRY 2000 209 285
10 YYS X NON DRY 3400 200 200
11 YYS X DRY 2000 209 280
12 XXS Y NON DRY 190 23 200
13 XXS Y NON DRY 190 23 200
14 GGT Z NON DRY 190 23 200
15 XXS Y NON DRY 190 23 200
16 GGT Z NON DRY 190 23 200
17 XXS Y NON DRY 336 20 600
18 XXS Y NON DRY 190 23 200
For this analysis, I search for a specific group, using the following:
data1.loc[(data1['Package category'] == 'X') & (data1['Package'] == 'YYS') & (data1['Moisture'] == 'DRY')
& (data1['Length'] == 2000) & (data1['Height'] == 209.0),:]
From that specific group I found that the values for the column 'Packing weight' are varying within this group and I would like to just have one values, therefore I need to replace all the rows if that group that have 280 as Packing weight value to 285. So I am using this:
data1.loc[(data1['Package category'] == 'X') & (data1['Package'] == 'YYS') & (data1['Moisture'] == 'DRY')
& (data1['Length'] == 2000) & (data1['Height'] == 209.0),:].replace({280.0:285})
The problem is that I would like this replacement to be shown in my original dataframe "data1".
But if I use the code above it just shows me as it has done the replacement, but going through the original dataframe data1, the change has not been done.
I have to do this analysis for different groups, and at the end, I would like to have these changes shown effectively on my one original dataframe "data1"
Is there a way I can do this?

Edit: after reading this: Pandas how can 'replace' work after 'loc'?
I suggest the following edit:
let's call for the whole filtering con (just it to be more clear here, you should change it your entire conditions for filtering):
data1.loc[con, :] = data1.loc[con,:].replace({280.0:285})
replace returns a new dataframe

Using if statements to filter data?

Lets say I have an excel document with the following format. I'm reading said excel doc with pandas and plotting data using matplotlib and numpy. Everything is great!
Buttttt..... I wan't more constraints. Now I want to constrain my data so that I can sort for only specific zenith angles and azimuth angles. More specifically: I only want zenith when it is between 30 and 90, and I only want azimuth when it is between 30 and 330
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
0 81 10
70 35 7
110 90 17
270 45 23
330 45 13
345 47 6
175 82 7
220 7 8
This is an example of the sort of constraint I'm looking for.
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
70 35 7
110 90 17
270 45 23
330 45 13
175 82 7
The following is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
P_file = file1
out_file = file2
out_file2 = file3
data = pd.read_csv(file1,header=None,sep=' ')
df=pd.DataFrame(data=data)
df.to_csv(file2,sep=',',header = [19 headers. The three that matter for this question are 'DateTime', 'Zenith', 'Azimuth', and 'Ozone Amount'.]
df=pd.read_csv(file2,header='infer')
mask = df[df['DateTime'].str.contains('20141201')] ## In this line I'm sorting for anything containing the locator for the given day.
mask.to_csv(file2) ##I'm now updating file 2 so that it only has the data I want sorted for.
data2 = pd.read_csv(file2,header='infer')
df2=pd.DataFrame(data=data2)
def tojuliandate(date):
return.... ##give a function that changes normal date of format %Y%m%dT%H%M%SZ to julian date format of %y%j
def timeofday(date):
changes %Y%m%dT%H%M%SZ to %H%M%S for more narrow views of data
df2['Time of Day'] = df2['DateTime'].apply(timeofday)
df2.to_csv(file2) ##adds a column for "timeofday" to the file
So basically at this point this is all the code that goes into making the csv I want to sort. How would I go about sorting
'Zenith' and 'Azimuth'
If they met the criteria I specified above?
I know that I will need if statements to do this.
I tried something like this but it didn't work and I was looking for a bit of help:

df[(df["Zenith"]>30) & (df["Zenith"]<90) & (df["Azimuth"]>30) & (df["Azimuth"]<330)]
Basically a duplicate of Efficient way to apply multiple filters to pandas DataFrame or Series

You can use series between:
df[(df['Zenith'].between(30, 90)) & (df['Azimuth'].between(30, 330))]
Yields:
Azimuth Zenith Ozone Amount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Note that by default, these upper and lower bounds are inclusive (inclusive=True).

You can only write those entries of the dataframe to your file, which are meeting your boundary conditions
# replace the line df.to_csv(...) in your example with
df[((df['Zenith'] >= 3) & (df['Zenith'] <= 90)) and
((df['Azimuth'] >= 30) & (df['Azimuth'] <= 330))].to_csv('my_csv.csv')

Using pd.DataFrame.query:
df_new = df.query('30 <= Zenith <= 90 and 30 <= Azimuth <= 330')
print(df_new)
Azimuth Zenith OzoneAmount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7

Join dataframe with matrix output using pandas

I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)

Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reiterative sum based on previous values and an ID on Python? - python

Related

DataSet in Panda: Increase Offset if next value is smaller then previous one

Maximal Subset of Pandas Column based on a CutoFF

How can I locate a subgroup of a dataframe based on more than one variable and replace a value only for that subgroup in the original dataframe?

Using if statements to filter data?

Join dataframe with matrix output using pandas

Categories

Resources