Correct values based on reference data in Python

Correct values based on reference data in Python - python

There's a sensor dataset, and the values in value column needs to be corrected based on one specific sensor R in the data. The values are directions in degrees (circle 360 degrees). The correction method is as below formula, for each individual sensor i, calculate sum of sine /cosine differences respecting to the reference sensor and get the corrected degrees by calculating artanh. Then minus it from its original values. Vi(t) is the value of sensor i at time t, and VR(t) is the value of Reference sensor R at time t.
date sensor value tag
0 2000-01-01 1 200 a
1 2000-01-02 1 200 a
''''''''''''''''''''''''''''''''
7 2000-01-08 1 300 b
8 2000-01-02 2 202 c
9 2000-01-03 2 204 c
10 2000-01-04 2 206 c
I have tried some but little confused in how to complete this request in a for loop.
The timestamps for sensors are not matching. The individual sensor may have more or less timestamps than the reference sensor.
I want to add an additional column to store corrected values.
Below is the sample dataset I made. If choose sensor 2 as the reference sensor to correct other sensor values, how can I complete it in a python loop. Thanks in advance!
import pandas as pd
sensor1 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=8),"sensor":[1,1,1,1,1,1,1,1],"value":[200,200,200,200,200,300,300,300],"tag":pd.Series(['a','b']).repeat(4)})
sensor2 = pd.DataFrame({"date": pd.date_range('1/2/2000', periods=10),"sensor":[2,2,2,2,2,2,2,2,2,2],"value":[202,204,206,208,220,250,300,320,280,260],"tag":pd.Series(['c','d']).repeat(5)})
sensor3 = pd.DataFrame({"date": pd.date_range('1/3/2000', periods=10),"sensor":[3,3,3,3,3,3,3,3,3,3],"value":[265,222,232,220,260,300,250,200,190,223],"tag":pd.Series(['e','f']).repeat(5)})
sensor4 = pd.DataFrame({"date": pd.date_range('1/1/2000', periods=11),"sensor":[4,4,4,4,4,4,4,4,4,4,4],"value":[206,203,210,253,237,282,320,232,255,225,262],"tag":pd.Series(['c']).repeat(11)})
sensordata = sensor1.append([sensor2,sensor3,sensor4]).reset_index(drop = True)

Here is an inelegant solution, using for loops and multiple merges. As an example, I use sensor4 to correct the remaining sensors. The correction formula was not 100% clear to me, so I interpreted it as adding the sine and the cosine.
def data_correction(vi, vr):
return vi - np.arctan(np.sum(np.sin(vi-vr) + np.cos(vi-vr), axis=0)) # i assume sin and cosine are summed?
sensors = [sensor1, sensor2, sensor3] # assuming you want to correct with sensor 4
sensorR = sensor4.copy()
for i in range(len(sensors)):
# create temp dataframe, with merge on date, so that measurements line up
temp = pd.merge(sensors[i], sensorR, how='inner', left_on='date', right_on='date')
# do correction and assign to new column
temp['value_corrected'] = data_correction(temp['value_x'], temp['value_y'])
# add this column to the original sensor data
sensors[i] = sensors[i].merge(temp[['date', 'value_corrected']], how='inner', left_on='date', right_on='date')

Related

Is there a way to join two datasets on timestamp with an offset such that it connects time_1 with time_2 where time_2 is 2hrs earlier than time_1?

I'm trying to predict delays based on weather 2 hours before scheduled travel. I have one dataset of travel data (call df1) and one dataset of weather (call df2). In order to predict the delay, I am trying to join df1 and df2 with an offset of 2 hours. That is, I want to look at the weather data 2 hours before the scheduled travel data. A paired down view of the data would look something like this
example df1 (travel data):
travel_data
location
departure_time
delayed
blah
KPHX
2015-04-23T15:02:00.000+0000
1
bleh
KRDU
2015-04-27T15:19:00.000+0000
0
example df2 (weather data):
location
report_time
weather_data
KPHX
2015-01-01 01:53:00
blih
KRDU
2015-01-01 09:53:00
bloh
I would like to join the data first on location and then on the timestamp data with a minimum 2 hour offset. If there are multiple weather reports greater than 2 hours earlier than departure time, I would like to join the travel data with the closest report to a 2 hour offset as possible.
So far I have used
joinedDF = airlines_6m_recode.join(weather_filtered, (col("location") == col("location")) & (col("departure_time") == (col("report_date") + f.expr('INTERVAL 2 HOURS'))), "inner")
This works only for the times when the departure time and (report date - 2hrs) match exactly, so I'm losing a large percentage of my data. Is there a way to join to the next closest report date outside the 2hr buffer?
I have looked into window functions but they don't describe how to do joins.

Change the join condition to be >= and get largest report timestamp after partitioning by location.
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# 1.Join as per conditions
# 2. Partition by location, order by report_ts desc, add row_number
# 3. Filter row_number == 1
joinedDF = airlines_6m_recode.join(
weather_filtered,
(airlines_6m_recode["location"] == weather_filtered["location"]) & (weather_filtered["report_time_ts"] <= airlines_6m_recode["departure_time_ts"] - F.expr("INTERVAL 2 HOURS"))
, "inner")\
.withColumn("row_number", F.row_number().over(Window.partitionBy(airlines_6m_recode['location'])\
.orderBy(weather_filtered["report_time_ts"].desc())))
# Just to Print Intermediate result.
joinedDF.show()
joinedDF.filter('row_number == 1').show()

To identify what are the channels that increase more than 10% against the data of last week

I have a large data frame across different timestamps. Here is my attempt:
all_data = []
for ws in wb.worksheets():
rows=ws.get_all_values()
df_all_data=pd.DataFrame.from_records(rows[1:],columns=rows[0])
all_data.append(df_all_data)
data = pd.concat(all_data)
#Change data type
data['Year'] = pd.DatetimeIndex(data['Week']).year
data['Month'] = pd.DatetimeIndex(data['Week']).month
data['Week'] = pd.to_datetime(data['Week']).dt.date
data['Application'] = data['Application'].astype('str')
data['Function'] = data['Function'].astype('str')
data['Service'] = data['Service'].astype('str')
data['Channel'] = data['Channel'].astype('str')
data['Times of alarms'] = data['Times of alarms'].astype('int')
#Compare Channel values over weeks
subchannel_df = data.pivot_table('Times of alarms', index = 'Week', columns='Channel', aggfunc='sum').fillna(0)
subchannel_df = subchannel_df.sort_index(axis=1)
The data frame I am working on
What I hope to achieve:
add a percentage row (the last row vs the second last row) at the end of the data frame, excluding situations as such: divide by zero and negative percentage
show those channels which increase more than 10% as compared against last week.
I have been trying different methods to achieve those for days. However, I would not manage to do it. Thank you in advance.

You could use the shift function as an equivalent to Lag window function in SQL to return last week's value and then perform the calculations in row level. To avoid dividing by zero you can use numpy where function that is equivalent to CASE WHEN in SQL. Let's say your column value on which you perform the calculations named: "X"
subchannel_df["XLag"] = subchannel_df["X"].shift(periods=1).fillna(0).astype('int')
subchannel_df["ChangePercentage"] = np.where(subchannel_df["XLag"] == 0, 0, (subchannel_df["X"]-subchannel_df["XLag"])/subchannel_df["XLag"])
subchannel_df["ChangePercentage"] = (subchannel_df["ChangePercentage"]*100).round().astype("int")
subchannel_df[subchannel_df["ChangePercentage"]>10]
Output:
Channel X XLag ChangePercentage
Week
2020-06-12 12 5 140
2020-11-15 15 10 50
2020-11-22 20 15 33
2020-12-13 27 16 69
2020-12-20 100 27 270

combine pandas apply results as multiple columns in a single dataframe

Summary
Suppose that you apply a function to a groupby object, so that every g.apply for every g in the df.groupby(...) gives you a series/dataframe. How do I combine these results into a single dataframe, but with the group names as columns?
Details
I have a dataframe event_df that looks like this:
index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0
...
I want to create a sampling of the event for every note, and the sampling is done at times as given by t_df:
index t
0 0
1 0.5
2 1.0
...
So that I'd get something like this.
t C D
0 off off
0.5 on off
1.0 off on
...
What I've done so far:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
print(group_with_t)
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
gb.apply(get_t_for_gb, **kwargs)
So what I get is a number of dataframes for each note, all of the same size (same as t_df):
t event
0 on
0.5 off
...
t event
0 off
0.5 on
...
How do I go from here to my desired dataframe, with each group corresponding to a column in a new dataframe, and the index being t?

EDIT: sorry, I didn't take into account below, that you rescale your time column and can't present a whole solution now because I have to leave, but I think, you could do the rescaling by using pandas.merge_asof with your two dataframes to get the nearest "rescaled" time and from the merged dataframe you could try the code below. I hope this is, what you wanted.
import pandas as pd
import io
sio= io.StringIO("""index event note time
0 on C 0.5
1 on D 0.75
2 off C 1.0""")
df= pd.read_csv(sio, sep='\s+', index_col=0)
df.groupby(['time', 'note']).agg({'event': 'first'}).unstack(-1).fillna('off')
Take the first row in each time-note group by agg({'event': 'first'}), then use the note-index column and transpose it, so the note values become columns. Then at the end fill all cells, for which no datapoints could be found with 'off' by fillna.
This outputs:
Out[28]:
event
note C D
time
0.50 on off
0.75 off on
1.00 off off
You might also want to try min or max in case on/off is not unambiguous for a combination of time/note (if there are more rows for the same time/note where some have on and some have off) and you prefer one of these values (say if there is one on, then no matter how many offs are there, you want an on etc.). If you want something like a mayority-vote, I would suggest to add a mayority vote column in the aggregated dataframe (before the unstack()).

Oh so I found it! All I had to do was to unstack the groupby results. Going back to generating the groupby result:
def get_t_note_series(notedata_row, t_arr):
"""Return the time index in the sampling that corresponds to the event."""
t_idx = np.argwhere(t_arr >= notedata_row['time']).flatten()[0]
return t_idx
def get_t_for_gb(group, **kwargs):
t_idxs = group.apply(get_t_note_series, args=(t_arr,), axis=1)
t_idxs.rename('t_arr_idx', inplace=True)
group_with_t = pd.concat([group, t_idxs], axis=1).set_index('t_arr_idx')
## print(group_with_t) ## unnecessary!
return group_with_t
t_arr = np.arange(0,10,0.5)
t_df = pd.DataFrame({'t': t_arr}).rename_axis('t_arr_idx')
gb = event_df.groupby('note')
result = gb.apply(get_t_for_gb, **kwargs)
At this point, result is a dataframe with note as an index:
>> print(result)
event
note t
C 0 off
0.5 on
1.0 off
....
D 0 off
0.5 off
1.0 on
....
Doing result = result.unstack('note') does the trick:
>> result = result.unstack('note')
>> print(result)
event
note C D
t
0 off off
0.5 on on
1.0 off off
....
D 0 off
0.5 off
1.0 on
....

Merging one minute ahead time frame values from csv using Pandas Python 3

I am trying to compare the test file with model file and then verifying it with the result.
Here is what I have tried till now:
import pandas as pd
data = pd.read_csv("data.csv",encoding = "utf-16", header = 0,sep="\t")
data.head(20)
createmodel = data.drop(labels=['param1','param3','param5','param7','param9','param13','param15','colorsame'], axis=1)
createmodel.drop_duplicates().to_csv("model.csv",index=False,header =True,sep="\t",encoding="utf-16")
createmodel.head(10)
createmodel.drop_duplicates().to_csv("test.csv",index=False,header =True,sep="\t",encoding="utf-16")
createmodel.head(10)
verifyresult = pd.read_csv("verify.csv",encoding = "utf-16", header = 0,sep="\t")
verifyresult.head(20)
result = pd.merge(testmodel,createmodel, on = ["param2","param4","param6","param8","param10","param11","param12","param14","param16"])
result = result.drop_duplicates()
Here are the files model,test, and verify
I have achieved the comparison, using the merge statement and got output in the result variable.
The only part that is troubling me is that, I need to find one minute later time of the value in the result.Time from the verify.csv and then merge the values with result in another column. And save it as csv.
The final result must be like the following:
If following is the dataframe in result variable:
2018.5.1 0:5 0-1 0-1 0-1 0-1 0--1 0 1 -43--42 78-79 Red
And verify.csv has:
2018.5.1 0:6 Green
which is the values associated with the one minute later time frame of the value of result variable.
Then the new frame should be:
Time param2 param4 param6 param8 param10 param11 param12 param14 param16 color Actual
2018.5.1 0:5 0-1 0-1 0-1 0-1 0--1 0 1 -43--42 78-79 Red Green
which is the final result.
Kindly, suggest me the way to do and achieve what I want.

You can convert your 'Time' columns to datetime, which allows you to easily subtract off one minute from the verify.csv DataFrame, and then you can just merge (or map or whatever you want to join them)
import pandas as pd
result['Time'] = pd.to_datetime(result['Time'], format='%Y.%m.%d %H:%M')
verifyresult['Time'] = pd.to_datetime(verifyresult['Time'], format='%Y.%m.%d %H:%M')
# Only subtract one minute if it is a weekday
mask = verifyresult['Time'].dt.dayofweek < 5
verifyresult.loc[mask, 'Time'] = verifyresult.loc[mask, 'Time'] - pd.Timedelta(minutes=1)
result = result.merge(verifyresult, on='Time')
# or
#result['Actual'] = result['Time'].map(verifyresult.set_index('Time').Actual)
Outputs:
Time param2 param4 param6 param8 param10 param11 param12 param14 param16 color Actual
0 2018-05-01 00:05:00 0-1 0-1 0-1 0-1 0--1 0 1 -43--42 78-79 Red Green

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?

Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correct values based on reference data in Python - python

Related

Is there a way to join two datasets on timestamp with an offset such that it connects time_1 with time_2 where time_2 is 2hrs earlier than time_1?

To identify what are the channels that increase more than 10% against the data of last week

combine pandas apply results as multiple columns in a single dataframe

Merging one minute ahead time frame values from csv using Pandas Python 3

Slicing my data frame is returning unexpected results

Categories

Resources