Efficiently calculating remaining useful lifetime with pandas - python

I have a pandas dataframe that contains multiple rows with a datetime and a sensor value. My goal is to add a column that calculates the days until the sensor value will exceed the threshold the next time.
For instance, for the data <2019-01-05 11:00:00, 200>, <2019-01-06 12:00:00, 250>, <2019-01-07 13:00:00, 300> I would want the additional column to look like [1 day, 0 days, 0 days] for thresholds between 200 and 250 and [2 days, 1 day, 0 days] for thresholds between 250 and 300.
I tried subsampling the dataframe with df_sub = df[df[sensor_value] >= threshold], iterate over both dataframes and calculate the next timestamp in df_sub given the current timestamp in df. However, this solution seems to be every inefficient and I think that pandas might have some optimized way to calculating what I need.
In the following example code, I tried what I described above.
import pandas as pd
data = [{'time': '2019-01-05 11:00:00', 'sensor_value' : 200},
{'time': '2019-01-05 14:37:52', 'sensor_value' : 220},
{'time': '2019-01-05 17:55:12', 'sensor_value' : 235},
{'time': '2019-01-06 12:00:00', 'sensor_value' : 250},
{'time': '2019-01-07 13:00:00', 'sensor_value' : 300},
{'time': '2019-01-08 14:00:00', 'sensor_value' : 250},
{'time': '2019-01-09 15:00:00', 'sensor_value' : 320}]
df = pd.DataFrame(data)
df['time'] = pd.to_datetime(df['time'])
def calc_rul(df, threshold):
# calculate all datetime where the threshold is exceeded
df_sub = sorted(df[df['sensor_value'] >= threshold]['time'].tolist())
# variable to store all days
remaining_days = []
for v1 in df['time'].tolist():
for v2 in df_sub:
# if the exceeding date is the first in future calculate the days difference
if(v2 > v1):
remaining_days.append((v2-v1).days)
break
elif(v2 == v1):
remaining_days.append(0)
break
df['RUL'] = pd.Series(remaining_days)
calc_rul(df, 300)
Expected output (output of the above sample):

Here's what I would do for one threshold
def calc_rul(df, thresh):
# we mark all the values greater than thresh
markers =df.value.ge(thresh)
# copy dates of the above row
df['last_day'] = np.nan
df.loc[markers, 'last_day'] = df.timestamp
# back fill those dates
df['last_day'] = df['last_day'].bfill().astype('datetime64[ns]')
df['RUL'] = (df.last_day - df.timestamp).dt.days
# drop the columns if necessary,
# remove this line to better see how the code works
df.drop('last_day', axis=1, inplace=True)
calc_rul(df, 300)

Instead of spliting the dataframe, you can use the '.loc' that allows you to filter and iterate through your threshold the same way:
df['RUL'] = '[2 days, 1 day, 0 days]'
for threshold in threshold_list:
df.loc[df['sensor_value'] > <your_rule>,'RUL'] = '[1 day, 0 days, 0 days]'
This technique avoids splitting the dataframe.

Related

Different questions about pandas pivot tables

Here's my df:
df=pd.DataFrame(
{
'Color': ['red','blue','red','red','green','red','yellow'],
'Type': ['Oil', 'Aluminium', 'Oil', 'Oil', 'Cement Paint', 'Synthetic Rubber', 'Emulsion'],
'Finish' : ['Satin', 'Matte', 'Matte', 'Satin', 'Semi-gloss', 'Satin', 'Satin'],
'Use' : ['Interior', 'Exterior', 'Interior', 'Interior', 'Exterior', 'Exterior', 'Exterior'],
'Price' : [55, 75, 60, 60, 55, 75, 50]
}
)
I want to create a pivot table that will output 'Color', 'color count', the percentage or weight or each count of color, and finally a total row, outputting the total color count next to 100%. Additionally, I'd like to add a header with today's date in the following format (02 - Nov).
Here is my current pivot with the aproximating inputs
today=datetime.date.today()
today_format=today.strftime("%d-m%")
pivot_table=pd.pivot_table(
data=df,
index='Color',
aggfunc={'Color':'count'}
)
df['Color'].value_counts(
normalize=True
).mul(100).round(1).astype(str) + '%'
Is there a way to add more information to the pivot as a header, total and extra column? Or just I just try to convert the pivot back to a DF and edit it from there?
The main difficulty I'm finding is that since I'm handling string data, when I 'aggfunc='sum' it actually adds the strings. And If I try to add 'margins=True, margins_name='Total count' I get the following error:
if isinstance(aggfunc[k], str):
KeyError: 'Type'
The desired table output would look something like this:
Updated Answer
Thanks to a great suggestion by Rabinzel, we can also have today's date as a column header as well:
df = (df['Color'].value_counts().reset_index().pivot_table(index = ['index'], aggfunc = np.sum, margins=True, margins_name='Total')
.assign(perc = lambda x: x['Color']/x.iloc[:-1]['Color'].sum() * 100)
.rename(columns = {'Color' : 'Color Count',
'perc' : '%'}))
new_cols = pd.MultiIndex.from_product([[datetime.today().strftime('%#d-%b')], df.columns])
df.columns = new_cols
df
2-Nov
Color Count %
index
blue 1 14.285714
green 1 14.285714
red 4 57.142857
yellow 1 14.285714
Total 7 100.000000

How to access nested data in a pandas dataframe?

Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]

Group By Dinstinct in Pandas

I Have Script Like This in Pandas :
dfmi['Time'] = pd.to_datetime(dfmi['Time'], format='%H:%M:%S')
dfmi['hours'] = dfmi['Time'].dt.hour
sum_dh = dfmi.groupby(['Date','hours']).agg({'Amount': 'sum', 'Price':'sum'})
dfdhsum = pd.DataFrame(sum_dh)
dfdhsum.columns = ['Amount', 'Gas Sales']
dfdhsum
And the output :
I want Sum Distinct Group BY and the final result is like This :
How its pandas code solution ??
I don't understand what you want to exactly but this instruction will sum hours , amount ans gas sales for each date
dfmi.groupby("Date").agg({'hours': 'sum', 'Amount': 'sum','Gas Sales':'sum})

12 month moving average from python

df_close['MA'] = df_close.rolling(window=12).mean()
I keep getting this error can anyone help please
ValueError: Wrong number of items passed 20, placement implies 1
My assignment:
Pull 20 years of monthly stock price and trading volume data for any 20 stocks of your pick from Yahoo Finance.
Calculate the monthly returns and 12-month moving average of each stock.
Other parts of the code:
start = dt.datetime(2000,1,1)
end = dt.datetime(2020,2,1)
df = web.DataReader(['AAPL', 'MSFT', 'ROKU', 'GS', 'GOOG', 'KO', 'ULTA', 'JNJ', 'ZM', 'AMZN', 'NFLX', 'TSLA', 'CMG', 'ATVI', 'LOW', 'BA', 'SYY', 'SNAP', 'BYND', 'OSTK'], 'yahoo',start,end)
df['Adj Close']
df['Volume']
data1 = df[['Adj Close', 'Volume']].copy()
data1['date1'] = data1.index
print(data1)
data2 = data1.merge(data1, left_on='date1', right_on='date1')
data2
df.sort_index()
df
df_monthly_returns = df['Adj Close'].ffill().pct_change()
df_monthly_returns.sort_index()
print(df_monthly_returns.tail())
df_close['MA'] = df_close.rolling(window=12).mean()
df_close ```
ValueError: Wrong number of items passed 20, placement implies 1 means that you are attempting to put too many elements in less space.
df_close['MA'] = df_close.rolling(window=12).mean()
You are pushing 20 "things" into a container that allows only one.
So if you want to put a element with 20 "columns" into a single dataframe column, try iterating over df_close.rolling(window=12).mean() and store in df_close['MA'].
And please update your question with more code so that I can give you a exact solution.

Convert a dictionary of a list of dictionaries to pandas DataFrame

I pulled a list of historical option price of AAPL from the RobinHoood function robin_stocks.get_option_historicals(). The data was returned in a form of dictional of list of dictionary as shown below.
I am having difficulties to convert the below object (named historicalData) into a DataFrame. Can someone please help?
historicalData = {'data_points': [{'begins_at': '2020-10-05T13:30:00Z',
'open_price': '1.430000',
'close_price': '1.430000',
'high_price': '1.430000',
'low_price': '1.430000',
'volume': 0,
'session': 'reg',
'interpolated': False},
{'begins_at': '2020-10-05T13:40:00Z',
'open_price': '1.430000',
'close_price': '1.340000',
'high_price': '1.440000',
'low_price': '1.320000',
'volume': 0,
'session': 'reg',
'interpolated': False}],
'open_time': '0001-01-01T00:00:00Z',
'open_price': '0.000000',
'previous_close_time': '0001-01-01T00:00:00Z',
'previous_close_price': '0.000000',
'interval': '10minute',
'span': 'week',
'bounds': 'regular',
'id': '22b49380-8c50-4c76-8fb1-a4d06058f91e',
'instrument': 'https://api.robinhood.com/options/instruments/22b49380-8c50-4c76-8fb1-a4d06058f91e/'}
I tried the below code code but that didn't help:
import pandas as pd
df = pd.DataFrame(historicalData)
df
You didn't write that you want only data_points (as in the
other answer), so I assume that you want your whole dictionary
converted to a DataFrame.
To do it, start with your code:
df = pd.DataFrame(historicalData)
It creates a DataFrame, with data_points "exploded" to
consecutive rows, but they are still dictionaries.
Then rename open_price column to open_price_all:
df.rename(columns={'open_price': 'open_price_all'}, inplace=True)
The reason is to avoid duplicated column names after join
to be performed soon (data_points contain also open_price
attribute and I want the corresponding column from data_points
to "inherit" this name).
The next step is to create a temporary DataFrame - a split of
dictionaries in data_points to individual columns:
wrk = df.data_points.apply(pd.Series)
Print wrk to see the result.
And the last step is to join df with wrk and drop
data_points column (not needed any more, since it was
split into separate columns):
result = df.join(wrk).drop(columns=['data_points'])
This is pretty easy to solve with the below. I have chucked the dataframe to a list via list comprehension
import pandas as pd
df_list = [pd.DataFrame(dic.items(), columns=['Parameters', 'Value']) for dic in historicalData['data_points']]
You then could do:
df_list[0]
which will yield
Parameters Value
0 begins_at 2020-10-05T13:30:00Z
1 open_price 1.430000
2 close_price 1.430000
3 high_price 1.430000
4 low_price 1.430000
5 volume 0
6 session reg
7 interpolated False

Categories

Resources