I have a pandas Dataframe with an index using UTC time and a column with data (in the example the column "value_1").
My question is: How could I create a new column in which each value is the value of the first column but 20 seconds later. Using the example below, I would get for the first value of this second column the value at the moment "2011-01-01 00:00:20".
import pandas as pd
import numpy as np
data_1 = pd.DataFrame(index=pd.date_range('1/1/2011', periods = 1000, freq ='S'))
data_1['value_1'] = 100 + np.random.randint(0,1000,size=(1000, 1))
data_1['value_2'] = ??¿¿
I don't know if it would be possible if I change the index to a different format.
I have seen that pandas have some useful functionalities to work with time series but I have not found the one in order to solve this problem yet.
Thank-you in advance.
you can either use shift with the value of seconds you want to use (here 20):
data_1['value_2'] = data_1['value_1'].shift(-20)
or can reindex with the index + 20s and get values with to_numpy:
data_1['value_2'] = data_1['value_1'].reindex(data_1['value_1'].index+pd.Timedelta(seconds=20)).to_numpy()
Related
I'm not sure if I'm going about this the right way. However, I'm trying to create a multi-index off three fields, which eventually will be three separate joins based upon input data.
I'm creating a second data frame in the example below, which I want to use a date lookup table, which will be able to join to the expiration date.
When I try to create the multi-index I get the following error:
ValueError: Length of names must match number of levels in MultiIndex.
I've tried turn the dataframe data_index into a series, but I'm still getting the error.
import pandas as pd
import numpy as np
import requests
from datetime import date
raw_data = requests.get(f"https://cdn.cboe.com/api/global/delayed_quotes/options/_SPX.json")
dict_data = pd.DataFrame.from_dict(raw_data.json())
#create dataframe from options key
data = pd.DataFrame(dict_data.loc["options", "data"])
# regex to strip the contract type, strike, and expire date
data["contract_type"] = data.option.str.extract(r"\d([A-Z])\d")
data["strike_price"] = data.option.str.extract(r"\d[A-Z](\d+)\d\d\d").astype(int)
data["expiration_date"] = str(20) + data.option.str.extract(r"[A-Z](\d+)").astype(str)
# Convert expiration to datetime format
data["expiration_date"] = pd.to_datetime(data["expiration_date"], format="%Y-%m-%d")
data_index = pd.MultiIndex.from_frame(data, names=[["expiration_date", "strike_price", "contract_type"]]) ## this is where the error occurs
print(data_index)
date_df = pd.DataFrame(data[["expiration_date"]].unique())
date_df.set_index('expiration_date').join(data_index.set_index('expiration_date'))
today_date = date.today().strftime("%Y-%m-%d")
days_till_expire = (np.busday_count(today_date,date_df["expiration_date"].dt.strftime("%Y-%m-%d").to_list())/ 252)
date_df.loc[date_df["days_till_expire"].eq(0), "days_till_expire"] = 1 / 252
print(date_df)
How can I get the multi index to work with a join on a single index? Joining the date_df into the multi_index dataframe? I believe the logic below should work if the multi-index is setup correctly
It's probably too long to put it into comments, so here in form of an answer:
I suppose that what you wanted to achieve is this (not from the entire frame, but from the three columns out of it):
data_index = pd.MultiIndex.from_arrays([data["expiration_date"], data["strike_price"], data["contract_type"]], names=["expiration_date", "strike_price", "contract_type"]) ## this is where the error occurs
For how to join on a single index check out:
How to join a multi-index series to a single index dataframe with Pandas?
Does it help you?
I have two excels. The first one is the dependent variable: data with Date & station ID, With date as the 'index' of dataframe, and station ID for headers. as shown as below,
The second one (independent variable) is the data I used to simulate the dependent variable(first excel above), also with Date, have the format as one column for year and the other two for month and date respectively. As shown in the image below
what I want is 1. skip the NaN value in the first excel.
2. add the value in the first table to the second excel based on the same date and the same water monitoring station ID
These are the codes I write until now, I am new to python and have been struggling for days.
import csv
import pandas
import pandas as pd
import openpyxl
from numpy import size
from pandas import DataFrame
from datetime import datetime
import datetime as dt
import numpy as np
# firstly, exclude those don't have value
# read csv file(2)
csvB4reflectance = pd.read_csv('GEEdownload.csv')
b4 = pd.read_csv('GEEdownload.csv',sep=',',parse_dates=['system:time_start'])
b4.set_index('system:time_start',inplace=True) #set index and change index type, to drop out
print(csvB4reflectance)
path = 'F:/72hourtimewindow/project/waterqualitydate/29UMT/'
excelorder = pd.read_excel(path+'Stationwithorder.xls',header = 0, index_col=0)
print(excelorder)
b41 = b4.dropna(axis=0,how='all')
print(b41)
# process this table, start to calculate when data in the form is not NaN
b41num = b41.to_numpy()
print(b41num)
# import excel order
for i in b41num:
for j in i:
if j == NaN:
break
else:
if
print(j)```
I have came across this problem, the second chart need to be melted firstly!~ and then everything seems to be esay
I want to find all the local min and maxima values in a column and save the whole row in a new dataframe.
See the example code below. I know we have groupy and likes.
How do I do it in a proper way and create the cycle, which should increase by 1? Lastly only take the time of the minimum and they save it.
import pandas as pd
import numpy as np
l = list(np.linspace(0,10,12))
data = [('time',l),
('A',[0,5,0.6,-4.8,-0.3,4.9,0.2,-4.7,0.5,5,0.1,-4.6]),
('B',[ 0,300,20,-280,-25,290,30,-270,40,300,-10,-260]),
]
df = pd.DataFrame.from_dict(dict(data))
print(df)
data_1 = [('cylce',[1,2,3]),
('delta_time',[2.727273,6.363636 ,10.000000]),
('A_max',[5,4.9,5]),
('A_min',[-4.8,-4.7,-4.6]),
('B_min',[-280,-270,-260]),
('B_max',[300,290,300]),
]
df_1 = pd.DataFrame.from_dict(dict(data_1))
print(df_1)
Any help is much appreciated.
My issue is very simple, but I just can't wrap my head around it:
I have two dataframes:
time series dataframe with two columns: Timestamp and DataValue
A time interval dataframe with start, end timestamps and a label
What I want to do:
Add a third column to the timeseries that yields the labels according to the time interval dataframe.
Every timepoint needs to have an assigned label designated by the time interval dataframe.
This code works:
TimeSeries_labelled = TimeSeries.copy(deep=True)
TimeSeries_labelled["State"] = 0
for index in Timeintervals_States.index:
for entry in TimeSeries_labelled.index:
if Timeintervals_States.loc[index,"start"] <= TimeSeries_labelled.loc[entry, "Timestamp"] <= Timeintervals_States.loc[index,"end"]:
TimeSeries_labelled.loc[entry, "State"] = Timeintervals_States.loc[index,"state"]
But it is really slow. I tried to make it shorter and faster with pyhton built in filter codes, but failed miserably.
Please help!
I don't really know about TimeSeries, with a dataframe containing timestamps as datetime object you could use something like the following :
import pandas as pd
#Create the thrid column in the target dataframe
df_timeseries['label'] = pd.Series('',index=df_timeseries.index)
#Loop over the dataframe containing start and end timestamps
for index,row in df_start_end.iterrows():
#Create a boolean mask to filter data
mask = (df_timeseries['timestamp'] > row['start']) & (df_timeseries['timestamp'] < row['end'])
df_timeseries.loc[mask,'label'] = row['label']
This will make the rows your timeseries dataframe that match the condition of the mask have the label of the row, for each rows of your dataframe containing start & end timestamps
I have a large dataframe (around 35k entries), the index of this data frame is composed by dates (like 2014-02-12), this dates are not unique. What I need to do is to find for each data what is the max value for each data and create a new data frame with it. I created a solution that works (it is down bellow) but it takes a lot of time to process. Does anyone knows a faster way that I could do this? Thank you.
#Creates a empty dataframe
dataset0514maxrec = pd.DataFrame(columns=dataset0514max.columns.values)
dataset0514maxrec.index.name = 'Date'
#Gets the unique values, find the groups, recover the max value and append it
for i in dataset0514max.index.unique():
tempDF1 = dataset0514max.loc[dataset0514max.index.isin([i])]
tempDF2 = tempDF1[tempDF1['Data_Value'] == tempDF1['Data_Value'].max()]
dataset0514maxrec = dataset0514maxrec.append(tempDF2.head(1))
print(dataset0514maxrec)
groupby with levels
df.groupby(level=0).Data_Value.max().reset_index()
The next two options require the index to be a datetime index. If it
isn't, convert it:
df.index = pd.to_datetime(df.index)
resample
df.resample('D').max()
sort_values + duplicated
df = df.sort_values('Data_Value')
m = ~df.index.duplicated()
df = df[m]