How to generate one value each minute out of irregular data? - python

I have values that are mesured event-related. So there are not the same amount of data every Minute. To be able to better handle this data I aim to only take the first row of values every Minute.
The time of the data I import from a csv looks like this:
time
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:11
11.11.2011 11:12
11.11.2011 11:12
11.11.2011 11:13
The other values are Temperatures.
One main problem ist to import the time in the right format.
I tried to solve this with the help of this comunity like this:
with open('my_file.csv','r') as file:
for line in file:
try:
time = line.split(';')[0] #splits the line at the comma and takes the first bit
time = dt.datetime.strptime(time, '%d.%m.%Y %H:%M')
print(time)
except:
pass
then I importet the columns of the temperatures and joind them like this:
df = pd.read_csv("my_file.csv", sep=';', encoding='latin-1')
df=df[["time", "T1", "T2", "DT1", "DT2"]]
when I printed the dtypes of my data the time was datetime64[ns] and the others where objects.
I tried different options of groupby and resample. Like the following:
df=df.groupby([pd.Grouper(key = 'time', freq='1min')])
df.resample('M')
One main problem that was stated in the error messages was that the datatype of the time was not appropriate for grouping,... because it is not an DatetimeIndex.
So I tried to convert the dates to a DatetimeIndex like this:
df.index = pd.to_datetime(daten["time"].index, format='%Y-%m-%d %H:%M:%S')
but then I reseaved a Nummeration of the Index starting with 1970-01-01 so I am not quite shure if this conversion is possible with irregular data.
Without this conversion I also get the message <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000026938A74850>
When I then try to call my dataframe the message shows and when saving it to csv like this:
df.to_csv('04_01_DTempminuten.csv', index=False, encoding='utf-8', sep =';', date_format = '%Y-%m-%d %H:%M:%S')
I receive either the same message or only one line with a Dezimalnumber instead of the time.
Does anyone have an idear how to deal with this irregular data to get one line of values each minute?
Thank you for reading my question. I am really thankful for any Idears.

Without sample data I can only show how I do it with irregular time series, which I think is your case. I work with price data which comes at irregular time intervals. So if you need to sample taking the first minute value you can use resample with for a specific interval using ohlc aggregation function, that will give you four columns for each sample interval.
open: first value in the interval
high: highest
low: lowest value
close: last value
In your case the sampling interval would 1 minute ('T')
In the following example I'm using one second ('S') as resampling frequency, to resample ask column (your temperature column):
import pandas as pd
df = pd.read_csv('my_tick_data.csv')
df['date_time'] = pd.to_datetime(df['date_time'])
df.set_index('date_time', inplace=True)
df.head(6)
df['ask'].resample('S').ohlc()
This is not solving your date issue, which is a prerequisite for this part because the data set needs to be indexed by date. If you can provide sample data maybe I can help you with that part either.

Related

How do I convert unusual time string into date time

I measured the seeing index and I need to plot it as a function of time, but the time I received from the measurement is a string with 02-09-2022_time_11-53-51,045 format. How can I convert it into something Python could read and I could use in my plot?
Using pandas I extracted time and seeing_index columns from the txt file received by the measurement. Python correctly plotted seeing index values on Y axes, but besides plotting time values on the X axis, it just added a number to each row and plotted index against row number. What can I do so it was index against time?
You may try this:
df.time = pd.to_datetime(df.time, format='%d-%m-%Y_time_%H-%M-%S,%f')

Changing Period to Datetime

My goal is to convert period to datetime.
If Life Was Easy:
master_df = master_df['Month'].to_datetime()
Back Story:
I built a new dataFrame that originally summed the monthly totals and made a 'Month' column by converting a timestamp to period. Now I want to convert that time period back to a timestamp so that I can create plots using matplotlib.
I have tried following:
Reading the docs for Period.to_timestamp.
Converting to a string and then back to datetime. Still keeps the period issue and won't convert.
Following a couple similar questions in Stackoverflow but could not seem to get it to work.
A simple goal would be to plot the following:
plot.bar(m_totals['Month'], m_totals['Showroom Visits']);
This is the error I get if I try to use a period dtype in my charts
ValueError: view limit minimum 0.0 is less than 1 and is an invalid Matplotlib date value.
This often happens if you pass a non-datetime value to an axis that has datetime units.
Additional Material:
Code I used to create the Month column (where period issue was created):
master_df['Month'] = master_df['Entry Date'].dt.to_period('M')
Codes I used to group to monthly totals:
m_sums = master_df.groupby(['DealerName','Month']).sum().drop(columns={'Avg. Response Time','Closing Percent'})
m_means = master_df.groupby(['DealerName','Month']).mean()
m_means = m_means[['Avg. Response Time','Closing Percent']]
m_totals = m_sums.join(m_means)
m_totals.reset_index(inplace=True)
m_totals
Resulting DataFrame:
I was able to cast the period type to string then to datetime. Just could not go straight from period to datetime.
m_totals['Month'] = m_totals['Month'].astype(str)
m_totals['Month'] = pd.to_datetime(m_totals['Month'])
m_totals.dtypes
I wish I did not get downvoted for not providing the entire dataFrame.
First change it to str then to date
index=pd.period_range(start='1949-01',periods=144 ,freq='M')
type(index)
#changing period to date
index=index.astype(str)
index=pd.to_datetime(index)
df.set_index(index,inplace=True)
type(df.index)
df.info()
Another potential solution is to use to_timestamp. For example: m_totals['Month'] = m_totals['Month'].dt.to_timestamp()

Python - select certain time range pandas

Python newbie here but I have some data that is intra-day financial data, going back to 2012, so it's got the same hours each day(same trading session each day) but just different dates. I want to be able to select certain times out of the data and check the corresponding OHLC data for that period and then do some analysis on it.
So at the moment it's a CSV file, and I'm doing:
import pandas as pd
data = pd.DataFrame.read_csv('data.csv')
date = data['date']
op = data['open']
high = data['high']
low = data['low']
close = data['close']
volume = data['volume']
The thing is that the date column is in the format of "dd/mm/yyyy 00:00:00 "as one string or whatever, so is it possible to still select between a certain time, like between "09:00:00" and "10:00:00"? or do I have to separate that time bit from the date and make it it's own column? If so, how?
So I believe pandas has a between_time() function, but that seems to need a DataFrame, so how can I convert it to a DataFrame, then I should be able to use the between_time function to select between the times I want. Also because there's obviously thousands of days, all with their own "xx:xx:xx" to "xx:xx:xx" I want to pull that same time period I want to look at from each day, not just the first lot of "xx:xx:xx" to "xx:xx:xx" as it makes its way down the data, if that makes sense. Thanks!!
Consider the dataframe df
from pandas_datareader import data
df = data.get_data_yahoo('AAPL', start='2016-08-01', end='2016-08-03')
df = df.asfreq('H').ffill()
option 1
convert index to series then dt.hour.isin
slc = df.index.to_series().dt.hour.isin([9, 10])
df.loc[slc]
option 2
numpy broadcasting
slc = (df.index.hour[:, None] == [9, 10]).any(1)
df.loc[slc]
response to comment
To then get a range within that time slot per day, use resample + agg + np.ptp (peak to peak)
df.loc[slc].resample('D').agg(np.ptp)

Removing nth row in Pandas

I have a Pandas df with a time series that goes by 34 milliseconds and I only need a 5 second resolution. I initially created a time stamp and tried to both setting the time stamp as an index and resample and .iloc.
# Defining file path
file = "C:/file/path/data.csv"
# Read in data and parse date/time to DateTime format
data = pd.read_csv(file,header=10,parse_dates=[[0,1]],dayfirst=False)
# time stamp in preferred format
data['date_stamp'] = pd.to_datetime(data['Date_ Time'],dayfirst=False)
#trying to get every 5 seconds, not 34 milliseconds
data.iloc[::15,:]
# saving new file to csv
data.to_csv(""C:/file/path/data.csv"",date_format='%Y%m%d %H:%M:%S')
Would this be best to do a time index and resample? This code always returns the same data in the df. Whats the best way to condense this data into 5 second intervals?
I think you can use resample with first:
data.set_index('date_stamp', inplace=True)
print (data.resample('5S').first())
See docs
If use older pandas as 0.18.0:
print (data.resample('5S', how='first'))

Selecting data for one hour in a timeseries dataframe

I'm having trouble selecting data in a dataframe dependent on an hour.
I have a months worth of data which increases in 10min intervals.
I would like to be able to select the data (creating another dataframe) for each hour in a specific day for each hour. However, I am having trouble creating an expression.
This is how I did it to select the day:
x=all_data.resample('D').index
for day in range(20):
c=x.day[day]
d=x.month[day]
print data['%(a)s-%(b)s-2009' %{'a':c, 'b':d} ]
but if I do it for hour, it will not work.
x=data['04-09-2009'].resample('H').index
for hour in range(8):
daydata=data['4-9-2009 %(a)s' %{'a':x.hour[hour]}]
I get the error:
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named 4-9-2009 0'
which is true as it is in format dd/mm/yyy hh:mm:ss
I'm sure this should be easy and something to do with resample. The trouble is I don't want to do anything with the dat, just select the data frame (to correlate it afterwards)
Cheers
You don't need to resample your data unless you want to aggregate into a daily value (e.g., sum, max, median)
If you just want a specific day's worth of data, you can use to the follow example of the .loc attribute to get started:
import numpy
import pandas
N = 3700
data = numpy.random.normal(size=N)
time = pandas.DatetimeIndex(freq='10T', start='2013-02-15 14:30', periods=N)
ts = pandas.Series(data=data, index=time)
ts.loc['2013-02-16']
The great thing about using .loc on a time series is that you can be a general or specific as you want with the dates. So for a particular hour, you'd say:
ts.loc['2013-02-16 13'] # notice that i didn't put any minutes in there
Similarly, you can pull out a whole month with:
ts.loc['2013-02']
The issue you're having with the string formatting is that you're manually padding the string with a 0. So if you have a 2-digit hour (i.e. in the afternoon) you end up with a 3-digit representation of the hours (and that's not valid). SO if I wanted to loop through a specific set of hours, I would do:
hours = [2, 7, 12, 22]
for hr in hours:
print(ts.loc['2013-02-16 {0:02d}'.format(hr)])
The 02d format string tell python to construct a string from a digit (integer) that is least two characters wide and the pad the string with a 0 of the left side if necessary. Also you probably need to format your date as YYYY-mm-dd instead of the other way around.

Categories

Resources