I'm currently sitting in Jupyter Notebook on a dataset that has a duration column that looks like this;
I still feel like a newbie at programming at programming, so i'm not sure to convert this data so it can be visualized in graphs in jupyter. Right now its just all strings in the column.
Does anyone knows how i do this right?
Thank you!
Assuming each time in your data is a string and assuming the formats are all as shown then you could use a parser after a little massaging of the data:
from dateutil import parser
s = "1 hour 35 mins"
print(s)
s = s.replace('min', 'minute')
time = parser.parse(s).time()
print(time)
This somewhat less flexible than the answer from #Jimpsoni which captures the two numbers but will work on your data and variations such as "1h 35m". If your data is in a List then you can loop through it; if in a Pandas series then you could form a function and use .apply to convert the values in the series.
You could loop through your dataset and extract the numbers from the strings and then turn those numbers into timedelta objects. Here is a one example from your dataset.
from datetime import timedelta
import re
string = "1 hour 35 mins" # Example from dataset
# Extract numbers with regex
numbers = list(map(int, re.findall(r'\d+', string)))
# Create timedelta object from those numbers
if len(numbers) < 2:
time = timedelta(minutes=numbers[0])
else:
time = timedelta(hours=numbers[0], minutes=numbers[1])
print(time) # -> prints 1:35:00
More about deltatime object here.
What is the optimal way to loop through your dataset really depends in what form is your data in but this is an example how you would do it to one instance of data.
Related
This is kind of a mixture between these two questions:
Pandas is a Timestamp within a Period (because it adds a time period in pandas)
Generate a random date between two other dates (but I need multiple dates (at least 1 million which I specify with a variable LIMIT))
How can I generate random dates WITH random time between a given date period randomly for a specific given amount?
Performance is rather important for me, hence I chose to go with pandas, any performance boosts are appreciated even if that means using another library.
My approach so far would be the following:
tstamp = pd.to_datetime(['01/01/2010', '2020-12-31'])
# ???
But I don't know how to randomize between dates. I was thinking of using randint for a random unix epoch time and then converting that, but it would slow it down A LOT.
You can try this, it is very fast:
start = np.datetime64('2017-01-01')
end = np.datetime64('2018-01-01')
limit = 1000000
delta = np.arange(start,end)
indices = np.random.choice(len(delta), limit)
delta[indices]
All I had to do was to add str(fake.date_time_between(start_date='-10y', end_date='now')) into my Pandas DataFrame append logic. I'm not even sure that the str() there is necessary.
P.S. you initialize it like this:
from faker import Faker
# initialize Faker
fake = Faker()
I have data in a pandas dataframe that is marked by timestamps as datetime objects. I would like to make a graph that takes the time as something fluid. My idea was to substract the first timestamp from the others (here exemplary for the second entry)
xhertz_df.loc[1]['Dates']-xhertz_df.loc[0]['Dates']
to get the time passed since the first measurement. Which gives 350 days 08:27:51 as a timedelta object. So far so good.
This might be a duplicate but I have not found the solution here so far. Is there a way to quickly transform this object to a number of e.g. minutes or seconds or hours. I know I could extract the individual days, hours and minutes and make a tedious calculation to get it. But is there an integrated way to just turn this object into what I want?
Something like
timedelta.tominutes
that gives it back as a float of minutes, would be great.
If all you want is a float representation, maybe as simple as:
float_index = pd.Index(xhertz_df.loc['Dates'].values.astype(float))
In Pandas, Timestamp and Timedelta columns are internally handled as numpy datetime64[ns], that is an integer number of nanoseconds.
So it is trivial to convert a Timedelta column to a number of minutes:
(xhertz_df.loc[1]['Dates']-xhertz_df.loc[0]['Dates']).astype('int64')/60000000000.
Here is a way to do so with ‘timestamp‘:
Two examples for converting and one for the diff
import datetime as dt
import time
# current date and time
now = dt.datetime.now()
timestamp1 = dt.datetime.timestamp(now)
print("timestamp1 =", timestamp1)
time.sleep(4)
now = dt.datetime.now()
timestamp2 = dt.datetime.timestamp(now)
print("timestamp2 =", timestamp2)
print(timestamp2 - timestamp1)
I have two lists of dates and times in the following format:
YYYY:DDD:HH:MM:SS:mmm
(mmm is miliseconds).
I am getting these times from another list using regular expressions.
for line in start_list
start_time = re.search(r'((\d\d\d\d):(\d*):(\d\d):(\d\d):(\d\d):(\d\d\d))')
for line in end_list
end_time = re.search(r'((\d\d\d\d):(\d*):(\d\d):(\d\d):(\d\d):(\d\d\d))')
I could do a for loop cycling through start_time, and have a counter to keep track of the current line for end_time. I'm not really sure the best way to execute it. I cant seem to figure out how to cycle through each line in each list to calculate the time difference between them. Any help would be very greatly appreciated.
You can convert to datetime objects and then just subtract the first series to the second as e.g.:
import pandas as pd
start_list_pts=[date.replace(":",".") for date in start_list]
end_list_pts=[date.replace(":",".") for date in end_list]
start_datetimes=pd.to_datetime(start_list_pts,format='%Y.%j.%H.%M.%S.%f')
end_datetimes=pd.to_datetime(end_list_pts,format='%Y.%j.%H.%M.%S.%f')
dts=start_datetimes-end_datetimes
Your milliseconds were not a fraction of your seconds that's why I changed the separator to points. You can of course also only change the last one!
I'm having trouble selecting data in a dataframe dependent on an hour.
I have a months worth of data which increases in 10min intervals.
I would like to be able to select the data (creating another dataframe) for each hour in a specific day for each hour. However, I am having trouble creating an expression.
This is how I did it to select the day:
x=all_data.resample('D').index
for day in range(20):
c=x.day[day]
d=x.month[day]
print data['%(a)s-%(b)s-2009' %{'a':c, 'b':d} ]
but if I do it for hour, it will not work.
x=data['04-09-2009'].resample('H').index
for hour in range(8):
daydata=data['4-9-2009 %(a)s' %{'a':x.hour[hour]}]
I get the error:
raise KeyError('no item named %s' % com.pprint_thing(item))
KeyError: u'no item named 4-9-2009 0'
which is true as it is in format dd/mm/yyy hh:mm:ss
I'm sure this should be easy and something to do with resample. The trouble is I don't want to do anything with the dat, just select the data frame (to correlate it afterwards)
Cheers
You don't need to resample your data unless you want to aggregate into a daily value (e.g., sum, max, median)
If you just want a specific day's worth of data, you can use to the follow example of the .loc attribute to get started:
import numpy
import pandas
N = 3700
data = numpy.random.normal(size=N)
time = pandas.DatetimeIndex(freq='10T', start='2013-02-15 14:30', periods=N)
ts = pandas.Series(data=data, index=time)
ts.loc['2013-02-16']
The great thing about using .loc on a time series is that you can be a general or specific as you want with the dates. So for a particular hour, you'd say:
ts.loc['2013-02-16 13'] # notice that i didn't put any minutes in there
Similarly, you can pull out a whole month with:
ts.loc['2013-02']
The issue you're having with the string formatting is that you're manually padding the string with a 0. So if you have a 2-digit hour (i.e. in the afternoon) you end up with a 3-digit representation of the hours (and that's not valid). SO if I wanted to loop through a specific set of hours, I would do:
hours = [2, 7, 12, 22]
for hr in hours:
print(ts.loc['2013-02-16 {0:02d}'.format(hr)])
The 02d format string tell python to construct a string from a digit (integer) that is least two characters wide and the pad the string with a 0 of the left side if necessary. Also you probably need to format your date as YYYY-mm-dd instead of the other way around.
Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)
I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).
Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.
I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']
Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.
If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.