This is kind of a mixture between these two questions:
Pandas is a Timestamp within a Period (because it adds a time period in pandas)
Generate a random date between two other dates (but I need multiple dates (at least 1 million which I specify with a variable LIMIT))
How can I generate random dates WITH random time between a given date period randomly for a specific given amount?
Performance is rather important for me, hence I chose to go with pandas, any performance boosts are appreciated even if that means using another library.
My approach so far would be the following:
tstamp = pd.to_datetime(['01/01/2010', '2020-12-31'])
# ???
But I don't know how to randomize between dates. I was thinking of using randint for a random unix epoch time and then converting that, but it would slow it down A LOT.
You can try this, it is very fast:
start = np.datetime64('2017-01-01')
end = np.datetime64('2018-01-01')
limit = 1000000
delta = np.arange(start,end)
indices = np.random.choice(len(delta), limit)
delta[indices]
All I had to do was to add str(fake.date_time_between(start_date='-10y', end_date='now')) into my Pandas DataFrame append logic. I'm not even sure that the str() there is necessary.
P.S. you initialize it like this:
from faker import Faker
# initialize Faker
fake = Faker()
Related
I'm currently sitting in Jupyter Notebook on a dataset that has a duration column that looks like this;
I still feel like a newbie at programming at programming, so i'm not sure to convert this data so it can be visualized in graphs in jupyter. Right now its just all strings in the column.
Does anyone knows how i do this right?
Thank you!
Assuming each time in your data is a string and assuming the formats are all as shown then you could use a parser after a little massaging of the data:
from dateutil import parser
s = "1 hour 35 mins"
print(s)
s = s.replace('min', 'minute')
time = parser.parse(s).time()
print(time)
This somewhat less flexible than the answer from #Jimpsoni which captures the two numbers but will work on your data and variations such as "1h 35m". If your data is in a List then you can loop through it; if in a Pandas series then you could form a function and use .apply to convert the values in the series.
You could loop through your dataset and extract the numbers from the strings and then turn those numbers into timedelta objects. Here is a one example from your dataset.
from datetime import timedelta
import re
string = "1 hour 35 mins" # Example from dataset
# Extract numbers with regex
numbers = list(map(int, re.findall(r'\d+', string)))
# Create timedelta object from those numbers
if len(numbers) < 2:
time = timedelta(minutes=numbers[0])
else:
time = timedelta(hours=numbers[0], minutes=numbers[1])
print(time) # -> prints 1:35:00
More about deltatime object here.
What is the optimal way to loop through your dataset really depends in what form is your data in but this is an example how you would do it to one instance of data.
Many thanks in advance for helping a python newbie like me !
I have a DataFrame containing daily or hourly prices for a particular crypto.
I was just wondering if there is an easy way to check if there is any missing day or hour (depending on the chosen granularity) that would break a perfectly constant timedelta (between 2 dates) in the index?
Here an example of an other "due diligence" check I am doing. I am just making sure that the temporal order is respected:
# Check timestamp order:
i = 0
for i in range(0,len(df.TS)-1):
if df.TS[i] > df.TS[i+1]:
print('Timestamp does not respect time direction, please check df.')
break
else:
i += 1
Perhaps there is surely a better way to do this but I didn't find any build in function for both of these checks I would like to do.
Many thanks again and best regards,
Pierre
If df.TS is where you store your datetime data, then you can do this (example for daily data, change freq accordingly):
pd.date_range(start = df.TS.min(), end = df.TS.max(), freq = 'D').difference(df.TS)
This will return the difference between a complete range and your datetime series.
I have two lists of dates and times in the following format:
YYYY:DDD:HH:MM:SS:mmm
(mmm is miliseconds).
I am getting these times from another list using regular expressions.
for line in start_list
start_time = re.search(r'((\d\d\d\d):(\d*):(\d\d):(\d\d):(\d\d):(\d\d\d))')
for line in end_list
end_time = re.search(r'((\d\d\d\d):(\d*):(\d\d):(\d\d):(\d\d):(\d\d\d))')
I could do a for loop cycling through start_time, and have a counter to keep track of the current line for end_time. I'm not really sure the best way to execute it. I cant seem to figure out how to cycle through each line in each list to calculate the time difference between them. Any help would be very greatly appreciated.
You can convert to datetime objects and then just subtract the first series to the second as e.g.:
import pandas as pd
start_list_pts=[date.replace(":",".") for date in start_list]
end_list_pts=[date.replace(":",".") for date in end_list]
start_datetimes=pd.to_datetime(start_list_pts,format='%Y.%j.%H.%M.%S.%f')
end_datetimes=pd.to_datetime(end_list_pts,format='%Y.%j.%H.%M.%S.%f')
dts=start_datetimes-end_datetimes
Your milliseconds were not a fraction of your seconds that's why I changed the separator to points. You can of course also only change the last one!
I'm dealing with time series data using python's pandas DataFrame.
Given that this time series has a value in the range of -10 to 10, we want to find out how many times it passes by 3.
In the simplest case, you can check if the values in the previous and current columns are small or large based on 3 to see if there are any changes.
Is there a function in pandas to help with this?
If you just want to find how many times 0 come out
use pd.count(axis=columns)
import pandas as pd
path = ('./test.csv')
dataframe = pd.read_csv(path,encoding='utf8')
print(dataframe.count(axis='columns'))
Is there some function in Python to handle this. GoogleDocs has a Weekday -operation so perhaps there is something like that in Python. I am pretty sure someone must have solved this, similar problems occur in sparse data such as in finance and research. I am basically just trying to organize a huge amount of different sized vectors indexed by days, time-series, I am not sure how I should hadle the days -- mark the first day with 1 and the last day with N or with unix -time or how should that be done? I am not sure whether the time-series should be saved into matrix so I could model them more easily to calculate correlation matrices and such things, any ready thing to do such things?
Let's try to solve this problem without the "practical" extra clutter:
import itertools
seq = range(100000)
criteria = cycle([True]*10 + [False]*801)
list(compress(seq, criteria))
now have to change them into days and then change the $\mathbb R$ into $( \mathbb R, \mathbb R)$, tuple. So $V : \mathbb R \mapsto \mathbb R^{2}$ missing, investigating.
[Update]
Let's play! Below code solves the subproblem -- creates some test data to test things -- now we need to create arbitrary days and valuations there to try to test it on arbitrary timeseries. If we can create some function $V$, we are very close to solve this problem...it must consider though the holidays and weekends so maybe not easy (not sure).
import itertools as i
import time
import math
import numpy
def createRandomData():
samples=[]
for x in range(5):
seq = range(5)
criteria = i.cycle([True]*x+ [False]*3)
samples += [list(i.compress( seq, criteria ))]
return samples
def createNNtriangularMatrix(data):
N = len(data)
return [aa+[0]*(N-len(aa)) for aa in data]
A= createNNtriangularMatrix(createRandomData())
print numpy.array(A)
print numpy.corrcoef(A)
I think you should figure out someway the days you want to INCLUDE, and create a (probably looping) subroutine use slicing operations on your big list.
For discontinuous slices, you can take a look at this question:
Discontinuous slice in python list
Or perhaps you could make the days you do not want receive a null value (zero or None).
Try using pandas. You can create a DateOffset for business days and include your data in a DataFrame (see: http://pandas.pydata.org/pandas-docs/stable/timeseries.html) to analyze it.
I think it depends on the scope of your problem, for a personal calendar, 'day' is good enough for indexing.
One's life is as long as 200 years, about 73000 days, simply calculate and record them all, maybe use a dict, e.g.
day = {}
# day[0] = [event_a, event_b, ...]
# or you may want to rewrite the __getitem__ method like this: day['09-05-2012']
Why would you want to remove the holidays and weekends? Is it because they are outliers or zeroes? If they are zeroes they will be handled by the model. You would want to leave the data in the time series and use dummy variables to model the seasonal effects (ie monthly dummies), day of the week dummies and holiday dummies. Clearly, I am dummfounded. I have season people who are unable to deal with time series analysis even break the weekdays into one time series and the weekends into another which completely ignores the lead and lag impacts around holidays.
If it is trading days you want then you can use the pandas datareader package to download the s&p 500 historical prices for U.S. and use the index of dates as a mask to your data.
Answered on mobile, I'll add links and code later.