Python dateutil.rrule is incredibly slow - python

I'm using the python dateutil module for a calendaring application which supports repeating events. I really like the ability to parse ical rrules using the rrulestr() function. Also, using rrule.between() to get dates within a given interval is very fast.
However, as soon as I try doing any other operations (ie: list slices, before(), after(),...) everything begins to crawl. It seems like dateutil tries to calculate every date even if all I want is to get the last date with rrule.before(datetime.max).
Is there any way of avoiding these unnecessary calculations?

My guess is probably not. The last date before datetime.max means you have to calculate all the recurrences up until datetime.max, and that will reasonably be a LOT of recurrences. It might be possible to add shortcuts for some of the simpler recurrences. If it is every year on the same date for example, you don't really need to compute the recurrences inbetween, for example. But if you have every third something you must, for example, and also if you have a maximum recurrences, etc. But I guess dateutil doesn't have these shortcuts. It would probably be quite complex to implement reliably.
May I ask why you need to find the last recurrence before datetime.max? It is, after all, almost eight thousand years into the future... :-)

Related

Why is the compute() method slow for Dask dataframes but the head() method is fast?

So I'm a newbie when it comes to working with big data.
I'm dealing with a 60GB CSV file so I decided to give Dask a try since it produces pandas dataframes. This may be a silly question but bear with me, I just need a little push in the right direction...
So I understand why the following query using the "compute" method would be slow(lazy computation):
df.loc[1:5 ,'enaging_user_following_count'].compute()
btw it took 10 minutes to compute.
But what I don't understand is why the following code using the "head" method prints the output in less than two seconds (i.e., In this case, I'm asking for 250 rows while the previous code snippet was just for 5 rows):
df.loc[50:300 , 'enaging_user_following_count'].head(250)
Why doesn't the "head" method take a long time? I feel like I'm missing something here because I'm able to pull a huge number of rows in a way shorter time than when using the "compute" method.
Or is the compute method used in other situations?
Note: I tried to read the documentation but there was no explanation to why head() is fast.
I played around with this a bit half a year ago. .head() is not checking all your partitions but simply checks the first partition. There is no synchronization overhead etc. so it is quite fast, but it does not take the whole Dataset into account.
You could try
df.loc[-251: , 'enaging_user_following_count'].head(250)
IIRC you should get the last 250 entries of the first partition instead of the actual last indices.
Also if you try something like
df.loc[conditionThatIsOnlyFulfilledOnPartition3 , 'enaging_user_following_count'].head(250)
you get an error that head could not find 250 samples.
If you actually just want the first few entries however it is quite fast :)
This processes the whole dataset
df.loc[1:5, 'enaging_user_following_count'].compute()
The reason is, that loc is a label-based selector, and there is no telling what labels exist in which partition (there's no reason that they should be monotonically increasing). In the case that the index is well-formed, you may have useful values of df.divisions, and in this case Dask should be able to pick only the partitions of your data that you need.

Divide one "column" by another in Tab Delimited file

I have many files with three million lines in identical tab delimited format. All I need to do is divide the number in the 14th "column" by the number in the 12th "column", then set the number in the 14th column to the result.
Although this is a very simple function I'm actually really struggling to work out how to achieve this. I've spent a good few hours searching this website but unfortunately the answers I've seen have completely gone over the top of my head as I'm a novice coder!
The tools I have Notepad++ and Ultraedit (which has the ability to use Javascript, although i'm not familiar with this), and Python 3.6 (I have very basic Python knowledge). Other answers have suggested using something called "awk", but when I looked this up it needs Unix - I only have Windows. What's the best tool for getting this done? I'm more than willing to learn something new.
In python there are a few ways to handle csv. For your particular use case
I think pandas is what you are looking for.
You can load your file with df = pandas.read_csv(), then performing your division and replacement will be as easy as df[13] /= df[11].
Finally you can write your data back in csv format with df.to_csv().
I leave it to you to fill in the missing details of the pandas functions, but I promise it is very easy and you'll probably benefit from learning it for a long time.
Hope this helps

Business time difference between datetime objects in Python

How do I find the business time difference between two datetime objects in Python?
Business time can be 9-17
I want also to consider holidays. I have a function is_holiday(date) that takes a date object.
def buisiness_time_between(datetime1, datetime2):
return business_time_after(datetime1) +
(business_time_per_day * days_between(datetime1,datetime2)) +
business_time_before(datetime2)
I'm sure you can figure out how to implement business_time_after, business_time_before, business_time_per_day, and days_between
I did not really find a simple pythonic answer to this problem.
The package businesstime suggested by J.F. Sebastian does what I wanted, but I think that the algorithms can be optimized and simplified.
For example, if we change the model of the day we can use the standard python datetime operations.
I try to be clearer: if our work day is 9-17 we may say that our day is made by 8h and not 24h. So, 9:00->0:00 and 17:00->08:00. Now what we have to do is to create a function that converts the real time inside our new model in this way:
if it is a time lower than our start time (09:00), we don't care about it and we convert it to 00:00
if it is a business time we convert subtracting the start time, for example 13:37 becomes 04:37
if it is a time greater than our stop time (17:00), we don't care about it and we convert it to 08:00
After this conversion all the calculation can be made easily with the python datetime package as you do with a normal datetime.
Thanks also to the answer of user2085282 you can find this concept implemented in the repo django-business-time.
With the model conversion, I found extremely easy to add another feature: the lunch break
In fact, if you have a lunch break 12-13, you can map in your new day model all the times inside the interval to 12. But pay attention to convert also the time after 13 considering one hour of lunch break.
I am sorry that what I developed is just a Django application and not a Python package. But for the moment, it works for me. And sorry also for my code, I am a newbie :)
PS: the full documentation is coming soon!

Design - How to handle timestamps (storage) and when performing computations ; Python

I'm trying to determine (as my application is dealing with lots of data from different sources and different time zones, formats, etc) how best to store my data AND work with it.
For example, should I store everything as UTC? This means when I fetch data I need to determine what timezone it is currently in, and if it's NOT UTC, do the necessary conversion to make it so. (Note, I'm in EST).
Then, when performing computations on the data, should I extract (say it's UTC) and get into MY time zone (EST), so it makes sense when I'm looking at it? I should I keep it in UTC and do all my calculations?
A lot of this data is time series and will be graphed, and the graph will be in EST.
This is a Python project, so lets say I have a data structure that is:
"id1": {
"interval": 60, <-- seconds, subDict['interval']
"last": "2013-01-29 02:11:11.151996+00:00" <-- UTC, subDict['last']
},
And I need to operate on this, by determine if the current time (now()) is > the last + interval (has the 60 second elapsed)? So in code:
lastTime = dateutil.parser.parse(subDict['last'])
utcNow = datetime.datetime.utcnow().replace(tzinfo=tz.tzutc())
if lastTime + datetime.timedelta(seconds=subDict['interval']) < utcNow:
print "Time elapsed, do something!"
Does that make sense? I'm working with UTC everywhere, both stored and computationally...
Also, if anyone has links to good write-ups on how to work with timestamps in software, I'd love to read it. Possibly like a Joel On Software for timestamp usage in applications ?
It seems to me as though you're already doing things 'the right way'. Users will probably expect to interact in their local time zone (input and output), but it's normal to store normalized dates in UTC format so that they are unambiguous and to simplify calculation. So, normalize to UTC as soon as possible, and localize as late as possible.
Some small amount of information about Python and timezone processing can be found here:
Django timezone implementation
pytz Documentation
My current preference is to store dates as unix timestamp tv_sec values in backend storage, and convert to Python datetime.datetime objects during processing. Processing will usually be done with a datetime object in the UTC timezone and then converted to a local user's timezone just before output. I find having that having a rich object such as a datetime.datetime helps with debugging.
Timezone are a nuisance to deal with and you probably need to determine on a case-by-case basis whether it's worth the effort to support timezones correctly.
For example, let's say you're calculating daily counts for bandwidth used. Some questions that may arise are:
What happens on a daylight saving boundary? Should you just assume that a day is always 24 hours for ease of calculation or do you need to always check for every daily calculation that a day may have less or more hours on the daylight savings boundary?
When presenting a localized time, does it matter if a time is repeated? eg. If you have an hourly report display in localtime without a time zone attached, will it confuse the user to have a missing hour of data, or a repeated hour of data around daylight savings changes.
Since, as I can see, you do not seem to be having any implementation problems, I would focus rather on design aspects than on code and timestamp format. I have an experience of participating in design of network support for a navigation system implemented as a distributed system in a local network. The nature of that system is such that there is a lot of data (often conflicting), coming from different sources, so solving possible conflicts and keeping data integrity is rather tricky. Just some thoughts based on that experience.
Timestamping data, even in a distributed system including many computers, usually is not a problem if you do not need a higher resoluition than one provided by system time functions and higher time synchronization accuracy than one provided by your OS components.
In the simplest case using UTC is quite reasonable, and for most of tasks it's enough. However, it's important to understand the purpose of using time stamps in your system from the very beginning of design. Time values (no matter if it is Unix time or formatted UTC strings) sometimes may be equal. If you have to resolve data conflicts based on timestamps (I mean, to always select a newer (or an older) value among several received from different sources), you need to understand if an incorrectly resolved conflict (that usually means a conflict that may be resolved in more than one way, as timestamps are equal) is a fatal problem for your system design, or not. The probable options are:
If the 99.99% of conflicts are resolved in the same way on all the nodes, you do not care about the remaining 0.01%, and they do not break data integrity. In that case you may safely continue using something like UTC.
If strict resolving of all the conflicts is a must for you, you have to design your own timestamping system. Timestamps may include time (maybe not system time, but some higher resolution timer), sequence number (to allow producing unique timestamps even if time resolution is not enough for that) and node identifier (to allow different nodes of your system to generate completely unique timestamps).
Finally, what you need may be not timestamps based on time. Do you really need to be able to calculate time difference between a pair of timestamps? Isn't it enough just to allow ordering timestamps, not connecting them to real time moments? If you don't need time calculations, just comparisons, timestamps based on sequential counters, not on real time, are a good choice (see Lamport time for more details).
If you need strict conflict resolving, or if you need very high time resolution, you will probably have to write your own timestamp service.
Many ideas and clues may be borrowed from a book by A. Tanenbaum, "Distributed systems: Principles and paradigms". When I faced such problems, it helped me a lot, and there is a separate chapter dedicated to timestamps generation in it.
I think the best approach is to store all timestamp data as UTC. When you read it in, immediately convert to UTC; right before display, convert from UTC to your local time zone.
You might even want to have your code print all timestamps twice, once in local time and the second time in UTC time... it depends on how much data you need to fit on a screen at once.
I am a big fan of the RFC 3339 timestamp format. It is unambiguous to both humans and machines. What is best about it is that almost nothing is optional, so it always looks the same:
2013-01-29T19:46:00.00-08:00
I prefer to convert timestamps to single float values for storage and computations, and then convert back to the datetime format for display. I wouldn't keep money in floats, but timestamp values are well within the precision of float values!
Working with time floats makes a lot of code very easy:
if time_now() >= last_time + interval:
print("interval has elapsed")
It looks like you are already doing it pretty much this way, so I can't suggest any dramatic improvements.
I wrote some library functions to parse timestamps into Python time float values, and convert time float values back to timestamp strings. Maybe something in here will be useful to you:
http://home.blarg.net/~steveha/pyfeed.html
I suggest you look at feed.date.rfc3339. BSD license, so you can just use the code if you like.
EDIT: Question: How does this help with timezones?
Answer: If every timestamp you store is stored in UTC time as a Python time float value (number of seconds since the epoch, with optional fractional part), you can directly compare them; subtract one from another to find out the interval between them; etc. If you use RFC 3339 timestamps, then every timestamp string has the timezone right there in the timestamp string, and it can be correctly converted to UTC time by your code. If you convert from a float value to a timestamp string value right before displaying, the timezone will be correct for local time.
Also, as I said, it looks like he is already pretty much doing this, so I don't think I can give any amazing advice.
Personally I'm using the Unix-time standard, it's very convenient for storage due to its simple representation form, it's merely a sequence of numbers. Since internally it represent UTC time, you have to make sure to generate it properly (converting from other timestamps) before storing and format it accordingly to any time zone you want.
Once you have a common timestamp format in the backend data (tz aware), plotting the data is very easy as is just a matter of setting the destination TZ.
As an example:
import time
import datetime
import pytz
# print pre encoded date in your local time from unix epoch
example = {"id1": {
"interval": 60,
"last": 1359521160.62
}
}
#this will use your system timezone formatted
print time.strftime("%Y-%m-%d %H:%M:%S",time.localtime(example['id1']['last']))
#this will use ISO country code to localize the timestamp
countrytz = pytz.country_timezones['BR'][0]
it = pytz.timezone(countrytz)
print it.localize(datetime.datetime.utcfromtimestamp(example['id1']['last']))

Algorithm in Python to store and search daily occurrence for thousands of numbered events?

I'm investigating solutions of storing and querying a historical record of event occurrences for a large number of items.
This is the simplified scenario: I'm getting a daily log of 200 000 streetlamps (labeled sl1 to sl200000) which shows if the lamp was operational on the day or not. It does not matter for how long the lamp was in service only that it was on a given calendar day.
Other bits of information are stored for each lamp as well and the beginning of the Python class looks something like this:
class Streetlamp(object):
"""Class for streetlamp record"""
def __init__(self, **args):
self.location = args['location']
self.power = args['power']
self.inservice = ???
My py-foo is not too great and I would like to avoid a solution which is too greedy on disk/memory storage. So a solution with a dict of (year, month, day) tuples could be one solution, but I'm hoping to get pointers for a more efficient solution.
A record could be stored as a bit stream with each bit representing a day of a year starting with Jan 1. Hence, if a lamp was operational the first three days of 2010, then the record could be:
sl1000_up = dict('2010': '11100000000000...', '2011':'11111100100...')
Search across year boundaries would need a merge, leap years are a special case, plus I'd need to code/decode a fair bit with this home grown solution. It seems not quiet right. speed-up-bitstring-bit-operations, how-do-i-find-missing-dates-in-a-list and finding-data-gaps-with-bit-masking where interesting postings I came across. I also investigated python-bitstring and did some googling, but nothing seems to really fit.
Additionally I'd like search for 'gaps' to be possible, e.g. 'three or more days out of action' and it is essential that a flagged day can be converted into a real calendar date.
I would appreciate ideas or pointers to possible solutions. To add further detail, it might be of interest that the back-end DB used is ZODB and pure Python objects which can be pickled are preferred.
Create a 2D-array in Numpy:
import numpy as np
nbLamps = 200000
nbDays = 365
arr = np.array([nbLamps, nbDays], dtype=np.bool)
It will be very memory-efficient and you can aggregate easily the days and lamps.
In order to manipulate the days even better, have a look at scikits.timeseries. They will allow you to access the dates with datetime objects.
I'd probably dictionary the lamps and have each of them contain a list of state changes where the first element is the time of the change and the second the value that's valid since that time.
This way when you get to the next sample you do nothing unless the state changed compared to the last item.
Searching is quick and efficient as you can use binary search approaches on the times.
Persisting it is also easy and you can append data to an existing and running system without any problems too as well as dictionary the lamp state lists to further reduce resource usage.
If you want to search for a gap you just go over all the items and compare the next and prev times - and if you decided to dictionary the state lists then you'll be able to do it just once for every different list rather then every lamp and then get all the lamps that had the same "offline" states with just one iteration which may sometimes help

Categories

Resources