I am measuring run times of a program in seconds. Depending on the amount of data I input, that can take milliseconds to days. Is there a Python module that I can use to convert the number of seconds to the most useful unit and display that? Approximations are fine.
For example, 50 should become 50 seconds, 590 should become 10 minutes, 100000 should become 1 day or something like that. I could write the basic thing myself, but I am sure people have thought about this more than I have and have considered many of the edge case I wouldn't think about in a 1000 years :)
Edit: I noticed tqdm must have some logic associated with that, as it selects the length of the ETA string accordingly. Compare
for _ in tqdm.tqdm(range(10)): time.sleep(1)
with
for _ in tqdm.tqdm(range(100000)): time.sleep(1)
Edit: I have also found this Gist, but I would prefer code with at least some maintanence :)
https://gist.github.com/alexwlchan/73933442112f5ae431cc
Close the question if you want, humanize.naturaldelta is the answer:
This modest package contains various common humanization utilities, like turning a number into a fuzzy human readable duration ('3 minutes ago') or into a human readable size or throughput. It works with python 2.7 and 3.3 and is localized to Russian, French, Korean and Slovak.
https://github.com/jmoiron/humanize
I just found arrow:
Arrow is a Python library that offers a sensible and human-friendly approach to creating, manipulating, formatting and converting dates, times and timestamps.
It has humanize(), too, and is much more well-maintainted, it seems:
https://arrow.readthedocs.io/en/latest/#humanize
Related
I'd like to keep the last 10 minutes of a time series in memory with some sort of deque system in Python.
Right now I'm using deque but I may receive 100 data points in few seconds and then nothing for few seconds.
Any idea ?
I read something about FastRBTree in a post but it dated back to 2014. Is there any better solution now?
I am mostly interested in computing the standard deviation over a fixed period of time, So the less data I receive within that fixed period of time, the less the standard deviation will be
If you are concerned about container size, "simplest" thing might be to use the deque and just set a maxlen argument and then as it overflows, the oldest adds are just lost, but that does not guarantee 10 minutes worth obviously. But it is an efficient data structure for this.
If you want to "trim by time in deque" then you probably need to create a custom class that can hold the data and a timestamp of some kind and then periodically poll the end of the deque for the time of the earliest item and keep popping until you are no later than current time + 10 mins.
If things are happening more dynamically, you might use some kind of database structure to do this (not my area of expertise, but seems plausible path to pursue) and possible re-ask a similar question (with some more details) as a database or sqlite question.
I'm building a social network and I're going to create my own urls at random the question that has been on my mind for a long time is how to create urls like Instagram posts like Django like the following:
https://www.instagram.com/p/CeqcZdeoNaP/
or
https://www.youtube.com/watch?v=MhRaaU9-Jg4
My problem is that on the one hand these urls have to be unique and on the other hand it does not make sense that on a large scale when the number of uploaded posts by the user is more than 100,000 I set unique = True Because the performance of the database decreases
Another point is the use of uuids, which solves this problem of uniqueness to a large extent, but the strings produced by uuid are very long, and if I shorten these strings and reduce the number of letters in the string, there is a possibility of a collision. And that several identical strings are produced
I wanted to know if there is a solution to this issue that generated urls are both short and unique while maintaining database performance
Thank you for your time đź’™
You might choose to design around ULIDs. https://github.com/ulid/spec
It's still 128 bits.
The engineering tradeoff they made was 48 bits of predictable low-entropy clock
catenated with an 80-bit nonce.
Starting with a timestamp makes it play very nicely with postgres B-trees.
They serialize 5 bits per character instead of the 4 bits offered by hex.
You could choose to go for 6 if you want, for the sake of brevity.
Similarly you could also choose to adjust the clock tick granularity,
and reduce its range.
Keeping the Birthday Paradox in mind,
you might choose to use a smaller nonce, as well.
The current design offers good collision resistance
up to around 2^40 identifiers per clock tick,
which might be overkill for your needs.
I’m trying to think through a sort of extra credit project- optimizing our schedule.
Givens:
“demand” numbers that go down to the 1/2 hour. These tell us the ideal number of people we’d have on at any given time;
8 hour shift, plus an hour lunch break > 2 hours from the start and end of the shift (9 hours from start to finish);
Breaks: 2x 30 minute breaks in the middle of the shift;
For simplicity, can assume an employee would have the same schedule every day.
Desired result:
Dictionary or data frame with the best-case distribution of start times, breaks, lunches across an input number of employees such that the difference between staffed and demanded labor is minimized.
I have pretty basic python, so my first guess was to just come up with all of the possible shift permutations (points at which one could take breaks or lunches), and then ask python to select x (x=number of employees available) at random a lot of times, and then tell me which one best allocates the labor. That seems a bit cumbersome and silly, but my limitations are such that I can’t see beyond such a solution.
I have tried to look for libraries or tools that help with this, but the question here- how to distribute start times and breaks within a shift- doesn’t seem to be widely discussed. I’m open to hearing that this is several years off for me, but...
Appreciate anyone’s guidance!
I am trying to solve a problem. I would appreciate your valuable input on this.
Problem statement:
I am trying to read a lot of files (of the order of 10**6) in the same base directory. Each file has the name that matches the pattern (YYYY-mm-dd-hh), and the content of the files are as follows
mm1, vv1
mm2, vv2
mm3, vv3
.
.
.
where mm is the minute of the day and vv” is some numeric value with respect to that minute. I need to find, given a start-time (ex. 2010-09-22-00) and an end-time (ex. 2017-09-21-23), the average of all vv’s.
So basically user will provide me with a start_date and end_date, and I will have to get the average of all the files in between the given date range. So my function would be something like this:
get_average(start_time, end_time, file_root_directory):
Now, what I want to understand is how can I use multiprocessing to average out the smaller chunks, and then build upon that to get the final values.
NOTE: I am not looking for linear solution. Please advise me on how do I break the problem in smaller chunks and then sum it up to find the average.
I did tried using multiprocessing module in python by creating a pool of 4 processes, but I am not able to figure out how do I retain the values in memory and add the result together for all the chunks.
You process is going to be I/O bound.
Multiprocessing may not be very useful, if not counterproductive.
Moreover your storage system, base on enormous number of small files, is not the best. You should look at a time serie database such as influxdb.
Given that the actual processing is trivial—a sum and count of each file—using multiple processes or threads is not going to gain much. This is because 90+% of the effort is opening each file and transferring into memory its content.
However, the most obvious partitioning would be based on some per-data-file scheme. So if the search range is (your example) 2010-09-22-00 through 2017-09-21-23, then there are seven years with (maybe?) one file per hour for a total of 61,368 files (including two leap days).
61 thousand processes do not run very effectively on one system—at least so far. (Probably it will be a reasonable capability some years from now.) But for a real (non-supercomputing) system, partitioning the problem into a few segments, perhaps twice or thrice the number of CPUs available to do the work. This desktop computer has four cores, so I would first try 12 processes where each independently computes the sum and count (number of samples present, if variable) of 1/12 of the files.
Interprocess communication can be eliminated by using threads. Or for process oriented approach, setting up a pipe to each process to receive the results is a straightforward affair.
I'm using the python dateutil module for a calendaring application which supports repeating events. I really like the ability to parse ical rrules using the rrulestr() function. Also, using rrule.between() to get dates within a given interval is very fast.
However, as soon as I try doing any other operations (ie: list slices, before(), after(),...) everything begins to crawl. It seems like dateutil tries to calculate every date even if all I want is to get the last date with rrule.before(datetime.max).
Is there any way of avoiding these unnecessary calculations?
My guess is probably not. The last date before datetime.max means you have to calculate all the recurrences up until datetime.max, and that will reasonably be a LOT of recurrences. It might be possible to add shortcuts for some of the simpler recurrences. If it is every year on the same date for example, you don't really need to compute the recurrences inbetween, for example. But if you have every third something you must, for example, and also if you have a maximum recurrences, etc. But I guess dateutil doesn't have these shortcuts. It would probably be quite complex to implement reliably.
May I ask why you need to find the last recurrence before datetime.max? It is, after all, almost eight thousand years into the future... :-)