I am trying to solve a problem. I would appreciate your valuable input on this.
Problem statement:
I am trying to read a lot of files (of the order of 10**6) in the same base directory. Each file has the name that matches the pattern (YYYY-mm-dd-hh), and the content of the files are as follows
mm1, vv1
mm2, vv2
mm3, vv3
.
.
.
where mm is the minute of the day and vv” is some numeric value with respect to that minute. I need to find, given a start-time (ex. 2010-09-22-00) and an end-time (ex. 2017-09-21-23), the average of all vv’s.
So basically user will provide me with a start_date and end_date, and I will have to get the average of all the files in between the given date range. So my function would be something like this:
get_average(start_time, end_time, file_root_directory):
Now, what I want to understand is how can I use multiprocessing to average out the smaller chunks, and then build upon that to get the final values.
NOTE: I am not looking for linear solution. Please advise me on how do I break the problem in smaller chunks and then sum it up to find the average.
I did tried using multiprocessing module in python by creating a pool of 4 processes, but I am not able to figure out how do I retain the values in memory and add the result together for all the chunks.
You process is going to be I/O bound.
Multiprocessing may not be very useful, if not counterproductive.
Moreover your storage system, base on enormous number of small files, is not the best. You should look at a time serie database such as influxdb.
Given that the actual processing is trivial—a sum and count of each file—using multiple processes or threads is not going to gain much. This is because 90+% of the effort is opening each file and transferring into memory its content.
However, the most obvious partitioning would be based on some per-data-file scheme. So if the search range is (your example) 2010-09-22-00 through 2017-09-21-23, then there are seven years with (maybe?) one file per hour for a total of 61,368 files (including two leap days).
61 thousand processes do not run very effectively on one system—at least so far. (Probably it will be a reasonable capability some years from now.) But for a real (non-supercomputing) system, partitioning the problem into a few segments, perhaps twice or thrice the number of CPUs available to do the work. This desktop computer has four cores, so I would first try 12 processes where each independently computes the sum and count (number of samples present, if variable) of 1/12 of the files.
Interprocess communication can be eliminated by using threads. Or for process oriented approach, setting up a pipe to each process to receive the results is a straightforward affair.
Related
I have a file (File_A) with greater than 40 million rows which I need to import into Python and compare certain aspects to another very large data file (File_B). One example: find rows in File_A whose 'ID' is not contained in the 'ID' Column of File_B.
These are by far the biggest data files I have ever worked with, and I am wondering if there is a mechanism to avoid having to read_csv every day when I log on (however they have my system set up, the kernel is usually timed out every day or so). I'm using chunksize to bring the file into a df in batches, but it still takes around half an hour.
Is there a way to avoid this rather cumbersome process whenever my kernel gets reset / the python file gets closed?
Additionally, any general recommendations on packages/concepts for importing, manipulating and comparing these super large datasets like these is much appreciated.
I am working on an application which generates a couple of hundred datasets every ten minutes. These datasets consist of a timestamp, and some corresponding values from an ongoing measurement.
(Almost) Naturally, I use pandas dataframes to manage the data in memory.
Now I need to do some work with history data (eg. averaging or summation over days/weeks/months etc. but not limited to that), and I need to update those accumulated values rather frequently (ideally also every ten minutes), so I am wondering which would be the most access-efficient way to store the data on disk?
So far I have been storing the data for every ten minute interval in a separate csv-file and then read the relevant files into a new dataframe as needed. But I feel that there must be a more efficient way, especially when it comes to working with a larger amount of datasets. Although computation cost and memory are not the central issue, as I am running the code on a comparatively powerful machine, but I still don't want to (and most likely, can't afford to) read all the data into memory every time.
It seems to me that the answer should lie within the built-in serialization functions of pandas, but from the docs and my google findings I honestly can't really tell which would fit my needs best.
Any ideas how I could manage my data better?
I have this code which splits the dataframe in 10000 rows and writes to file.
I tried instance with z1d with 24cpu and 192GB but even that didn't do much speed and for 1 million rows it took 9 mins.
This is code
total = df2.count()
offset = 10000
counter = int(total/offset) + 1
idxDf = df.withColumn("idx", monotonically_increasing_id())
for i in range(0, counter):
lower = i * offset
upper = lower + offset
filter = f"idx > {lower} and idx < {upper}"
ddf = idxDf.filter(filter)
ddf2 = ddf.drop("idx")
ddf2.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
Is there any way i can make in it. Currently i am using single master node only. I have 100 million rows and want to know how fast i can do that with emr.
Look like my normal python code is also able to do the same stuff in same minutes
A few problems with what you’re trying to do here:
Stop trying to write pyspark code as if it’s normal python code. It isn’t. Read up on exactly how spark works first and foremost. You’ll have more success if you change the way you program when you use spark, not try to get spark to do what you want in the way you want.
Avoid for loops with Spark wherever possible. for loops only work within native python, so you’re not utilising spark when you start one. Which means one CPU on one Spark node will run the code.
Python is, by default, single threaded. Adding more CPUs will do literally nothing to performance for native python code (ie your for loop) unless you rewrite your code for either (a) multi-threaded processing (b) distributed processing (ie spark).
You only have one master node (and I assume zero slaves nodes). That’s going to take aaaaaaggggggggeeeessss to process a 192GB file. The point of Spark is to distribute the workload onto many other slave nodes. There’s some really technical ways to determine the optimal number of slave nodes for your problem. Try something like >50 or >100 or slaves. Should help you see a decent performance uplift (each node able to process at least between 1gb-4gb of data). Still too slow? Either add more slave nodes, or choose more powerful machines for the slaves. I remember running a 100GB file through some heavy lifting took a whole day on 16 nodes. Upping the machine spec and number of slaves brought it down to an hour.
For writing files, don’t try and reinvent the wheel if you don’t need to.
Spark will automatically write your files in a distributed manner according to the level of partitioning on the dataframe. On disk, it should create a directory called outputpath which contains the n distributed files:
df.repartition(n_files)
df.write.option("header", "false").option("delimiter", " ").option("compression","gzip").csv(outputpath)
You should get a directory structured something like this:
path/to/outputpath:
- part-737hdeu-74dhdhe-uru24.csv.gz
- part-24hejje—hrhehei-47dhe.csv.gz
- ...
Hope this helps. Also, partitioning is super important. If your initial file is not distributed (one big csv), it’s a good idea to do df.repartition(x) on the resulting dataframe after you load it, where x = number of slave nodes.
I have a huge database (~100 variables with a few million rows) consisting of stock data. I managed to connect python with the database via sqlalchemy (postgreql+psycopg2). I am running it all on the cloud.
In principle I want to do a few things:
1) Regression of all possible combinations: I am running a simple regression of each stock, i.e. ABC on XYZ AND also XYZ on ABC, this across the n=100 stocks, resulting in n(n+1) / 2 combinations.
-> I think of a function that calls in the pairs of stocks, does the two regressions and compares the results and picks one based on some criteria.
My question: Is there an efficient way to call in the "factorial"?
2) Rolling Windows: To avoid an overload of data, I thought to only call the dataframe of investigation, i.e. 30days, and then roll over each day, meaning my periods are:
1: 1D-30D
2: 2D-31D and so on
Meaning I always drop the first day and add another row at the end of my dataframe. So meaning I have two steps, drop the first day and read in the next row from my database.
My question: Is this a meaningful way or does Python has something better in its sleeve? How would you do it?
3) Expanding windows: Instead of dropping the first row and add another one, I keep the 30 days and add another 30days and then run my regression. Problem here, at some point I would embrace all the data which will probably be too big for the memory?
My question: What would be a workaround here?
4) As I am running my analysis on the cloud (with a few more cores than my own pc) in fact I could use multithreading, sending "batch" jobs and let Python do things in parallel. I thought of splitting my dataset in 4x 25 stocks and let it run in parallel (so vertical split), or should I better split horizontally?
Additionally I am using Jupyter; I am wondering how to best approach here, usually I have a shell script calling my_program.py. Is this the same here?
Let me try to give answers categorically and also note my observations.
From your description, I suppose you have taken each stock scrip as one variable and you are trying to perform pairwaise linear regression amongst them. Good news about this - it's highly parallizable. All you need to do is generate unique combinations of all possible pairings and perform your regressions and then only to keep those models which fit your criteria.
Now as stocks are your variables, I am assuming rows are their prices or something similar values but definitely some time series data. If my assumption is correct then there is a problem in rolling window approach. In creating these rolling windows what you are implicitly doing is using a data sampling method called 'bootstrapping' which uses random but repeatitive sampling. But due to just rolling your data you are not using random sampling which might create problems for your regression results. At best the model may simply be overtrained, at worst, I cannot imagine. Hence, drop this appraoch. Plus if it's a time series data then the entire concept of windowing would be questionable anyway.
Expanding windows are no good for the same reasons stated above.
About memory and processibility - I think this is an excellent scenario where one can use Spark. It is exactly built for this purpose and has excellent support for python. Millions of data points are no big deal for Spark. Plus, you would be able to massively parallelize your operations. Being on cloud infrastructure also gives you advantage about configurability and expandability without headache. I don't know why people like to use Jupyter even for batch tasks like these but if you are hell-bent on using it, then PySpark kernel is also supported by Jupyter. Vertical split would be right approach here probably.
Hope these answer your questions.
I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a floating point number with four places after decimal.
There are two possible approaches:
The normal merge sort
While calculating the 59th variables, i start sending variables in particular ranges to to particular nodes. Sort the ranges in those nodes and then combine them in the reducer once i have perfectly sorted data and now I also know where to merge what set of data; It basically becomes appending.
Which is a better approach and why?
I'll assume that you are looking for a total sort order without a secondary sort for all your rows. I should also mention that 'better' is never a good question since there is typically a trade-off between time and space and in Hadoop we tend to think in terms of space rather than time unless you use products that are optimized for time (TeraData has the capability of putting Databases in memory for Hadoop use)
Out of the two possible approaches you mention, I think only one would work within the Hadoop infrastructure. Num 2, Since Hadoop leverages many nodes to do one job, sorting becomes a little trickier to implement and we typically want the 'shuffle and sort' phase of MR to take care of the sorting since distributed sorting is at the heart of the programming model.
At the point when the 59th variable is generated, you would want to sample the distribution of that variable so that you can send it through the framework then merge like you mentioned. Consider the case when the variable distribution of x contain 80% of your values. What this might do is send 80% of your data to one reducer who would do most of the work. This assumes of course that some keys will be grouped in the sort and shuffle phase which would be the case unless you programmed them unique. It's up to the programmer to set up partitioners to evenly distribute the load by sampling the key distribution.
If on the other hand we were to sort in memory then we could accomplish the same thing during reduce but there are inherent scalability issues since the sort is only as good as the amount of memory available in the node currently running the sort and dies off quickly when it starts to use HDFS to look for the rest of the data that did not fit into memory. And if you ignored the sampling issue you will likely run out of memory unless all your key values pairs are evenly distributed and you understand the memory capacity within your data.
Check out the Hadoop Comparator Class Part of HadoopStreaming Wiki Page
You can move the datasets to HDFS, use Python to write a mapper and do a hadoop streaming mapper only job. The Hadoop Streaming will automatically help you sort them.
Then you can use hdfs dfs -getmerge and -copyToLocal to move the sorted records back to local if you want.