I have a PySpark dataframe that contains records for 6 million people, each with an individual userid. Each userid has 2000 entries. I want to save my each userid's data into a separate csv file with the userid as the name.
I have some code that does this, taken from the solution to this question. However, as I understand it the code will try to partition each of the 6 million ids. I don't actually care about this as I'm going to write each of these files to another non-HDFS server.
I should note that the code works for a small number of userids (up to 3000) but it fails on the full 6 million.
Code:
output_file = '/path/to/some/hdfs/location'
myDF.write.partitionBy('userid').mode('overwrite').format("csv").save(output_file)
When I run the above it takes WEEKS to run with most of that time spent on the writing step. I assume this is because of the number of partitions. Even if I manually specify the number of partitions to something small it still takes ages to execute.
Question: Is there a way to save each of the userids data into a single, well named (name of file = userid) file without partitioning?
Given the requirements there is really on much hope for improvement. HDFS is not designed for handling very small files, and pretty much any file system will be challenged if you try to open 6 million file descriptors at the same time.
You can improve this a little, if you haven't already by calling repartition before write:
(myDF
.repartition('userid')
.write.partitionBy('userid').mode('overwrite').format("csv").save(output_file))
If you can accept multiple ids per file you can use persistent table and bucketing
myDFA
.write
.bucketBy(1024, 'userid') # Adjust numBuckets if needed
.sortBy('userid')
.mode('overwrite').format("csv")
.saveAsTable(output_table))
and process each file separately, taking consecutive chunks of data.
Finally, if plain text output is not a hard requirement you can use any sharded database and partition data by userid.
Related
Every few days I have to load around 50 rows of data into a Postgres 14 database through a Python script. As part of this I want to record a run number as a column in the db. This number should be the same for all rows I am inserting at that time and larger than any number currently in that column in the database, but other than that its actual value doesn't matter (ie, I need to make the number myself, I'm not just pulling it in from somewhere else).
The obvious way to do this would be with two calls from Python to the database - one to get the current max run number and one to save the data with the run number set to one more than the number retrieved in the first query. Is this best practice? Is there a better way to do this with only one call? Would people recommend making a function in postgres that does this instead and then call that function? I feel like this is likely a common situation with accepted best practice but I don't know what it is.
o, I'm extracting lots of data from an OpenVMS RDB Oracle database with a Python script.
On "hand-made" profiling, 3/4 of the time is taken writing down the data to the TXT file.
print >> outputFile, T00_EXTPLANTRA_STRUCTS.setStruct(parFormat).build(hyperContainer)
This is the specific line that prints out the data, which takes 3/4 of the execution time.
T00_EXTPLANTRA_STRUCTS.py is an external file containing the data structures (Which .setStruct() defines), and the hyperContainer is a Container from "Construct" library that contains data.
It takes almost five minutes to extract the whole file. I'd really like to learn if there is a way to write TXT data faster than this.
I already optimized the rest of the code especially DB transactions, it's just the writing operation that's taking a long time to execute.
The data to write looks like this, with 167.000 lines of this kind. (I hid the actual data with "X")
XX;XXXXX;X;>;XXXXX;XXXX;XXXXXXXXXXXXXX ;XXX; ;XXX; ;
Lots of thanks for anyone spending any time on this.
I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave then the single node standard Python program for this (of course I'm reading from file and saving to file there). So I would like to ask where possibly might be the problem?
My Spark program looks like:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2])
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])
JsonExtractor only selects the data from field that is given by sys.argv[1].
My data are basically many small one line files, where this line is always json.
I have tried both, reading and writing the data from/to Amazon S3 and local disc on all the machines.
I would like to ask if there is something that I might be missing or if Spark is supposed to be so slow in comparison with the local non paralleled single node program.
As it was advised to me at the Spark mailing list the problem was in the lot of very small json files.
Performance can be much improved either by merging small files to one bigger or by:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2]).coalesce(10) #10 partition tasks
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])
baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.