I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave then the single node standard Python program for this (of course I'm reading from file and saving to file there). So I would like to ask where possibly might be the problem?
My Spark program looks like:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2])
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])
JsonExtractor only selects the data from field that is given by sys.argv[1].
My data are basically many small one line files, where this line is always json.
I have tried both, reading and writing the data from/to Amazon S3 and local disc on all the machines.
I would like to ask if there is something that I might be missing or if Spark is supposed to be so slow in comparison with the local non paralleled single node program.
As it was advised to me at the Spark mailing list the problem was in the lot of very small json files.
Performance can be much improved either by merging small files to one bigger or by:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2]).coalesce(10) #10 partition tasks
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])
Related
o, I'm extracting lots of data from an OpenVMS RDB Oracle database with a Python script.
On "hand-made" profiling, 3/4 of the time is taken writing down the data to the TXT file.
print >> outputFile, T00_EXTPLANTRA_STRUCTS.setStruct(parFormat).build(hyperContainer)
This is the specific line that prints out the data, which takes 3/4 of the execution time.
T00_EXTPLANTRA_STRUCTS.py is an external file containing the data structures (Which .setStruct() defines), and the hyperContainer is a Container from "Construct" library that contains data.
It takes almost five minutes to extract the whole file. I'd really like to learn if there is a way to write TXT data faster than this.
I already optimized the rest of the code especially DB transactions, it's just the writing operation that's taking a long time to execute.
The data to write looks like this, with 167.000 lines of this kind. (I hid the actual data with "X")
XX;XXXXX;X;>;XXXXX;XXXX;XXXXXXXXXXXXXX ;XXX; ;XXX; ;
Lots of thanks for anyone spending any time on this.
I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:
fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']
for pdf in fileList:
text = myModule.extractText(pdf) # Distribute this function
# Do stuff with text
We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.
I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?
This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:
input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))
As your data volume increases or you wish to increase your throughput you can simply add additional nodes.
I am very new to Python.
In our company we use Base SAS for data analysis (ETL, EDA, basic model building). We want to check whether replacing it with Python is possible for big chunk of data. With respect to that i have following few questions :
How do python handle large files? my PC has RAM of 8gb and i have a flat file of 30gb (say a csv file). I would generally do operations like left join, deleting, group by etc. on such file. This is easily doable in SAS i.e. I don't have to be worried about low RAM. Are the same operations doable in python? would appreciate if somebody can provide the list of libraries & code for the same.
How can i perform SAS operations like "PROC SQL" in python to create dataset in my local PC while fetching the data from server.
i.e. In sas i would download 10mln rows (7.5 gb of data) from SQL server by performing following operation
libname aa ODBC dsn =sql user = pppp pwd = XXXX;
libname bb '<<local PC path>>';
proc sql outobs = 10000000;
create table bb.foo as
select * from aa.bar
;quit;
What is the method to perform the same in python. Again just to remind you - my PC has only 8 gb RAM
Python and specially python 3.X provides a lot of tools for handling large files.One of them is using iterators.
Python returns the result of inputs (reading from text or csv or ...) actually the result of open file as an iterator, thus you won't have the problem of loading whole of file in memory, with this trick you can read your file line by line and based on your need handle them.
For example if you want to chuck your file in a blocks you can use a deque object to preserve the lines which are belong to one block (based on your condition).
Alongside the collections.deque function, you can use some itertools functions to handle and apply your conditions on your lines, for example if you want to access to next line in each iteration you can use itertools.zip_longest function and for creating multiple independent iterator from your file object you can use itertools.tee.
Recently I wrote a code for filtering some huge log files (30GB and larger ) which performs very good.I have putted the code in github which you can check it and use it.
https://github.com/Kasramvd/log-filter
I have a large XML file, ~30 MB.
Every now and then I need to update some of the values. I am using element tree module to modify the XML. I am currently fetching the entire file, updating it and then placing it again. SO there is ~60 MB of data transfer every time. Is there a way I update the file remotely?
I am using the following code to update the file.
import xml.etree.ElementTree as ET
tree = ET.parse("feed.xml")
root = tree.getroot()
skus = ["RUSSE20924","PSJAI22443"]
qtys = [2,3]
for child in root:
sku = child.find("Product_Code").text.encode("utf-8")
if sku in skus:
print "found"
i = skus.index(sku)
child.find("Quantity").text = str(qtys[i])
child.set('updated', 'yes')
tree.write("feed.xml")
Modifying a file directly via FTP without uploading the entire thing is not possible except when appending to a file.
The reason is that there are only three commands in FTP that actually modify a file (Source):
APPE: Appends to a file
STOR: Uploads a file
STOU: Creates a new file on the server with a unique name
What you could do
Track changes
Cache the remote file locally and track changes to the file using the MDTM command.
Pros:
Will half the required data transfer in many cases.
Hardly requires any change to existing code.
Almost zero overhead.
Cons:
Other clients will have to download the entire thing every time something changes(no change from current situation)
Split up into several files
Split up your XML into several files. (One per product code?)
This way you only have to download the data that you actually need.
Pros:
Less data to transfer
Allows all scripts that access the data to only download what they need
Combinable with suggestion #1
Cons:
All existing code has to be adapted
Additional overhead when downloading or updating all the data
Switch to a delta-sync protocol
If the storage server supports it switching to a delta synchronization protocol like rsync would help a lot because these only transmit the changes (with little overhead).
Pros:
Less data transfer
Requires little change to existing code
Cons:
Might not be available
Do it remotely
You already pointed out that you can't but it still would be the best solution.
What won't help
Switch to a network filesystem
As somebody in the comments already pointed out switching to a network file system (like NFS or CIFS/SMB) would not really help because you cannot actually change parts of the file unless the new data has the exact same length.
What to do
Unless you can do delta synchronization I'd suggest to implement some caching on the client side first and if it doesn't help enough to then split up your files.
I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.