Distributing Python module - Spark vs Process Pools - python

I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:
fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']
for pdf in fileList:
text = myModule.extractText(pdf) # Distribute this function
# Do stuff with text
We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.
I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?

This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:
input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))
As your data volume increases or you wish to increase your throughput you can simply add additional nodes.

Related

Multiple External Processes Reading From the Same Data Source

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!
You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

Python Multithreading/processing gains for inserts to different tables in MySQL?

I've been pouring over everywhere I can to find an answer to this, but can't seem to find anything:
I've got a batch update to a MySQL database that happens every few minutes, with Python handling the ETL work (I'm pulling data from web API's into the MySQL system).
I'm trying to get a sense of what kinds of potential impact (be it positive or negative) I'd see by using either multithreading or multiprocessing to do multiple connections & inserts of the data simultaneously. Each worker (be it thread or process) would be updating a different table from any other worker.
At the moment I'm only updating a half-dozen tables with a few thousand records each, but this needs to be scalable to dozens of tables and hundreds of thousands of records each.
Every other resource I can find out there addresses doing multithreading/processing to the same table, not a distinct table per worker. I get the impression I would definitely want to use multithreading/processing, but it seems everyone's addressing the one-table use case.
Thoughts?
I think your question is too broad to answer concisely. It seems you're asking about two separate subjects - will writing to separate MySQL tables speed it up, and is python multithreading the way to go. For the python part, since you're probably doing mostly IO, you should look at gevent, and ultramysql. As for the MySQL part, you'll have to wait for more answers.
For one I wrote in C#, I decided the best work partitioning was each "source" having a thread for extraction, one for each transform "type", and one to load the transformed data to each target.
In my case, I found multiple threads per source just ended up saturating the source server too much; it became less responsive overall (to even non-ETL queries) and the extractions didn't really finish any faster since they ended up competing with each other on the source. Since retrieving the remote extract was more time consuming than the local (in memory) transform, I was able to pipeline the extract results from all sources through one transformer thread/queue (per transform "type"). Similarly, I only had a single target to load the data to, so having multiple threads there would have just monopolized the target.
(Some details omitted/simplified for brevity, and due to poor memory.)
...but I'd think we'd need more details about what your ETL process does.

Handling large data files in python with low RAM low resource, creating large datafiles at local PC from SQLserver database using Python / ODBC

I am very new to Python.
In our company we use Base SAS for data analysis (ETL, EDA, basic model building). We want to check whether replacing it with Python is possible for big chunk of data. With respect to that i have following few questions :
How do python handle large files? my PC has RAM of 8gb and i have a flat file of 30gb (say a csv file). I would generally do operations like left join, deleting, group by etc. on such file. This is easily doable in SAS i.e. I don't have to be worried about low RAM. Are the same operations doable in python? would appreciate if somebody can provide the list of libraries & code for the same.
How can i perform SAS operations like "PROC SQL" in python to create dataset in my local PC while fetching the data from server.
i.e. In sas i would download 10mln rows (7.5 gb of data) from SQL server by performing following operation
libname aa ODBC dsn =sql user = pppp pwd = XXXX;
libname bb '<<local PC path>>';
proc sql outobs = 10000000;
create table bb.foo as
select * from aa.bar
;quit;
What is the method to perform the same in python. Again just to remind you - my PC has only 8 gb RAM
Python and specially python 3.X provides a lot of tools for handling large files.One of them is using iterators.
Python returns the result of inputs (reading from text or csv or ...) actually the result of open file as an iterator, thus you won't have the problem of loading whole of file in memory, with this trick you can read your file line by line and based on your need handle them.
For example if you want to chuck your file in a blocks you can use a deque object to preserve the lines which are belong to one block (based on your condition).
Alongside the collections.deque function, you can use some itertools functions to handle and apply your conditions on your lines, for example if you want to access to next line in each iteration you can use itertools.zip_longest function and for creating multiple independent iterator from your file object you can use itertools.tee.
Recently I wrote a code for filtering some huge log files (30GB and larger ) which performs very good.I have putted the code in github which you can check it and use it.
https://github.com/Kasramvd/log-filter

Spark speed performance

I have program that I have for single computer (in Python) exection and also implemented the same for Spark. This program basically only reads .json from which it takes one field and saves it back. Using Spark my program runs aproximately 100 times slower on 1 master and 1 slave then the single node standard Python program for this (of course I'm reading from file and saving to file there). So I would like to ask where possibly might be the problem?
My Spark program looks like:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2])
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])
JsonExtractor only selects the data from field that is given by sys.argv[1].
My data are basically many small one line files, where this line is always json.
I have tried both, reading and writing the data from/to Amazon S3 and local disc on all the machines.
I would like to ask if there is something that I might be missing or if Spark is supposed to be so slow in comparison with the local non paralleled single node program.
As it was advised to me at the Spark mailing list the problem was in the lot of very small json files.
Performance can be much improved either by merging small files to one bigger or by:
sc = SparkContext(appName="Json data preprocessor")
distData = sc.textFile(sys.argv[2]).coalesce(10) #10 partition tasks
json_extractor = JsonExtractor(sys.argv[1])
cleanedData = distData.flatMap(json_extractor.extract_json)
cleanedData.saveAsTextFile(sys.argv[3])

Speeding up Arcpy python Script, Big Data

I have a ridiculously simple python script that uses the arcpy module. I turned it into a script tool in arcmap and am running it that way. It works just fine, I've tested it multiple times on small datasets. The problem is that I have a very large amount of data. I need to run the script/tool on a .dbf table with 4 columns and 490,481,440 rows, and currently it has taken days. Does anyone have any suggestions on how to speed it up? To save time I've already created the columns that will be populated in the table before I run the script. "back" represents the second number after the comma in the "back_pres_dist" column and "dist" represents the fourth. All I want is for them to be in their own separate columns. The table and script look something like this:
back_pres_dist back dist
1,1,1,2345.6
1,1,2,3533.8
1,1,3,4440.5
1,1,4,3892.6
1,1,5,1292.0
import arcpy
from arcpy import env
inputTable = arcpy.GetParameterAsText(0)
back1 = arcpy.GetParameterAsText(1) #the empty back column to be populated
dist3 = arcpy.GetParameterAsText(2) #the empty dist column to be populated
arcpy.CalculateField_management(inputTable, back1, '!back_pres_dist!.split(",")[1]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("back column updated.")
arcpy.CalculateField_management(inputTable, dist3, '!back_pres_dist!.split(",")[3]', "PYTHON_9.3")
updateMess = arcpy.AddMessage("dist column updated.")
updateMess = arcpy.AddMessage("All columns updated.")
Any suggestions would be greatly appreciated. I know that reading some parts of the data into memory might speed things up, but I'm not sure how to do that with python (when using R it took forever to read into memory and was a nightmare trying to write to a .csv).
This is a ton of data. I'm guessing that your main bottleneck is read/write operations on the disk and not CPU or memory.
Your process appears to modify each row independently according to constant input values in what's essentially a tabular operation that doesn't really require GIS functionality. As a result, I would definitely look at doing this outside of the arcpy environment to avoid that overhead. While you could dump this stuff to numpy with the new arcpy.da functionality, I think that even this might be a bottleneck. Seems you should be able to more directly read your *.dbf file with a different library.
In fact, this operation is not really tabular; it's really about iteration. You'll probably want to exploit things like the "WITH"/"AS" keywords (PEP 343, Raymond Hettinger has a good video on youtube, too) or iterators in general (see PEPs 234, 255), which only load a record at a time.
Beyond those general programming approaches, I'm thinking that your best bet would be to break this data into chunks, parallelize, and then reassemble the results. Part of engineering the parallelization could be to spread your data across different disk platters to avoid competing between i/o requests. iPython is an add-on for python that has a pretty easy to use, high-level pacakge, "parallel", if you want an easy place to start. Lots of pretty good videos on youtube from PyCon 2012. There's a 3 hour one where the parallel stuff starts at 2:13:00 or so.

Categories

Resources