Multiple External Processes Reading From the Same Data Source - python

I have a situation where I have multiple sources that will need to read from the same (small in size) data source, possibly at the same time. For example, multiple different computers calling a function that needs to read from an external data source (e.g. excel file). Since it multiple different sources are involved, I cannot simply read from the data source once and pass it into the function---it must be loaded in the function.
Is there a data source that can handle this effectively? A pandas dataframe was an acceptable format for information that need to be read so I tried storing that dataframe in an sqlite3 databases since according to the sqlite3 website, sqlite3 databases can handle concurrent reads. Unfortunately, it is failing too often. I tried multiple different iterations and simply could not get it to work.
Is there another data format/source that would work/be effective? I tried scouring the internet for whether or not something as simple as an excel file + the pandas read_excel function could handle this type of concurrency but I could not find information. I tried an experiment of using a multiprocessing pool to simultaneously load the same very large (i.e. 1 minute load) excel file and it did not crash. But of course, that is not exactly a perfect experiment.
Thanks!

You can try using openpyxl's read-only mode. It uses generator instead of loading whole file.
Also take a look at processing large xlsx file in python

Related

Execute Python script inside SSIS, which using SSIS generated data

I have a SSIS package that will import an excel file. I want to use a python script to run through all the column headings and replace any white spaces with a '_'.
Previously when doing this for a pandas dataframe, I'd use:
df.columns = [w.replace(' ','_') for w in list(df.columns)]
However I don't know how to reference the column headers from python. I understanding I use a 'Execute Process Task' and how to implement that into SSIS, however how can I refer to a dataset contained within the SSIS package from Python?
Your dataset won't be in SSIS. The only data that is "in" SSIS are row buffers in a Data Flow Task. There you define a source, destination and any transformation that takes place per row.
If you're going to execute a python script, the end result is that you've expressed the original Excel file in some other format. Maybe you rewrote it as a CSV, maybe you wrote it to a table, perhaps it's just written back as a new Excel file but with no whitespace in the column names.
There is no native Data Flow source that will allow you to use python directly. There is a Script Component which allows you to run anything and there is IronPython which would allows you to run IronPython in SSIS but that's not going to work for a Data Flow Task. A Data Flow Task is metadata dependent at run time. That is, before the package runs, the engine will interrogate the source and destination elements to ensure they exist, the data type of the columns is the same or bigger than the data type described in the contract that was built during design time.
In simple terms, you can't dynamically change out the shape of the data in a Data Flow Task. If you need a generic dynamic data importer, then you're writing all the logic yourself. You can still use SSIS as the execution framework as it has nice logging, management, etc but your SSIS package is going to be a mostly .NET project.
So, with all of that said, I think the next challenge you'll run into if you try to use IronPython with Pandas is that they don't work together. At least, not well enough that the expressed desire "a column rename" is worth the effort and maintenance headache you'd have.
There is an option to execute sp_execute_external_script with python script in a Data Flow and use it as a source. You can also save it to CSV or excel file and read it in SSIS.

How to open huge parquet file using Pandas without enough RAM

I am trying to read a decently large Parquet file (~2 GB with about ~30 million rows) into my Jupyter Notebook (in Python 3) using the Pandas read_parquet function. I have also installed the pyarrow and fastparquet libraries which the read_parquet function uses as the engine for parquet files. Unfortunately, it seems that while reading, my computer freezes and eventually I get an error saying it ran out of memory (I don't want to repeat running the code since this will cause another freeze - I don't know the verbatim error message).
Is there a good way to somehow write some part of the parquet file to memory without this occurring? I know that parquet files are columnar and it may not be possible to store only a part of the records to memory, but I'd like to potentially split it up if there is a workaround or perhaps see if I am doing anything wrong while trying to read this in.
I do have a relatively weak computer in terms of specs, with only 6 GB memory and i3. The CPU is 2.2 GHz with Turbo Boost available.
Do you need all the columns? You might be able to save memory by just loading the ones you actually use.
A second possibility is to use an online machine (like google colab) to load the parquet file and then save it as hdf. Once you have it, you can use it in chunks.
You can use Dask instead of pandas. It it is built on pandas, so has similar API that you will likely be familiar with, and is meant for larger data.
https://examples.dask.org/dataframes/01-data-access.html
Its possible to read parquet data in
batches
read certain row groups or iterate over row groups
read only certain columns
This way you can reduce the memory footprint. Both fastparquet and pyarrow should allow you to do this.
In case of pyarrow, iter_batches can be used to read streaming batches from a Parquet file.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=1000):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 1000 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):

Removing JSON objects that aren't correctly formatted Python

I'm building a chatbot database atm. I uses data from pushshift.io. In order to deal with big datafile, (I understand that json loads everything into RAM, so if you only have 16GB RAM and working with 30GB of data, that is a nono), I wrote a bash script that split the big file into smaller chunk of 3GB of file so that I can run it through json.loads (or pd.read_json). The problem whenever I run my code it returns
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Thus I take a look into the temp json file that I just created and I see this happens in my JSON file:
ink_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
The sample correction of the data looks like this
{"score_hidden":false,"name":"t1_cnas8zv","link_id":"t3_2qyr1a","body":"Most of us have some family members like this. *Most* of my family is like this. ","downs":0,"created_utc":"1420070400","score":14,"author":"YoungModern","distinguished":null,"id":"cnas8zv","archived":false,"parent_id":"t3_2qyr1a","subreddit":"exmormon","author_flair_css_class":null,"author_flair_text":null,"gilded":0,"retrieved_on":1425124282,"ups":14,"controversiality":0,"subreddit_id":"t5_2r0gj","edited":false}
I notice that my bash script split the file without paying attention to the JSON objects. So my question is are there ways to write a function in python that can detect JSON objects that are not correctly formatted and deleted it?
There isn't a lot of information to go on, but I would challenge the frame a little.
There are several incremental json parsers available in Python. A quick search shows ijson should allow you to traverse your very large data structure without exploding.
You also should consider another data format (or a real database), or you will easily find yourself spending time reimplementing much much slower versions of features that already exist with the right tools.
If you are using the json standard library, then calling json.loads on badly formatted data will return JSONDecodeError. You can put your code in a try-catch statement and check if this exception occurs to make sure you only process correctly formatted data.

Merge or Concatenate Hundreds of Excel files

I have 638 Excel files in a directory that are about 3000 KB large, each. I want to concatenate all of them together, hopefully only using Python or command line (no other programming software or languages).
Essentially, this is part of a larger process that involves some simple data manipulation, and I want it all to be doable by just running a single python file (or double clicking batch file).
I've tried variations of the code below - Pandas, openpyxl, and xlrd and they seem to have about the same speed. Converting to csv seems to require VBA which I do not want to get into.
temp_list=[]
for filename in os.listdir(filepath):
temp = pd.read_excel(filepath + filename,
sheet_name=X, usecols=fields)
temp_list.append(temp)
Are there simpler command line solutions to convert these into csv files or merge into one excel document? Or is this pretty much it, just using the basic libraries to read individual files?
.xls(x) is a very (over)complicated format with lots of features and quirks accumulated over the years and is thus rather hard to parse. And it was never designed for speed or for large amounts of data but rather for ease of use for business people.
So with your number of files, your best bet is to convert those to .csv or another easy-to-parse format (or use such a format for data exchange in the first place) -- and preferrably, do this before you get to process them -- e.g. upon a file's arrival.
E.g. this is how you can save the first sheet of a .xls(x) to .csv with pywin32 using Excel's COM interface:
import win32com.client
# Need the typelib metadata to have Excel-specific constants
x = win32com.client.gencache.EnsureDispatch("Excel.Application")
# Need to pass full paths, see https://stackoverflow.com/questions/16394842/excel-can-only-open-file-if-using-absolute-path-why
w = x.Workbooks.Open("<full path to file>")
s = w.Worksheets(1)
s.SaveAs("<full path to file without extension>",win32com.client.constants.xlCSV)
w.Close(False)
Running this in parallel would normally have no effect because the same server process would be reused. You can force creating a different process for each batch as per How can I force python(using win32com) to create a new instance of excel?.

Distributing Python module - Spark vs Process Pools

I've made a Python module that extracts handwritten text from PDFs. The extraction can sometimes be quite slow (20-30 seconds per file). I have around 100,000 PDFs (some with lots of pages) and I want to run the text extraction on all of them. Essentially something like this:
fileNameList = ['file1.pdf','file2.pdf',...,'file100000.pdf']
for pdf in fileList:
text = myModule.extractText(pdf) # Distribute this function
# Do stuff with text
We used Spark once before (a coworker, not me) to distribute indexing a few million files from an SQL DB into Solr across a few servers, however when researching this it seems that Spark is more for parallelizing large data sets, not so much distributing a single task. For that it looks like Python's inbuilt 'Process Pools' module would be better, and I can just run that on a single server with like 4 CPU cores.
I know SO is more for specific problems, but was just wanting some advice before I go down the entirely wrong road. For my use case should I stick to a single server with Process Pools, or split it across multiple servers with Spark?
This is perfectly reasonable to use Spark for since you can distribute the task of text extraction across multiple executors by placing the files on distributed storage. This would let you scale out your compute to process the files and write the results back out very efficiently and easily with pySpark. You could even use your existing Python text extraction code:
input = sc.binaryFiles("/path/to/files")
processed = input.map(lambda (filename, content): (filename, myModule.extract(content)))
As your data volume increases or you wish to increase your throughput you can simply add additional nodes.

Categories

Resources