I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.
copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.
Related
I am new to Apache Beam. I have a requirement to read a text file with the format as given below
a=1
b=3
c=2
a=2
b=6
c=5
Here all rows till an empty line are part of one record and need to be processed together (eg. insert to the table as columns). The above example corresponds to a file with just 2 records.
I am using ReadFromText to read the file and process it. It reads each line as an element. I am then trying to loop and process till I get empty lines.
ReadFromText returns a PCollection and I have read that PCollection is an abstraction of the potentially distributed dataset. My doubt is while reading, will I get records in the same order as in the file. Or will I just get a collection of rows where the order is not preserved. What solution can I use to solve this problem?
I am using python language. I have to read the file from the GCP bucket and use Google Dataflow for execution.
No, your records are not guaranteed to be in the same order. PCollections are inherently unordered, and elements in a PCollection are expected to be parallelization, that is distinct and not reliant on other elements in the PCollection.
In your example you're using TextIO which treats each line of a text file as a separate element, but what you need is to gather each set of data for a record as one element. There are many potential ways around this.
If you can modify the text file, you could put all your data on a single line per record, and then parse that line in a transform you write. This is the usual approach taken, for example with CSV files.
If you can't modify the files, a simple solution for adding your own logic for reading files is to retrieve the files with FileIO and then write a custom ParDo with your own logic for reading the files. This is not as simple as using an existing IO out of the box, but is still easier than creating a fully featured Source.
If the files are more complex and you need a more robust solution, you can implement your own Source that reads the file and outputs records in your required format. This would most likely involve using Splittable DoFns and would require a fair amount of knowledge in how a FileBasedSource works.
I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance
If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.
So, I have a large amount of data that I wish to upload on a table in MySQL. I can use MySQL's inbuilt data import wizard to upload each .csv file(around 90 files, ~150 mb each) into the table but each file takes too long and it will take months to upload all this data.
So instead, I want to use the 'LOAD DATA INFILE' command(which apparently is faster according to the internet) in MySQL but this is generally done for individual files, so I was wondering if I can maybe 'loop' this SQL command using Python(Python has a module that connects to MySQL named 'MySQLclient') and run it through the directory where each of my data file is in so that all the data gets uploaded on to my table one by one automatically. But unfortunately I am not able to come up with a syntactically accurate method to do so in Python 3.6. Maybe you can help?
Other methods/commands to perform this task are also welcome.
Versions: Python 3.6; MySQL 5.7; Win 8.1;
I have to build a database from 1260000 xml files. Each of these xml files is being processed with python, parsed, and then inserted in a certain way into the database.
This is done with the psycopg2 library.
For example I read a name, I see if already the name is in the database and then I do the insertion or not as the case may be.
This all with python.
Each file takes about 10 minutes to run, which takes years to complete.
I wonder if there is an alternative, for what I am trying to do. (Sorry for the noob question)
Problem
I was trying to implement an web API(based on Flask), which would be used to query the database given some specific conditions, reconstruct the data and finally export the result to a .csv file.
Since the amount of data is really really huge, I can not construct the whole dataset and generate the .csv file all at once(e.g. create a DataFrame using pandas and finally call df.to_csv()), because that would cause a slow query and maybe the http connection would end up timeout.
So I create a generator which query the database 500 records per time and yield the result one by one, like:
def __generator(q):
[...] # some code here
while True:
if records == None:
break
records = q[offset:offset+limit] # q means a sqlalchemy query object
[...] # omit some reconstruct code
for record in records:
yield record
and finally construct a Response object, and send .csv to client side:
return Response(__generate(q), mimetype='text/csv') # Flask
The generator works well and all data are encoded by 'uft-8', but when I try to open the .csv file using Microsoft Excel, it appears to be messy code.
Measures Already Tried
add a BOM header to the export file, doesn't work;
using some other encode like 'gb18030', and 'cp936', most of the messy code disappear, some still remained, and some part of the table structure become weird.
My Question Is
How can I make my code compatible to Microsoft Excel? That means at least two conditions should be satisfied:
no messy code, well displayed;
well structured table;
I would be really appreciated for your answer!
How are you importing the csv file to excel? Have you tried importing the csv as a text file?
By reading as text format for each column, it wont modify columns that it reads as different types like dates. Your code may be correct, and excel may just be modifying the data when it parses it as a csv - by importing as text format, it wont modify anything.
I would recommend you look into xlutils. It's been around for quite some time, and our company has used it both for reading configuration files to run automated test and for generating reports of test results.