I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance
If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.
Related
I am looking to replace the database (SQL)(around 50,00050 rowscolumns) for my app with excel. I need to update a single cell in excel without loading the whole workbook and then saving it again (I am using Openpyxl) as it is computationally very expensive. I need an alternative that will help me save execution time.
I have tried excel APIs like xlwings but need an alternative to APIs
I cannot comment yet, so I will "answer". Why would you replace a database with Excel? Sounds crazy to me. There are plenty of other persistent storage file systems out there to use, pickle, HD5, pyarrow stuff, csv, etc.. I used the feather format for a while, super fast and pandas can use it natively.
There are several OLTP Postgres databases which in total accept 100 millions rows daily.
Also there is a Greenplum DWH. How to load this 100 million rows of data, with only little transformation to Greenplum daily?
I am gonna to use Python for that.
I am sure that doing that in the traditional way (psycopg2 + cursor.execute("INSERT ...), even with batches, gonna take lot of time and will create a bottleneck in the whole system.
Do you have any suggestions how to optimize the process of data loading? Any links or books which may help also welcome.
You should try to export data into a flat file (csv, txt etc).
Then you can use some of Greenplum utilities form import data.
Look here.
You can do the transformation on data with Python before creating the flat file. Use Python to automate the whole process: export data into file and import data into table.
I have to build a database from 1260000 xml files. Each of these xml files is being processed with python, parsed, and then inserted in a certain way into the database.
This is done with the psycopg2 library.
For example I read a name, I see if already the name is in the database and then I do the insertion or not as the case may be.
This all with python.
Each file takes about 10 minutes to run, which takes years to complete.
I wonder if there is an alternative, for what I am trying to do. (Sorry for the noob question)
I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.
copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.
I am working on a data warehouse and looking for an ETL solution that uses Python.
I have played with SnapLogic as an ETL, but I was wondering if there were any other solutions out there.
This data warehouse is just getting started. Ihave not brought any data over yet. It will easily be over 100 gigs with the initial subset of data I want to load into it.
Yes. Just write Python using a DB-API interface to your database.
Most ETL programs provide fancy "high-level languages" or drag-and-drop GUI's that don't help much.
Python is just as expressive and just as easy to work with.
Eschew obfuscation. Just use plain-old Python.
We do it every day and we're very, very pleased with the results. It's simple, clear and effective.
You can use pyodbc a library python provides to extract data from various Database Sources. And than use pandas dataframes to manipulate and clean the data as per the organizational needs. And than pyodbc to load it to your data warehouse.