There are several OLTP Postgres databases which in total accept 100 millions rows daily.
Also there is a Greenplum DWH. How to load this 100 million rows of data, with only little transformation to Greenplum daily?
I am gonna to use Python for that.
I am sure that doing that in the traditional way (psycopg2 + cursor.execute("INSERT ...), even with batches, gonna take lot of time and will create a bottleneck in the whole system.
Do you have any suggestions how to optimize the process of data loading? Any links or books which may help also welcome.
You should try to export data into a flat file (csv, txt etc).
Then you can use some of Greenplum utilities form import data.
Look here.
You can do the transformation on data with Python before creating the flat file. Use Python to automate the whole process: export data into file and import data into table.
Related
#Background
I am currently playing with some web scraping project as I am learning python.
I have a project which scrapes products with information about price etc using Selenium.
Than I add every record to pandas DF, do some additional data manipulation and than store data in csv and upload to google drive. This runs every night
#Question itself
I would like to watch price changes, new products etc. Would you recommend, how to store data with date key, so there is option to flag new products etc?
My idea is to store every load in one csv and add one column with "date_of_load"... But this seems noob_like... Maybe store data in PostrgreDB? I would like to start learning SQL, so I would try making my own DB.
Thanks for your ideas
As for me better to use NoSQL (Mongo) for this task. You can create JSON (data of prices) with keys are date.
This can help you:
https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
https://www.mongodb.com/python
https://realpython.com/introduction-to-mongodb-and-python/
https://www.google.com/search?&q=python+mongo
That is cool! I would suggest sqlite3 (https://docs.python.org/3/library/sqlite3.html) just to get a feeling with SQL. As you can see, it says "It’s also possible to prototype an application using SQLite and then port the code to a larger database such as PostgreSQL or Oracle", which is sort of what you suggested(?), so it could be a nice place to start.
However, CSV might do just fine. As long as there is not too much data (it takes forever to load(and process) all your necessary data), it doesn't matter much how you store it as long as you manage to apply it as you desire.
I'm trying to find a better way to push data to sql db using python. I have tried
dataframe.to_sql() method and cursor.fast_executemany()
but they don't seem to increase the speed with that data(the data is in csv files) i'm working with right now. Someone suggested that i could use named tuples and generators to load data much faster than pandas can do.
[Generally the csv files are atleast 1GB in size and it takes around 10-17 minutes to push one file]
I'm fairly new to much of concepts of python,so please suggest some method or atleast a reference any article that shows any info. Thanks in advance
If you are trying to insert the csv as is into the database (i.e. without doing any processing in pandas), you could use sqlalchemy in python to execute a "BULK INSERT [params, file, etc.]". Alternatively, I've found that reading the csvs, processing, writing to csv, and then bulk inserting can be an option.
Otherwise, feel free to specify a bit more what you want to accomplish, how you need to process the data before inserting to the db, etc.
I'm gonna use data from a .csv to train a model to predict user activity on google ads (impressions, clicks) in relation to the weather for a given day. And I have a .csv that contains 6000+ recordings of this info and want to parse it into a database using Python.
I tried making a df in pandas but for some reason the whole table isn't shown. The middle columns (there's about 7 columns I think) and rows (numbered over 6000 as I mentioned) are replaced with '...' when I print the table so I'm not sure if the entirety of the information is being stored and if this will be usable.
My next attempt will possible be SQLite but since it's local memory, will this interfere with someone else making requests to my API endpoint if I don't have the db actively open at all times?
Thanks in advance.
If you used pd.read_csv() i can assure you all of the info is there, it's just not displaying it.
You can check by doing something like print(df['Column_name_you_are_interested_in'].tolist()) just to make sure though. You can also use the various count type methods in pandas to make sure all of your lines are there.
Panadas is pretty versatile so it shouldn't have trouble with 6000 lines
So, I have a large amount of data that I wish to upload on a table in MySQL. I can use MySQL's inbuilt data import wizard to upload each .csv file(around 90 files, ~150 mb each) into the table but each file takes too long and it will take months to upload all this data.
So instead, I want to use the 'LOAD DATA INFILE' command(which apparently is faster according to the internet) in MySQL but this is generally done for individual files, so I was wondering if I can maybe 'loop' this SQL command using Python(Python has a module that connects to MySQL named 'MySQLclient') and run it through the directory where each of my data file is in so that all the data gets uploaded on to my table one by one automatically. But unfortunately I am not able to come up with a syntactically accurate method to do so in Python 3.6. Maybe you can help?
Other methods/commands to perform this task are also welcome.
Versions: Python 3.6; MySQL 5.7; Win 8.1;
I have an existing python script that loops through a directory of XML files parsing each file using etree, and inserting data at different points into a Postgres database schema using psycopg2 module. This hacked together script worked just fine but now the amount of data (number and size of XML files) is growing rapidly, and the number of INSERT statements is just not scaling. The largest table in my final database has grown to about ~50 million records from about 200,000 XML files. So my question is, what is the most efficient way to:
Parse data out of XMLs
Assemble row(s)
Insert row(s) to Postgres
Would it be faster to write all the data to a CSV in the correct format and then bulk load the final CSV tables to Postgres using COPY_FROM command?
Otherwise I was thinking about populating some sort of temporary data structure in memory that I could insert into the DB once it reaches a certain size? I am just having trouble arriving at the specifics of how this would work.
Thanks for any insight on this topic, and please let me know if more information is needed to answer my question.
copy_from is the fastest way I found to do bulk inserts. You might be able to get away with streaming the data through a generator to stay away from writing temporary files while keeping memory usage low.
A generator function could assemble rows out of the XML data, then consume that generator with copy_from. You may even want multiple levels of generators such that you can have one which yields records from a single file and another which composes those from all 200,000 files. You'd end up with a single query which will be much faster than 50,000,000.
I wrote an answer here with links to example and benchmark code for setting something similar up.