I am using python pandas to load data from a MySQL database, change, then update another table. There are a 100,000+ rows so the UPDATE query's take some time.
Is there a more efficient way to update the data in the database than to use the df.iterrows() and run an UPDATE query for each row?
The problem here is not pandas, it is the UPDATE operations. Each row will fire its own UPDATE query, meaning lots of overhead for the database connector to handle.
You are better off using the df.to_csv('filename.csv') method for dumping your dataframe into CSV, then read that CSV file into your MySQL database using the LOAD DATA INFILE
Load it into a new table, then DROP the old one and RENAME the new one to the old ones name.
Furthermore, I suggest you do the same when loading data into pandas. Use the SELECT INTO OUTFILE MySQL command and then load that file into pandas using the pd.read_csv() method.
Related
I have a dataframe which contains some columns and snowflake table is having some columns. Some columns are same and some columns are different between them. As of now, I am extracting the snowflake table to python code and concatenating both and again replacing the table. But table is having huge data, it's very hectic. Is it possible to append the dataframe directly to the snowflake table when some columns are different and some are same. If yes, please tell me how can I do this.No solution is working for me. How can I do it effectively, with less time?
Yes It's possible to append the data to an existing table in a snowflake.
Setup your connection.
You can use sqlalchemy and create an engine later you can push df to snowflake using:
from snowflake.connector.pandas_tools import pd_writer
df.to_sql('<snowflaketablename>', engine, index=False, method=pd_writer, if_exists='append')
remember to give option if_exists="append" to append the data frame to existing table.
I am trying to figure out how to iterate through rows in a .CSV files and enter that data into a table in sqlite but only if the data in that row meets certain criteria.
I am trying to build a database of my personal spending. I have used python to categorise my spending data I now want to enter that data into a database with each category as a different table. This means I need to sort the data and enter it into different tables based on the category of spend.
I looked for quite a long time. Can anyone help?
You need to read the CSV file using pandas and store it in a pandas DataFrame. Then (If you did not create already a database) use SQLAlchemy library (Here is the documentation) to create an engine engine = sqlalchemy.create_engine('sqlite:///file.db').
Afterwards, you need to convert the DataFrame to the SQL database using pandas to_sql function (Documentation). df.to_sql('file_name', engine, index=False). I used the index=False to avoid creating a column for the index of the DataFrame.
I have large text files that need parsing and cleaning. I am using jupyter notebook to do it. I have a sql server DB that I want to insert the data in after the data is prepared. I used pyodbc to insert the final dataframe into sql. df is my dataframe and I put my sql insert query in the field sqlInsertQuery
df_records = df.values.tolist()
cursor.executemany(sqlInsertQuery,df_records)
cursor.commit()
for a few rows it works fine, but when I want to insert the whole dataframe at once with the code above, with executemany() it lasts for hours and it keeps ruuning till I stop the kernel.
I exported one file/dataframe in an excel file and it is about 83 M, As my dataframe contains very large strings and lists.
Someone recommended using fast_executemany instead but seems it is faulty.
some others recommended other packages than pyodbc.
Some said better not to use jupyter and use pycharm or Ipython.
I did not get what is the best/fastet way to insert my data to my DB in my case. I am not a developer and I really appreciate if you help me on this.
So i have this huge DB schema from a vehicle board cards, this data is actually stored in multiple excel files, my job was to create a database scheema to dump all this data into a MySql, but now i need to create the process to insert data into the DB.
This is an example of how is the excel tables sorted:
The thing is that all this excel files are not well tagged.
My question is, what do i need to do in order to create a script to dump all this data from the excel to the DB?
I'm also using ids, Foreign keys, Primary Keys, joins, etc.
I've thought about this so far:
1.-Normalize the structure of the tables in Excel in a good way so that data can be inserted with SQL language.
2.-Create a script in python to insert the data of each table.
Can you help out where should i start and how? what topics i should google?
With pandas you can easily read from excel (both csv and xlsx) and dump the data into any database
import pandas as pd
df = pd.read_excel('file.xlsx')
df.to_sql(sql_table)
If you have performance issues dumping to MySQL, you can find another way of doing the dump here
python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?
I have a 22 million row .csv file (~850mb) that I am trying to load into a postgres db on Amazon RDS. It fails every time (I get a time-out error), even when I split the file into smaller parts (each of 100,000 rows) and even when I use chunksize.
All I am doing at the moment is loading the .csv as a dataframe and then writing it to the db using df.to_sql(table_name, engine, index=False, if_exists='append', chunksize=1000)
I am using create_engine from sqlalchemy to create the connection: engine = create_engine('postgresql:database_info')
I have tested writing smaller amounts of data with psycopg2 without a problem, but it takes around 50 seconds to write 1000 rows. Obviously for 22m rows that won't work.
Is there anything else I can try?
The pandas DataFrame.to_sql() method is not especially designed for large inserts, since it does not utilize the PostgreSQL COPY command.
Regular SQL queries can time out, it's not the fault of pandas, it's controlled by the database server but can be modified per connection, see this page and search for 'statement_timeout'.
What I would recommend you to do is to consider using Redshift, which is optimized for datawarehousing and can read huge data dumps directly from S3 buckets using the Redshift Copy command.
If you are in no position to use Redshift, I would still recommend finding a way to do this operation using the PostgreSQL COPY command, since it was invented to circumvent exactly the problem you are experiencing.
You can to write the dataframe to a cString and then write this to the database using the copy_from method in Psycopg which I believe does implement the PostgreSql COPY command that #firelynx mentions.
import cStringIO
dboutput = cStringIO.StringIO()
output = output.T.to_dict().values()
dboutput.write('\n'.join([ ''.join([row['1_str'],'\t',
row['2_str'], '\t',
str(row['3_float'])
]) for row in output]))
dboutput.seek(0)
cursor.copy_from(dboutput, 'TABLE_NAME')
connenction.commit()
where output is originally a pandas dataframe with columns [1_str, 2_str, 3_float] that you want to write to the database.