So i have this huge DB schema from a vehicle board cards, this data is actually stored in multiple excel files, my job was to create a database scheema to dump all this data into a MySql, but now i need to create the process to insert data into the DB.
This is an example of how is the excel tables sorted:
The thing is that all this excel files are not well tagged.
My question is, what do i need to do in order to create a script to dump all this data from the excel to the DB?
I'm also using ids, Foreign keys, Primary Keys, joins, etc.
I've thought about this so far:
1.-Normalize the structure of the tables in Excel in a good way so that data can be inserted with SQL language.
2.-Create a script in python to insert the data of each table.
Can you help out where should i start and how? what topics i should google?
With pandas you can easily read from excel (both csv and xlsx) and dump the data into any database
import pandas as pd
df = pd.read_excel('file.xlsx')
df.to_sql(sql_table)
If you have performance issues dumping to MySQL, you can find another way of doing the dump here
python pandas to_sql with sqlalchemy : how to speed up exporting to MS SQL?
Related
I have an excel spreadsheet which contains several columns; e.g. table, column name, join conditions etc, that will be used as inputs to populate DDL scripts that will create other large tables.
I need to be able to parse this excel spreadsheet to create the SQL DML queries, and am trying to work out the best method to do so. I was thinking of creating a parser in Python using Pandas, but was wondering does anybody have any ideas of the best approach?
I am trying to figure out how to iterate through rows in a .CSV files and enter that data into a table in sqlite but only if the data in that row meets certain criteria.
I am trying to build a database of my personal spending. I have used python to categorise my spending data I now want to enter that data into a database with each category as a different table. This means I need to sort the data and enter it into different tables based on the category of spend.
I looked for quite a long time. Can anyone help?
You need to read the CSV file using pandas and store it in a pandas DataFrame. Then (If you did not create already a database) use SQLAlchemy library (Here is the documentation) to create an engine engine = sqlalchemy.create_engine('sqlite:///file.db').
Afterwards, you need to convert the DataFrame to the SQL database using pandas to_sql function (Documentation). df.to_sql('file_name', engine, index=False). I used the index=False to avoid creating a column for the index of the DataFrame.
I would like to store CSV files in SQL Server. I've created a table with column "myDoc" as varbinary(max). I generate the CSV's on a server using Python/Django. I would like to insert the actual CSV (not the path) as a BLOB object so that I can later retrieve the actual CSV file.
How do I do this? I haven't been able to make much headway with this documentation, as it mostly refers to .jpg's
https://msdn.microsoft.com/en-us/library/a1904w6t(VS.80).aspx
Edit:
I wanted to add that I'm trying to avoid filestream. The CSVs are too small (5kb) and I don't need text search over them.
Not sure why you want varbinary over varchar, but it will work either way
Insert Into YourTable (myDoc)
Select doc = BulkColumn FROM OPENROWSET(BULK 'C:\Working\SomeXMLFile.csv', SINGLE_BLOB) x
I am creating a new application which uses ZODB and I need to import legacy data mainly from a postgres database but also from some csv files. There is a limited amount of manipulation needed to the data (sql joins to merge linked tables and create properties, change names of some properties, deal with empty columns etc).
With a subset of the postgres data I did a dump to csv files of all the relevant tables, read these into pandas dataframes and did the manipulation. This works but there are errors which are partly due to transferring the data into a csv first.
I now want to load all of the data in (and get rid of the errors). I am wondering if it makes sense to connect directly to the database and use read_sql or to carry on using the csv files.
The largest table (csv file) is only 8MB so I shouldn't have memory issues, I hope. Most of the errors are to do with encoding and or choice of separator (the data contains |,;,: and ').
Any advice? I have also read about something called Blaze and wonder if I should actually be using that.
If your CSV files aren't very large (as you say) then I'd try loading everything into postgres with odo, then using blaze to perform the operations, then finally dumping to a format that ZODB can understand. I wouldn't worry about the performance of operations like join inside the database versus in memory at the scale you're talking about.
Here's some example code:
from blaze import odo, Data, join
for csv, tablename in zip(csvs, tablenames):
odo(csv, 'postgresql://localhost/db::%s' % tablename)
db = Data('postgresql://localhost/db')
# see the link above for more operations
expr = join(db.table1, db.table2, 'column_to_join_on')
# execute `expr` and dump the result to a CSV file for loading into ZODB
odo(expr, 'joined.csv')
I am using python pandas to load data from a MySQL database, change, then update another table. There are a 100,000+ rows so the UPDATE query's take some time.
Is there a more efficient way to update the data in the database than to use the df.iterrows() and run an UPDATE query for each row?
The problem here is not pandas, it is the UPDATE operations. Each row will fire its own UPDATE query, meaning lots of overhead for the database connector to handle.
You are better off using the df.to_csv('filename.csv') method for dumping your dataframe into CSV, then read that CSV file into your MySQL database using the LOAD DATA INFILE
Load it into a new table, then DROP the old one and RENAME the new one to the old ones name.
Furthermore, I suggest you do the same when loading data into pandas. Use the SELECT INTO OUTFILE MySQL command and then load that file into pandas using the pd.read_csv() method.