Is there any way to turn my excel workbook into MySQL database. Say for example my Excel workbook name is copybook.xls then MySQL database name would be copybook. I am unable to do it. Your help would be really appreciated.
Here I give an outline and explanation of the process including links to the relevant documentation. As some more thorough details were missing in the original question, the approach needs to be tailored to particular needs.
The solution
There's two steps in the process:
1) Import the Excel workbook as Pandas data frame
Here we use the standard method of using pandas.read_excel to get the data out from Excel file. If there is a specific sheet we want, it can be selected using sheet_name. If the file contains column labels, we can include them using parameter index_col.
import pandas as pd
# Let's select Characters sheet and include column labels
df = pd.read_excel("copybook.xls", sheet_name = "Characters", index_col = 0)
df contains now the following imaginary data frame, which represents the data in the original Excel file
first last
0 John Snow
1 Sansa Stark
2 Bran Stark
2) Write records stored in a DataFrame to a SQL database
Pandas has a neat method pandas.DataFrame.to_sql for interacting with SQL databases through SQLAlchemy library. The original question mentioned MySQL so here we assume we already have a running instance of MySQL. To connect the database, we use create_engine. Lastly, we write records stored in a data frame to the SQL table called characters.
from sqlalchemy import create_engine
engine = create_engine('mysql://USERNAME:PASSWORD#localhost/copybook')
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
We can check if the data has been stored
engine.execute("SELECT * FROM characters").fetchall()
Out:
[(0, 'John', 'Snow'), (1, 'Sansa', 'Stark'), (2, 'Bran', 'Stark')]
or better, use pandas.read_sql_table to read back the data directly as data frame
pd.read_sql_table("characters", engine)
Out:
index first last
0 0 John Snow
1 1 Sansa Stark
2 2 Bran Stark
Learn more
No MySQL instance available?
You can test the approach by using an in-memory version of SQLite database. Just copy-paste the following code to play around:
import pandas as pd
from sqlalchemy import create_engine
# Create a new SQLite instance in memory
engine = create_engine("sqlite://")
# Create a dummy data frame for testing or read it from Excel file using pandas.read_excel
df = pd.DataFrame({'first' : ['John', 'Sansa', 'Bran'], 'last' : ['Snow', 'Stark', 'Stark']})
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
# Read SQL database table into a DataFrame
pd.read_sql_table('characters', engine)
Related
Context: I have a dataframe that I queried using SQl. From this query, I saved to a dataframe using pandas on spark API. Now, after some transformations, I'd like to save this new dataframe on a new table at a given database.
Example:
spark = SparkSession.builder.appName('transformation').getOrCreate()
df_final = spark.sql("SELECT * FROM table")
df_final = ps.DataFrame(df_final)
## Write Frame out as Table
spark_df_final = spark.createDataFrame(df_final)
spark_df_final.write.mode("overwrite").saveAsTable("new_database.new_table")
but this doesn't work. How can I save a pandas on spark API dataframe directly to a new table in a database (this database doesn't exist yet)
Thanks
You can use the following procedure. I have the following demo table.
You can convert it to pandas dataframe of spark API using the following code:
df_final = spark.sql("SELECT * FROM demo")
pdf = df_final.to_pandas_on_spark()
#print(type(pdf))
#<class 'pyspark.pandas.frame.DataFrame'>
Now after performing your required operations on this pandas dataframe on spark API, you can convert it back to spark dataframe using the following code:
spark_df = pdf.to_spark()
print(type(spark_df))
display(spark_df)
Now to write this dataframe to a table into a new database, you have to first create the database first and then write the dataframe to table.
spark.sql("create database newdb")
spark_df.write.mode("overwrite").saveAsTable("newdb.new_table")
You can see that the table is written to the new database. The following is a reference image of the same:
Hope everyone is well and staying safe.
I'm uploading an Excel file to SQL using Python. I have three fields: CarNumber nvarchar(50), Status nvarchar(50), and DisassemblyStart date.
I'm having an issue importing the DisassemblyStart field dates. The connection and transaction using Python are succesful.
However I get zeroes all over, even though the Excel file is populated with dates. I've tried switching to nvarchar(50), date, and datetime to see if I can least get a string and nothing. I saved the Excel file as CSV and TXT and tried uploading it and still got zeroes. I added 0.001 to every date in Excel (as to add an artificial time) in case that would make it clic but nothing happened. Still zeroes. I sure there's a major oversight from being too much in the weeds. I need help.
The Excel file has the three field columns.
This is the Python code:
import pandas as pd
import pyodbc
# Import CSV
data = pd.read_csv (r'PATH_TO_CSV\\\XXX.csv')
df = pd.DataFrame(data,columns= ['CarNumber','Status','DisassemblyStart'])
df = df.fillna(value=0)
# Connect to SQL Server
conn = pyodbc.connect("Driver={SQL Server};Server=SERVERNAME,PORT ;Database=DATABASENAME;Uid=USER;Pwd=PW;")
cursor = conn.cursor()
# Create Table
cursor.execute ('DROP TABLE OPS.dbo.TABLE')
cursor.execute ('CREATE TABLE OPS.dbo.TABLE (CarNumber nvarchar(50),Status nvarchar(50), DisassemblyStart date)')
# Insert READ_CSV INTO TABLE
for row in df.itertuples():
cursor.execute('INSERT INTO OPS.dbo.TABLE (CarNumber,Status,DisassemblyStart) VALUES (?,?,Convert(datetime,?,23))',row.CarNumber,row.Status,row.DisassemblyStart)
conn.commit()
conn.close()
Help will be much appreciated.
Thank you and be safe,
David
The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.
I have 3000 CSV files stored on my hard drive, each containing thousands of rows and 10 columns. Rows correspond to dates, and the number of rows as well as the exact dates is different across spreadsheets. The columns for all the spreadsheets are the same in number (10) and label. For each date from the earliest date across all spreadsheets to the latest date across all spreadsheets, I need to (i) access the columns in each spreadsheet for which data for that date exists, (ii) run some calculations, and (iii) store the results (a set of 3 or 4 scalar values) for that date. To clarify, results should be a variable in my workspace that stores the results for each date for all CSVs.
Is there a way to load this data using Python that is both time and memory efficient? I tried creating a Pandas data frame for each CSV, but loading all the data into RAM takes almost ten minutes and almost completely fills up my RAM. Is it possible to check if the date exists in a given CSV, and if so, load the columns corresponding to that CSV into a single data frame? This way, I could load just the rows that I need from each CSV to do my calculations.
Simple Solution.
Go and Download DB Browser for SQlite.
Open it, and create New Database.
After That, go to File and Import Table from CSV. ( Do this for All of your CSV Tables ) Alternatively, you can use Python script and sqlite3 library to be fast and automated for creating table and inserting values from your CSV sheets.
When you are done with importing all the tables, play around with this function based on your details.
import sqlite3
import pandas as pd
data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def create_database(): # Create Database with table name
con = sqlite3.connect('database.db')
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS my_CSV_data (id INTEGER PRIMARY KEY, name text, address text,mobile text , phone text,balance float,max_balance INTEGER)")
con.commit()
con.close()
def insert_into_company(): # Inserting data into column
con = sqlite3.connect(connection_str)
cur = con.cursor()
for i in data:
cur.execute("INSERT INTO my_CSV_data VALUES(Null,?,?,?,?,?,?)",(i[0],i[1],i[2],i[3],i[4],i[5]))
con.commit()
con.close()
def select_company(): # Viewing Data from Column
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data")
data = cur.fetchall()
con.close()
return data
create_database()
insert_into_company()
for j in select_company():
print(j)
Do this Once, you can you use it again and again. It will enable you to access data in less than 1 second. Ask me, if you need any other help. I'll be happy to guide you through.
I need to fetch data from a MySQL database into Pandas dataframe using odo library in Python. Odo's documentation only provides information on passing a table name to fetch the data but how do I pass a SQL query string that fetches the required data from the database.
The following code works:
import odo
import pandas as pd
data = odo('mysql+pymysql://username:{0}#localhost/dbname::{1}'.format('password', 'table_name'), pd.DataFrame)
But how do I pass a SQL string instead of a table name. Because I need to join multiple other tables to pull the required data.
Passing a string directly to odo is not supported by the module. There are three methods to move the data using the tools listed.
First, create a sql query as a string and read using:
data = pandas.read_sql_query(sql, con, index_col=None,
coerce_float=True, params=None,
parse_dates=None, chunksize=None)[source]
ref http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_sql_query.html#pandas.read_sql_query
Second, utilizing the odo method requires running a query into a dictionary then use the dictionary in the odo(source, destination) structure.
cursor.execute(sql)
results = db.engine.execute(sql)
data = odo(results, pd.DataFrame)
ref pg 30 of https://media.readthedocs.org/pdf/odo/latest/odo.pdf
ref How to execute raw SQL in SQLAlchemy-flask app
ref cursor.fetchall() vs list(cursor) in Python
Last, to increase the speed of execution, consider appending the pandas data frame for each result in results.
result = db.engine.execute(sql).fetchone()
data = pd.DataFrame(index=index, columns=list('AB'))
data = df_.fillna(0) # with 0s rather than NaNs
while result is not None:
dataappend = pd.DataFrame(result, columns=list('AB'))
data.append(dataappend)
result = db.engine.execute(sql).fetchone()
ref https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
ref Creating an empty Pandas DataFrame, then filling it?