I need to fetch data from a MySQL database into Pandas dataframe using odo library in Python. Odo's documentation only provides information on passing a table name to fetch the data but how do I pass a SQL query string that fetches the required data from the database.
The following code works:
import odo
import pandas as pd
data = odo('mysql+pymysql://username:{0}#localhost/dbname::{1}'.format('password', 'table_name'), pd.DataFrame)
But how do I pass a SQL string instead of a table name. Because I need to join multiple other tables to pull the required data.
Passing a string directly to odo is not supported by the module. There are three methods to move the data using the tools listed.
First, create a sql query as a string and read using:
data = pandas.read_sql_query(sql, con, index_col=None,
coerce_float=True, params=None,
parse_dates=None, chunksize=None)[source]
ref http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.read_sql_query.html#pandas.read_sql_query
Second, utilizing the odo method requires running a query into a dictionary then use the dictionary in the odo(source, destination) structure.
cursor.execute(sql)
results = db.engine.execute(sql)
data = odo(results, pd.DataFrame)
ref pg 30 of https://media.readthedocs.org/pdf/odo/latest/odo.pdf
ref How to execute raw SQL in SQLAlchemy-flask app
ref cursor.fetchall() vs list(cursor) in Python
Last, to increase the speed of execution, consider appending the pandas data frame for each result in results.
result = db.engine.execute(sql).fetchone()
data = pd.DataFrame(index=index, columns=list('AB'))
data = df_.fillna(0) # with 0s rather than NaNs
while result is not None:
dataappend = pd.DataFrame(result, columns=list('AB'))
data.append(dataappend)
result = db.engine.execute(sql).fetchone()
ref https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
ref Creating an empty Pandas DataFrame, then filling it?
Related
Context: I have a dataframe that I queried using SQl. From this query, I saved to a dataframe using pandas on spark API. Now, after some transformations, I'd like to save this new dataframe on a new table at a given database.
Example:
spark = SparkSession.builder.appName('transformation').getOrCreate()
df_final = spark.sql("SELECT * FROM table")
df_final = ps.DataFrame(df_final)
## Write Frame out as Table
spark_df_final = spark.createDataFrame(df_final)
spark_df_final.write.mode("overwrite").saveAsTable("new_database.new_table")
but this doesn't work. How can I save a pandas on spark API dataframe directly to a new table in a database (this database doesn't exist yet)
Thanks
You can use the following procedure. I have the following demo table.
You can convert it to pandas dataframe of spark API using the following code:
df_final = spark.sql("SELECT * FROM demo")
pdf = df_final.to_pandas_on_spark()
#print(type(pdf))
#<class 'pyspark.pandas.frame.DataFrame'>
Now after performing your required operations on this pandas dataframe on spark API, you can convert it back to spark dataframe using the following code:
spark_df = pdf.to_spark()
print(type(spark_df))
display(spark_df)
Now to write this dataframe to a table into a new database, you have to first create the database first and then write the dataframe to table.
spark.sql("create database newdb")
spark_df.write.mode("overwrite").saveAsTable("newdb.new_table")
You can see that the table is written to the new database. The following is a reference image of the same:
I am writing a python cloud function to load csv files into BigQuery after adding a new column creation_date . Till now no success. Is there any way to achieve this using cloud function or pandas.
Any help will be appreciated.
I have already gone through other links where csv file is getting generated and kept in GCS after adding date column .My requirement is not to create any extra file.Do you think pandas will be good option.Please suggest.
Thanks
Ritu
Yes, it's possible to accomplish that with CloudFunction.
What you could do, download the csv file to the Cloud Function docker instance (/tmp directory), load to pandas dataframe and from there you can manipulate the data according to your needs (add/remove column/rows, etc).
Once the data is ready to be loaded into BQ, you can use the method:
load_job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
Update:
I see Pandas supports gs:// now loading directly from GCS.
df = pd.read_csv('gs://bucket/your_path.csv')
Ref: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
You must create the dataset and table in Big Query, along with the types of each column. Then, define a "time created" column in your dataframe, creation_date variable:
import pandas as pd
creation_date = pd.Timestamp.now() # for each entry in the table
Then, save your dataframe into Big Query, same names of pandas columns, with specific names of columns and df, your_pandas_dataframe:
from google.cloud import bigquery
client = bigquery.Client()
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("DATE", bigquery.enums.SqlTypeNames.DATE), # create each column in Big Query along with types
bigquery.SchemaField("NAME_COLUMN_2", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("NAME_COLUMN_3", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("NAME_COLUMN_4", bigquery.enums.SqlTypeNames.INTEGER),
],
write_disposition="WRITE_APPEND",
)
job = client.load_table_from_dataframe(
your_pandas_dataframe, 'project.dataset.table', job_config=job_config
)
Yes, you can definitely use pandas for this. This is my tested example that works
import pandas as pd
from datetime import datetime
# df = pd.from_csv('ex.csv')
df = pd.DataFrame({'test': ['one', 'two']})
data = []
for i in range(0, df.shape[0]):
if i == 0:
data.append(str(datetime. today()).split(".")[0])
else:
data.append("")
df['creation_date'] = data
print(df)
# df.to_csv('temp/save.csv')
Is there any way to turn my excel workbook into MySQL database. Say for example my Excel workbook name is copybook.xls then MySQL database name would be copybook. I am unable to do it. Your help would be really appreciated.
Here I give an outline and explanation of the process including links to the relevant documentation. As some more thorough details were missing in the original question, the approach needs to be tailored to particular needs.
The solution
There's two steps in the process:
1) Import the Excel workbook as Pandas data frame
Here we use the standard method of using pandas.read_excel to get the data out from Excel file. If there is a specific sheet we want, it can be selected using sheet_name. If the file contains column labels, we can include them using parameter index_col.
import pandas as pd
# Let's select Characters sheet and include column labels
df = pd.read_excel("copybook.xls", sheet_name = "Characters", index_col = 0)
df contains now the following imaginary data frame, which represents the data in the original Excel file
first last
0 John Snow
1 Sansa Stark
2 Bran Stark
2) Write records stored in a DataFrame to a SQL database
Pandas has a neat method pandas.DataFrame.to_sql for interacting with SQL databases through SQLAlchemy library. The original question mentioned MySQL so here we assume we already have a running instance of MySQL. To connect the database, we use create_engine. Lastly, we write records stored in a data frame to the SQL table called characters.
from sqlalchemy import create_engine
engine = create_engine('mysql://USERNAME:PASSWORD#localhost/copybook')
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
We can check if the data has been stored
engine.execute("SELECT * FROM characters").fetchall()
Out:
[(0, 'John', 'Snow'), (1, 'Sansa', 'Stark'), (2, 'Bran', 'Stark')]
or better, use pandas.read_sql_table to read back the data directly as data frame
pd.read_sql_table("characters", engine)
Out:
index first last
0 0 John Snow
1 1 Sansa Stark
2 2 Bran Stark
Learn more
No MySQL instance available?
You can test the approach by using an in-memory version of SQLite database. Just copy-paste the following code to play around:
import pandas as pd
from sqlalchemy import create_engine
# Create a new SQLite instance in memory
engine = create_engine("sqlite://")
# Create a dummy data frame for testing or read it from Excel file using pandas.read_excel
df = pd.DataFrame({'first' : ['John', 'Sansa', 'Bran'], 'last' : ['Snow', 'Stark', 'Stark']})
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
# Read SQL database table into a DataFrame
pd.read_sql_table('characters', engine)
The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.
I have the following Python Pandas excel read in statement which utilizes a 'converter' to change an 'ID' from a number type to a string type. I set this up this way in order to make merging dataframes easier later on in the code. I have now gotten access to the DB to pull data directly. Is anyone familiar with adding in a converter into the cnxn line with PYODBC?
Excel
df = pd.read_excel('c:/Users/username/Desktop/filename.xlsx', sheet_name="sheet1", converters={'ID':str})
PYODBC
cnxn = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server};SERVER=server.name.com,xxxxx;UID=user;PWD=password; Trusted_Connection=yes')
cursor = cnxn.cursor()
cursor.execute(script)
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df = pd.read_sql_query(script, cnxn)
As of right now, utilizing excel works exactly how I want it to and I am fairly confident I can convert the series type later on in the code, but I am wondering if it can be done when it is called/imported directly from SQL.
Thanks for your help!