As mentioned in the question I have the task of joining two data frames of different databases locally(MySQL and PostgreSQL) and need to get an output(dataset) in a CSV file. Let me tell you what I have done so far in the following:
Created a connection(con) using mysql.connector.connect
Meanwhile for postgres using psycopg2 separately.
df1 = pd.read_sql(sql='SELECT * FROM mydatabase.Empl_det', con=conn).to_csv(r'C:\Users\Aaru\Documents\Empl_det1.csv', index=False) before doing this I have created a table in both MySQL and postgres....as a result I got a output in the form of csv.
Similarly to PostgreSQL df2 = pd.read_sql(sql='SELECT * FROM postgres.public.salary_details', con=conn).to_csv(r'C:\Users\Aaru\Documents\Emp_sal.csv', index=False) separately.
Now the thing is I need to join these two data frames based on the common column and get an output in CSV file.
Sometimes I used to get You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax for this MySQL version and I have gone through the documentation too....but I am not getting a proper reference for my question.
I am using MySQL 8.0.21 version and PostgreSQL 10.
Can anyone help me in bringing out with the join? Hope that the above said will be helpful in getting out with answers.
Thanks a lot in advance!!
One more question:
Is it possible to do this and then merge?
import pandas as pd
import mysql.connector
import psycopg2
conn = mysql.connector.connect(host="localhost",
user="user",
password="mypassword",
database="mydatabase"
)
mycursor_sql = conn.cursor()
con = psycopg2.connect(
database="postgres", user="postgres", password="mypassword", host="localhost", port=5432)
cursor = con.cursor()
df1 = pd.read_sql(sql='SELECT * FROM mydatabase.Empl_det', con=conn).to_csv(
r'C:\Users\Aaru\Documents\Empl_det1.csv', index=False)
df2 = pd.read_sql(sql='SELECT * FROM postgres.public.salary_details', con=conn).to_csv(
r'C:\Users\Aaru\Documents\Emp_sal1.csv', index=False)
merged_df = df1.merge(df2, on='id')
merged_df.to_csv('join.csv')
print("Success")
conn.commit()
conn.close()
Is there any other way to do it?... Need a quick response as soon as possible.
Thanks
Here you go:
merged_df = df1.merge(df2, on='id')
merged_df.to_csv('filename.csv')
For joining use merge method, for writing use to_csv method. But if you have a lot of data this can lead to a lack of memory on your machine.
Still learning Python, so bear with me. I use the following script to import a csv file into a local SQL database. My problem is that the csv file usually has a bunch of empty rows at the end of it and I get primary key errors upon import. What's the best way to handle this? If I manually edit the csv in a text editor I can delete all the rows of ,,,,,,,,,,,,,,,,,,,,,,,,,,, and it works perfectly.
Bonus question, is there an easy way to iterate through all .csv files in a directory, and then delete or move them after they've been processed?
import pandas as pd
data = pd.read_csv (r'C:\Bookings.csv')
df = pd.DataFrame(data, columns= ['BookingKey','BusinessUnit','BusinessUnitKey','DateTime','Number','Reference','ExternalId','AmountTax','AmountTotal','AmountPaid','AmountOpen','AmountTotalExcludingTax','BookingFee','MerchantFee','ProcessorFee','NumberOfPersons','Status','StatusDateTime','StartTime','EndTime','PlannedCheckinTime','ActualCheckinTime','Attendance','AttendanceDatetime','OnlineBookingCheckedDatetime','Origin','CustomerKey'])
df = df.fillna(value=0)
print(df)
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};'
'Server=D3VBUP\SQLEXPRESS;'
'Database=BRIQBI;'
'Trusted_Connection=yes;')
cursor = conn.cursor()
for row in df.itertuples():
cursor.execute('''
INSERT INTO BRIQBI.dbo.Bookings (BookingKey,BusinessUnit,BusinessUnitKey,DateTime,Number,Reference,ExternalId,AmountTax,AmountTotal,AmountPaid,AmountOpen,AmountTotalExcludingTax,BookingFee,MerchantFee,ProcessorFee,NumberOfPersons,Status,StatusDateTime,StartTime,EndTime,PlannedCheckinTime,ActualCheckinTime,Attendance,AttendanceDatetime,OnlineBookingCheckedDatetime,Origin,CustomerKey)
VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?)
''',
row.BookingKey,
row.BusinessUnit,
row.BusinessUnitKey,
row.DateTime,
row.Number,
row.Reference,
row.ExternalId,
row.AmountTax,
row.AmountTotal,
row.AmountPaid,
row.AmountOpen,
row.AmountTotalExcludingTax,
row.BookingFee,
row.MerchantFee,
row.ProcessorFee,
row.NumberOfPersons,
row.Status,
row.StatusDateTime,
row.StartTime,
row.EndTime,
row.PlannedCheckinTime,
row.ActualCheckinTime,
row.Attendance,
row.AttendanceDatetime,
row.OnlineBookingCheckedDatetime,
row.Origin,
row.CustomerKey
)
conn.commit()
Ended up being really easy. I added the dropna function so all the rows of data that had no data in them would be dropped.
df = df.dropna(how = 'all')
Now off to find out how to iterate through multiple files in a directory and move them to another location.
I use pandas library to write data which I retrieve from sql table. It is working but I don't see a column names there. Also each row is appended in Excel like
('aa','aa','01/10/2019','zzz')
('bb','cc','03/10/2019','yy')
..
I want my Excel sheet with column names and without colon.(') A proper excel sheet.
eg:
Name Address Date product
aa aa 01/10/2019 zzz
My code is as follows;
cursor.execute(sql, values)
records = cursor.fetchall()
data = []
for row in records:
data.append(row)
df = pd.DataFrame(data)
cursor.close()
cnxn.close()
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, index=False, sheet_name='Invoice')
writer.save()
How can I retrieve with column names and write that in excel using pandas?
I use pyodbc to connect to the SQL Server database.
I'd recommend using pandas' built in read_sql function to read from your sql database.
This should read your header in properly and the datatypes too!
You'll need to define a sqlAlchemy connection to pass to pd.read_sql but that is simple for mssql:
from sqlalchemy import create_engine
engine = create_engine('mssql://timmy:tiger#localhost:5432/mydatabase')
connection = engine.connect()
query = "SELECT * FROM mytable"
df = pd.read_sql(query, engine)
EDIT:
As per Ratha's comment, it should be noted that the sqlAlchemy connection isn't essential for read_sql to work, and indeed, from the docs a connection is defined as:
con : SQLAlchemy connectable (engine/connection) or database string URI
or DBAPI2 connection (fallback mode)
Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.
I have connected my python via Jupyter Notebook to my local postgresql database. I am able to run a SELECT query successfully and extract out the data from my table.
However, I want to show the rows of data in my postgresql table as a dataframe instead of what I currently have.
Below is my code:
conn = psycopg2.connect("dbname=juke user=postgres")
cur = conn.cursor()
cur.execute('SELECT * FROM albums')
for i in cur:
print(i)
This is my output from the code:
How do I get the output to show as rows in a dataframe instead?
I looked at and tried a bunch of different possible solutions from recommended duplicate post that people shared. Unfortunately, none of them worked.
create dataframe using pandas - https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html:
import pandas as pd
conn = psycopg2.connect("dbname=juke user=postgres")
df = pd.read_sql_query("SELECT * FROM albums", conn)
df.head()
Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.
However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.
What I'd like to do:
from impala.dbapi import connect
from impala.util import as_pandas
# connect to my host and port
conn=connect(host='myhost', port=111)
# create query to save table as pandas df
create_query = """
SELECT * FROM {}
""".format(my_table_name)
# run query on impala
cur = conn.cursor()
cur.execute(create_query)
# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()
Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.
# create query to save new_df back to impala
save_query = """
CREATE TABLE new_table AS
SELECT *
FROM pandas_df
"""
# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()
The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!
You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.
The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.
You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).
I am trying to do same thing and I figured out a way to do this with an example provided with impyla:
df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)
This works fine and table in impala (backend mysql) works fine.
However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)