I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.
Related
I have a .CSV file that has data of the form:
2006111024_006919J.20J.919J-25.HLPP.FMGN519.XSVhV7u5SHK3H4gsep.log,2006111024,K0069192,MGN519,DN2BS460SEB0
This his how it appears in a text file. In Excel the commas are columns.
The .csv file can have 100s of these rows. To make things easier both coding and reading the code, I am using pandas mixed with SQL Alchemy. I am new to Python and all these modules
my initial method gets all the info but does one insert at a time for each row of a csv file. My mentor says this is not the best way and that I should use a "bulk" insert/read all rows of the csv then insert them all at once. My method so far uses pandas df.to_sql. I hear this method has a "multi" mode for insert. The problem is, I have no idea how to use it with my limited knowledge and how it would work with the method I have so far:
def odfsfromcsv_to_db(csvfilename_list, db_instance):
odfsdict = db_instance['odfs_tester_history']
for csv in csvfilename_list: # is there a faster way to compare the list of files in archive and history?
if csv not in archivefiles_set:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WF_SCRIBE'])
#print(odfscsv_df['ODFS_LOG_FILENAME'])
for index, row in odfscsv_df.iterrows():
table_row = {
"ODFS_LOG_FILENAME": row['ODFS_LOG_FILENAME'],
"ODFS_FILE_CREATE_DATETIME": row['ODFS_FILE_CREATE_DATETIME'],
"LOT": row['LOT'],
"TESTER": row['TESTER'],
"WF_SCRIBE": row['WF_SCRIBE'],
"CSV_FILENAME": csv.name
}
print(table_row)
df1 = pd.DataFrame.from_dict([table_row])
result = df1.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)
else:
print(csv.name + " is in archive folder already")
How do I modify this and be able to insert multiple records at once. I felt limited to creating a new dictionary for each row of the table and then inserting that dictionary into the table for each row. Is there a way to collate the rows into one big structure and push them all at once into my db using pandas?
You already have the pd.read you would just need to use the following code:
odfscsv_df.to_sql('DATA', conn, if_exists='replace', index = False)
Where DATA is your table, conn is your connection, etc. the two links below should help you with any specifics to your code, and I have attached a snippet of some old code that might help to make it clearer, however, the two links below are the better resource.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=sql#io-sql
import sqlite3
import pandas as pd
from pandas import DataFrame
conn = None;
try:
conn = sqlite3.connect(':memory:') # This allows the database to run in RAM, with no requirement to create a file.
#conn = sqlite3.connect('data.db') # You can create a new database by changing the name within the quotes.
except Error as e:
print(e)
c = conn.cursor() # The database will be saved in the location where your 'py' file is saved IF you did not choose the :memory: option
# Create table - DATA from input.csv - this must match the values and headers of the incoming CSV file.
c.execute('''CREATE TABLE IF NOT EXISTS DATA
([generated_id] INTEGER PRIMARY KEY,
[What] text,
[Ever] text,
[Headers] text,
[And] text,
[Data] text,
[You] text,
[Have] text)''')
conn.commit()
# When reading the csv:
# - Place 'r' before the path string to read any special characters, such as '\'
# - Don't forget to put the file name at the end of the path + '.csv'
# - Before running the code, make sure that the column names in the CSV files match with the column names in the tables created and in the query below
# - If needed make sure that all the columns are in a TEXT format
read_data = pd.read_csv (r'input.csv', engine='python')
read_data.to_sql('DATA', conn, if_exists='replace', index = False) # Insert the values from the csv file into the table 'DATA'
# DO STUFF with your data
c.execute('''
DROP TABLE IF EXISTS DATA
''')
conn.close()
I found my own answer by just feeding the dataframe straight to sql. It is a slight modification of #user13802268's answer:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
result = odfscsv_df.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)
The goal is to load a csv file into an Azure SQL database from Python directly, that is, not by calling bcp.exe. The csv files will have the same number of fields as do the destination tables. It'd be nice to not have to create the format file bcp.exe requires (xml for +-400 fields for each of 16 separate tables).
Following the Pythonic approach, try to insert the data and ask SQL Server to throw an exception if there is a type mismatch, or other.
If you don't want use bcp cammand to import the csv file, you can using Python pandas library.
Here's the example that I import a no header 'test9.csv' file on my computer to Azure SQL database.
Csv file:
Python code example:
import pandas as pd
import sqlalchemy
import urllib
import pyodbc
# set up connection to database (with username/pw if needed)
params = urllib.parse.quote_plus("Driver={ODBC Driver 17 for SQL Server};Server=tcp:***.database.windows.net,1433;Database=Mydatabase;Uid=***#***;Pwd=***;Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;")
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
# read csv data to dataframe with pandas
# datatypes will be assumed
# pandas is smart but you can specify datatypes with the `dtype` parameter
df = pd.read_csv (r'C:\Users\leony\Desktop\test9.csv',header=None,names = ['id', 'name', 'age'])
# write to sql table... pandas will use default column names and dtypes
df.to_sql('test9',engine,if_exists='append',index=False)
# add 'dtype' parameter to specify datatypes if needed; dtype={'column1':VARCHAR(255), 'column2':DateTime})
Notice:
get the connect string on Portal.
UID format is like [username]#[servername].
Run this scripts and it works:
Please reference these documents:
HOW TO IMPORT DATA IN PYTHON
pandas.DataFrame.to_sql
Hope this helps.
I'm currently using PyHive (Python3.6) to read data to a server that exists outside the Hive cluster and then use Python to perform analysis.
After performing analysis I would like to write data back to the Hive server.
In searching for a solution, most posts deal with using PySpark. In the long term we will set up our system to use PySpark. However, in the short term is there a way to easily write data directly to a Hive table using Python from a server outside of the cluster?
Thanks for your help!
You could use the subprocess module.
The following function will work for data you've already saved locally. For example, if you save a dataframe to csv, you an pass the name of the csv into save_to_hdfs, and it will throw it in hdfs. I'm sure there's a way to throw the dataframe up directly, but this should get you started.
Here's an example function for saving a local object, output, to user/<your_name>/<output_name> in hdfs.
import os
from subprocess import PIPE, Popen
def save_to_hdfs(output):
"""
Save a file in local scope to hdfs.
Note, this performs a forced put - any file with the same name will be
overwritten.
"""
hdfs_path = os.path.join(os.sep, 'user', '<your_name>', output)
put = Popen(["hadoop", "fs", "-put", "-f", output, hdfs_path], stdin=PIPE, bufsize=-1)
put.communicate()
# example
df = pd.DataFrame(...)
output_file = 'yourdata.csv'
dataframe.to_csv(output_file)
save_to_hdfs(output_file)
# remove locally created file (so it doesn't pollute nodes)
os.remove(output_file)
In which format you want to write data to hive? Parquet/Avro/Binary or simple csv/text format?
Depending upon your choice of serde you use while creating hive table, different python libraries can be used to first convert your dataframe to respective serde, store the file locally and then you can use something like save_to_hdfs (as answered by #Jared Wilber below) to move that file into hdfs hive table location path.
When a hive table is created (default or external table), it reads/stores its data from a specific HDFS location (default or provided location). And this hdfs location can be directly accessed to modify data. Some things to remember if manually updating data in hive tables- SERDE, PARTITIONS, ROW FORMAT DELIMITED etc.
Some helpful serde libraries in python:
Parquet: https://fastparquet.readthedocs.io/en/latest/
Avro:https://pypi.org/project/fastavro/
It took some digging but I was able to find a method using sqlalchemy to create a hive table directly from a pandas dataframe.
from sqlalchemy import create_engine
#Input Information
host = 'username#local-host'
port = 10000
schema = 'hive_schema'
table = 'new_table'
#Execution
engine = create_engine(f'hive://{host}:{port}/{schema}')
engine.execute('CREATE TABLE ' + table + ' (col1 col1-type, col2 col2-type)')
Data.to_sql(name=table, con=engine, if_exists='append')
You can write back.
Convert data of df into such format like you are inserting multiple rows into the table at once eg.. insert into table values (first row of dataframe comma separated ), (second row), (third row).... so on;
thus you can insert.
bundle=df.assign(col='('+df[df.col[0]] + ','+df[df.col[1]] +...+df[df.col[n]]+')'+',').col.str.cat(' ')[:-1]
con.cursor().execute('insert into table table_name values'+ bundle)
and you are done.
I have a Pandas DataFrame with around 200,000 indexes/rows and 30 columns.
I need to have this directly exported into an .mdb file, converting it into a csv and manually importing it will not work.
I understand there's tools like pyodbc that help a lot with importing/reading access, but there is little documentation on how to export.
I'd love any help anyone can give, and would strongly appreciate any examples.
First convert the dataframe into .csv file using the below command
name_of_your_dataframe.to_csv("filename.csv", sep='\t', encoding='utf-8')
Then load .csv to .mdb using pyodbc
MS Access can directly query CSV files and run a Make-Table Query(https://support.office.com/en-us/article/Create-a-make-table-query-96424f9e-82fd-411e-aca4-e21ad0a94f1b) to produce a resulting table. However, some cleaning is needed to remove the rubbish rows. Below opens two files one for reading and other for writing. Assuming rubbish is in first column of csv, the if logic writes any line that has some data in second column (adjust as needed):
import os
import csv
import pyodbc
# TEXT FILE CLEAN
with open('C:\Path\To\Raw.csv', 'r') as reader, open('C:\Path\To\Clean.csv', 'w') as writer:
read_csv = csv.reader(reader); write_csv = csv.writer(writer,lineterminator='\n')
for line in read_csv:
if len(line[1]) > 0:
write_csv.writerow(line)
# DATABASE CONNECTION
access_path = "C:\Path\To\Access\\DB.mdb"
con = pyodbc.connect("DRIVER={{Microsoft Access Driver (*.mdb, *.accdb)}};DBQ={};" \
.format(access_path))
# RUN QUERY
strSQL = "SELECT * INTO [TableName] FROM [text;HDR=Yes;FMT=Delimited(,);" + \
"Database=C:\Path\To\Folder].Clean.csv;"
cur = con.cursor()
cur.execute(strSQL)
con.commit()
con.close() # CLOSE CONNECTION
os.remove('C\Path\To\Clean.csv') # DELETE CLEAN TEMP
2020 Update
There is now a supported external SQLAlchemy dialect for Microsoft Access ...
https://github.com/gordthompson/sqlalchemy-access
... which enables you to use pandas' to_sql method directly via pyodbc and the Microsoft Access ODBC driver (on Windows).
I would recommend to export the pandas dataframe to csv as usual like this:
dataframe_name.to_csv("df_filename.csv", sep=',', encoding='utf-8')
Then you can convert it to .mdb file as this stackoverflow answer shows
Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()