How can I write data from a pickle file to sqlite3 database? - python

I have webscraped data into a pickle file and want to write that data into a sqlite3 database. Can anybody help me out with what needs to be done?

You need to create a column of type BLOB (which is a supported datatype in Sqlite3).
Then, you can just INSERT INTO data (id, content) VALUES (?, ?) with the binary dump of your pickle object.
Here's a walk through on Sqlite3 inserts.
pickle.dumps will convert the object into a byte-string you can store in the database. pickle.loads will turn it back after being SELECT'd from the database.
You can also consider using dill for more complex objects.

Related

How to list all JSON keys in a file to identify database column names?

I plan to use Python to load JSON into an SQLite database. Is there a way to list all of the keys from a JSON file (not from a string) using Python?
I want to determine what columns I will need/use in my SQLite table(s), without having to manually read the file and make a list. Something along the lines of INFORMATION_SCHEMA.COLUMNS in SQL Server, or the FINDALL in Python for XML.
I'm not looking to use other technologies, I'm sticking to Python, JSON, and SQLite on purpose.
The following syntax produced the desired results:
import json
with open('my-file.json') as file:
data = json.load(file)
for i in data:
print(i.keys())

store pandas df as blob in oracle database

I want to detect different dataframes in the excel file and give each detected dataframe an id and store this dataframe as an object/blob into Oracle database.
So in DB table, it would look like:
DF_ID
DF_BLOB
1
/blob string for df 1/
2
/blob string for df 2/
I know how to store entire excel file as blob in oracle (basically directly store excelfile.read())
but I cannot directly read() or open() pandas df. Then how can I store this df object as blob?
The go to library for storing Python objects in a binary format is pickle.
To get a byte-string instead of writing to a file use pickle.dumps():
pickle.dumps(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
Return the pickled representation of the object obj as a bytes object, instead of writing it to a file.
Arguments protocol, fix_imports and buffer_callback have the same meaning as in the Pickler constructor.

Get data as csv from a very large MySQL dump file

I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.

Copying csv data obtained using .read() to postgresql database using Python 3.5

I want to copy a csv data without uploading the actual csv to my folder. Currently, I can get the csv data via the following code:
f = request.files['data_file'].read()
a = f.decode('utf-8')
If I would print a, I can get the data from the csv. My problem is that how do I copy this data to my postgresql database? I tried using the copy command in postgresql but it needs a path to the file and I don't want to store the actual csv I just want it to be copied directly in my postgres database. Im using python 3.
Using one of the psycopg2.copy_* methods is the way to go. Which one will depend on -
Does the csv have a header?
Does the csv structure match the table exactly (number and type of columns)?
Note #1 - The form of the copy command that takes a path expects that path to exist on the database server. In Heroku, that will never be the case. Instead, you need the form of the command like this: copy table_name from stdin.... The copy_from method is a convenience method for that form.
1) Simplest case - comma-delimited file with no headers exactly matching the table structure:
stmt.copy_from(request.files['data_file'], 'your_table', sep=',')
(stmt is a cursor, preferably used inside a with conn.cursor() as stmt: clause)
2) No header, but csv only has a subset of the columns:
stmt.copy_from(request.files['data_file'], 'some_table', sep=',', cols=['col1', 'col2', 'col3'])
3) If you have a header, you will need copy_expert -
sql = """
copy some_table (col1, col2, col3)
from stdin with csv header delimiter ','
"""
stmt.copy_expert(sql, request.files['data_file'])
Note #2 - the data will be implicitly converted to the correct type. It will also need to satisfy data constraints. A failure of either operation for a single record kills the entire transaction. As a result, you may need to get fancy and load all the data into a simple temp table, clean it, then do a select into (
Note #3 - I kinda guessed that you can use request.files directly, but did not test it. If that fails, stream the data to a temp file and use that as the argument to the copy method.
See:
http://initd.org/psycopg/docs/cursor.html#cursor.copy_from

How i can insert data from dataframe(in python) to greenplum table?

Problem Statement:
I have multiple csv files. I am cleaning them using python and inserting them to SQL server using bcp. Now I want to insert that into Greenplum instead of SQL Server. Please suggest a way to bulk insert into greenplum table directly from python data-frame to GreenPlum table.
Solution: (What i can think)
Way i can think is CSV-> Dataframe -> Cleainig -> Dataframe -> CSV -> then Use Gpload for Bulk load. And integrate it in Shell script for automation.
Do anyone has a good solution for it.
Issue in loading data directly from dataframe to gp table:
As gpload ask for the file path. Can i pass a varibale or dataframe to that? Is there any way to bulkload into greenplum ?I dont want to create a csv or txt file from dataframe and then load it to greenplum.
I would use psycopg2 and the io libraries to do this. io is built-in and you can install psycopg2 using pip (or conda).
Basically, you write your dataframe to a string buffer ("memory file") in the csv format. Then you use psycopg2's copy_from function to bulk load/copy it to your table.
This should get you started:
import io
import pandas
import psycopg2
# Write your dataframe to memory as csv
csv_io = io.StringIO()
dataframe.to_csv(csv_io, sep='\t', header=False, index=False)
csv_io.seek(0)
# Connect to the GreenPlum database.
greenplum = psycopg2.connect(host='host', database='database', user='user', password='password')
gp_cursor = greenplum.cursor()
# Copy the data from the buffer to the table.
gp_cursor.copy_from(csv_io, 'db.table')
greenplum.commit()
# Close the GreenPlum cursor and connection.
gp_cursor.close()
greenplum.close()

Categories

Resources