Inserting pandas df into sqlite3 inserts running text, not columns - python

I have a pandas dataframe that I want to insert into a sqlite3 database (writing in python 3.6+).
However, when I insert the data into the table, it shows up as a text file, with text being written continuously. I tried copy/pasting the text from the database file into here on Stackoverflow, but Stackoverflow says there is an error when trying to submit. As an aside data point, when I open Sublime Text and copy/paste the text from the database, the only text that gets pasted is:
SQLite format 3
So, I am pasting below a screenshot of the text from the database. It looks like Stackoverflow doesn't like those unusual question mark characters.
My code to pass the pandas dataframe into sqlite3 is as follows. Note that I am only creating the table here (not inserting data yet) and still get this phenotype.
cols = ["name", "website", "phone", "address", "city"]
df = pd.DataFrame(columns = cols)
conn = sqlite3.connect('TestDB2.db')
c = conn.cursor()
c.execute("""
CREATE TABLE STUDENTS(
id integer PRIMARY KEY,
name text,
age integer,
city text,
country text
)""")
conn.commit()
dfObj.to_sql('STUDENTS', conn, if_exists='replace', index = False)
The pandas dataframe looks nice in python and when writing to an excel file. I am new to the sqlite3 world but am familiar with postgres. Any help would be much appreciated!!

As #DinoCoderSaurus mentions:
a sqlite3 database is not a text file. The result here is what one gets when opening a sqlite database with a text editor. Use command line sqlite3 or some tool (my fave DB browser for sqlite) to inspect db contents. It is unclear from the posted code what is dfObj. The to_sql is meant to insert data.
I did not realize I was looking at the file in a text editor. I installed a plugin for Visual Studio and expected it to provide database visualization. It does not. I installed DB browser and my data is there.

Related

How to fix 'Python sqlite3.OperationalError: no such table' issue

I through my collegue recieved .db file (which includes text and number data) which I need to pass into pandas dataframe for further processing. I never worked or know about SQLite. But, with few google search,I written following line of code:
import pandas as pd
import numpy as np
import sqlite3
conn = sqlite3.connect('data.db') # This create `data.sqlite`
sql="""
SELECT * FROM data;
"""
df=pd.read_sql_query(sql,conn)
df.head()
This giving me following error
'error Execution failed on sql ' SELECT * FROM data;
': no such table: data
What table this code is referring to ? I had only data.db.
I do not quite understand where i am going wrong with this. Any advice how to get my data into dataframe df?
I'm also new to SQL but based on what you've provided, "data" is referring to a table in your database "data.db".
The query that you typed is instructing the program to select all items from the table called "data". This website helped me with creating tables: https://www.tutorialspoint.com/sqlite/sqlite_create_table.htm

Inserting multiple records at once into SQL Server using Pandas

I have a .CSV file that has data of the form:
2006111024_006919J.20J.919J-25.HLPP.FMGN519.XSVhV7u5SHK3H4gsep.log,2006111024,K0069192,MGN519,DN2BS460SEB0
This his how it appears in a text file. In Excel the commas are columns.
The .csv file can have 100s of these rows. To make things easier both coding and reading the code, I am using pandas mixed with SQL Alchemy. I am new to Python and all these modules
my initial method gets all the info but does one insert at a time for each row of a csv file. My mentor says this is not the best way and that I should use a "bulk" insert/read all rows of the csv then insert them all at once. My method so far uses pandas df.to_sql. I hear this method has a "multi" mode for insert. The problem is, I have no idea how to use it with my limited knowledge and how it would work with the method I have so far:
def odfsfromcsv_to_db(csvfilename_list, db_instance):
odfsdict = db_instance['odfs_tester_history']
for csv in csvfilename_list: # is there a faster way to compare the list of files in archive and history?
if csv not in archivefiles_set:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WF_SCRIBE'])
#print(odfscsv_df['ODFS_LOG_FILENAME'])
for index, row in odfscsv_df.iterrows():
table_row = {
"ODFS_LOG_FILENAME": row['ODFS_LOG_FILENAME'],
"ODFS_FILE_CREATE_DATETIME": row['ODFS_FILE_CREATE_DATETIME'],
"LOT": row['LOT'],
"TESTER": row['TESTER'],
"WF_SCRIBE": row['WF_SCRIBE'],
"CSV_FILENAME": csv.name
}
print(table_row)
df1 = pd.DataFrame.from_dict([table_row])
result = df1.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)
else:
print(csv.name + " is in archive folder already")
How do I modify this and be able to insert multiple records at once. I felt limited to creating a new dictionary for each row of the table and then inserting that dictionary into the table for each row. Is there a way to collate the rows into one big structure and push them all at once into my db using pandas?
You already have the pd.read you would just need to use the following code:
odfscsv_df.to_sql('DATA', conn, if_exists='replace', index = False)
Where DATA is your table, conn is your connection, etc. the two links below should help you with any specifics to your code, and I have attached a snippet of some old code that might help to make it clearer, however, the two links below are the better resource.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=sql#io-sql
import sqlite3
import pandas as pd
from pandas import DataFrame
conn = None;
try:
conn = sqlite3.connect(':memory:') # This allows the database to run in RAM, with no requirement to create a file.
#conn = sqlite3.connect('data.db') # You can create a new database by changing the name within the quotes.
except Error as e:
print(e)
c = conn.cursor() # The database will be saved in the location where your 'py' file is saved IF you did not choose the :memory: option
# Create table - DATA from input.csv - this must match the values and headers of the incoming CSV file.
c.execute('''CREATE TABLE IF NOT EXISTS DATA
([generated_id] INTEGER PRIMARY KEY,
[What] text,
[Ever] text,
[Headers] text,
[And] text,
[Data] text,
[You] text,
[Have] text)''')
conn.commit()
# When reading the csv:
# - Place 'r' before the path string to read any special characters, such as '\'
# - Don't forget to put the file name at the end of the path + '.csv'
# - Before running the code, make sure that the column names in the CSV files match with the column names in the tables created and in the query below
# - If needed make sure that all the columns are in a TEXT format
read_data = pd.read_csv (r'input.csv', engine='python')
read_data.to_sql('DATA', conn, if_exists='replace', index = False) # Insert the values from the csv file into the table 'DATA'
# DO STUFF with your data
c.execute('''
DROP TABLE IF EXISTS DATA
''')
conn.close()
I found my own answer by just feeding the dataframe straight to sql. It is a slight modification of #user13802268's answer:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
result = odfscsv_df.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)

Importing Dates from Excel to Microsoft SQL Server Management Studio

Hope everyone is well and staying safe.
I'm uploading an Excel file to SQL using Python. I have three fields: CarNumber nvarchar(50), Status nvarchar(50), and DisassemblyStart date.
I'm having an issue importing the DisassemblyStart field dates. The connection and transaction using Python are succesful.
However I get zeroes all over, even though the Excel file is populated with dates. I've tried switching to nvarchar(50), date, and datetime to see if I can least get a string and nothing. I saved the Excel file as CSV and TXT and tried uploading it and still got zeroes. I added 0.001 to every date in Excel (as to add an artificial time) in case that would make it clic but nothing happened. Still zeroes. I sure there's a major oversight from being too much in the weeds. I need help.
The Excel file has the three field columns.
This is the Python code:
import pandas as pd
import pyodbc
# Import CSV
data = pd.read_csv (r'PATH_TO_CSV\\\XXX.csv')
df = pd.DataFrame(data,columns= ['CarNumber','Status','DisassemblyStart'])
df = df.fillna(value=0)
# Connect to SQL Server
conn = pyodbc.connect("Driver={SQL Server};Server=SERVERNAME,PORT ;Database=DATABASENAME;Uid=USER;Pwd=PW;")
cursor = conn.cursor()
# Create Table
cursor.execute ('DROP TABLE OPS.dbo.TABLE')
cursor.execute ('CREATE TABLE OPS.dbo.TABLE (CarNumber nvarchar(50),Status nvarchar(50), DisassemblyStart date)')
# Insert READ_CSV INTO TABLE
for row in df.itertuples():
cursor.execute('INSERT INTO OPS.dbo.TABLE (CarNumber,Status,DisassemblyStart) VALUES (?,?,Convert(datetime,?,23))',row.CarNumber,row.Status,row.DisassemblyStart)
conn.commit()
conn.close()
Help will be much appreciated.
Thank you and be safe,
David

SQL statement for CSV files on IPython notebook

I have a tabledata.csv file and I have been using pandas.read_csv to read or choose specific columns with specific conditions.
For instance I use the following code to select all "name" where session_id =1, which is working fine on IPython Notebook on datascientistworkbench.
df = pandas.read_csv('/resources/data/findhelp/tabledata.csv')
df['name'][df['session_id']==1]
I just wonder after I have read the csv file, is it possible to somehow "switch/read" it as a sql database. (i am pretty sure that i did not explain it well using the correct terms, sorry about that!). But what I want is that I do want to use SQL statements on IPython notebook to choose specific rows with specific conditions. Like I could use something like:
Select `name`, count(distinct `session_id`) from tabledata where `session_id` like "100.1%" group by `session_id` order by `session_id`
But I guess I do need to figure out a way to change the csv file into another version so that I could use sql statement. Many thx!
Here is a quick primer on pandas and sql, using the builtin sqlite3 package. Generally speaking you can do all SQL operations in pandas in one way or another. But databases are of course useful. The first thing you need to do is store the original df in a sql database so that you can query it. Steps listed below.
import pandas as pd
import sqlite3
#read the CSV
df = pd.read_csv('/resources/data/findhelp/tabledata.csv')
#connect to a database
conn = sqlite3.connect("Any_Database_Name.db") #if the db does not exist, this creates a Any_Database_Name.db file in the current directory
#store your table in the database:
df.to_sql('Some_Table_Name', conn)
#read a SQL Query out of your database and into a pandas dataframe
sql_string = 'SELECT * FROM Some_Table_Name'
df = pd.read_sql(sql_string, conn)
Another answer suggested using SQLite. However, DuckDB is a much faster alternative than loading your data into SQLite.
First, loading your data will take time; second, SQLite is not optimized for analytical queries (e.g., aggregations).
Here's a full example you can run in a Jupyter notebook:
Installation
pip install jupysql duckdb duckdb-engine
Note: if you want to run this in a notebook, use %pip install jupysql duckdb duckdb-engine
Example
Load extension (%sql magic) and create in-memory database:
%load_ext SQL
%sql duckdb://
Download some sample CSV data:
from urllib.request import urlretrieve
urlretrieve("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv", "penguins.csv")
Query:
%%sql
SELECT species, COUNT(*) AS count
FROM penguins.csv
GROUP BY species
ORDER BY count DESC
JupySQL documentation available here

Write pandas table to impala

Using the impyla module, I've downloaded the results of an impala query into a pandas dataframe, done analysis, and would now like to write the results back to a table on impala, or at least to an hdfs file.
However, I cannot find any information on how to do this, or even how to ssh into the impala shell and write the table from there.
What I'd like to do:
from impala.dbapi import connect
from impala.util import as_pandas
# connect to my host and port
conn=connect(host='myhost', port=111)
# create query to save table as pandas df
create_query = """
SELECT * FROM {}
""".format(my_table_name)
# run query on impala
cur = conn.cursor()
cur.execute(create_query)
# store results as pandas data frame
pandas_df = as_pandas(cur)
cur.close()
Once I've done whatever I need to do with pandas_df, save those results back to impala as a table.
# create query to save new_df back to impala
save_query = """
CREATE TABLE new_table AS
SELECT *
FROM pandas_df
"""
# run query on impala
cur = conn.cursor()
cur.execute(save_query)
cur.close()
The above scenario would be ideal, but I'd be happy if I could figure out how to ssh into impala-shell and do this from python, or even just save the table to hdfs. I'm writing this as a script for other users, so it's essential to have this all done within the script. Thanks so much!
You're going to love Ibis! It has the HDFS functions (put, namely) and wraps the Impala DML and DDL you'll need to make this easy.
The general approach I've used for something similar is to save your pandas table to a CSV, HDFS.put that on to the cluster, and then create a new table using that CSV as the data source.
You don't need Ibis for this, but it should make it a little bit easier and may be a nice tool for you if you're already familiar with pandas (Ibis was also created by Wes, who wrote pandas).
I am trying to do same thing and I figured out a way to do this with an example provided with impyla:
df = pd.DataFrame(np.reshape(range(16), (4, 4)), columns=['a', 'b', 'c', 'd'])
df.to_sql(name=”test_df”, con=conn, flavor=”mysql”)
This works fine and table in impala (backend mysql) works fine.
However, I got stuck on getting text values in as impala tries to do analysis on columns and I get cast errors. (It would be really nice if possible to implicitly cast from string to [var]char(N) in impyla.)

Categories

Resources