Replacing Bad Data in Pandas Data Frame - python

In Python 2.7, I'm connecting to an external data source using the following:
import pypyodbc
import pandas as pd
import datetime
import csv
import boto3
import os
# Connect to the DataSource
conn = pypyodbc.connect("DSN = FAKE DATA SOURCE; UID=FAKEID; PWD=FAKEPASSWORD")
# Specify the query we're going to run on it
script = ("SELECT * FROM table")
# Create a dataframe from the above query
df = pd.read_sql_query(script, conn)
I get the following error:
C:\Python27\python.exe "C:/Thing.py"
Traceback (most recent call last):
File "C:/Thing.py", line 30, in <module>
df = pd.read_sql_query(script,conn)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 431, in read_sql_query
parse_dates=parse_dates, chunksize=chunksize)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1608, in read_query
data = self._fetchall_as_list(cursor)
File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1617, in _fetchall_as_list
result = cur.fetchall()
File "build\bdist.win32\egg\pypyodbc.py", line 1819, in fetchall
File "build\bdist.win32\egg\pypyodbc.py", line 1871, in fetchone
ValueError: could not convert string to float: ?
It's seems to me that in one of the float columns, there is a '?' symbol for some reason. I've reached out to the owner of the data source, but they cannot change the underlying table.
Is there a way to replace incorrect data like this using pandas? I've tried using replace after the read_sql_query statement, but I get the same error.

Hard to know for certain without having your data obviously, but you could try setting coerce_float to False, i.e. replace your last line with
df = pd.read_sql_query(script, conn, coerce_float=False)
See the documentation of read_sql_query.

Related

How do I debug my python code to identify if my issues are being caused by my code or my installation of pandas

I'm trying to use Pandas for the first time. My script looks to open a sqlite database, read the contents of the test_t table then convert the Unix milliseconds timestamps in the 'timestamp' column to dd/mm/yyyy hh:mm:ss before outputting the results to an excel file. I cant tell if the traceback issues are being caused by my code, my installation of pandas or an issue with the format of the string in the 'timestamp' column and example of which is 1670258569783. I have created a test script which correctly reports the pandas version so it seems to be installed. I've included my code and the debug output below.
Any guidance on resolving this issue would be greatly appreciated.
My Code
import sqlite3
import pandas as pd
from datetime import datetime
# Connect to the database
conn = sqlite3.connect('test.db')
# Query the database
query = '''SELECT * FROM test_t'''
df = pd.read_sql_query(query, conn)
# Convert the Unix millisecond timestamps to human-readable dates and times
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
# Export to an Excel file
df.to_excel('message_view.xlsx', index=False)
# Close the connection
conn.close()
Traceback Errors
C:\Users\xxxx\AppData\Local\Microsoft\WindowsApps\python3.9.exe C:\temp\MyPython\testscript.py
Traceback (most recent call last):
File "C:\temp\MyPython\testscript.py", line 13, in <module>
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\series.py", line 4771, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\apply.py", line 1123, in apply
return self.apply_standard()
File "C:\Users\xxxx\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\pandas\core\apply.py", line 1174, in apply_standard
mapped = lib.map_infer(
File "pandas\_libs\lib.pyx", line 2924, in pandas._libs.lib.map_infer
File "C:\temp\MyPython\testscript.py", line 13, in <lambda>
df['human_readable_time'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000).strftime('%d/%m/%Y %H:%M:%S'))
ValueError: Invalid value NaN (not a number)

Writing data into Snowflake table using Python

I am trying to read data from Excel to pandas dataframe and then write the dataframe to Snowflake table. Code as below.
Connection is established and Excel read is working fine but write to snowflake table is not working. Am getting below error . Requesting help to resolve the error
snowflake.connector.errors.MissingDependencyError: Missing optional dependency: pandas Process finished with exit code 1
import pandas as pd
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
from snowflake.connector.pandas_tools import pd_writer
url = URL(
account = '',
user = '',
schema = 'TMP',
database = 'TMP',
warehouse= 'DATABRICKS',
role = '',
authenticator='externalbrowser',
)
engine = create_engine(url)
con = engine.connect()
df = pd.read_excel("C:\\Final.xlsx")
df.columns = df.columns.astype(str)
table_name = 'test_connect'
if_exists = 'replace'
df.to_sql(name=table_name.lower(), con=con,index= False, if_exists=if_exists, method=pd_writer)
Detailed Error info below
Traceback (most recent call last):
File "C:\Users\XYZ\AppData\Roaming\JetBrains\DataSpell2022.2\scratches\scratch.py", line 32, in <module>
df.to_sql(name=table_name.lower(), con=con,index= False, if_exists=if_exists, method=pd_writer)
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\pandas\core\generic.py", line 2963, in to_sql
return sql.to_sql(
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\pandas\io\sql.py", line 697, in to_sql
return pandas_sql.to_sql(
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\pandas\io\sql.py", line 1739, in to_sql
total_inserted = sql_engine.insert_records(
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\pandas\io\sql.py", line 1322, in insert_records
return table.insert(chunksize=chunksize, method=method)
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\pandas\io\sql.py", line 950, in insert
num_inserted = exec_insert(conn, keys, chunk_iter)
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\snowflake\connector\pandas_tools.py", line 320, in pd_writer
df = pandas.DataFrame(data_iter, columns=keys)
File "C:\Users\XYZ\AppData\Roaming\Python\Python310\site-packages\snowflake\connector\options.py", line 36, in __getattr__
raise MissingDependencyError(self._dep_name)
snowflake.connector.errors.MissingDependencyError: Missing optional dependency: pandas
Process finished with exit code 1
I believe the following dependency install step has not been completed: https://docs.snowflake.com/en/user-guide/python-connector-pandas.html#installation

Working with json file to convert to a sqlite table format in python

I have data formatted in .json file. The end goal is to reformat the data to sqlite table and store into a database for further analysis.
Here is a sample of the data:
{"_id":{"$oid":"60551"},"barcode":"511111019862","category":"Baking","categoryCode":"BAKING","cpg":{"$id":{"$oid":"601ac114be37ce2ead437550"},"$ref":"Cogs"},"name":"test brand #1612366101024","topBrand":false}
{"_id":{"$oid":"601c5460be37ce2ead43755f"},"barcode":"511111519928","brandCode":"STARBUCKS","category":"Beverages","categoryCode":"BEVERAGES","cpg":{"$id":{"$oid":"5332f5fbe4b03c9a25efd0ba"},"$ref":"Cogs"},"name":"Starbucks","topBrand":false}
{"_id":{"$oid":"601ac142be37ce2ead43755d"},"barcode":"511111819905","brandCode":"TEST BRANDCODE #1612366146176","category":"Baking","categoryCode":"BAKING","cpg":{"$id":{"$oid":"601ac142be37ce2ead437559"},"$ref":"Cogs"},"name":"test brand #1612366146176","topBrand":false}
{"_id":{"$oid":"601ac142be37ce2ead43755a"},"barcode":"511111519874","brandCode":"TEST BRANDCODE #1612366146051","category":"Baking","categoryCode":"BAKING","cpg":{"$id":{"$oid":"601ac142be37ce2ead437559"},"$ref":"Cogs"},"name":"test brand #1612366146051","topBrand":false}
Followed by the code:
import pandas as pd
import json
import sqlite3
# Open json file and convert to a list
with open("users.json") as f:
dat = [json.loads(line.strip()) for line in f]
# create a datafrom from json file
df = pd.DataFrame(dat)
#open database connection
con = sqlite3.connect("fetch_rewards.db")
c = con.cursor()
df.to_sql("users", con)
c.close()
The error I am getting:
Traceback (most recent call last):
File "C:\Users\mohammed.alabbas\Desktop\sqlite\import_csv.py", line 16, in <module>
df.to_sql("users", con)
File "C:\Users\name\AppData\Roaming\Python\Python39\site-packages\pandas\core\generic.py", line 2605, in to_sql
sql.to_sql(
File "C:\Users\name\AppData\Roaming\Python\Python39\site-packages\pandas\io\sql.py", line 589, in to_sql
pandas_sql.to_sql(
File "C:\Users\name\AppData\Roaming\Python\Python39\site-packages\pandas\io\sql.py", line 1828, in to_sql
table.insert(chunksize, method)
File "C:\Users\mname\AppData\Roaming\Python\Python39\site-packages\pandas\io\sql.py", line 830, in insert
exec_insert(conn, keys, chunk_iter)
File "C:\Users\mname\AppData\Roaming\Python\Python39\site-packages\pandas\io\sql.py", line 1555, in _execute_insert
conn.executemany(self.insert_statement(num_rows=1), data_list)
sqlite3.InterfaceError: Error binding parameter 1 - probably unsupported type.
Thanks in advance

Traceback when using Pandas to convert to Json Format

I recently have been trying to use the NBA API to pull shot chart data. I'll link the documentation for the specific function I'm using here.
I keep getting a traceback as follows:
Traceback (most recent call last):
File "nbastatsrecieve2.py", line 27, in <module>
df.to_excel(filename, index=False)
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\generic.py", line 2023, in to_excel
formatter.write(
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\io\formats\excel.py", line 730, in write
writer = ExcelWriter(stringify_path(writer), engine=engine)
File "C:\Users\*\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\io\excel\_base.py", line 637, in __new__
raise ValueError(f"No engine for filetype: '{ext}'") from err
ValueError: No engine for filetype: ''
This is all of the code as I currently have it:
from nba_api.stats.endpoints import shotchartdetail
import pandas as pd
import json
print('Player ID?')
playerid = input()
print('File Name?')
filename = input()
response = shotchartdetail.ShotChartDetail(
team_id= 0,
player_id= playerid
)
content = json.loads(response.get_json())
# transform contents into dataframe
results = content['resultSets'][0]
headers = results['headers']
rows = results['rowSet']
df = pd.DataFrame(rows)
df.columns = headers
# write to excel file
df.to_excel(filename, index=False)
Hoping someone can help because I'm very new to the JSON format.
You are getting this because the filename has no extension. Pandas will use the extension (like xlsx or xls) of the filename (if you're not giving it an ExcelWriter) to understand the right lib to use for this format. Just try this with something like df.to_excel('filename.xlsx', index=False) and see how goes.

Insert pandas dataframe to mysql using sqlalchemy

I simply try to write a pandas dataframe to local mysql database on ubuntu.
from sqlalchemy import create_engine
import tushare as ts
df = ts.get_tick_data('600848', date='2014-12-22')
engine = create_engine('mysql://user:passwd#127.0.0.1/db_name?charset=utf8')
df.to_sql('tick_data',engine, flavor = 'mysql', if_exists= 'append')
and it pop the error
biggreyhairboy#ubuntu:~/git/python/fjb$ python tushareDB.py
Error on sql SHOW TABLES LIKE 'tick_data'
Traceback (most recent call last):
File "tushareDB.py", line 13, in <module>
df.to_sql('tick_data', con = engine,flavor ='mysql', if_exists= 'append')
File "/usr/lib/python2.7/dist-packages/pandas/core/frame.py", line 1261, in to_sql
self, name, con, flavor=flavor, if_exists=if_exists, **kwargs)
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 207, in write_frame
exists = table_exists(name, con, flavor)
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 275, in table_exists
return len(tquery(query, con)) > 0
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 90, in tquery
cur = execute(sql, con, cur=cur)
File "/usr/lib/python2.7/dist-packages/pandas/io/sql.py", line 53, in execute
con.rollback()
AttributeError: 'Engine' object has no attribute 'rollback'
the dataframe is not empty, database is ready without tables, i have tried other method to create table in python with mysqldb and it works fine.
a related question:
Writing to MySQL database with pandas using SQLAlchemy, to_sql
but no actual reason was explained
You appear to be using an older version of pandas. I did a quick git bisect to find the version of pandas where line 53 contains con.rollback(), and found pandas at v0.12, which is before SQLAlchemy support was added to the execute function.
If you're stuck on this version of pandas, you'll need to use a raw DBAPI connection:
df.to_sql('tick_data', engine.raw_connection(), flavor='mysql', if_exists='append')
Otherwise, update pandas and use the engine as you intend to. Note that you don't need to use the flavor parameter when using SQLAlchemy:
df.to_sql('tick_data', engine, if_exists='append')

Categories

Resources