Background
I have 3 separate python scripts that share the same structure and effectively do the same thing - call a request to an API, retrieve XML data, convert XML to ElementTree object then to pandas DataFrame object then use .to_sql() to import that dataframe into an oracle database. This was successful for two out of three of the scripts I have written but the third is not writing to the DB, there are no errors returned, the table is created empty, the script hangs
Code from successful files:
oracle_db = sa.create_engine('oracle://sName:sName#123.456.78/testDB')
connection = oracle_db.connect()
df.to_sql('TABLE_NAME', connection, if_exists='append',index = False)
I would post the code for the unsuccessful file but it is quite literally the same besides the table and variable name.
What I have Tried
I have tried to use cx_oracle's engine to drive the connection to the
DB with no success:
conn = cx_Oracle.connect("sName", "sName","123.456.789.1/Test", encoding = "UTF-8")
I have verified the dataframe is valid.
I have verified the connection to the DB.
SOLVED - there was a column that was strictly integers so I had to specify the data type in the to.sql() call.
Related
I am VERY new to Azure and Azure functions, so be gentle. :-)
I am trying to write an Azure timer function (using Python) that will take the results returned from an API call and insert the results into a table in Azure SQL.
I am virtually clueless. If someone would be willing to handhold me through the process, it would be MOST appreciated.
I have the API call already written, so that part is done. What I totally don't get is how to get the results from what is returned into Azure SQL.
The result set I am returning is in the form of a Pandas dataframe.
Again, any and all assistance would be AMAZING!
Thanks!!!!
Here is an example that writes a panda data structure to and SQL Table:
import pyodbc
import pandas as pd
# insert data from csv file into dataframe.
# working directory for csv file: type "pwd" in Azure Data Studio or Linux
# working directory in Windows c:\users\username
df = pd.read_csv("c:\\user\\username\department.csv")
# Some other example server values are
# server = 'localhost\sqlexpress' # for a named instance
# server = 'myserver,port' # to specify an alternate port
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
To make it work for your case you need to:
replace the read from csv file with your function call
Change the insert statement to match the structure of your SQL Table.
For more details see: https://learn.microsoft.com/en-us/sql/machine-learning/data-exploration/python-dataframe-sql-server?view=sql-server-ver15
I have created a column family in local cassandra as below with cqlsh.
CREATE TABLE sample.stackoverflow_question12 (
id1 int,
class1 int,
name1 text,
PRIMARY KEY (id1)
)
I have a sample csv file with name "data.csv" and the data in the file is as below.
id1 | name1 |class1
1 | hello | 10
2 | world | 20
Used the below python code to connect db and load data from csv by using Anaconda (After installation of Cassandra driver using pip in anaconda)
#Connecting to local Cassandra server
from Cassandra.Cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(["127.0.0.1"],auth_provider = auth_provider,protocol_version=4)
session = cluster.connect()
session.set_keyspace('sample')
cluster.connect()
#File loading
prepared = session.prepare(' Insert into stackoverflow_question12 (id1,class1,name1)VALUES (?, ?, ?)')
with open('D:/Cassandra/NoSQL/data.csv', 'r') as fares:
for fare in fares:
columns=fare.split(",")
id1=columns[0]
class1=columns[1]
name1=columns[2]
session.execute(prepared, [id1,class1,name1])
#closing the file
fares.close()
when I executed the above code getting below error.
Received an argument of invalid type for column "id1". Expected: <class 'cassandra.cqltypes.Int32Type'>, Got: <class 'str'>; (required argument is not an integer)
When I changed data types to text and ran the above code then it loads data with header fields too.
Can anyone help me to make changes in my code to load data without header content? or your successful code also fine if any.
The reason to make column names as id1 and class1 is id and class are keywords and throwing error in the code when used within "fares" loop.
But in real world column names would be seen as class and id. How to run code when these type of columns came into picture?
The another question I got in mind is Cassandra will store primary key first then remaining keys in ascending order. Can we load csv columns which are not indexed same as Cassandra columns storage?
Based on this, I need to build another solution.
You need to use types accordingly to your schema - for integer columns you need to use int(columns...) because split generates strings. If you want to skip header, then you can do something like this:
cnt = 0
with open('D:/Cassandra/NoSQL/data.csv', 'r') as fares:
if cnt = 0:
continue
for fare in fares:
...
Although it's better to use Python's built-in CSV reader that could be customized to skip header automatically...
P.S. If you just want to load data from CSV, I recommend to use external tools, like DSBulk that are flexible and heavily optimized for that task. See following blog posts for examples:
https://www.datastax.com/blog/2019/03/datastax-bulk-loader-introduction-and-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-more-loading
https://www.datastax.com/blog/2019/04/datastax-bulk-loader-common-settings
https://www.datastax.com/blog/2019/06/datastax-bulk-loader-unloading
https://www.datastax.com/blog/2019/07/datastax-bulk-loader-counting
https://www.datastax.com/blog/2019/12/datastax-bulk-loader-examples-loading-other-locations
I am trying to use the BigQuery package to interact with Pandas DataFrames. In my scenario, I query a base table in BigQuery, use .to_dataframe(), then pass that to load_table_from_dataframe() to load it into a new table in BigQuery.
My original problem was that str(uuid.uuid4()) (for random ID's) was automatically being converted to bytes instead of string, so I am forcing a schema instead of allowing it to auto-detect what to make.
Now, though, I passed a job_config with a job_config dict that contained the schema, and now I get this error:
File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/client.py", line 903, in load_table_from_dataframe
job_config.source_format = job.SourceFormat.PARQUET
AttributeError: 'dict' object has no attribute 'source_format'
I already had PyArrow installed, and tried also installing FastParquet, but it didnt help, and this didn't happen before I tried to force a schema.
Any ideas?
https://google-cloud-python.readthedocs.io/en/latest/bigquery/usage.html#using-bigquery-with-pandas
https://google-cloud-python.readthedocs.io/en/latest/_modules/google/cloud/bigquery/client.html#Client.load_table_from_dataframe
Looking in to the actual package it seems that it forces Parquet format, but like I said, I had no issue before, just now that I'm trying to give a table schema.
EDIT: This only happens when I try to write to BigQuery.
Figured it out. After weeding through Google's documentation I forgot to put:
load_config = bigquery.LoadJobConfig()
load_config.schema = SCHEMA
Oops. Never loaded the config dict from the BigQuery package.
I am very new to Python and am trying to move data from a MongoDB collection into a CSV document (needs to be done with Python, not mongoexport, if possible).
I am using the Pymongo and CSV packages to bring the data from the database, into the CSV.
This is the structure of the data I am querying from MongoDB:
Primary identifer - Computer Name (parent): R2D2
Details - Computer Details (parent): Operating System (Child), Owner (Child)
I need for Operating System and Owner to have their own columns in the CSV sheet, but they keep on falling under a single column called Computer Name.
Is there away around this, so that the child objects can have their own columns, instead of being grouped under their parent object?
I've done this kind of thing many times. You will have a loop iterating your mongo query with pymongo. At the bottom end of the loop you will use the csv module to write each line of the csv file. Your question relates to what goes on in the middle of the loop.
MongoDB syntax is closest to the Java language which also fits with the 'associative array' objects in Mongo being JSON. Pymongo brings these things closer to Python types.
So the parent object will come straight from the pymongo iterable which if they have children, will by Python dictionaries:
import csv
from pymongo import MongoClient
client = MongoClient(ipaddress, port)
db = client.MyDatabase
collection = db.mycollection
f = open('output_file.csv', 'wb')
csvwriter = csv.writer(f)
for doc in collection.find():
computer_name = doc.get('Computer Name')
operating_system = computer_name.get('Operating System')
owner = computer_name.get('owner')
csvwriter.writerow([operating_system, owner])
f.close()
Of course there's probably a bunch of other columns in there without children that also come direct from the doc.
I have a script built using procedural programming that uses a sqlite database file. The script processes a CSV file, then uses a standard cursor to pass its particulars to a single SQLite DB.
After this, the script extracts from the DB to produce a number of spreadsheets in Excel via xlwt.
The problem with this is that the script only handles one input file at a time, whereas I will need to be iterating through about 70-90 of these files on any given day.
I've been trying to rewrite the script as object-oriented, but I'm having trouble with sharing the cursor.
The original input file comes in a zip archive that I would extract via Linux or Mac OS X command lines. Previously this was done manually; now I've managed to write classes and loop through totally ad-hoc numbers of multiple input files via the multi version of tkfiledialog.
Furthermore the original input file (ie one of the 70-90) is a text file in csv format with a DRF extension (obv., not really important) that gets picked by a simple tkfiledialog box:
FILENAMER = tkFileDialog.askopenfilename(title="Open file", \
filetypes=[("txt file",".DRF"),("txt file", ".txt"),\
("All files",".*")])
The DRF file itself is issued daily by location and date, ie 'BHP0123.DRF' is for the BHP location, issued for 23 January. To keep everything as straightforward as possible the procedural script further decomposes the DRF to just the BHP0123 part or prefix, then uses it to build a SQLite DB.
FBASENAME = os.path.basename(FILENAMER)
FBROOT = os.path.splitext(FBNAMED)[0]
OUTPUTDATABASE = 'sqlite_' + FBROOT + '.db'
Basically with the program as a procedural script I just had to create one DB, one connection and one cursor, which could be shared by all the functions in the script:
conn = sqlite3.connect(OUTPUTDATABASE) # <-- originally :: Is this the core problem?
curs = conn.cursor()
conn.text_factory = sqlite3.OptimizedUnicode
In the procedural version these variables above are global.
Procedurally I have
1) one function to handle formatting, and
2) another to handle the calculations needed. The DRF is indexed with about 2500 fields per row; I discard the majority and only use about 400-500 of these per row.
The formatting function parses out the CSV via a for-loop (discards junk characters, incomplete data, etc), then passes the formatted data for the calculator to process and chew on. The core problem seems to be that on the one hand I need the DB connection to be constant for each input DRF file, but on the other that connection can only be shared by the formatter and calculator, and 'regenerated' for each DRF.
Crucially, I've tried to rewrite as little of the formatter and calculator as possible.
I've tried to create a separate dbcxn class, then create an instance to share, but I'm confused as to how to handle the output DB situation with the cursor (and pass it intact to both formatter and calculator):
class DBcxn(object):
def __init__(self, OUTPUTDATABASE):
OUTPUTDATABASE = ?????
self.OUTPUTDATABASE = OUTPUTDATABASE
def init_db_cxn(self, OUTPUTDATABASE):
conn = sqlite3.connect(OUTPUTDATABASE) # < ????
self.conn = conn
curs = conn.cursor()
self.curs = curs
conn.text_factory = sqlite3.OptimizedUnicode
dbtest = DBcxn( ???? )
If anyone might suggest a way of untangling this I'd be very grateful. Please let me know if you need more information.
Cheers
Massimo Savino