How do I enable local infile server side? - python

Firstly, I am new to Stack Overflow so appreciate any suggestions on how to improve my question asking in addition to any potential solutions.
I am using MySQL and Python (with Pycharm IDE) and I need to load data from a csv file into a table in a database I already created using MySQL Workbench (I am not using the Workbench import tool because it's too slow and I will need to do this from Pycharm in the future anyway).
I can connect to my database successfully from PyCharm and view the tables etc. I am then using the following to try and load the data from my CSV file -
mycursor = db.cursor()
query = "LOAD DATA LOCAL INFILE 'file path' INTO TABLE scores FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (ID, Date, Score)"
mycursor.execute(query)
I received an error when I did this and saw from another question on Stack Overflow that I needed to enable the 'local_infile' setting so I did that (I can see it's turned on).
However, I am still getting the error below when I try to import my data and from reading on Stock Overflow I believe this is because I also need to enable 'local_infile' on the server side but am unsure how to do this?
The error I am now getting when I try to load my data is as follows -
mysql.connector.errors.ProgrammingError:
3948 (42000): Loading local data is disabled; this must be enabled on both the client and server sides

Related

Pyspark: Incremental load, How to overwrite/update Hive table where data is being read

I'm currently writing a script for a daily incremental ETL. I used a initial load script to load base data to a hive table. Thereafter, I created a daily incremental script and reads from the same table, and uses that same data to run the 2nd script.
Initially, I tried to "APPEND" the new data with the daily incremental script, however that seemed to create duplicate rows. So, now I'm attempting to "OVERWRITE" the hive table instead, thus creating the below exception.
I noticed others with a similar issue, that want to read and overwrite the same table have tried to "refreshTable" before overwriting... I tried this solution as well, but I'm still receiving the same error?
Maybe I should refresh the table path as well?
-Thanks
The Error:
Caused by: java.io.FileNotFoundException: File does not exist: hdfs://essnp/sync/dev_42124_edw_b/EDW/SALES_MKTG360/RZ/FS_FLEET_ACCOUNT_LRF/Data/part-00000-4db6432b-f59c-4112-83c2-672140348454-c000.snappy.parquet
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
End of my code:
###### Loading the LRF table ######
spark.catalog.refreshTable(TABLE_NAME)
hive.write_sdf_as_parquet(spark,final_df_converted,TABLE_NAME,TABLE_PATH,mode='overwrite')
print("LOAD COMPLETED " + str(datetime.now()))
####### Ending SparkSession #######
spark.sparkContext.stop()
spark.stop() ```
It's not a good habit to read and write to same path,
as per spark DAG lineage, some time read and write both happens at the same time, so it's expected.
Better to read from one location and write to another location.

Can read but not write to as/400 database with python

I have a problem setting which involves data manipulation on an IBMi as/400 database. I'm trying to solve the problem with the help of python and pandas.
For the last days I'm was trying to set up a proper connection to the as/400 db which every combination of package, driver, dialect whatsoever that I could find on SO or Google.
Neither of the solutions is fully working for me. Some are better, while others are not working at all.
Here's the current situation:
I'm able to read and write data through pyodbc. The connection string I'm using is the following:
cstring = urllib.parse.quote("DRIVER={IBM i Access ODBC Driver};SYSTEM=IP;UID=XXX;PWD=YYY;PORT=21;CommitMode=0;SIGNON=4;CCSID=1208;TRANSLATE=1;")
Then I establish the connection like so:
connection = pypyodbc.connect(cstring)
With connection I can read and write data from/to the as400 db through raw SQL statements:
connection.execute("""CREATE TABLE WWNMOD5.temp(
store_id INT GENERATED BY DEFAULT AS IDENTITY NOT NULL,
store_name VARCHAR(150),
PRIMARY KEY (store_id)
)""")
This is, of course, a meaningless example. My goal would be to write a pandas DataFrame to the as400 by using
df.to_sql()
But when trying to do something like this:
df.to_sql('temp', connection, schema='WWNMOD5', chunksize=1000, if_exists='append', index=False)
I get this error:
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ('42000', '[42000] [IBM][System i Access ODBC-Treiber][DB2 für i5/OS]SQL0104 - Token ; ungültig. Gültige Token: <ENDE DER ANWEISUNG>.')
Meaning that an invalid token was used which in this case I believe is the ';' at the end of the SQL statement.
I believe that pandas isn't compatible with the pyodbc package. Therefore, I was also trying to work with the db over sqlalchemy.
With sqlalchemy, I establish the connection like so:
engine= sa.create_engine("iaccess+pyodbc:///?odbc_connect=%s"%cstring)
I also tried to use ibm_db_sa instead of iaccess but the result is always the same.
If I do the same from above with sqlalchemy, that is:
df.to_sql('temp', engine, schema='WWNMOD5', chunksize=1000, if_exists='append', index=False)
I don't get any error message but the table is not created either and I don't know why.
Is there a way how to get this working? All the SO threads are only suggesting solutions for establishing a connection and reading data from as400 databases but don't cover writing data back to the as400 db via python.
It looks like you are using the wrong driver. Pandas claims support for any DB supported by SQLAlchemy. But in order for SQLAlchemy to use DB2, it needs a third party extension.
This is the one recommended by SQLAlchemy: https://pypi.org/project/ibm-db-sa/

How to write a pandas DataFrame directly into a Netezza Database?

I have a pandas DataFrame in python and want this DataFrame directly to be written into a Netezza Database.
I would like to use the pandas.to_sql() method that is described here but it seems like that this method needs one to use SQLAlchemy to connect to the DataBase.
The Problem: SQLAlchemy does not support Netezza.
What I am using at the moment to connect to the database is pyodbc. But this o the other hand is not understood by pandas.to_sql() or am I wrong with this?
My workaround to this is to write the DataFrame into a csv file via pandas.to_csv() and send this to the Netezza Database via pyodbc.
Since I have big data, writing the csv first is a performance issue. I actually do not care if I have to use SQLAlchemy or pyodbc or something different but I cannot change the fact that I have a Netezza Database.
I am aware of deontologician project but as the author states itself "is far from complete, has a lot of bugs".
I got the package to work (see my solution below). But if someone nows a better solution, please let me know!
I figured it out. For my solution see accepted answer.
Solution
I found a solution that I want to share for everyone with the same problem.
I tried the netezza dialect from deontologician but it does not work with python3 so I made a fork and corrected some encoding issues. I uploaded to github and it is available here. Be aware that I just made some small changes and that is mostly work of deontologician and nobody is maintaining it.
Having the netezza dialect I got pandas.to_sql() to work directy with the Netezza database:
import netezza_dialect
from sqlalchemy import create_engine
engine = create_engine("netezza://ODBCDataSourceName")
df.to_sql("YourDatabase",
engine,
if_exists='append',
index=False,
dtype=your_dtypes,
chunksize=1600,
method='multi')
A little explaination to the to_sql() parameters:
It is essential that you use the method='multi' parameter if you do not want to take pandas for ever to write in the database. Because without it it would send an INSERT query per row. You can use 'multi' or you can define your own insertion method. Be aware that you have to have at least pandas v0.24.0 to use it. See the docs for more info.
When using method='multi' it can happen (happend at least to me) that you exceed the parameter limit. In my case it was 1600 so I had to add chunksize=1600 to avoid this.
Note
If you get a warning or error like the following:
C:\Users\USER\anaconda3\envs\myenv\lib\site-packages\sqlalchemy\connectors\pyodbc.py:79: SAWarning: No driver name specified; this is expected by PyODBC when using DSN-less connections
"No driver name specified; "
pyodbc.InterfaceError: ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
Then you propably treid to connect to the database via
engine = create_engine(netezza://usr:pass#address:port/database_name)
You have to set up the database in the ODBC Data Source Administrator tool from Windows and then use the name you defined there.
engine = create_engine(netezza://ODBCDataSourceName)
Then it should have no problems to find the driver.
I know you already answered the question yourself (thanks for sharing the solution)
One general comment about large data-writes to Netezza:
I’d always choose to write data to a file and then use the external table/ODBC interface to insert the data. Instead of inserting 1600 rows at a time, you can probably insert millions of rows in the same timeframe.
We use UTF8 data in the flat file and CSV unless you want to load binary data which will probably require fixed width files.
I’m not a python savvy but I hope you can follow me ...
If you need a documentation link, you can start here: https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.load.doc/c_load_create_external_tbl_syntax.html

Clob reading in python using cx_oracle - not working

I am trying to fetch a clob data from Oracle server and the connection is made through ssh tunnel.
When I tried to run the following code:
(id,clob) = cursor.fetchone()
print('one fetched')
clob_data = clob.read()
print(clob_data)
the execution freezes
Can someone help me with what's wrong here because I have referred to cx_oracle docs and the example code is just the same.
It is possible that there is a round trip taking place that is not being handled properly by the cx_Oracle driver. Please create an issue here (https://github.com/oracle/python-cx_Oracle/issues) with a few more details such as platform, Python version, Oracle database/client version, etc.
You can probably work around the issue, however, by simply returning the CLOBs as strings as can be seen in this sample: https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py.

Data loaded to MySQL via python disappears

I've looked around to see whether anyone had this problem but looks like not! Basically my problem is as follows:
I try loading data into MYSQL db using the MySQLdb library for python
I seem to succeed, since I'm able to retrieve the items I loaded within the same python instance
ONce the python code is run and closed, when I try to retrieve the data either by running a query in MySQL workbench or by running a python code in command prompt, I cannot retrieve the data..
So in summary, I do load the data in, but the moment I close the python instance, the data seems to disappear..
To try to isolate the problem later, I placed a time.sleep(60) line into my code, so that once the python code loads the data, I can go and try retrieving the data from MYSQL workbench using queries, but I still cant..
I thought perhaps I'm saving data into different instances, but I checked things like "port" etc. and they are identical!..
I've spent 4-5 hours trying to figure out, but starting to lose hope.. Help much apperciated.. Please find below my code:
db = MySQLdb.connect("localhost","root","password","mydb")
cursor = db.cursor()
cursor.execute("SELECT VERSION()")
data = cursor.fetchone()
print data
cursor.execute("LOAD DATA LOCAL INFILE "+ "filepath/file.txt" +" INTO TABLE addata FIELDS TERMINATED BY ';' LINES TERMINATED BY '\r\n'")
data = cursor.fetchall()
print data ###At this point data displays warnings etc
cursor.execute("select * from addata")
data = cursor.fetchmany(10)
print data ###Here I can see that the data is loaded
time.sleep(60) ##Here while the code is sleeping I go to mysql workbench and try the query "select * from addata".. It returns nothing:(
You almost certainly need to commit the data after you have loaded it.
If your program exits without committing the data, the DB will roll back your transaction, on the assumption that something has gone wrong.
You may be able to set autocommit as part of your connection request, otherwise you should call 'commit()' via your cursor object.

Categories

Resources