Python script hangs when executing long running query, even after query completes - python

I've got a Python script that loops through folders and within each folder, executes the sql file against our Redshift cluster (using psycopg2). Here is the code that does the loop (note: this works just fine for queries that take only a few minutes to execute):
for folder in dir_list:
#Each query is stored in a folder by group, so we have to go through each folder and then each file in that folder
file_list = os.listdir(source_dir_wkly + "\\" + str(folder))
for f in file_list:
src_filename = source_dir_wkly + "\\" + str(folder) + "\\" + str(f)
dest_filename = dest_dir_wkly + "\\" + os.path.splitext(os.path.basename(src_filename))[0] + ".csv"
result = dal.execute_query(src_filename)
result.to_csv(path_or_buf=dest_filename,index=False)
execute_query is a method stored in another file:
def execute_query(self, source_path):
conn_rs = psycopg2.connect(self.conn_string)
cursor = conn_rs.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
sql_file = self.read_sql_file(source_path)
cursor.execute(sql_file)
records = cursor.fetchall()
conn_rs.commit()
return pd.DataFrame(data=records)
def read_sql_file(self, path):
sql_path = path
f = open(sql_path, 'r')
return f.read()
I have a couple queries that take around 15 minutes to execute (not unusual given the size of the data in our Redshift cluster), and they execute just fine in SQL Workbench. I can see in the AWS Console that the query has completed, but the script just hangs and doesn't dump the results to a csv file, nor does it proceed to the next file in the folder.
I don't have any timeouts specified. Is there anything else I'm missing?

The line records = cursor.fetchall() is likely the culprit. It reads all data and hence loads all results from the query into memory. Given that your queries are very large, that data probably cannot all be loaded into memory at once.
You should iterate over the results from the cursor and write into your csv one by one. In general trying to read all data from a database query at once is not a good idea.
You will need to refactor your code to do so:
for record in cursor:
csv_fh.write(record)
Where csv_fh is a file handle to your csv file. Your use of pd.DataFrame will need rewriting as it looks like it expects all data to be passed to it.

Related

Psycopg2 copy_expert returns different results than direct query

I am trying to debug this issue that I have been having for a couple of weeks now. I am trying to copy the result of a query in a Postgresql db into a csv file using psycopg2 and copy expert, however when my script finishes running, sometimes I end up with less rows than if I ran the query directly into the db using pgAdmin. This is the code that runs the query and saves it into a csv:
cursor = pqlconn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
fd = open("query.sql", 'r')
sql_query = fd.read()
fd.close()
csv_path = 'test.csv'
query = "copy (" + sql_query + \
") TO STDOUT WITH (FORMAT csv, DELIMITER ',', HEADER)"
with open(csv_path, 'w', encoding='utf-8') as f_output:
cursor.copy_expert(query, f_output)
print("Saved information to csv: ", csv_path)`
When it runs I will sometimes end up with less rows than if I ran it directly on the db, running it again still returns less rows than what I am already seeing in the db directly. Would appreciate any guidance on this, thanks!

Is there a way I can use multi-threading or multi-processing in python to connect to 200 different servers and download data from them

I am writing a script inn python to download data from around 200 different servers using multi-threading. My Objective is to fetch data from a table in Database of the server and save the data into a csv file. All the servers have the database and table.
The code I have written is:
import concurrent.futures
import sqlalchemy as db
import urllib
import pandas as pd
def write_to_database(line):
try:
server = line[0]
filename = line[1]
file_format = ".csv"
file = filename + file_format
print(file)
params = urllib.parse.quote_plus(
"DRIVER={SQL Server};SERVER=" + server + ";DATABASE=Database_name;UID=xxxxxxx;PWD=xxxxxxx")
engine = db.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
sql_DF = pd.read_sql("SELECT * FROM table_name",
con=engine, chunksize=50000)
sql_DF.to_csv()
except Exception as e:
print(e)
def read_server_names():
print("Reading Server Names")
f = open("servers_data.txt", "r")
contents = f.readlines()
for line in contents:
list.append(line.split(','))
def main():
with concurrent.futures.ProcessPoolExecutor() as executor:
for line in zip(list, executor.map(write_to_database, list)):
print()
if __name__ == '__main__':
list = []
read_server_names()
main()
The problem with this code is the process is taking lot of system memory. Can I get some guidance on a better way to do this task by using either multi-threading or multi-processing? Which will give good performance in terms of using less CPU resources!
I'd suggest using multiprocessing. I've also slightly refactored your reading code to avoid using a global variable.
The write function now prints three status messages; one when it begins reading from a given server, another when it finishes reading (into memory!) and another when it has finished writing to a file.
Concurrency is limited to 10 tasks, and each worker process is recycled after 100. You may want to change those parameters.
imap_unordered is used for slightly faster performance, since the order of tasks doesn't matter here.
If this is still too resource intensive, you will need to do something else than naively use Pandas; instead maybe just use SQLAlchemy to do the same query and write to the CSV file one row at a time.
import multiprocessing
import sqlalchemy as db
import urllib
import pandas as pd
file_format = ".csv"
def write_to_database(line):
try:
server, filename = line
file = filename + file_format
params = urllib.parse.quote_plus(
"DRIVER={SQL Server};"
"SERVER=" + server + ";"
"DATABASE=Database_name;"
"UID=xxxxxxx;"
"PWD=xxxxxxx"
)
print(server, "start")
engine = db.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
sql_DF = pd.read_sql("SELECT * FROM table_name", con=engine, chunksize=50000)
print(server, "read", len(sql_DF))
sql_DF.to_csv(file)
print(server, "write", file)
except Exception as e:
print(e)
def read_server_names():
with open("servers_data.txt", "r") as f:
for line in f:
# Will break (on purpose) if there are more than 2 fields
server, filename = f.strip().split(",")
yield (server, filename)
def main():
server_names = list(read_server_names())
# 10 requests (subprocesses) at a time, recycle every 100 servers
with multiprocessing.Pool(processes=10, maxtasksperchild=100) as p:
for i, result in enumerate(p.imap_unordered(write_to_database, server_names), 1):
print("Progress:", i, "/", len(server_names))
pass # do nothing with the result, the function deals with it
if __name__ == "__main__":
main()

Efficient importing CSVs into Oracle Table (Python)

I am using Python 3.6 to iterate through a folder structure and return the file paths of all these CSVs I want to import into two already created Oracle tables.
con = cx_Oracle.connect('BLAH/BLAH#XXX:666/BLAH')
#Targets the exact filepaths of the CSVs we want to import into the Oracle database
if os.access(base_cust_path, os.W_OK):
for path, dirs, files in os.walk(base_cust_path):
if "Daily" not in path and "Daily" not in dirs and "Jul" not in path and "2017-07" not in path:
for f in files:
if "OUTPUT" in f and "MERGE" not in f and "DD" not in f:
print("Import to OUTPUT table: "+ path + "/" + f)
#Run function to import to SQL Table 1
if "MERGE" in f and "OUTPUT" not in f and "DD" not in f:
print("Import to MERGE table: "+ path + "/" + f)
#Run function to import to SQL Table 2
A while ago I was able to use PHP to produce a function that used the BULK INSERT SQL command for SQL Server:
function bulkInserttoDB($csvPath){
$tablename = "[DATABASE].[dbo].[TABLE]";
$insert = "BULK
INSERT ".$tablename."
FROM '".$csvPath."'
WITH (FIELDTERMINATOR = ',', ROWTERMINATOR = '\\n')";
print_r($insert);
print_r("<br>");
$result = odbc_prepare($GLOBALS['connection'], $insert);
odbc_execute($result)or die(odbc_error($connection));
}
I was looking to replicate this for Python, but a few Google searches left me to believe there is no 'BULK INSERT' command for Oracle. This BULK INSERT command had awesome performance.
Since these CSVs I am loading are huge (2GB x 365), performance is crucial. What is the most efficient way of doing this?
The bulk insert is made using the cx_oracle library and the commands
con = cx_Oracle.connect(CONNECTION_STRING)
cur= con.cursor()
cur.prepare("INSERT INTO MyTable values (
to_date(:1,'YYYY/MM/DD HH24:MI:SS'),
:2,
:3,
to_date(:4,'YYYY/MM/DD HH24:MI:SS'),
:5,
:6,
to_date(:7,'YYYY/MM/DD HH24:MI:SS'),
:8,
to_date(:9,'YYYY/MM/DD HH24:MI:SS'))"
) ##prepare your statment
list.append((sline[0],sline[1],sline[2],sline[3],sline[4],sline[5],sline[6],sline[7],sline[8])) ##prepare your data
cur.executemany(None, list) ##insert
you prepare an insert statement. Then you store your file and your list. finally you execute the many. It will paralyze everything.

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

Does GAE GCS write have asynchronous version like the NDB functions

Does GAE GCS write have asynchronous version like the NDB functions(eg, put_async)?
I found out(from appstats) that since I am uploading multiple files sequentially in my code, it is consuming a lot a time(all of them being sequential). I am trying to reduce this time since with six files it is ~7 seconds and I wish to have more files.
I have put the following code snippet in a for loop which iterates over all the files selected by the user on the webpage:
gcs_file = gcs.open (filename, 'w', content_type = 'image/jpeg')
gcs_file.write (photo)
gcs_file.close ()
Please let me know if you need any more data
UPDATE
Original code:
photo_blobkey_list = []
video_blobkey_list = []
i = 0
for photo in photo_list:
filename = bucket + "/user_pic_"+str (user_index) + "_" + str (i)
# Store the file in GCS
gcs_file = gcs.open (filename, 'w', content_type = 'image/jpeg')
gcs_file.write (photo)
gcs_file.close ()
# Store the GCS filename in Blobstore
blobstore_filename = '/gs' + filename
photo_blobkey = blobstore.create_gs_key (blobstore_filename)
photo_blobkey_list.append (photo_blobkey)
i = i + 1
i = 0
for video in video_list:
filename = bucket + "/user_video_"+str (user_index) + "_" + str (i)
filename = filename.replace (" ", "_")
# Store the file in GCS
gcs_file = gcs.open (filename, 'w', content_type = 'video/avi')
gcs_file.write (video)
gcs_file.close ()
# Store the GCS filename in Blobstore
blobstore_filename = '/gs' + filename
video_blobkey = blobstore.create_gs_key (blobstore_filename)
video_blobkey_list.append (video_blobkey)
i = i + 1
::
user_record.put ()
My Doubt:
Based on the suggestions, I plan to put the GCS write part in a tasklet subroutine which takes filename and the photo/video as argument. How do I use "yield" here so as to make the above code run in parallel as much as possible(all the photo and video write operations)? As per the docs, I need to provide all the arguments to yield so as to make the tasklets run in parallel. But, in my case, the number of photos and videos uploaded is variable number.
Relevant official docs text:
If that were two separate yield statements, they would happen in
series. But yielding a tuple of tasklets is a parallel yield: the
tasklets can run in parallel and the yield waits for all of them to
finish and returns the results.
You can do the write operation in a task.
There's async versions of the URLFetch API.
You could use this to write async direct to cloud storage using the cloud storage XML or JSON API's.

Categories

Resources