Is is possible to concurrently export data? - python

I have a Python program which executes below steps:
Look for .sql file present in particular folder
Create a List with all .sql file name
Create a database connection
Execute for loop for each file name present in list created in step 2.
Read .sql file
Execute query mentioned in .sql file against database
Export data in to file
Repeat step 4 for all 15 files
This works fine and as expected. However, each file is exported in serial fashion (one after another). Is there any way, I can start exporting all 15 files at a same time?

Yes, you can actually call all 15 files parallel. Here is an example. I am calling request 4 times with different parameters on a functions.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
The point is create a function that runs one file from start to finish(quote_stream). Then call that function with different files in different threads(multi_qouter). For a function that takes parameters as yours, you just place them [executor.submit(quote_stream,file) for file in files] and set max_workers=len(files), where files is a list of your sql files to be passed in that function.

Related

How to monitor a CSV file for changes?

I'm trying to monitor a CSV file that is being written to by a separate program. Around every 10 seconds, the CSV file is updated with a couple more lines. Each time the file is updated, I want to be able to detect the file has been changed (will always be the same file), take the new lines, and write them to console (just for a test).
I have looked around the website, and have found numerous ways of watching a file to see if its updated (like so http://thepythoncorner.com/dev/how-to-create-a-watchdog-in-python-to-look-for-filesystem-changes/), but I can't seem to find anything that will allow me to get to the changes made in the file to print out to console.
Current code:
import time
from watchdog.observers import Observer
from watchdog.events import PatternMatchingEventHandler
def on_created(event):
print(f"hey, {event.src_path} has been created!")
def on_deleted(event):
print(f"Someone deleted {event.src_path}!")
def on_modified(event):
print(f"{event.src_path} has been modified")
def on_moved(event):
print(f"ok ok ok, someone moved {event.src_path} to {event.dest_path}")
if __name__ == "__main__":
patterns = "*"
ignore_patterns = ""
ignore_directories = False
case_sensitive = True
my_event_handler = PatternMatchingEventHandler(patterns, ignore_patterns, ignore_directories, case_sensitive)
my_event_handler.on_created = on_created
my_event_handler.on_deleted = on_deleted
my_event_handler.on_modified = on_modified
my_event_handler.on_moved = on_moved
path = "."
go_recursively = True
my_observer = Observer()
my_observer.schedule(my_event_handler, path, recursive=go_recursively)
my_observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
my_observer.stop()
my_observer.join()
This runs, but looks for changes in files all over the place. How do I make it listen for changes from one single file?
If you're more or less happy with the script other than it tracking a bunch of files then you could change the patterns = "*" part which is a wildcard matching string which tells the PatternMatchingEventHandler to look for any file. You could change that to paterns = 'my_file.csv' and also change the path variable to the directory that the file is in to save some time recursively scanning all the directories in '.'. Then you don't need recursive set to True for a single file either.
Print new lines to console part (one option):
import pandas as pd
...
def on_modified(event):
print(f"{event.src_path} has been modified")
# You said "a couple more lines" I'm going to take that
# as two:
df = pd.read_csv(event.src_path)
print("Newest 2 lines:")
print(df[-2:])
If it's not two lines you'll want to track the length of the file and pass that to the function which opens the CSV so it knows how many lines are new.
I believe since this is a CSV file, reading file using pandas and checking the file size can help. You can use df.tail(2) to print last two rows after reading the csv using pandas

Python APscheduler and returning filepath of most recent csv in directory

I am trying to get the below code to write files to the specified folder with no luck. I think the error is with the imported 'glob' package/function because similar code works for other files, but I'm not sure. Note also that I'm not getting any errors on the in-between 'do stuff' code so I don't think that's an issue.
#Import Stuff
import pandas as pd
import os
#Import apscheduler and related packages
import time
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.triggers.interval import IntervalTrigger
def process_ZN_ES_comb_LL_15M_csv(path_to_csv):
# Open ZN_ES_comb and customize
filename2 = max(glob.iglob("C:\Users\cost9\OneDrive\Documents\PYTHON\Daily Tasks\ZN_ES\ZN_ES_15M\CSV\Beta\*.csv"))
ZN_ES_comb_LL_15M = pd.read_csv(filename2)
#Do stuff, no errors given
#Send to csv automatically
ZN_ES_comb_LL_15M.to_csv(path_to_csv.replace('.csv', '_modified_{timestamp}.csv').format(
timestamp=time.strftime("%Y%m%d-%H%M%S")), index=False)
if __name__ == '__main__':
path_to_csv = "C:\Users\cost9\OneDrive\Documents\PYTHON\Daily Tasks\ZN_ES\ZN_ES_15M\CSV\Lead_Lag\ZN_ES_comb_LL_15M.csv"
scheduler = BackgroundScheduler()
scheduler.start()
scheduler.add_job(func=process_ZN_ES_comb_LL_15M_csv,
args=[path_to_csv],
trigger=IntervalTrigger(seconds=60))
# Wait for 7 seconds so that scheduler can call process_csv 3 times
time.sleep(7)
Essentially I'm having apscheduler automatically write the file to the folder shown below, but nothing is showing up. Further, I have to identify a file using 'glob' package from another folder in order to build on that file in the #do stuff lines. That's why I think there's some issue with the filename2 line but I'm not sure. Any help is appreciated!
Try use double quotes " for your filename2 line. The thing that is jumping out at me is the whitespace in the file path "Daily Tasks" and using double quotes can solve this issue.

Picking file path and file name with async file download

I am currently using this code (python 3.5.2):
from multiprocessing.dummy import Pool
from urllib.request import urlretrieve
urls = ["link"]
result = Pool(4).map(urlretrieve, urls)
print(result[0][0])
It works, but gets saved to the temp file with some weird name, is there a way to pick a file path and possibly a file name? as well as adding a file extension, it gets saved without one.
Thanks!
You simply need to supply a location to urlretrieve. However pool.map doesn't appear to support multiple args in functions (Python multiprocessing pool.map for multiple arguments). So, you can refactor, as described there, or use a different multiprocessing primitive, e.g. Process:
from multiprocessing import Process
from urllib.request import urlretrieve
urls = ["link", "otherlink"]
filenames = ["{}.html".format(i) for i in urls]
args = zip(urls, filenames)
for arg in args:
p = Process(urlretrieve, arg)
p.start()
In the comments you say you only need to download 1 url. In that case it is very easy:
from urllib.request import urlretrieve
urlretrieve("https://yahoo.com", "where_to_save.html")
Then the file will be saved in where_to_save.html. You can of course provide a full path there, e.g. /where/exactly/to/save.html.

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'
There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

Python script hangs when executing long running query, even after query completes

I've got a Python script that loops through folders and within each folder, executes the sql file against our Redshift cluster (using psycopg2). Here is the code that does the loop (note: this works just fine for queries that take only a few minutes to execute):
for folder in dir_list:
#Each query is stored in a folder by group, so we have to go through each folder and then each file in that folder
file_list = os.listdir(source_dir_wkly + "\\" + str(folder))
for f in file_list:
src_filename = source_dir_wkly + "\\" + str(folder) + "\\" + str(f)
dest_filename = dest_dir_wkly + "\\" + os.path.splitext(os.path.basename(src_filename))[0] + ".csv"
result = dal.execute_query(src_filename)
result.to_csv(path_or_buf=dest_filename,index=False)
execute_query is a method stored in another file:
def execute_query(self, source_path):
conn_rs = psycopg2.connect(self.conn_string)
cursor = conn_rs.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
sql_file = self.read_sql_file(source_path)
cursor.execute(sql_file)
records = cursor.fetchall()
conn_rs.commit()
return pd.DataFrame(data=records)
def read_sql_file(self, path):
sql_path = path
f = open(sql_path, 'r')
return f.read()
I have a couple queries that take around 15 minutes to execute (not unusual given the size of the data in our Redshift cluster), and they execute just fine in SQL Workbench. I can see in the AWS Console that the query has completed, but the script just hangs and doesn't dump the results to a csv file, nor does it proceed to the next file in the folder.
I don't have any timeouts specified. Is there anything else I'm missing?
The line records = cursor.fetchall() is likely the culprit. It reads all data and hence loads all results from the query into memory. Given that your queries are very large, that data probably cannot all be loaded into memory at once.
You should iterate over the results from the cursor and write into your csv one by one. In general trying to read all data from a database query at once is not a good idea.
You will need to refactor your code to do so:
for record in cursor:
csv_fh.write(record)
Where csv_fh is a file handle to your csv file. Your use of pd.DataFrame will need rewriting as it looks like it expects all data to be passed to it.

Categories

Resources