AIOFiles Take Longer Than Normal File Operation - python

I have a question I'm new to the python async world and I write some code to test the power of asyncio, I create 10 files with random content, named file1.txt, file2.txt, ..., file10.txt
here is my code:
import asyncio
import aiofiles
import time
async def reader(pack, address):
async with aiofiles.open(address) as file:
pack.append(await file.read())
async def main():
content = []
await asyncio.gather(*(reader(content, f'./file{_+1}.txt') for _ in range(10)))
return content
def core():
content = []
for number in range(10):
with open(f'./file{number+1}.txt') as file:
content.append(file.read())
return content
if __name__ == '__main__':
# Asynchronous
s = time.perf_counter()
content = asyncio.run(main())
e = time.perf_counter()
print(f'Take {e - s: .3f}')
# Synchronous
s = time.perf_counter()
content = core()
e = time.perf_counter()
print(f'Take {e - s: .3f}')
and got this result:
Asynchronous: Take 0.011
Synchronous: Take 0.001
why Asynchronous code takes longer than Synchronous code ?
where I do it wrong ?

I post an issue #110 on aiofiles's GitHub and the author of aiofiles answer that:
You're not doing anything wrong. What aiofiles does is delegate the file reading operations to a thread pool. This approach is going to be slower than just reading the file directly. The benefit is that while the file is being read in a different thread, your application can do something else in the main thread.A true, cross-platform way of reading files asynchronously is not available yet, I'm afraid :)
I hope it be helpful to anybody that has the same problem

Related

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Multiprocessing Pipe() not working

I am trying to get familiar with the multiprocessing module. I am currently having some issues with Pipe(). I devised a small example to illustrate my problem.
I wrote two functions:
One that creates files in a specific folder (spawner)
And another that detects these files and copies them to another folder (cleaner)
They both work fine. I also managed to create a Process for both so that the creation and copying of the files happens simultaneously.
For the next step, I want the spawner to communicate to the cleaner that it has finished creating files so that the latter can terminate.
Here is the code:
import os
from time import sleep
import multiprocessing as mp
from shutil import copy2
def spawner(f_folder, pipeEnd):
template = 'my_file{}.txt'
for i in range(10):
new_file = os.path.join(f_folder, template.format(str(i)))
with open(new_file, 'w'):
pass
sleep(1)
pipeEnd.send(True)
return
def cleaner(f_folder, t_folder, pipeEnd):
state = set()
while not pipeEnd.recv():
new_files = set(os.listdir(f_folder)).difference(state)
state = set(os.listdir(f_folder))
for file in new_files:
copy2(os.path.join(f_folder, file), t_folder)
sleep(3)
return
if __name__ == '__main__':
receiver, sender = mp.Pipe()
from_folder = r'C:\Users\evkouni\Desktop\TEMP\PythonTests\subProcess\from'
to_folder = r'C:\Users\evkouni\Desktop\TEMP\PythonTests\subProcess\to'
p = mp.Process(target=spawner, args=(from_folder, sender))
q = mp.Process(target=cleaner, args=(from_folder, to_folder, receiver))
p.start()
q.start()
I just cannot seem to be able to get it to work.. Any help would be appreciated.
A Pipe is the wrong solution to your problem. You could use a pipe if you wanted to pass the file names from the spawner to the cleaner, but what you are trying to do is raise a flag. For that purpose, I would recommend the use of an Event: https://docs.python.org/2/library/multiprocessing.html#multiprocessing.Event
This can be considered a thread-safe (and multiprocess-safe) boolean. You would use it like
finished = mp.Event()
...
finished.set() # pipeEnd.send(True)
...
while not finished.is_set(): # while not receiver.recv():

Python multiprocessing pool hangs on map call

I have a function that parses a file and inserts the data into MySQL using SQLAlchemy. I've been running the function sequentially on the result of os.listdir() and everything works perfectly.
Because most of the time is spent reading the file and writing to the DB, I wanted to use multiprocessing to speed things up. Here is my pseduocode as the actual code is too long:
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
The problem I'm seeing is that the script hangs and never finishes. I usually get 48 of 63 records into the database. Sometimes it's more, sometimes it's less.
I've tried using pool.close() and in combination with pool.join() and neither seems to help.
How do I get this script to complete? What am I doing wrong? I'm using Python 2.7.8 on a Linux box.
You need to put all code which uses multiprocessing, inside its own function. This stops it recursively launching new pools when multiprocessing re-imports your module in separate processes:
def parse_file(filename):
...
def main():
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
if __name__ == '__main__':
main()
See the documentation about making sure your module is importable, also the advice for running on Windows(tm)
The problem was a combination of 2 things:
my pool code being called multiple times (thanks #Peter Wood)
my DB code opening too many sessions (and/or) sharing sessions
I made the following changes and everything works now:
Original File
def parse_file(filename):
f = open(filename, 'rb')
data = f.read()
f.close()
soup = BeautifulSoup(data,features="lxml", from_encoding='utf-8')
# parse file here
db_record = MyDBRecord(parsed_data)
session = get_session() # see below
session.add(db_record)
session.commit()
pool = mp.Pool(processes=8)
pool.map(parse_file, ['my_dir/' + filename for filename in os.listdir("my_dir")])
DB File
def get_session():
engine = create_engine('mysql://root:root#localhost/my_db')
Base.metadata.create_all(engine)
Base.metadata.bind = engine
db_session = sessionmaker(bind=engine)
return db_session()

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

md5 multithreaded bruteforce

I use python 2.7, and I have a simple multitheaded md5 dict brute:
# -*- coding: utf-8 -*-
import md5
import Queue
import threading
import traceback
md5_queue = Queue.Queue()
def Worker(queue):
while True:
try:
item = md5_queue.get_nowait()
except Queue.Empty:
break
try:
work(item)
except Exception:
traceback.print_exc()
queue.task_done()
def work(param):
with open('pwds', 'r') as f:
pwds = [x.strip() for x in f.readlines()]
for pwd in pwds:
if md5.new(pwd).hexdigest() == param:
print '%s:%s' % (pwd, md5.new(pwd).hexdigest())
def main():
global md5_queue
md5_lst = []
threads = 5
with open('md5', "r") as f:
md5_lst = [x.strip() for x in f.readlines()]
for m in md5_lst:
md5_queue.put(m) # add md5 hash to queue
for i in xrange(threads):
t = threading.Thread(target=Worker, args=(md5_queue,))
t.start()
md5_queue.join()
if __name__ == '__main__':
main()
Work in 5 threads. Each thread reads one hash from queue and checks it with list of passwords. Pretty simple: 1 thread 1 check in 'for' loop.
I want to have a little bit more: 1 thread and few threads to check passwords. So work() should read hash from queue and start a new number of threads to check passwords (1 thread hash, 10 thread there check for passwords). For example: 20 threads with hash and 20 threads to brute the hash in each thread. Something like that.
How can I do this?
P.S. Sorry for my explanation, ask if you did not understood what I want.
P.P.S. It's not about bruting md5, it's about multi-threading.
Thanks.
The default implementation of Python (called CPython) uses a Global Interpreter Lock (GIL) that effectively only allows one thread to be running at once. For I/O bound multithreaded applications, this is not usually a problem, but for CPU-bound applications like yours, it means you're not going to see much of a multicore speedup.
I'd suggest using a different Python implementation that doesn't have a GIL such as Jython, or rewriting your code to use a different language that doesn't have a GIL. Writing it in natively compiled code is a good idea, but most scripting languages that have an MD5 function usually implement that in native code anyways, so I honestly wouldn't expect much of a speedup between a natively compiled language and a scripting language.
I believe that the following code would be a considerably more efficient program than your example code:
from __future__ import with_statement
try:
import md5
digest = lambda text: md5.new(text).hexdigest()
except ImportError:
import hashlib
digest = lambda text: hashlib.md5(text.encode()).hexdigest()
def main():
passwords = load_passwords('pwds')
check_hashes('md5', passwords)
def load_passwords(filename):
passwords = {}
with open(filename) as file:
for word in (line.strip() for line in file):
passwords.setdefault(digest(word), []).append(word)
return passwords
def check_hashes(filename, passwords):
with open(filename) as file:
for code in (line.strip() for line in file):
for word in passwords.get(code, ()):
print (word + ':' + code)
if __name__ == '__main__':
main()
It has been written with both Python 2.x and 3.x and should be able to run on either of those languages.

Categories

Resources