Downloading thousands of files using python - python

I'm trying to download about 500k small csv files (5kb-1mb) from a list of urls but it is been taking too long to get this done. With the code bellow, I am lucky if I get 10k files a day.
I have tried using the multiprocessing package and a pool to download multiple files simultaneously. This seems to be effective for the first few thousand downloads, but eventually the overall speed goes down. I am no expert, but I assume that the decreasing speed indicates that the server I am trying to download from cannot keep up with this number of requests. Does that makes sense?
To be honest, I am quite lost here and was wondering if there is any piece of advice on how to speed this up.
import urllib2
import pandas as pd
import csv
from multiprocessing import Pool
#import url file
df = pd.read_csv("url_list.csv")
#select only part of the total list to download
list=pd.Series(df[0:10000])
#define a job and set file name as the id string under urls
def job(url):
file_name = str("test/"+url[46:61])+".csv"
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
#run job
pool = Pool()
url = [ "http://" + str(file_path) for file_path in list]
pool.map(job, url)

You are re-coding the wheel!
How about that :
parallel -a urls.file axel
Of course you'll have to install parallel and axel for your distribution.
axel is a multithreaded counterpart to wget
parrallel allows you to run tasks using multithreading.

Related

Python bs4 lxml parsing slow

while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/'+filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass
Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)

Beautiful soup, XML to Pandas dataframe

i am a beginer at machine learning and exploring with database for my nlp project. here i got the data from http://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html. and I am trying to create a pd dataframe where i want to parse the xml data , I also want to add a label(1) to the positive reviews, Can someone please help me with the code, a sample output has been given,
from bs4 import BeautifulSoup
positive_reviews = BeautifulSoup(open('/content/drive/MyDrive/sorted_data_acl/electronics/positive.review', encoding='utf-8').read())
positive_reviews = positive_reviews.findAll('review_text')
positive_reviews[0]
<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad. It will run my cable modem, router, PC, and LCD monitor for 5 minutes. This is more than enough time to save work and shut down. Equally important, I know that my electronics are receiving clean power.
I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.
As always, Amazon had it to me in <2 business days
</review_text>
the main issue is the note it is pseudo xml
download tar.gz file and unzip / untar
build dictionary of all files
workaround to deal with pseudo xml - insert document element in string representation of document
then simple case of using list/dict comprehensions to generate pandas constructor format
dfs îs a dictionary of data frames ready to be used
import requests
from pathlib import Path
from tarfile import TarFile
from bs4 import BeautifulSoup
import io
import pandas as pd
# download tar with psuedo XML...
url = "http://www.cs.jhu.edu/%7Emdredze/datasets/sentiment/domain_sentiment_data.tar.gz"
fn = Path.cwd().joinpath(url.split("/")[-1])
if not fn.exists():
r = requests.get(url, stream=True)
with open(fn, 'wb') as f:
for chunk in r.raw.stream(1024, decode_content=False):
if chunk:
f.write(chunk)
# untar downloaded file and generate a dictionary of all files
TarFile.open(fn, "r:gz").extractall()
files = {f"{p.parent.name}/{p.name}":p for p in Path.cwd().joinpath("sorted_data_acl").glob("**/*") if p.is_file()}
# convert all files into dataframes in a dict
dfs = {}
for file in files.keys():
with open(files[file]) as f: text = f.read()
# psuedo xml where there is not root element stops it from being well formed
# force it in...
soup = BeautifulSoup(f"<root>{text}</root>", "xml")
# simple case of each review is a row and each child element is a column
dfs[file] = pd.DataFrame([{c.name:c.text.strip("\n") for c in r.children if c.name} for r in soup.find_all("review")])

Async for loop via Asyncio in Python

I have some short code that is built with a for loop to iterate through a list of image URLs and download/save the files. This code works fine, but the list of URLs is too large and would take forever to complete one check at a time.
My goal is to make this a asynchronous for loop in hopes that will speed things up greatly, but I just started writing python to build this and don't know enough to utilize the Asyncio library - can't build out the iterations through aiter. How can I get this running?
To summarize: I have a for loop and need to make it asynchronous so it can handle multiple iterations simultaneously (setting up a limit for the amount of async loops would be awesome too).
import pandas as pd
import requests
import asyncio
df = pd.read_excel(r'filelocation', sheet_name='Sheet2')
for index, row in df.iterrows():
url = row[0]
filename = url.split('/')[-1]
r = requests.get(url, allow_redirects=False)
open('filelocation' + filename, 'wb').write(r.content)
r.close()

How can I optimize a web-scraping code snippet to run faster?

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)
There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.

wget Vs urlretrieve of python

I have a task to download Gbs of data from a website. The data is in form of .gz files, each file being 45mb in size.
The easy way to get the files is use "wget -r -np -A files url". This will donwload data in a recursive format and mirrors the website. The donwload rate is very high 4mb/sec.
But, just to play around I was also using python to build my urlparser.
Downloading via Python's urlretrieve is damm slow, possible 4 times as slow as wget. The download rate is 500kb/sec. I use HTMLParser for parsing the href tags.
I am not sure why is this happening. Are there any settings for this.
Thanks
Probably a unit math error on your part.
Just noticing that 500KB/s (kilobytes) is equal to 4Mb/s (megabits).
urllib works for me as fast as wget. try this code. it shows the progress in percentage just as wget.
import sys, urllib
def reporthook(a,b,c):
# ',' at the end of the line is important!
print "% 3.1f%% of %d bytes\r" % (min(100, float(a * b) / c * 100), c),
#you can also use sys.stdout.write
#sys.stdout.write("\r% 3.1f%% of %d bytes"
# % (min(100, float(a * b) / c * 100), c)
sys.stdout.flush()
for url in sys.argv[1:]:
i = url.rfind('/')
file = url[i+1:]
print url, "->", file
urllib.urlretrieve(url, file, reporthook)
import subprocess
myurl = 'http://some_server/data/'
subprocess.call(["wget", "-r", "-np", "-A", "files", myurl])
As for the html parsing, the fastest/easiest you will probably get is using lxml
As for the http requests themselves: httplib2 is very easy to use, and could possibly speed up downloads because it supports http 1.1 keep-alive connections and gzip compression. There is also pycURL which claims to be very fast (but more difficult to use), and is build on curllib, but I've never used that.
You could also try to download different files concurrently, but also keep in mind that trying to optimize your download times too far may be not very polite towards the website in question.
Sorry for the lack of hyperlinks, but SO tells me "sorry, new users can only post a maximum of one hyperlink"
Transfer speeds can be easily misleading.. Could you try with the following script, which simply downloads the same URL with both wget and urllib.urlretrieve - run it a few times incase you're behind a proxy which caches the URL on the second attempt.
For small files, wget will take slightly longer due to the external process' startup time, but for larger files that should be come irrelevant.
from time import time
import urllib
import subprocess
target = "http://example.com" # change this to a more useful URL
wget_start = time()
proc = subprocess.Popen(["wget", target])
proc.communicate()
wget_end = time()
url_start = time()
urllib.urlretrieve(target)
url_end = time()
print "wget -> %s" % (wget_end - wget_start)
print "urllib.urlretrieve -> %s" % (url_end - url_start)
Maybe you can wget and then inspect the data in Python?
Since python suggests using urllib2 instead of urllib, I take a test between urllib2.urlopen and wget.
The result is, it takes nearly the same time for both of them to download the same file.Sometimes, urllib2 performs even better.
The advantage of wget lies in a dynamic progress bar to show the percent finished and the current download speed when transferring.
The file size in my test is 5MB.I haven't used any cache module in python and I am not aware of how wget works when downloading big size file.
There shouldn't be a difference really. All urlretrieve does is make a simple HTTP GET request. Have you taken out your data processing code and done a straight throughput comparison of wget vs. pure python?
Please show us some code. I'm pretty sure that it has to be with the code and not on urlretrieve.
I've worked with it in the past and never had any speed related issues.
You can use wget -k to engage relative links in all urls.

Categories

Resources