Download a large number of multiple files

Download a large number of multiple files - python

I have a set of image url with index, now I want to parse it through a downloader that can download multiple files at a time to speed up the process.
I tried to put the file name and URL to dicts(name and d2 respectively) and then use requests and threading to do that:
def Thread(start,stop):
for i in range(start, stop):
url = d2[i]
r = requests.get(url)
with open('assayImage/{}'.format(name[i]), 'wb') as f:
f.write(r.content)
for n in range(0, len(d2), 1500):
stop = n + 1500 if n +1500 <= len(d2) else len(d2)
threading.Thread(target = Thread, args = (n,stop)).start()
However, sometimes the connection is timed out and that file will not be downloaded, and after a while, the download speed decreases dramatically. For example, for the first 1 hour, I can download 10000 files, but 3 hours later I can only download 8000 files. Each file size is small, around 500KB.
So, I want to ask that is there any stable way to download a large number of multiple files? I really appreciate your answer.

Related

Python bs4 lxml parsing slow

while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/'+filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass

Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)

How can I optimize JSON webscrape?

I recently got interested in bitcoin and the whole blockchain thing. Since every transaction is public by design, I thought it would be interesting to investigate the number of wallets, size of transactions and such. But the current block height of bitcoin is 732,324 which is quite a lot of blocks to walk through one after each other. Thus, I want to obtain the hash code for each block so I can multi-thread grabbing the transactions.
Blockchain links one block after each other and if I go to the first block (the genesis block) and simply find the next block in the chain and so forth until the end, I should have what I need. I am quite new in python, but below is my code for obtaining the hashes and saving them to a file. However, at the current rate it would take 30-40 hours to complete on my machine. Thus, is there a more efficient way to solve the problem?
#imports
from urllib.request import urlopen
from datetime import datetime
import json
#Setting start parameters
genesisBlock = "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f"
baseurl = "https://blockchain.info/rawblock/"
i = 0 #counter for tracking progress
#Set HASH
blockHASH = genesisBlock
#Open file to save results
filePath = "./blocklist.tsv"
fileObject = open(filePath, 'a')
#Write header, if first line
if i == 0:
fileObject.write("blockHASH\theight\ttime\tn_tx\n")
#Start walking through each block
while blockHASH != "" :
#Print progress
if i % 250 == 0:
print(str(i)+"|"+datetime.now().strftime("%H:%M:%S"))
# store the response of URL
url = baseurl+blockHASH
response = urlopen(url)
# storing the JSON response in data
data_json = json.loads(response.read().decode())
#Write result to file
fileObject.write(blockHASH+"\t"+
str(data_json["height"])+"\t"+
str(data_json["time"])+"\t"+
str(data_json["n_tx"])+"\t"+
"\n")
#increment counter
i = i + 1
#Set new hash
blockHASH = data_json["next_block"][0]
if i > 1000: break #or just let it run until completion
# Close the file
fileObject.close()

While this doesn't comment directly on the efficiency of your approach, using orjson or rapidjson will definitely speed up your results, since they're both quite a bit faster than the standard json library.
Rapidjson can be swapped in as easily as just doing import rapidjson as json whereas orjson you have to make a couple changes, as described on their github page, but nothing too hard.

Python requests not utilizing network

I'm trying to download a medium-sized APK file ( 10-300 MB ) and save it locally. My connection speed should be about 90 mbps, yet the process rarely surpasses 1 mbps, and my network doesn't seem to be anywhere near cap.
I've verified the part that's getting stuck is indeed the SSL download with cProfile, and I've tried various advice on StackOverflow like reducing or increasing chunk size, to no avail. I'd love a way to either test if this could be a server-side issue, or advice on what am I doing wrong on the clientside.
Relevant code:
session = requests.Session() # I've heard session is better due to the persistent HTTP connection
session.trust_env = False
r = session.get(<url>, headers=REQUEST_HEADERS, stream=True, timeout=TIMEOUT) # timeout=60.
r.raise_for_status()
filename = 'myFileName'
i = 0
with open(filename, 'wb') as result:
for chunk in r.iter_content(chunk_size=1024*1024):
if chunk:
i += 1
if(i % 5 == 0):
print(f'At chunk {i} with timeout {TIMEOUT}')
result.write(chunk)

I was trying to download many different files; Upon printing the urls I'm trying to download and testing in chrome I saw some of the URLs were significantly slower than other ones to download.
It seems like a server issue, which I ended up solving by picking a good timeout to the requests.

Python split video file into smaller parts

I am stuck with my project(web application using Python-Django) on converting a large file(say 1GB) to small parts using python.I could create the large file into smaller parts,but the problem is only the part 1 gets played and rest of the files wont open.
I understood i need to specify the video information before the video data but i dont know how.
Below is my code and someone help me how i could split the the large file into smaller ones.
[N:B] I need to split the video from the django views when the upload is completed
def video_segments(video):
loc = settings.MEDIA_ROOT + '/' + format(video.video_file)
filetype = format(video.video_file).split(".")
data = None
i = 0
start_index = 0
end_index = 1024000
file = Path(loc)
size = file.stat().st_size
file = open(loc, "rb")
while end_index < size:
i=i+1
file.seek(start_index)
bytes = file.read(end_index-start_index)
newfile = open(settings.MEDIA_ROOT+"/"+filetype[0]+format(i)+"."+filetype[1],"wb")
newfile.write(bytes)
start_index = end_index + 1
end_index = end_index + 1024000
`

I assume you are serving something like H.264 with a MP4 header to a modern www-browser.
If you chop video-files into parts like this the second part won't have any header and therefore will not play in any browser.
The question is why you are doing this at all.
Normally the entire file is served to the browser and the browser gets the parts it needs with HTTP partial file retrieval, modern browsers are smart enough to get only the parts they need assuming the video file in encoded correctly for this purpose.

Downloading thousands of files using python

I'm trying to download about 500k small csv files (5kb-1mb) from a list of urls but it is been taking too long to get this done. With the code bellow, I am lucky if I get 10k files a day.
I have tried using the multiprocessing package and a pool to download multiple files simultaneously. This seems to be effective for the first few thousand downloads, but eventually the overall speed goes down. I am no expert, but I assume that the decreasing speed indicates that the server I am trying to download from cannot keep up with this number of requests. Does that makes sense?
To be honest, I am quite lost here and was wondering if there is any piece of advice on how to speed this up.
import urllib2
import pandas as pd
import csv
from multiprocessing import Pool
#import url file
df = pd.read_csv("url_list.csv")
#select only part of the total list to download
list=pd.Series(df[0:10000])
#define a job and set file name as the id string under urls
def job(url):
file_name = str("test/"+url[46:61])+".csv"
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
#run job
pool = Pool()
url = [ "http://" + str(file_path) for file_path in list]
pool.map(job, url)

You are re-coding the wheel!
How about that :
parallel -a urls.file axel
Of course you'll have to install parallel and axel for your distribution.
axel is a multithreaded counterpart to wget
parrallel allows you to run tasks using multithreading.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Download a large number of multiple files - python

Related

Python bs4 lxml parsing slow

How can I optimize JSON webscrape?

Python requests not utilizing network

Python split video file into smaller parts

Downloading thousands of files using python

Categories

Resources