Use concurrent futures with dictionary.items() to iterate through key, values - python

I want to scrape data from webpage in more efficient way. I read about concurrent futures but I have no idea how to use it in my script.
My function to take data from each link takes four arguments:
def scrape_data_for_offer(b, m, url, loc):
then it saves scraped data do pandas date frame.
It's called in a loop:
for link, location in cars_link_dict.items():
scrape_data_for_offer(brand, model, link, location)
and I want it to speed up this scraping process.
I tried to solve it like this:
with concurrent.futures.ThreadPoolExecutor(max_workers=50) as executor:
executor.map(scrape_data_for_offer, brand, model, cars_link_dict.items())
But it doesn't work, do you have any ideas of how to solve this problem?

In your futures case, you're only passing three items. The last item is a two-element tuple. So, change your function to:
def scrape_data_for_offer(b,m,info):
url, loc = info
By the way, the words are "scrape", "scraped" and "scraping". Many, many, many people are using "scrp", "scrpped" and "scr*pping", but those words all refer to throwing things away.
As another by the way, the concurrent stuff is not really going to help you. I assume you are using BeautifulSoup for the scraping. BeautifulSoup is all Python code, and the global interpreter lock means that only one of the threads will be able to execute at any given time. You'll get a tiny bit of overlap while waiting for the web site responses to be delivered.
Also, running 50 workers is pointless unless you have 50 processors. They'll all fight for resources. If you have 8 processors, use about 12 workers. In most cases, you should just leave off that parameter; it will default to the number of processors in your machine.

Related

Scraping multiple webpages at once with Selenium

I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
Is there a way to increase speed? Such as, by visiting multiple
links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.
But, I am unable to apply it in my script.
Threading approach
You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.
Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.
Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.
You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances.
See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html
Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:
<suite name="TestSuite" thread-count="10" parallel="methods" >
If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:
def threader_run(data):
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
for i in data:
futures.append(executor.submit(scrapper,i))
for future in concurrent.futures.as_completed(futures):
print(future.result())
data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []
if __name__ == "__main__":
for x in data:
jobs = []
p = Process(target=threader_run,args(data[x],))
jobs.append(p)
p.start()
print(f'Started - {x}')
Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.
Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.

Python: asynchronously process tasks that are results from other asynchronous tasks

I am trying to fetch data of all transactions for several addresses from an API. Each address can have several pages of transactions, which I find out only when I ask for first page.
I have methods api.get_address_data(address, page) and api.get_transaction_data(tx).
Synchronous code for what I want to do would look like this:
def all_transaction_data(addresses):
for address in addresses:
data = api.get_address_data(address, page=0)
transactions = data.transactions
for n in range(1, data.total_pages):
next_page = api.get_address_data(address, page=n)
transactions += next_page.transactions
for tx in data.transactions:
yield api.get_transaction_data(tx)
I don't care about the order of transactions received (I will have to reorder them when I have all of them ready). I can fit all the data in memory, but it's a lot of very little requests, so I'd like to do as much in parallel as possible.
What is the best way to accomplish this? I was playing around with asyncio (the API calls are under my control so I can convert them to async), but I have trouble with interleaving the layers: my best solution can fetch all the addresses first, list all the pages second and finally get all transactions in one big batch. I would like each processing step to be scheduled immediately when the appropriate input data is ready, and the results collected into one big list (or yielded from a single generator).
It seems that I need some sort of open-ended task queue, where task "get-address" fetches the data and enqueues a bunch of "get-pages" tasks, which in turn enqueue "get-transaction" tasks, and only these are then collected into a list of results?
Can this be done with asyncio? Would something like gevent be more suitable, or maybe just a plain ThreadPoolExecutor? Is there a better approach than what I outlined so far?
Note that I'd like to avoid inversion of control flow, or at least hide it as an implementation detail. I.e., the caller of this code should be able to just call for tx in all_transaction_data(), or at worst async for.

Parallel data processing in Python

I have an architecture which is basically a queue with url addresses and some classes to process the content of those url addresses. At the moment the code works good, but it is slow to sequentially pull a url out of the queue, send it to the correspondent class, download the url content and finally process it.
It would be faster and make proper use of resources if for example it could read n urls out of the queue and then shoot n processes or threads to handle the downloading and processing.
I would appreciate if you could help me with these:
What packages could be used to solve this problem ?
What other approach can you think of ?
You might want to look into the Python Multiprocessing library. With multiprocessing.pool, you can give it a function and an array, and it will call the function with each value of the array in parallel, using as many or as few processes as you specify.
If C-calls are slow, like downloading, database requests, other IO - You can use just threading.Thread
If python code is slow, like frameworks, your logic, not accelerated parsers - You need to use multiprocessing Pool or Process. Also it speedups python code, but it is less tread-save and need to deep understanding how it works in complex code (locks, semaphores).

When doing network programming, is there a rule of thumb for determining how many threads to use?

Say I have a list of 1000 unique urls, and I need to open each one, and assert that something on the page is there. Doing this sequentially obviously is a poor choice, as most of the time the program will be sitting idle just waiting for a response. So, added in a thread pool where each worker reads from a main Queue, and opens a url to do a check. My question is, how big do I make the pool? Is it based on my network bandwidth, or some other metric? Are there any rules of thumb for this, or is it simply trial and error to find an effective size?
This is more of a theoretical question, but here's the basic outline of the code I'm using.
if __name__ == '__main__':
#get the stuff I've already checked
ID = 0
already_checked = [i[ID] for i in load_csv('already_checked.csv')]
#make sure I don't duplicate the effort
to_check = load_csv('urls_to_check.csv')
links = [url[:3] for url in to_check if i[ID] not in already_checked]
in_queue = Queue.Queue()
out_queue = Queue.Queue()
threads = []
for i in range(5):
t = SubProcessor(in_queue, out_queue)
t.setDaemon(True)
t.start()
threads.append(t)
writer = Writer(out_queue)
writer.setDaemon(True)
writer.start()
for link in links:
in_queue.put(link)
Your best bet is probably to write some code that runs some tests using the number of threads you specify, and see how many threads produce the best result. There are too many variables (speed of processor, speed of the buses, thread overhead, number of cores, and the nature of the code itself) for us to hazard a guess.
My experience (using .NET, but it should apply to any language) is that DNS resolution ends up being the limiting factor. I found that a maximum of 15 to 20 concurrent requests is all that I could sustain. DNS resolution is typically very fast, but sometimes can take hundreds of milliseconds. Without some custom DNS caching or other way to quickly do the resolution, I found that it averages about 50 ms.
If you can do multi-threaded DNS resolution, 100 or more concurrent requests is certainly possible on modern hardware (a quad-core machine). How your OS handles that many individual threads is another question entirely. But, as you say, those threads are mostly doing nothing but waiting for responses. The other consideration is how much work those threads are doing. If it's just downloading a page and looking for something specific, 100 threads is probably well within the bounds of reason. Provided that "looking" doesn't involve much more than just parsing an HTML page.
Other considerations involve the total number of unique domains you're accessing. If those 1,000 unique URLs are all from the different domains (i.e. 1,000 unique domains), then you have a worst case scenario: every request will require a DNS resolution (a cache miss).
If those 1,000 URLs represent only 100 domains, then you'll only have 100 cache misses. Provided that your machine's DNS cache is reasonable. However, you have another problem: hitting the same server with multiple concurrent requests. Some servers will be very unhappy if you make many (sometimes "many" is defined as "two or more") concurrent requests. Or too many requests over a short period of time. So you might have to write code to prevent multiple or more-than-X concurrent requests to the same server. It can get complicated.
One simple way to prevent the multiple requests problem is to sort the URLs by domain and then ensure that all the URLs from the same domain are handled by the same thread. This is less than ideal from a performance perspective, because you'll often find that one or two domains have many more URLs than the others, and you'll end up with most of the threads ended while those few are plugging away at their very busy domains. You can alleviate these problems by examining your data and assigning the threads' work items accordingly.

Multiprocessing a dictionary in python

I have two dictionaries of data and I created a function that acts as a rules engine to analyze entries in each dictionaries and does things based on specific metrics I set(if it helps, each entry in the dictionary is a node in a graph and if rules match I create edges between them).
Here's the code I use(its a for loop that passes on parts of the dictionary to a rules function. I refactored my code to a tutorial I read):
jobs = []
def loadGraph(dayCurrent, day2Previous):
for dayCurrentCount in graph[dayCurrent]:
dayCurrentValue = graph[dayCurrent][dayCurrentCount]
for day1Count in graph[day2Previous]:
day1Value = graph[day2Previous][day1Count]
#rules(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous)
p = multiprocessing.Process(target=rules, args=(day1Count, day1Value, dayCurrentCount, dayCurrentValue, dayCurrent, day2Previous))
jobs.append(p)
p.start()
print ' in rules engine for day', dayCurrentCount, ' and we are about ', ((len(graph[dayCurrent])-dayCurrentCount)/float(len(graph[dayCurrent])))
The data I'm studying could be rather large(could, because its randomly generated). I think for each day there's about 50,000 entries. Because most of the time is spend on this stage, I was wondering if I could use the 8 cores I have available to help process this faster.
Because each dictionary entry is being compared to a dictionary entry from the day before, I thought the proceses could be split up by that but my above code is slower than using it normally. I think this is because its creating a new process for every entry its doing.
Is there a way to speed this up and use all my cpus? My problem is, I don't want to pass the entire dictionary because then one core will get suck processing it, I would rather have a the process split to each cpu or in a way that I maximum all free cpus for this.
I'm totally new to multiprocessing so I'm sure there's something easy I'm missing. Any advice/suggestions or reading material would be great!
What I've done in the past is to create a "worker class" that processes data entries. Then I'll spin up X number of threads that each run a copy of the worker class. Each item in the dataset gets pushed into a queue that the worker threads are watching. When there are no more items in the queue, the threads spin down.
Using this method, I was able to process 10,000+ data items using 5 threads in about 3 seconds. When the app was only single-threaded, this would take significantly longer.
Check out: http://docs.python.org/library/queue.html
I would recommend looking into MapReduce implementations in Python. Here's one: http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=mapreduce+python. Also, take a look at a python package called Celery: http://celeryproject.org/. With celery you can distribute your computation not only among cores on a single machine, but also to a server farm (cluster). You do pay for that flexibility with more involved setup/maintenance.

Categories

Resources