Web scraping in multiprocessing with python

Web scraping in multiprocessing with python - python

We have a web scraping script that essential:
fork process upto 5 for list of websites:
every process retrieves website data
(In code we have used os.spawnlp function for forking 5 processes and used requests library of python for scraping)
Normally, It takes less than a minutes to process on single process for one website. but when it is multi-processed like above, it takes too long time(more than 473 minutes) sometimes to get data. below attaching screenshot of top command output for this issue.
Does anyone know why it might be hanging every once in a while? And how to debug and solve this issue.

Related

Queuing python scripts from NextJS

I have a NextJS index page where it will resize an image. To do this, I have an api/run.js that will execute a python script and return the result.
However, the python script is resource intensive and takes about 3 minutes to return the result, so I want to make it run consecutively and not concurrently.
My goal is to be able to access the same webpage from multiple devices and be placed in the same queue.
How can I achieve this?
I tried using web sockets and a MySQL database, but I realized that someone has to be active on the website for it to work correctly. If not, the queue will not work.

Python Selenium Web Scraping with Same URL but Multiple Instances (Multiprocessing?)

Running a python selenium script using ChromeDriver that goes to one (and only one) URL, enters data into the form then scrapes the results. Would like to do this in parallel by entering different data into the same URL's form and getting the different results.
In researching Multiprocessing or Multithreading I have found Multithreading is best for I/O bound tasks and Multiprocessing best for CPU bound tasks.
Overall amount of data I'm scraping is small, select text only so don't believe I/O bound? Does this sound correct? From what I've gathered is that in general web scrapers are I/O intensive, maybe my example scenario is just an exception?
Running my current (sequential, non parallel) script, Resource Monitor shows chrome instance CPU usage ramp up AND across all (4) cores. So is chrome using multiprocessing by default and the advantage of multiprocessing within python really in being able to apply the scripts function to each chrome instance? Maybe I got this all wrong...
Also is it that a script that wants to open multiple URL's at once and interact with them inherently CPU bound due to that fact that it runs a lot of chrome instances? Assuming data scraped is small. Ignoring headless for now.
Image attached of CPU usage, spike in the middle (across all 4 CPU's) is when chrome is launched.
Any comments or advice appreciated, including any pseudo code on how you might implement something like this. Didn't share base code, question more around the structure of all this.

How to manage Python crawlers?

I have many python's scripts for crawling web contents. They have different start, stop time and wait time until the next crawl. Right now I have about 50 crawlers and will be more in the future. So how can I manage them so they does not take much of computer resoures.
At the moment for 1 crawler I make 1 window service for it so it can run independently and I also use time.sleep to wait to the next crawl.
I end up with about 30MB of RAM for each crawler. It's okay with small amount of crawler but I think it's not scalable.
And not just crawler but also schedule scripts as well.
Please share your thought on this.

How do I run multiple python scripts in sequence, with waiting one script to finish its job before another starts in every step?

I have 10 web crawlers, and a preprocessing script(say, WS and PS each.)
What I want to do is the following:
1)There is a main script that controls everything, which is called 'Mother.py'
2)The 'Mother.py' runs 2 main tasks; [running 10 web crawlers at the same time] / [parsing the crawled data]
The problem is that I want my script to wait until the 10 web crawlers complete their jobs, and then execute the parsing script and so on.
So the pseudo structure will be like:
def Mother():
run Crawler()
#wait untile Crawler is finished.
run Parser()
I googled for several methods such as 'subprocess' or 'threading',
but I found it difficult to understand and implement on my coding.
can this problem be solved by either 'subprocess' or any other simpler way?

How to run multithreaded Python scripts

i wrote a Python web scraper yesterday and ran it in my terminal overnight. it only got through 50k pages. so now i just have a bunch of terminals open concurrently running the script at different starting and end points. this works fine because the main lag is obviously opening web pages and not actual CPU load. more elegant way to do this? especially if it can be done locally

You have an I/O bound process, so to speed it up you will need to send requests concurrently. This doesn't necessarily require multiple processors, you just need to avoid waiting until one request is done before sending the next.
There are a number of solutions for this problem. Take a look at this blog post or check out gevent, asyncio (backports to pre-3.4 versions of Python should be available) or another async IO library.
However, when scraping other sites, you must remember: you can send requests very fast with concurrent programming, but depending on what site you are scraping, this may be very rude. You could easily bring a small site serving dynamic content down entirely, forcing the administrators to block you. Respect robots.txt, try to spread your efforts between multiple servers at once rather than focusing your entire bandwidth on a single server, and carefully throttle your requests to single servers unless you're sure you don't need to.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web scraping in multiprocessing with python - python

Related

Queuing python scripts from NextJS

Python Selenium Web Scraping with Same URL but Multiple Instances (Multiprocessing?)

How to manage Python crawlers?

How do I run multiple python scripts in sequence, with waiting one script to finish its job before another starts in every step?

How to run multithreaded Python scripts

Categories

Resources