I want to get data from the web in real-time and have used scrapy to extract the information to build a python utility. The problem is that the data is static while the information will change in time.
I wanted to know if it is viable to call my scrapy spider when the utility is invoked so that when the utility is called for the first time, the data at that time is stored as JSON with the user which will change when the user calls it the next time.
Please let me know if there is an alternative to it.
Thanks in advance.
Edit-1: To make it clear, the data that I have extracted will change over time. Here is a link to my previous question about building the spider: How to scrape contents from multiple tables in a webpage. The problem is that as the league progresses, the fixtures' status will change (completed or not yet completed). I want the users to get real-time scraped data.
Edit-2: What I previously did was calling my spider separately and using the JSON generated for the purpose of my utillity. For the users to have real-time data, when they use it on the terminal, should I push the scrapy code into the main repository that will be uploaded to PyPI and call the spider in the main function of the .py file? Is this possible? What are its alternatives, if any?
your could start your scrapy from code when you (or your user) need:
from scrapy import cmdline
SCRAPY_SPIDER_NAME = 'spyder_name' # spyder name to start scraping
cmdline.execute("scrapy crawl {}".format(SCRAPY_SPIDER_NAME))
Related
I am working on a web scraping project. In this project, I have written the necessary code to scrape the required information from a website using python and selenium. All of this code resides in multiple methods in a class. This code is saved as scraper.py.
When I execute this code, the program takes sometime between 6 and 10 seconds to extract all the necessary information from the website.
I wanted to create a UI for this project. I used django to create the UI. In the webapp, there is a form which when submitted opens a new browser window and starts the scraping process.
I access the scraper.py file in django views, where depending on the form inputs, the scraping occurs. While this works fine, the execution is very slow and takes almost 2 minutes to finish running.
How do I make the execution of the code faster using django faster? can you point me some tutorial on how to convert the scraper.py code into an api that django can access? will this help in making the code faster?
Thanks in advance
Few tiny tips,
How is your scraper.py working in the first place? Does it simply print the site links/details, or store it in a text file, or return them? What exactly happens in it?
If you wish to use your scraper.py as an "API" write your scraper.py code within a function that returns the details of your scraped site as a dictionary. Django's views.py can easily handle such dictionaries and send it over to your frontend HTML to replace the parts written in Jinja2.
Further speed can be achieved (in case your scraper does larger jobs) by using multi-threading and/or multi-processing. Do explore both :)
Example: if the spider throws an exception on page15, it should be able to restart at page 15.
As i went through the Scrapy documentation, under Jobs: pausing and resuming crawls topic - I ran the spider with the command as mentioned in the document i.e, scrapy crawl spidername -s JOBDIR=directory-path
so when i go into that specific directory-path, i can see that three files had been created namely, requests.queue, requests.seen and spider.state[as in the image link https://i.stack.imgur.com/gE7zU.png] i can see that only spider.state is having 1KB size and rest two files are 0KB, but while running the spider, under requests.queue folder a file with name p0 will be created, but once the spider is stopped and ran again the file p0 under requests.queue folder is deleted.
As i took a look into document again, it stated "Requests must be serializable by the pickle module, in order for persistence to work, so you should make sure that your requests are serializable." and after making the setting SCHEDULER_DEBUG = TRUE in settings.py i can see in console that,
[scrapy.core.scheduler] WARNING: Unable to serialize request:
is this the reason, why i can not resume the spider from where it stopped as the requests are not serialized? if so how can i make the requests serialized and make the spider to resume from where it left off? or is there any other approach how this can be achieved, answers with a sample code will be helpful.
and also can anyone explain what those three files are for as there is no explaination in the Scrapy documentation.
I guess in order to stop and resume spiders effectively, we should make use of DB to store the state of the spider, there maybe other ways too, but i felt this is the most effective way.
I have a python script that scrapes a page, uses jinja2 templating engine to output the appropriate HTML that I finally got working thanks to you kind folks and the people of The Coding Den Discord.
I'm looking to automate the .get request I'm making at the top of the file.
I have thousands of URLs I want this script to run on. What's a good way to go about this? I've tried using an array of URLs but requests says no to that. It complains that must be a string. So it seems I would need to iterate over the compiledUrls variable each time. Any advice on the subject would be much appreciated.
Build a text file with the urls.
urls.txt
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters1.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters2.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters3.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters4.html
https://www.perfectimprints.com/custom-promos/20267/Pizza-Cutters5.html
get urls and process:
with open("urls.txt") as file:
for single_url in file:
url = requests.get(single_url.strip())
..... # your code continue here
I have a working crawler in Scrapy+Splash. It launchs a spider on many pages. Each page contains a list of links. For each page the spider download the page then, some of the pages linked from that(not recursively). All the pages are saved on the file system. The system works flawlessy. At the moment I'm refactoring it to add some DB interaction.
I'm not using items, nor Item Pipelines.
What are the benefits of using them?
Adding some info:
The purpose of my crawler is to download entire pages (in html, png, or converted to txt using a library). As soon as the spider has the response to save, it passes it to a library that encapsulate all the io ops(File system and DB). So in this way, it is simpler than use items(with boilerplate for conversion) and pipelines.
So where is my doubt?
I don't know the way scrapy works internally well enough. The way the crawler is implemented the io ops are executed into the thread of the spider. So each spider takes longer to execute. As opposite if I move the io ops into the pipeline, maybe(?) scrapy can schedule its jobs better, executing them separately from the crawling job. Will there be any real performance difference?
In my opinion, using pipelines is just following the separation of concerns principle. Your spider can do many things, but it's core function is to extract information from web pages. The rest can (and possibly should) be refactored into a pipeline or an extension.
It might not be such an issue if you have one spider for one web site. But imagine you have a Scrapy project with hundreds of spiders for semantically similar web sites and you want to apply the same logic for each item -- take a page snapshot, check for duplicates, store in database etc. And now imagine the maintenance hell if you had all the logic in each of the spiders and had to change that logic.
The official tutorial specifies the way on how to call scrapy within python script
By changing the following setting attributes:
settings.overrides['FEED_URI'] = output_path
settings.overrides['FEED_FORMAT'] = 'json'
I am able to store the data scraped in a json file.
However, I'm trying to process and return the data scraped immediately within the function I defined. Hence, other functions can call this wrapper function in order to scrap some websites.
I figure there must be some settings I can play with FEED_URI, but I am not sure. Any advice will be appreciated deeply!
Feed exports are meant to serialize the data you've scraped (see feed export documentation). What you are trying to do doesn't involve serialization.
What you want to do instead is create a pipeline. Scrapy will pass scraped Items to the pipeline. They are dictionaries, and you can do whatever you want with them.