Recursive use of Scrapy to scrape webpages from a website - python

I have recently started to work with Scrapy. I am trying to gather some info from a large list which is divided into several pages(about 50). I can easily extract what I want from the first page including the first page in the start_urls list. However I don't want to add all the links to these 50 pages to this list. I need a more dynamic way. Does anyone know how I can iteratively scrape web pages? Does anyone have any examples of this?
Thanks!

use urllib2 to download a page. Then use either re (regular expressions) or BeautifulSoup (an HTML parser) to find the link to the next page you need. Download that with urllib2. Rinse and repeat.
Scapy is great, but you dont need it to do what you're trying to do

Why don't you want to add all the links to 50 pages? Are the URLs of the pages consecutive like www.site.com/page=1, www.site.com/page=2 or are they all distinct? Can you show me the code that you have now?

Related

Want to extract links and titles from a certain website with lxml and python but cant

I am Yasa James, 14, and I am new to web scraping.
I am trying to extract titles and links from this website.
As an so called "Utako" and a want-to-be programmer, I want to create a program that extract links and titles at the same time. I am currently using lxml because I cant download selenium, limited internet, very slow internet because Im from a province in philippines and I think its faster than other modules that I've used.
here's my code:
from lxml import html
import requests
url = 'https://animixplay.to/dr.%20stone'
page = requests.get(url)
doc = html.fromstring(page.content)
anime = doc.xpath('//*[#id="result1"]/ul/li[1]/p[1]/a/text()')
print(anime)
One thing I've noticed is that is I want to grab the value of an element from any of the divs, is it gives out an empty list as an output.
I hope you can help me with this my Seniors. Thank You!
Update:
i used requests-html to fix my problem and now its working, Thank you!
The reason this does not work is that the site you're trying to fetch uses JavaScript to generate the results, which means Selenium is your only option if you want to scrape the HTML. Any static fetching and processing libraries like lxml and beautifulsoup simply do not have the ability to parse the result of JavaScript calls.

How to write directly in the SEARCH boxes of websites with R

I am looking for a way to do web scraping on a web page after typing in its search box. Let me explain better with an example: I am looking for an R function that writes the word "notebook" directly on the amazon home page so that I can subsequently do web scraping of that generated page.
Any help?
Any suggestions?
Maybe I could do it in Python?
Thanks everyone for the help.
In python you have several modules designed for web scraping, i let you a list with the most common ones.
Requests
Beautiful Soup 4
lxml
Selenium
Scrapy
Just scrape the webpage from
https://www.amazon.com/s?k=whatever you want to search
Any sort of website will give you a url with a query when you search. just scrape from that url.

Sitemap creation with Scrapy

Is it possible to use Scrapy to generate a sitemap of a website including the URL of each page and its level/depth (the number of links I need to follow from the home page to get there)? The format of the sitemap doesn't have to be XML, it's just about the information. Furthermore I'd like to save the complete HTML source of the crawled pages for further analysis instead of scraping only certain elements from it.
Could somebody experienced in using Scrapy tell me whether this is a possible/reasonable scenario for Scrapy and give me some hints on how to find instructions? So far I could only find far more complex scenarios but no approach for this seemingly simple problem.
Addon for experienced webcrawlers: Given it is possible, do you think Scrapy is even the right tool for this? Or would it be easier to write my own crawler with a library like requests etc.?
Yes, it's possible to do what you're trying with Scrapy's LinkExtractor library. This will help you document the URLs for all of the pages on your site.
Once this is done, you can iterate through the URLs and the source (HTML) for each page using the urllib Python library.
Then you can use RegEx to find whatever patterns you're looking for within the HTML for each page in order to perform your analysis.

How can I scrape a site with multiple pages using beautifulsoup and python?

I am trying to scrape a website. This is a continuation of this
soup.findAll is not working for table
I was able to obtain needed data but the site has multiple pages which vary by the day. Some days it can be 20 pages and 33 pages on another. I was trying to implement this solution by obtaining the last page element How to scrape the next pages in python using Beautifulsoup
but when I got to the pager div in on the site I want to scrape I found this format
<a class="ctl00_cph1_mnuPager_1" href="javascript:__doPostBack('ctl00$cph1$mnuPager','32')">32</a>
<a class="ctl00_cph1_mnuPager_1">33</a>
how can I scrape all the pages in the site given that it the amount of pages change daily?
by the way page url does not change with page changes.
BS4 will not solve this issues anytime, because of it can't run Js
First, you can try to use Scrapy and this answer
You can use Selenium for it
I would learn how to use Selenium -- it's simple and effective in handling situations where BS4 won't do the job.
You can use it to log into sites, enter keys into search boxes, and click buttons on the screen. Not to mention, you can watch what it's doing with a browser.
I use it even when I'm doing something in BS4 to monitor the progress better of a scraping project.
Like some people have mentioned you might want to look at selenium. I wrote a blogpost for doing something like this a while back: http://danielfrg.com/blog/2015/09/28/crawling-python-selenium-docker/
Now things are much better with chrome and firefox headless.
Okay, so if I'm understanding correctly, there's an undetermined amount of pages that you want to scrape? I had a similar issue if that's the case. Look at the inspected page and see if there is an element that doesn't exist there but exists on the pages with content.
In my for loop I used
`pages = list(map(str, range(1, 5000))) /5000 is just a large number that what I
searching for wouldn't reach that high.
for n in pages:
base_url = 'url here'
url = base_url + n /n is the number of pages at the end of my url
/this is the element that didn't exist after the pages with content finished
figure = soup.find_all("figure")
if figure:
pass
else:
break /would break out of the page iterations and jump to my other listing in
another url after there wasn't any content left on the last page`
I hope this helps some, or helps cover what you needed.

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Categories

Resources