How do I use Scrapy with no specific start URL? - python

I'm attempting to use Scrapy to collect data from http://www.guidestar.org search results, but the way the website is set up, when I make a specific search for an organization on the website, the URL for the results is just
http://www.guidestar.org/SearchResults.aspx
which I can't plug in as the start URL since it doesn't link to any actual search results. Any ideas on how to get around this?

Related

How can I find a dynamic url on a webpage?

Guys I want to be able to know how you can find a dynamic url on a website. Primarily, I am looking for the search term at the end of a website. For example how would I find the link https://www.abbeywhisky.com/pages/search-results-page?q= from entering the front page of https://www.abbeywhisky.com
I am unsure if there is a way to do this by just using the landing page of a site. I would have tried to scrape the first page of the site but using the filter of "?=" does not shown any results.

How to find the correct URL when you made some choices on the web page?

I'm very new to learn about web scraping. By using xpath selector i am trying to get the knowledge on that webpage : https://seffaflik.epias.com.tr/transparency/uretim/planlama/kgup.xhtml
But the point is, whenever you change the date or the powerplant name, URL does not change therefore when you fetch the response, you are getting always the same and wrong answer. Is there a way to find the correct URL or anything else related to HTML Markup etc ?
For a scraping operation like this, you'll need to do a bit more than just load the document and then grab the content. The document in-question relies on JavaScript to load new information from some other resource after the user has defined a particular set of parameters and updated the form.
After loading the document, you'll need to define your search parameters. You can do this via JavaScript injection or via your browser's console. For example, if you were trying to define the value for the first date field, you could use
document.querySelectorAll('#j_idt199 input')[1].value = "Some/New/Date";
Repeat this process for the other fields you wish to define in your search, and then run the following code to programmatically execute your search:
document.querySelector('#j_idt199 button').click();
After that, you can either grab the information you want using plain JS query selectors, or you can implement a scraping library like artoo.js to help you interpret the data and export it.

How to scrape information about a specific product using search bar

I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )

Scraping Google news search

I am trying to get the number of results from a google news search for a specific day. In a browser this is easy - Do a google search, click the "news" tab, click "tools", then change the time period to the date you want, then click "tools" again and you can see a count for how many stories it found.
The start and end dates can be seen in the URL. For example here is a search for "stack overflow" over the past week - https://www.google.com/search?q=stack+overflow&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F3%2F2018%2Ccd_max%3A1%2F10%2F2018&tbm=nws
The problem is when I try to request one of these URLs it gives me the current results for it and ignores the date range I specify. I can change these parameters around in my browser and the results change as expected, it just doesn't work programmatically.
I have tried several ways in both python and C#, always with the same results.
For example -
import requests
response = requests.get('https://www.google.com/search?q=stack+overflow&source=lnt&tbs=cdr%3A1%2Ccd_min%3A1%2F1%2F2018%2Ccd_max%3A1%2F10%2F2018&tbm=nws')
print(response.content)
I finally found a working method using a headless web browser and Selenium. I suppose it has something to do with not being able to get the magic created by java by a simple request. I would still be interested in hearing an explanation or other ways to do this though.

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Categories

Resources