Scrapy: Get data on page and following link - python

I have been using scrapy for a personal project. My problem is very similar to the question asked on the following page:
Scrapy: Follow link to get additional Item data?
The page I am scraping is the following:
http://www.tennisinsight.com/player_activity.php?player_id=51
This page has a list of matches in this form for eg:
Round of 16 Def. Ivan Dodig(37,2.41) (CRO) 6-3 6-3 Recap Match Stats $1.043
I have currently written in scrapy code that opens every link on the page which has the "Match Stats" link, and scrapes data on that page into an individual record
In addition to this, I want to scrape the "Odds" column (which is the $1.043 above) and add this data to the record.
I have searched for an answer and it seems that I have to use the Request meta field and pass this data along to the parse method. However, I have a problem because I am struggling to incorporate it into my code. The answer from the stackoverflow link I linked above is "To scrape additional fields which are on other pages, in a parse method extract URL of the page with additional info, create and return from that parse method a Request object with that URL and pass already extracted data via its meta parameter."
This makes perfect sense, however, the URLs that I scrape are in the rules, so I dont know how to extract the required data.
Here is part of my code so far which will hopefully better explain my problem.
rules = (
Rule(SgmlLinkExtractor(allow=r"match_stats_popup.php\?matchID=\d+",
restrict_xpaths='//td[#class="matchStyle"]',
tags='a', attrs='href', process_value=getPopupLink), callback='parse_match', follow=True)
The parse_match function parses the match stats into one item.
So what happens is that each of these match stats links are opened up, and there is no way for me to access the main page's Odds column.
Any help will be much appreciated.

Related

Scraping data from a webpage based on VIEWSTATES

I'm attempting to scrape the details of all documents on this link.
The problem I'm facing is that the site is created using ASP.NET and the Viewstates aren't me to access the data directly, and I tried a mixture of beautifulSoup, Scrapy and Selenium, but to no avail. The data consists of 12782 documents whose pdf download link I need to extract from the page that redirects from each entry of the returned results on the aforementioned page.
The site also has an API here, but the catch here is that it only returns 2000 data points at any given point of time, so the ~12k data points is out of question.
Can someone help me with ANY ONE of the following:
Create a scraper to get the pdf links
Generate a query to get all the data from the API
Any recurrence relation that helps me generate links to get the queries for the API
Using the requests section in the API to get all the records at the same time delivered to your email
Ideally, a solution in python would be great, but if you can help me get a csv file of all the links, that would also work. Thanks in advance!
I ended up solving the problem by using the request functionality which was located here.
It took in a particular query and my email address and sent me the entire data dump I needed. From that data dump, I could use all the pdf links.

How to scrape information about a specific product using search bar

I'm making a system - mostly in Python with Scrapy - in which I can, basically, find information about a specific product. But the thing is that the request URL is massive huge, I got a clue that I should change some parts of it with variables to reach that specific product in which I would like to search for, but the URL has so many fields that I don't know, for sure, how to make it.
e.g: "https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt"
"demi+lovato+365+dias+do+ano" it's the book title, but I can see a lot of information on URL that I simply can't supply and of course, it changes from title to title. One solution I thought could be possible was to POST on search bar the title in which I was looking for and find it on result page but I don't know if it's the best approach since in fact, this is the first time I'll be working with web scraping.
Someone has some tip for how can I do that. All I could find was how to scrape all products for price comparison, scrape specific information about all these products and things like that but nothing about search for specific products.
Thanks for any contribs, this is very important for me and sorry about anything, I'm not a very present user and I'm not an English native speaker.
Feel free to make me any advice about user behavior, be better is always something I aim to.
You should use rule available in scrapy framework. This will help you to define how to navigate the site and its sub-site. Additionally you can configure other tags like span or div other than anchor tags to look for url of the link. By this way, additional query params in the link will be populated by the scrapy session as it emulate click on the hypelinks. If you skip the additional query params in the URL, there is a high chance that you will be blocked
How does scrapy use rules?
You don't need to follow that long link at all, often the different parameters are associated with your current session or settings/filters and you can keep only what you need.
Here is what I meant:
You can generate same result using these 2 urls:
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano
https://www.amazon.com.br/s?k=demi+lovato+365+dias+do+ano&adgrpid=86887777368&hvadid=392971063429&hvdev=c&hvlocphy=9047761&hvnetw=g&hvpos=1t1&hvqmt=e&hvrand=11390662277799676774&hvtargid=kwd-597187395757&hydadcr=5658_10696978&tag=hydrbrgk-20&ref=pd_sl_21pelgocuh_e%2Frobot.txt
If both links are generating same results then that's it, otherwise you will definitely have to play with different parameters, you can't predict website behavior without actually doing the test and having a lot of parameters is an issue then try something like:
from urllib.parse import quote_plus
base_url = "https://www.amazon.com.br"
link = base_url + "/k=%s&adgrpid=%s&hvadid=%s" % ( quote_plus(title), '86887777368', '392971063429' )

Get method from requests library seems to return homepage rather than specific URL

I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.

Scraping a website for specific data where URLs are inconsistent

I want to scrape http://www.narrpr.com/ for data, but I'm running into an issue. Most of the time, formatting URLs to access the specific pages you want to scrape is easy. However, in this instance, the URLs are formatted in the following fashion (for example):
http://www.narrpr.com/homes/mo/independence/64055/2412-s-ellison-way/38664800-summary.aspx
Where the number 38664800 appears to be some kind of unique ID.
When navigating the site manually, I enter 2412 S Ellison Way into the form, which then redirects me to the URL above.
How can I programmatically reach the correct page without needing to know that ID? Alternatively, how can I get that ID?
Thanks.

Parsing ajax responses to retrieve final url content in Scrapy?

I have the following problem:
My scraper starts at a "base" URL. This page contains a dropdown that creates another dropdown via ajax calls, and this cascades 2-3 times until it has all the information needed to get to the "final" page where the actual content I want to scrape is.
Rather than clicking things (and having to use Selenium or similar) I use the pages exposed JSON API to mimic this behavior, so instead of clicking dropdowns I simply send a request and read JSON responses that contain the array of information used to generate the next dropdown's contents, and do this until I have the final URL for one item. This URL takes me to the final item page that I want to actually parse.
I am confused about how to use Scrapy to get the "final" url for every combination of dropdown boxes. I wrote a crawler using urllib that used a ton of loops to just iterate through every combination of url, but Scrapy seems to be a bit different. I moved away from urllib and lxml because Scrapy seemed like a more maintainable solution, which is easier to integrate with Django projects.
Essentially, I am trying to force Scrapy to take a certain path that I generate along the way as I read the contents of the json responses, and only really parse the last page in the chain to get real content. It needs to do this for every possible page, and I would love to parallelize it so things are efficient (and use Tor, but these are later issues).
I hope I have explained this well, let me know if you have any questions. Thank you so much for your help!
Edit: Added an example
[base url]/?location=120&section=240
returns:
<departments>
<department id="62" abrev="SIG" name="name 1"/>
<department id="63" abrev="ENH" name="name 2"/>
<department id="64" abrev="GGTH" name="name 3"/>
...[more]
</departments>
Then I grab the department id, add it to the url like so:
[base url]/?location=120&section=240&department_id=62
returns:
<courses>
<course id="1" name="name 1"/>
<course id="2" name="name 2"/>
</courses>
This continues until I end up with the actual link to the listing.
This is essentially what this looks like on the page (though in my case, there is a final "submit" button on the form that sends me to the actual listing that I want to parse):
http://roshanbh.com.np/dropdown/
So, I need some way of scraping every combination of the dropdowns so that I get all the possible listing pages. The intermediate step of walking the ajax xml responses to generate final listing URLs is messing me up.
You can use a chain of callback functions starting for the main callback function, say you're implementing a spider extending BaseSpider, write your parse function like this:
...
def parse(self, response):
#other code
yield Request (url=self.baseurl, callback=self.first_dropdown)
def first_dropdown (self, response):
ids=self.parse_first_response() #Code for parsing the first dropdown content
for (i in ids):
req_url=response.url+"/?location="+i
yield Request (url=req_url, callback=self.second_dropdown)
def second_dropdown (self, response):
ids=self.parse_second_response() #Code for parsing the second dropdown contents
url=self.base_url
for (i in ids):
req_url=response.url+"&section="+i
yield Request (url=req_url, callback=self.third_dropdown)
...
the last callback function will have the code needed to extract your data.
Be careful, you're asking to try all possible combinations of input and this can lead you to an high number of requests very fast.

Categories

Resources