i am trying to learn how to use scrappy with python; i am not familiar with css
the website i am trying to scrape: https://fantasydata.com/nfl-stats/point-spreads-and-odds?season=2018&seasontype=1&week=17
so when i copy the selector for the date, this is the result:
stats_grid > div.k-grid-content.k-auto-scrollable > table > tbody > tr:nth-child(1) > td:nth-child(1) > span
when I bring up the scrappy module by doing: python shell "url"
and type in response.css('selector here')
I get no results!
How do i retrieve the date information?
Thanks for reading this message!
So the issue here is that the data you're trying to scrape is not available when scrappy receives the page response.
If you have your browser's developer console open when the page loads, check out the XHR request on the network tab to this URL:
https://fantasydata.com/NFLTeamStats/Odds_Read
If you check out its payload, you'll see that it contains exactly the data you are trying to scrape. In other words, it's loaded from the site's app via an HTTP fetch AFTER the initial page has loaded.
So, when you use a webscaper (like scrappy), you're unable to see that kind of data. You really only get the initial page template, and anything loaded by javascript after is unavailable.
If you're looking for general NFL and fantasy related stats, there's an app called FFDB that allows you to easily create databases using its engine:
FFDB Github Repository
Disclaimer: I am the author of the app.
As a last note, be aware that a css tag is not relevant for this issue. A scraping or webscrape tag would be more appropriate.
Best of luck!
Related
I'm new to Python & object-oriented programming in general. I'm trying to build a simple web scraper to create data frames from NBA contract data on basketball-reference.com. I had planned to use the requests library together with BeautifulSoup. However, the get method seems to be returning the site's homepage rather than the page affiliated with the URL I give.
I give a URL to a team's contracts page (https://www.basketball-reference.com/contracts/IND.html), but when I print the html it looks like it belongs to the homepage.
I haven't been able to find any documentation on the web about anyone else having this problem...
I'm using the Spyder IDE.
# Import library
import requests
# Assign the URL for contract scraping
url = 'https://www.basketball-reference.com/contracts/IND.html'
# Pull contracts page
page = requests.get(url)
# Check that correct page is being pulled
print(page.text)
This seems like it should be very straightforward, so I'm not understanding why the console is displaying html that clearly doesn't pertain to the page I'm trying to point to. I'm not getting any errors, just html from the homepage.
After checking the code on repl.it and visiting the webpage myself, I can confirm you are pulling in the correct page's HTML. The page variable contains the tables of data, as well as their info... and also the page's advertisements, the contact info, the social media buttons and links, the adblock detection scripts, and everything else on the webpage. Your issue isn't that you're getting the wrong page, it's that you're getting the entire page, not just the data.
You'll want to pick out the exact bits you're interested in - maybe by selecting the table and its child elements? The table's HTML id is contracts - that should be a good place to start.
(Try visiting the page in your browser, right-clicking anywhere on the page, and clicking "view page source" - that's what your program is pulling in. There's a LOT more to a webpage than most people realize!)
As a word of warning, though, Sports Reference has a data use policy that precludes web crawlers / spiders on their site. I would recommend checking (and using) one of the free sites they link instead; you risk being IP banned otherwise.
Simply printing the result of the get request on the terminal won't be very helpful, as the HTML page content returned is long - your terminal will truncate the printed response. I'm assuming in your case maybe the website has parts of the homepage reused in other pages as well, so it might get confusing.
I recommend writing the response into a file and then opening the file in the browser. You will see that your code is pulling the right page.
So I'm using scrapy to scrape a data from Amazon books section. But somehow I got to know that it has some dynamic data. I want to know how dynamic data can be extracted from the website. Here's something I've tried so far:
import scrapy
from ..items import AmazonsItem
class AmazonSpiderSpider(scrapy.Spider):
name = 'amazon_spider'
start_urls = ['https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6']
def parse(self, response):
items = AmazonsItem()
products_name = response.css('.s-access-title::attr("data-attribute")').extract()
for product_name in products_name:
print(product_name)
next_page = response.css('li.a-last a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
Now I was using SelectorGadget to select a class which I have to scrape but in case of a dynamic website, it doesn't work.
So how do I scrape a website which has dynamic content?
what exactly is the difference between dynamic and static content?
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
how would I know that data is dynamically created?
So how do I scrape a website which has dynamic content?
there are a few options:
Use Selenium, which allows you to simulate opening a browser, letting the page render, then pull the html source code
Sometimes you can look at the XHR and see if you can fetch the data directly (like from an API)
Sometimes the data is within the <script> tags of the html source. You could search through those and use json.loads() once you manipulate the text into a json format
what exactly is the difference between dynamic and static content?
Dynamic means the data is generated from a request after the initial page request. Static means all the data is there at the original call to the site
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
Refer to your first question
how would I know that data is dynamically created?
You'll know it's dynamically created if you see it in the dev tools page source, but not in the html page source you first request. You can also see if the data is generated by additional requests in the dev tool and looking at Network -> XHR
Lastly
Amazon does offer an API to access the data. Try looking into that as well
If you want to load dynamic content, you will need to simulate a web browser. When you make an HTTP request, you will only get the text returned by that request, and nothing more. To simulate a web browser, and interact with data on the browser, use the selenium package for Python:
https://selenium-python.readthedocs.io/
So how do I scrape a website which has dynamic content?
Websites that have dynamic content have their own APIs from where they are pulling data. That data is not even fixed it will be different if you will check it after some time. But, it does not mean that you can't scrape a dynamic website. You can use automated testing frameworks like Selenium or Puppeteer.
what exactly is the difference between dynamic and static content?
As I have explained this in your first question, the static data is fixed and will remain the same forever but the dynamic data will be periodically updated or changes asynchronously.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
for that, you can use libraries like BeautifulSoup in python and cheerio in Nodejs. Their docs are quite easy to understand and I will highly recommend you to read them thoroughly.
You can also follow this tutorial
how would I know that data is dynamically created?
While reloading the page open the network tab in chrome dev tools. You will see a lot of APIs are working behind to provide the relevant data according to the page you are trying to access. In that case, the website is dynamic.
So how do I scrape a website which has dynamic content?
To scrape the dynamic content from websites, we are required to let the web page load completely, so that the data can be injected into the page.
What exactly is the difference between dynamic and static content?
Content in static websites is fixed content that is not processed on the server and is directly returned by using prebuild source code files.
Dynamic websites load the contents by processing them on the server side in runtime. These sites can have different data every time you load the page, or when the data is updated.
How would I know that data is dynamically created?
You can open the Dev Tools and open the Networks tab. Over there once you refresh the page, you can look out for the XHR requests or requests to the APIs. If some requests like those exist, then the site is dynamic, else it is static.
How do I extract other information like price and image from the website? and how to get particular classes for example like a price?
To extract the dynamic content from the websites we can use Selenium (python - one of the best options) :
Selenium - an automated browser simulation framework
You can load the page, and use the CSS selector to match the data on the page. Following is an example of how you can use it.
import time
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.amazon.in/s?k=agatha+christie+books&crid=3MWRDVZPSKVG0&sprefix=agatha%2Caps%2C269&ref=nb_sb_ss_i_1_6")
time.sleep(4)
titles = driver.find_elements_by_css_selector(
".a-size-medium.a-color-base.a-text-normal")
print(titles[0].text)
In case you don't want to use Python, there are other open-source options like Puppeteer and Playwright, as well as complete scraping platforms such as Bright Data that have built-in capabilities to extract dynamic content automatically.
Say I want to scrape the following url:
https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week
I have the following python code:
import requests
from lxml import html
urlToSearch = 'https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week'
page = requests.get(urlToSearch)
tree = html.fromstring(page.content)
print(tree.xpath('//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div/text()'))
The trouble is when I print the text at the following xpath:
//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div
nothing appears but [] despite me confirming that "Found 500+ tracks" should be there. What am i doing wrong?
The problem is that requests does not generate dynamic content.
Right click on the page and view the page source, you'll see that the static content does not include any of the content that you see after the dynamic content has loaded.
However, (using Chrome) open dev tools, click on network and XHR. It looks like you can get the data through an API which is better than scraping anyway!
Problem is that with modern websites almost all web pages will change quite a lot after its been loaded with JavaScript, css etc. You will fetch the basic html before any DOM updates etc been made and will look differently to actually visiting the page with a browser.
Use the Selenium WebDriver framework (mostly used for test automation), it will emulate loading the page, executing javascripts etc.
Selenium Documentation for Python
I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'§ionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.
I'm trying to scrape https://www.grailed.com/ using scrapy. I have been able to get the elements I want in each listing (price, item title, size). I currently trying to get the ahrefs for each listing at the home page.
When I try response.xpath('.//div[starts-with(#id, "product")]').extract
returns
<bound method SelectorList.extract of [<Selector xpath='.//div[starts-with(#id, "product")]'
data=u'<div id="products">\n<div id="loading">\n<'>]>
Based on the inspect element it should be returning div class="feed-wrapper">?
I'm just trying to get those links so scrapy knows to go into each listing. Thank you for any help.
When you do scrapping always check source of page (not in the inspector but view-source) - that would be real data you operate with.
That div is added dynamically after page loads. JS does that job.
When you send request to server and receive pure HTML - JS will not be executed and so you see real Server response which you support to work with.
div class="feed-wrapper">
Here is real Server response to you. You must deal with it.