I can't figure out why BS4 is not seeing the text inside of the span in the following scenario:
Page: https://pypi.org/project/requests/
Text I'm looking for - number of stars on the left hand side (around 43,000 at the time of writing)
My code:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).text
also tried:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).get_text()
Both return an empty string ''. The element itself seems to be located correctly (I can browse through parents / siblings in PyCharm debugger without a problem. Fetching text in other parts of the website also works perfectly fine. It's just the github-related stats that fail to fetch.
Any ideas?
Because this page use Javascript to load the page dynamically.So you couldn't get it directly by response.text
The source code of the page:
You could crawl the API directly:
import requests
r = requests.get('https://api.github.com/repos/psf/requests')
print(r.json()["stargazers_count"])
Result:
43010
Using bs4, we can't scrape stars rate.
After inspecting the site, please check response html.
There, there is class information named "github-repo-info__item", but there is no text information.
in this case, use selenium.
Related
I am new to Python and I've been working on a program that alerts you when a new item is uploaded to jp.mercari.com (a shopping site). I have the alert part of the program working, but it operates based on the number of items that come up on the search results. When I scrape the website I am unable to find what I am looking for despite being able to locate it when I inspect element on the page. The scraping program looks like this:
from bs4 import BeautifulSoup
import requests
url = "https://jp.mercari.com/search?keyword=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("mer-text")
print(tag)
For more context, this is the website and some of the HTML. I've circled the parts I am trying to find in red:
Does anyone know why I am unable to find what I'm looking for?
Here is another example of the same problem but from a website that is in English:
import requests
url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("span")
print(tag)
Again, I can see the part of HTML I want to find when I inspect element but I can't find it when I scrape the website:
Here's what's happening with me: the element you seek (<mer-text>) is being found. However, the output is in Japanese, and Python doesn't know what to do with that. In my browser, it's being translated to English automatically by Google, so that's easier to deal with.
So to preface the website I've been trying to scrape seems to have/use (I'm unsure about the jargon with things relating to web development and the like) javascript code and I've been having varying success trying to scrape different tables on different pages.
For instance on this page: http://www.tennisabstract.com/cgi-bin/player.cgi?p=NovakDjokovic I was easily able to 'inspect element' then go to Network find the correct 'Name' of the script and then find the Request URL I needed to get the table that I wanted. The code I used for this was:
url = 'http://www.minorleaguesplits.com/tennisabstract/cgi-bin/frags/NovakDjokovic.js'
content = requests.get(url)
soup = BeautifulSoup(content.text, 'html.parser')
table = soup.find('table', id='tour-years', attrs= {'class':'tablesorter'})
dfs = pd.read_html(str(table))
df = pd.concat(dfs)
However, now when I'm looking at a different page on the same site, say this one http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html, I'm unable to find the Request URL that will allow me to eventually get the table that I want. I repeat the same process as I did above, but there's no .js script under the Network tab that has the table. I do see the table when I'm looking at the html elements, but of course I can't get it without the correct url.
So my question would be, how can I get the table from this page http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html ?
TIA!
On looking at the source code of the html page, you can see that all the data is already loaded in the script tag. Only thing you want is extract the variable value and load it to beautifulsoup.
The following code gives all the variables and the values from script tag
import requests, re
from bs4 import BeautifulSoup
res = requests.get("http://www.tennisabstract.com/charting/20190714-M-Wimbledon-F-Roger_Federer-Novak_Djokovic.html")
soup = BeautifulSoup(res.text, "lxml")
script = soup.find("script", attrs={"language":"JavaScript"}).text
var_only = script[:script.index("$(document)")].strip()
Next you can use regex to get the variable values - https://regex101.com/r/7cE85A/1
I am downloading the verb conjugations to aid my learning. However one thing I can't seem to get from this web page is the english translation near the top of the page.
The code I have is below. When I print results_eng it prints the section I want but there is no english translation, what am I missing?
import requests
from bs4 import BeautifulSoup
URL = 'https://conjugator.reverso.net/conjugation-portuguese-verb-ser.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results_eng = soup.find(id='list-translations')
eng = results_eng.find_all('p', class_='context_term')
In a normal website, you should be able to find the text in a paragraph witht the function get_text(), but in this case this is a search, wich means it's probably pulling the data from a database and its not in the paragraph itself. At least that's what I can come up with, since I tried to use that function and I got an empty string in return. Can't you try another website and see what happens?
p.d: I'm a beginner, sorry if I'm guessing wrong
Hope you're all well. I wrote a basic webscrape of an HTML site earlier today, along similar lines. I was following a tutorial, as you'll be able to see by my code I'm a bit of a green-horn to coding in Python. Hoping for a bit of guidance regarding scraping this site.
As you can see by the commented out code,
#print(results.prettify())
I am able to successfully able to print out the entire contents of the webpage. What I'd like to do however, is whittle down the contents of what I am printing out, so that I am just printing out the relevant content. There is a lot of content on the page that I don't want, and I'd like massage it out. Does anyone have any thoughts on why the for loop at the bottom of the code is not sequentially grabbing up the paragraphs in the xlmins unit of HTML and printing it out? Please see the below code for more.
import requests
from bs4 import BeautifulSoup
URL = "http://www.gutenberg.org/files/7142/7142-h/7142-h.htm"
page = requests.get(URL)
#we're going to create an object in Beautiful soup that will scrape it.
soup = BeautifulSoup(page.content, 'html.parser')
#this line of code takes
results = soup.find(xmlns='http://www.w3.org/1999/xhtml')
#print(results.prettify())
job_elems = results.find_all('p', xlmins="http://www.w3.org/1999/xhtml")
for job in job_elems:
paragraph = job.find("p", xlmins='http://www.w3.org/1999/xhtml')
print(paragraph.text.strip)
No <p> tag contains the attribute xlmins='http://www.w3.org/1999/xhtml', only the top HTML tag does. Remove that part and you'll get all the paragraphs.
job_elems = results.find_all('p')
for job in job_elems:
print(job.text.strip())
I'm trying to use BeautifulSoup to parse some HTML in Python. Specifically, I'm trying to create two arrays of soup objects: one for the dates of postings on a website, and one for the postings themselves. However, when I use findAll on the div class that matches the postings, only the initial tag is returned, not the text inside the tag. On the other hand, my code works just fine for the dates. What is going on??
# store all texts of posts
texts = soup.findAll("div", {"class":"quote"})
# store all dates of posts
dates = soup.findAll("div", {"class":"datetab"})
The first line above returns only
<div class="quote">
which is not what I want. The second line returns
<div class="datetab">Feb<span>2</span></div>
which IS what I want (pre-refining).
I have no idea what I'm doing wrong. Here is the website I'm trying to parse. This is for homework, and I'm really really desperate.
Which version of BeautifulSoup are you using? Version 3.1.0 performs significantly worse with real-world HTML (read: invalid HTML) than 3.0.8. This code works with 3.0.8:
import urllib2
from BeautifulSoup import BeautifulSoup
page = urllib2.urlopen("http://harvardfml.com/")
soup = BeautifulSoup(page)
for incident in soup.findAll('span', { "class" : "quote" }):
print incident.contents
That site is powered by Tumblr. Tumblr has an API.
There's a python port of Tumblr that you can use to read documents.
from tumblr import Api
api = Api('harvardfml.com')
freq = {}
posts = api.read()
for post in posts:
#do something here
for your bogus findAll, without the actual source code of your program it is hard to see what is wrong.