I am trying to parse the title of links using BeautifulSoup. I have tried various things but just can't get it to work.
The html is behind a login so here's a screenshot:
And here's my latest attempt which I was sure would work but just returns "None".
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
links = soup.find_all('ul', class_='nav list-group')
print(links)
for link in links:
title = link.get('title')
print(title)
Can anyone see what I am doing wrong?
This line of code:
links = soup.find_all('ul', class_='nav list-group')
Is not extracting the links, it's extracting the <ul> tags. Instead, you could try extracting the links with something like:
links = soup.find_all('a', class_='odds')
Then you will be able to loop over them and extract your titles:
for link in links:
print(link['title'])
What happens?
You are selecting the <ul> not its <a> so you wont get any href value.
How to fix?
Select more specific e.g. with these css selector that will find all <a> that has an title attribute, in your <ul>:
links = soup.select('ul.nav.list-group a[title]')
Example
Note: Your question needs some improvement, so you should provide specific part of driver.page_source as text and not as image - Took your code, so it is just a hint.
from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_source, 'html.parser')
for link in soup.select('ul.nav.list-group a[title]'):
title = link.get('title')
print(title)
Related
I am trying to Scrape NBA.com play by play table so I want to get the text for each box that is in the example picture.
for example(https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play).
checking the html code I figured that each line is in an article tag that contains div tag that contains two p tags with the information I want, however I wrote the following code and I get back that there are 0 articles and only 9 P tags (should be much more) but even the tags I get their text is not the box but something else. I get 9 tags so I am doing something terrible wrong and I am not sure what it is.
this is the code to get the tags:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def contains_word(t):
return t and 'keyword' in t
url = "https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
div_tags = soup.find_all('div', text=contains_word("playByPlayContainer"))
articles=soup.find_all('article')
p_tag = soup.find_all('p', text=contains_word("md:bg"))
thank you!
Use Selenium since it's using Javascript and pass it to Beautifulsoup. Also pip install selenium and get the chromedriver.exe
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.nba.com/game/bkn-vs-cha-0022000032/play-by-play")
soup = BeautifulSoup(driver.page_source, "html.parser")
This is a part of the html page i want to scrape
I am trying to get the title and the value of cryptos using beautifulsoup.
I have tried many solutions using find and find_all to get the content included in div but I don't see what is wrong... There is an example of what i tried:
titles = soup.find_all("div", {"class": "tabTitle-qQlkPW5Y"})
Can you please help me with this ?
My solution is to use selenium to make sure the page fully rendered. Then using beautiful soup we can navigate through its elements.
from selenium import webdriver
driver = webdriver.Chrome(pathToChromeWebDriver)
url = "https://fr.tradingview.com/markets/cryptocurrencies/global-charts/"
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for title in soup.find_all("div", {"class": "tabTitle-qQlkPW5Y"}):
print(title.string)
This is my code to scrape all links in a webpage:
from bs4 import BeautifulSoup
import requests
import re
page = requests.get("http://www3.asiainsurancereview.com/News")
soup = BeautifulSoup(page.text, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print(link.get('href'))
links.close()
But it lists out only the links that are present in the drop downs. Why is that? Why did it not "see" the links of the news articles present in the page? I actually want to scrape all the news articles. I tried the following, to identify a tag and scrape the news article links within that tag:
import requests
import re
links=open("Life_and_health_links.txt", "a")
page = requests.get("http://www3.asiainsurancereview.com/News")
soup = BeautifulSoup(page.text, "html.parser")
li_box = soup.select('div.col-sm-5 > ul > li > h5 > a')
for link in li_box:
print(link['href'])
But this, of course, displays only the links in that particular tag. And to list out links in other tags, I have to run this code multiple time specifying the specific tag whose link I want to list out. But, how do I list out all the links of the news articles in all the tags, and skip the links that are not of news articles?
You need to do some research to find the common pattern for news links.
Try this, hope it works.
li_box = soup.select("div ul li h5 a")
for a in li_box:
print(a['href'])
I need to retrieve is the href containing /questions/20702626/javac1-8-class-not-found. But the output I get for the code below is //stackoverflow.com:
from bs4 import BeautifulSoup
import urllib2
url = "http://stackoverflow.com/search?q=incorrect+operator"
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content)
for tag in soup.find_all('div'):
if tag.get("class")==['summary']:
for tag in soup.find_all('div'):
if tag.get("class")==['result-link']:
for link in soup.find_all('a'):
print link.get('href')
break;
Instead of making nested loops, write a CSS selector:
for link in soup.select('div.summary div.result-link a'):
print link.get('href')
Which is not only more readable, but also solves your problem. It prints:
/questions/11977228/incorrect-answer-in-operator-overloading
/questions/8347592/sizeof-operator-returns-incorrect-size
/questions/23984762/c-incorrect-signature-for-assignment-operator
...
/questions/24896659/incorrect-count-when-using-comparison-operator
/questions/7035598/patter-checking-check-of-incorrect-number-of-operators-and-brackets
Additional note: you might want to look into using StackExchange API instead of the current web-scraping/HTML-parsing approach.
I am new to Python and I am learning it for scraping purposes I am using BeautifulSoup to collect links (i.e href of 'a' tag). I am trying to collect the links under the "UPCOMING EVENTS" tab of site http://allevents.in/lahore/. I am using Firebug to inspect the element and to get the CSS path but this code returns me nothing. I am looking for the fix and also some suggestions for how I can choose proper CSS selectors to retrieve desired links from any site. I wrote this piece of code:
from bs4 import BeautifulSoup
import requests
url = "http://allevents.in/lahore/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data)
for link in soup.select( 'html body div.non-overlay.gray-trans-back div.container div.row div.span8 div#eh-1748056798.events-horizontal div.eh-container.row ul.eh-slider li.h-item div.h-meta div.title a[href]'):
print link.get('href')
The page is not the most friendly in the use of classes and markup, but even so your CSS selector is too specific to be useful here.
If you want Upcoming Events, you want just the first <div class="events-horizontal">, then just grab the <div class="title"><a href="..."></div> tags, so the links on titles:
upcoming_events_div = soup.select_one('div.events-horizontal')
for link in upcoming_events_div.select('div.title a[href]'):
print(link['href'])
Note that you should not use r.text; use r.content and leave decoding to Unicode to BeautifulSoup. See Encoding issue of a character in utf-8
import bs4 , requests
res = requests.get("http://allevents.in/lahore/")
soup = bs4.BeautifulSoup(res.text)
for link in soup.select('a[property="schema:url"]'):
print link.get('href')
This code will work fine!!