beautiful soup and requests not getting full page [duplicate] - python

This question already has an answer here:
Beautiful Soup 4 find_all don't find links that Beautiful Soup 3 finds
(1 answer)
Closed 8 years ago.
My code looks like this.
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data)
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
what I am getting is the property name till IDYLLIC SUITES,after that I get error "list index out of range".What is the problem?

I am not sure what is exactly bothering you. Because when I tried your code (as it is) it worked for me.
Still, try changing the parser, may be to html5lib
So do,
pip install html5lib
And then change your code to,
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.data.com.sg/iCurrentLaunch.jsp")
data = r.text
soup = BeautifulSoup(data,'html5lib') # Change of Parser
n = soup.findAll('table')[7].findAll('table')
for tab in n:
print tab.findAll('td')[1].text
Let me know if it helps

Related

Can't get for loop to work while parsing HTML using Beautiful Soup 4

I'm using the Beautiful Soup documentation to help me understand how to implement it. I'm not too familiar with Python as a whole, so maybe I'm making a syntax error, but I don't believe so. The code below should print out any links from the main Etsy page, but it's not doing that. The documentation states something similar to this, but maybe I'm missing something. Here's my code:
#!/usr/bin/python3
# import library
from bs4 import BeautifulSoup
import requests
import os.path
from os import path
# Request to website and download HTML contents
url='https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
req=requests.get(url)
content=req.text
soup=BeautifulSoup(content, 'html.parser')
for x in soup.head.find_all('a'):
print(x.get('href'))
The HTML prints if I set it up that way, but I can't get the for loop to work.
If you're trying to get all tags from the specified URL then:
url = 'https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy_e&utm_campaign=Search_US_Brand_GGL_ENG_General-Brand_Core_All_Exact&utm_ag=A1&utm_custom1=_k_Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB_k_&utm_content=go_227553629_16342445429_536666953103_kwd-1818581752_c_&utm_custom2=227553629&gclid=Cj0KCQiAi8KfBhCuARIsADp-A54MzODz8nRIxO2LnGcB8Ezc3_q40IQk9HygcSzz9fPmPWnrITz8InQaAt5oEALw_wcB'
with requests.get(url) as r:
r.raise_for_status()
soup = BeautifulSoup(r.text, 'lxml')
if (body := soup.body):
for a in body.find_all('a', href=True):
print(a['href'])

Extract text with Beautiful Soup [duplicate]

This question already has answers here:
BS4 Beautiful Soup extract text from find_all
(2 answers)
Closed 2 years ago.
I'm trying to learn how to use beautiful soup
using this website as a very simple example.
https://www.espncricinfo.com/ci/content/ground/56490.html#Profile
Lets say I want to extract the capacity of the ground. I have so far written the following code which gives me the field names, but I can't seem to understand how to get the actual value of 18,000
Can anyone help?
url="https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup.findAll('label')
Perhaps something like
from bs4 import BeautifulSoup
import requests
url = "https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stats = soup.find('div', {'id': 'stats'})
for e in stats.findAll('label'):
print(f"{e.text}: {e.nextSibling}")
demo

scraping youtube website using beautiful soup [duplicate]

This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?
BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python

Python 3 - Extract content between <td></td> [duplicate]

This question already has an answer here:
How to get inner text value of an HTML tag with BeautifulSoup bs4?
(1 answer)
Closed 5 years ago.
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser')
emails = soup.find_all('td', text = re.compile('#'))
for line in emails:
print(line)
I have the script above that works perfect in Python 2.7 with Beautifulsoup for extracting content between several in a HTML-file. When I run the same script in Python 3.6.4, however, I get the following results:
<td>xxx#xxx.com</td>
<td>xxx#xxx.com</td>
I want the content between without the TD stuff...
Why is this happening in Python 3?
I found the answer...
from bs4 import BeautifulSoup
import re
data = open('C:\folder')
soup = BeautifulSoup(data, 'html.parser') #Lade till html.parser
emails = soup.find_all('td', text = re.compile('#'))
for td in emails:
print(td.get_text())
Look close at the two last lines :)

BeautifulSoup and Large html

I was trying to scrape a number of large Wikipedia pages like this one.
Unfortunately, BeautifulSoup is not able to work with such a large content, and it truncates the page.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
However, if you prefer, you can also use lxml as follows:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')
I suggest you get the html content and then pass it to BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code

Categories

Resources