python beautifulsoup findall within find - python

I am trying to get the text in td class 'column-1' and I am having some trouble because it has no attribute text - but it clearly does so I must be doing something wrong. Here is the code:
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://vermontamerican.com/products/standard-drill-bit-extensions/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for part in soup.find_all('td'),{"class":"column-1"}:
part1 = part.text
print(part1)
If I take line 2 out and just print "part" above I get a result but it is giving all td not just column-1.
I have also tried this but I am new so I am sure this is wrong in more ways than one.
import urllib
import urllib.request
from bs4 import BeautifulSoup
theurl="http://vermontamerican.com/products/standard-drill-bit-extensions/"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for part in soup.find('tbody'),{"class":"row-hover"}:
for part1 in part.find_all('a'):
print(part1)

You are not passing the attribute selection dictionary into the find_all() function. Replace:
for part in soup.find_all('td'),{"class":"column-1"}:
with:
for part in soup.find_all('td', {"class":"column-1"}):
Now your code would produce:
17103
17104

Related

How to use find_all method in bs4 on an object without class

import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/').text
soup=BeautifulSoup (result, 'lxml')
stories=soup.find_all('tr')
print (stories)
The find method works but find_all doesn't I'm not sure why maybe it is because it doesn't have a class?
correct code is
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
stories=soup.find_all('tr')
you can access each 'tr' by
stories[0]
0 can be replaced with any number in list
You can also use Pandas
eg
import pandas
import requests
from bs4 import BeautifulSoup
result=requests.get('http://textfiles.com/stories/')
soup = BeautifulSoup(result.content, 'html5lib')
df=pandas.read_html(soup.prettify())
print(df)

Encoding issues request results

im learning python, and im trying to retrieve data from wikipedia, but is giving me encoding issues on special charecters of the links, text, etc:
My code:
import sys
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://pt.wikipedia.org/wiki/Jair_Bolsonaro")
bsObj = BeautifulSoup(html)
for link in bsObj.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
result:
/wiki/Hamilton_Mour%C3%A3o
/wiki/Michel_Temer
/wiki/C%C3%A2mara_dos_Deputados_do_Brasil
...
Should be:
/wiki/Hamilton_Mourão
/wiki/Michel_Temer
/wiki/Câmara_dos_Deputados_do_Brasil
...
Solution:
import urllib.parse
And in print line changed to:
print(urllib.parse.unquote(link.attrs['href']))

I have BeautifulSoup4, from Anaconda, but I can't seem to utilize it, to save URL to TXT

I have this view in Anaconda.
However, I can't see to utilize BS in my script.
import string
from urllib.request import urlopen
from bs4 import BeautifulSoup as bs
#import BeautifulSoup as bs
alphabets = string.ascii_lowercase
for i in alphabets:
#print(i)
html = urlopen("http://www.airlineupdate.com/content_public/codes/airportcodes/airports-by-iata/iata-" + i + ".htm")
print(html)
for j in html:
#soup = bs4(html, "html.parser")
soup = bs(html, "html.parser")
f = open('C:\\Users\\Excel\\Desktop\\URL.txt', 'w')
When I try to run the code above, I get the following error:
ModuleNotFoundError: No module named 'BeautifulSoup4'
Can someone enlighten me as to what's going on here?
from documentation its
from bs4 import BeautifulSoup
and based on your code, it seems like you want to use it as bs()
from bs4 import BeautifulSoup as bs

list out of range error when using soup.select('placeholder')[0].get_text() in Beautiful soup

New to scraping and I'm trying to use Beautiful soup to get the Wheelbase value ( eventually other things) from a wikipedia page ( I'll deal with robots.txt later) This is the guide I've been using
Two questions
1.) How do I resolve the error below?
2.) How do I scrape the value in the cell that contains wheelbase is it just "td#Wheelbase td" ?
The error I get is
File "evscraper.py", line 25, in <module>
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3') [0].get_text()
IndexError: list index out of range
Thanks for any help!
__author__ = 'KirkLazarus'
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
wheelbase_data['Wheelbase'] = soup.select('div#Wheelbase h3')[0].get_text()
print wheelbase_data
Well your first problem is with your selector. There's no div with the ID of "Wheelbase" on that page, so it's returning an empty list.
What follows is by no means perfect, but will get you what you want, only because you know the structure of the page already:
import re
import json
import gspread
from oauth2client.client import SignedJwtAssertionCredentials
import bs4
from bs4 import BeautifulSoup
import requests
wheelbase_data = {}
response =requests.get ('https://en.wikipedia.org/wiki/Tesla_Model_S')
soup = bs4.BeautifulSoup(response.text)
for link in soup.find_all('a'):
if link.get('href') == "/wiki/Wheelbase":
wheelbase = link
break
wheelbase_data['Wheelbase'] = wheelbase.parent.parent.td.text
It looks like you're looking for the incorrect path. I've had to do something similar in the past.. I'm not sure if this is the best approach but certainly worked for me.
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
car_data = pd.DataFrame()
models = ['Tesla_Model_S','Tesla_Model_X']
for model in models:
wiki = "https://en.wikipedia.org/wiki/{0}".format(model)
header = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
table = soup.find("table", { "class" : "infobox hproduct" })
for row in table.findAll("tr")[2:]:
try:
field = row.findAll("th")[0].text.strip()
val = row.findAll("td")[0].text.strip()
car_data.set_value(model,field,val)
except:
pass
car_data

BeautifulSoup(html) not working, saying can't call module?

import urllib2
import urllib
from BeautifulSoup import BeautifulSoup # html
from BeautifulSoup import BeautifulStoneSoup # xml
import BeautifulSoup # everything
import re
f = o.open( 'http://www.google.com', p)
html = f.read()
f.close()
soup = BeautifulSoup(html)
Getting an error saying the line with soup = BeautifulSoup(html) says 'module' object is not callable.
Your import BeautifulSoup makes BeautifulSoup refer to the module, not the class as it did after from BeautifulSoup import BeautifulSoup. If you're going to import the whole module, you might want to omit the from ... line or perhaps rename the class afterward:
from BeautifulSoup import BeautifulSoup
Soup = BeautifulSoup
...
import BeautifulSoup
....
soup = Soup(html)
#Blair's answer has the right slant but I'd perform some things slightly differently, i.e.:
import BeautifulSoup
Soup = BeautifulSoup.BeautifulSoup
(recommended), or
import BeautifulSoup
from BeautifulSoup import BeautifulSoup as Soup
(not bad either).
Install BeautifulSoup4
sudo easy_install BeautifulSoup4
Recommendation
from bs4 import BeautifulSoup

Categories

Resources