Parse HTML table in file to csv with BeautifulSoup - python

Hi I'm a Python noob and even bigger BeautifulSoup and html noob. I have a file downloaded that has an html table in it. In all the examples of BeautifulSoup parsing I have seen, they all use urllib to access the table url and then the read the response and pass it to BeautifulSoup to parse. My question is, for a locally stored file, do I have to load the entire file into memory? So instead of doing say:
contenturl = "http://www.bank.gov.ua/control/en/curmetal/detail/currency?period=daily"
soup = BeautifulSoup(urllib2.urlopen(contenturl).read())
Do I instead do:
soup = BeautifulSoup(open('/home/dir/filename').read())
That doesn't really seem to work. So I get the following error:
Traceback (most recent call last):
File "<string>", line 1, in <fragment>
TypeError: 'module' object is not callable
My apologies if its something really silly I'm doing but help is appreciated
Update: Issue is resolved, need to import class from module for BeautifulSoup. Thank you!

Related

Can't print only text using Beautiful soup

I am struggling creating one of my first projects on python3. When I use the following code:
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc", cookies=all_cookies)
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
print(offer_name.text.strip())
I get the following error:
Traceback (most recent call last):
File "scrape_products.py", line 45, in <module>
scrape_offers()
File "scrape_products.py", line 40, in scrape_offers
print(offer_name.text.strip())
File "/usr/local/lib/python3.7/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I've read many similar cases on StackOverFlow but I still can't help myself. If someone have any ideas, please help :)
P.S.: If i run the code without .text it show the entire <a class=...> ... </a>
findchildren returns a list. Sometimes you get an empty list, sometimes you get a list with one element.
You should add an if statement to check if the length of the returned list is greater than 1, then print the text.
import requests
from bs4 import BeautifulSoup
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc")
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
if (len(offer_name) >= 1):
print(offer_name[0].text.strip())
scrape_offers()

Python 3.7 scraping a page using BeautifulSoup issues with Available code on stack exchange

I have been trying to do webscraping for the a forum (https://forums.bharat-rakshak.com/viewtopic.php?f=3&t=7630) with the stack overflow code given in
Scraping a page for URLs using Beautifulsoup
I updated the Urllib2 to Urllib and the urlopen. However, the code still gives error on the metadata loop
for html in metaData:
text = BeautifulSoup(str(html).strip(),'lxml').get_text().encode("utf-8").replace("\n","") # convert the html to text
authors.append(text.split("Post by:")[1].split(" on ")[0].strip()) # get Post by:
times.append(text.split(" on ")[1].strip()) # get date
The error I get is
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
TypeError: a bytes-like object is required, not 'str'

Unable to parse json response with python

So I ran into an issue while importing API data into my code. Any help is much appreciated.
from urllib2 import Request, urlopen, URLError
import json, requests
data = requests.get('https://masari.superpools.net/api/live_stats?update=1522693430318').json()
data_parsed = json.loads(open(data,"r").read())
print data_parsed
I'm still quite new to python, and I ran into this error:
>C:\Users\bot2>python C:\Users\bot2\Desktop\Python_Code\helloworld.py
Traceback (most recent call last):
File "C:\Users\bot2\Desktop\Python_Code\helloworld.py", line 5, in <module>
data_parsed = json.loads(open(data,"r").read())
TypeError: coercing to Unicode: need string or buffer, dict found
data is already received as a json object (which is a dict in this case). Just do the following:
data = requests.get('https://masari.superpools.net/api/live_stats?update=1522693430318').json()
print data
Use data['network'] for example to access nested dictionaries.

Write data scraped to text file with python script

I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.

beautiful soup bug?

I have next code:
for table in soup.findAll("table","tableData"):
for row in table.findAll("tr"):
data = row.findAll("td")
url = data[0].a
print type(url)
I get next output:
<class 'bs4.element.Tag'>
That means, that url is object of class Tag and i could get attribytes from this objects.
But if i replace print type(url) to print url['href'] i get next traceback
Traceback (most recent call last):
File "baseCreator.py", line 57, in <module>
createStoresTable()
File "baseCreator.py", line 46, in createStoresTable
print url['href']
TypeError: 'NoneType' object has no attribute '__getitem__'
What is wrong? And how i can get value of href attribute.
I do like BeautifulSoup but I personally prefer lxml.html (for not too wacky HTML) because of the ability to utilise XPath.
import lxml.html
page = lxml.html.parse('http://somesite.tld')
print page.xpath('//tr/td/a/#href')
Might need to implement some form of "axes" though depending on the structure.
You can also use elementsoup as a parser - details at http://lxml.de/elementsoup.html

Categories

Resources