How to print out an function variable at HTML in python? - python

I am a self-learner and a beginner, searched a lot but maybe have lack of searching. I am scraping some values from two web sites and I want o compare them with an HTML output. Each web pages, I am combinin two class'es and gettin into a list. But when making an output with HTML I don't want all list to print. So I made function to choose any keywords to print. When I want to print out that function, It turns out 'None' at HTML output but it turns what I wanted on console. So how to show that special list?
OS= Windows , Python3.
from bs4 import BeautifulSoup
import requests
import datetime
import os
import webbrowser
carf_meySayf = requests.get('https://www.carrefoursa.com/tr/tr/meyve/c/1015?show=All').text
carf_soup = BeautifulSoup(carf_meySayf, 'lxml')
#spans
carf_name_span = carf_soup.find_all('span', {'class' : 'item-name'})
carf_price_span = carf_soup.find_all('span', {'class' : 'item-price'})
#spans to list
carf_name_list = [span.get_text() for span in carf_name_span]
carf_price_list = [span.get_text() for span in carf_price_span]
#combine lists
carf_mey_all = [carf_name_list +' = ' + carf_price_list for carf_name_list, carf_price_list in zip(carf_name_list, carf_price_list)]
#Function to choose and print special product
def test(namelist,product):
for i in namelist:
if product in i:
print(i)
a = test(carf_mey_all,'Muz')
# Date
date = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
# HTML part
html_str = """
<html>
<title>Listeler</title>
<h2>Tarih: %s</h2>
<h3>Product & Shop List</h3>
<table style="width:100%%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
%s
</tr>
</html>
"""
whole = html_str %(date,a)
Html_file= open("Meyve.html","w")
Html_file.write(whole)
Html_file.close()

the method test() must have return value, for example
def test(namelist,product):
results = ''
for i in namelist:
if product in i:
print(i)
results += '<td>%s</td>\n' % i
return results
Meyve.html results:
<html>
<title>Listeler</title>
<h2>Tarih: 2018-12-29 07:34:00</h2>
<h3>Product & Shop List</h3>
<table style="width:100%">
<tr>
<th>Carrefour</th>
</tr>
<tr>
<td>Muz = 6,99 TL</td>
<td>İthal Muz = 12,90 TL</td>
<td>Paket Yerli Muz = 9,99 TL</td>
</tr>
</html>
note: to be valid html you need to add <body></body>

The problem is that your test() function isn't explicitly returning anything, so it is implicitly returning None.
To fix this, test() should accumulate the text it wants to return (i.e, by building a list or string) and return a string containing the text you want to insert into html_str.

Related

How to get numbers from html?

I want to get
the number after: page=
the number after: "new">
the number after: /a>-
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 4449-4450<br/> </td>
<td> </td>
<td> qwqwqwqwqw <br/> qwqwqwqwqw 5111-5550<br/> </td>
<td> </td>
...
My code
tables = soup.find_all('a', attrs={'target': 'new'})
gives my only a list (see below) without the third number
[4449,
5111,
...]
her is how i would try to extract the 3 numbers from my list, once it has the third digit in it.
list_of_number1 = []
list_of_number2 = []
list_of_number3 = []
regex = re.compile("page=(\d+)")
for table in tables:
number1 = filter(regex.match, tables)
number2 = table.next_sibling.strip()
number3 =
list_of_number1.append(number1)
list_of_number2.append(number2)
list_of_number3.append(number3)
Do i use beautifulsoup for the third number or is it feasible to regex through the whole html for any number following "/a>-"
Here is how you can obtain your result using just the information that you need to get the numbers in the specific a element and in the text node immediately on the right:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
tables = soup.find_all('a', attrs={'target': 'new'})
print([(t.text, t["href"].split('=')[-1], t.next_sibling.replace('-', '')) for t in tables])
# => [('4449', '99', '4450'), ('5111', '77', '5550')]
You may certainly go the harder way with regexps:
import re
#... the same initialization code as above
for t in tables:
page = ""
page_m = re.search(r"[#?]page=(\d+)", t["href"])
if page_m:
page = page_m.group(1)
else:
page = ""
num = "".join([x for x in t.next_sibling if x.isdigit()])
results.append((int(t.text), int(page), int(num)))
print(results)
# => [(4449, 99, 4450), (5111, 77, 5550)]
NOTE:
t.text - gets the element text
t["href"] - gets the href attribute value of the t element
t.next_sibling - gets the next node after current one that is on the same hierarchy level.
You can also try:
for b in soup.select('a'):
print(b.attrs['href'].split('=')[1], b.text, b.nextSibling)
Output:
99 4449 -4450
77 5111 -5550

Delete HTML element if it contains a certain amount of numeric characters

For transforming a html-formatted file to a plain text file with Python, I need to delete all tables if the text within the table contains more than 40% numeric characters.
Specifically, I would like to:
identify each table element in a html file
calculate the number of numeric and alphabetic characters in the text and the correpsonding ratio, not considering characters within any html tags
. Thus, delete all html tags.
delete the table if its text is composed of more than 40% numeric characters.
 Keep the table if it contains less than 40% numeric characters
.
I defined a function that is called when the re.sub command is run. The rawtext variable contains the whole html-formatted text I want to parse. Within the function, I try to process the steps described above and return a html-stripped version of the table or a blank space, depending on the ratio of numeric characters. However, the first re.sub command within the function seems to delete not only tags, but everything, including the textual content.
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
print('ratio = ' + ratio)
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
emptystring = re.sub('.*?', ' ', table, flags=re.DOTALL)
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)
If you have an idea on what might be wrong with this code, I would be very happy if you share it with me. Thank you!
As I suggested you in comments, I wouldn't use regex to parse and use HTML in code. For example you could use a python library build up for this purpose like BeautifulSoup.
Here an example on how to use it
#!/usr/bin/python
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
html = """<html>
<head>Heading</head>
<body attr1='val1'>
<div class='container'>
<div id='class'>Something here</div>
<div>Something else</div>
<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
</div>
</body>
</html>"""
parsed_html = BeautifulSoup(html, 'html.parser')
print parsed_html.body.find('table').text
So you could end up with a code like that (just to give you an idea)
#!/usr/bin/python
import re
try:
from BeautifulSoup import BeautifulSoup
except ImportError:
from bs4 import BeautifulSoup
def tablereplace(table):
table = re.sub('<[^>]*>', ' ', str(table))
numeric = sum(c.isdigit() for c in table)
print('numeric: ' + str(numeric))
alphabetic = sum(c.isalpha() for c in table)
print('alpha: ' + str(alphabetic))
try:
ratio = numeric / float(numeric + alphabetic)
print('ratio: '+ str(ratio))
except ZeroDivisionError as err:
ratio = 1
if ratio > 0.4:
return True
else:
return False
table = """<table style="width:100%">
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>3241424134213424214321342424214321412</td>
<td>213423423234242142134214124214214124124</td>
<td>213424214234242</td>
</tr>
<tr>
<td>124234412342142414</td>
<td>1423424214324214</td>
<td>2134242141242341241</td>
</tr>
</table>
"""
if tablereplace(table):
print 'replace table'
parsed_html = BeautifulSoup(table, 'html.parser')
rawdata = parsed_html.find('table').text
print rawdata
UPDATE:
Anyway just this line of your code strips away all HTML tags, as you will know 'cause you are using it for char/digit counting purpose
table = re.sub('<[^>]*>', ' ', str(table))
But it's not safe, because you could also have <> inside the text of your tags or the HTML could be shattered or misplaced
I left it where it is because for the example it's working. But consider to use BeautifulSoup for all HTML management.
Thank you for your replies so far!
After intensive research, I found the solution to the mysterious deletion of the whole match. It seemed that the function only considered the first 150 or so characters of the match. However, if you specify table = table.group(0), the whole match is processed. group(0) accounts for the big difference here.
Below you can find my updated script thats works properly (also includes some other minor changes):
def tablereplace(table):
table = table.group(0)
table = re.sub('<[^>]*>', '\n', table)
numeric = sum(c.isdigit() for c in table)
alphabetic = sum(c.isalpha() for c in table)
try:
ratio = numeric / (numeric + alphabetic)
except ArithmeticError:
ratio = 1
else:
pass
if ratio > 0.4:
emptystring = ''
return emptystring
else:
return table
rawtext = re.sub('<table.+?<\/table>', tablereplace, rawtext, flags=re.IGNORECASE|re.DOTALL)

Extract links after th in beautifulsoup

Im trying to extract links from this page:
http://www.tadpoletunes.com/tunes/celtic1/
view-source:http://www.tadpoletunes.com/tunes/celtic1/
but I only want the reels: which in the page are delineated by :
start:
<th align="left"><b><a name="reels">REELS</a></b></th>
end ( the lines above the following):
<th align="left"><b><a name="slides">SLIDES</a></b></th>
The question is how to do this. I have the following code which gets the links for everything with a .mid extension:
def import_midifiles():
archive_url="http://www.tadpoletunes.com/tunes/celtic1/"
sauce= urllib.request.urlopen("http://www.tadpoletunes.com/tunes/celtic1/celtic.htm").read()
soup=bs.BeautifulSoup(sauce,'lxml')
tables=soup.find_all('table')
for table in tables:
for link in table.find_all('a',href=True):
if link['href'].endswith('.mid'):
listofmidis.append(archive_url + link['href'])
if listofmidis:
listoflists.append(listofmidis)
midi_list = [item for sublist in listoflists for item in sublist]
return midi_list
I cannot figure this out from the beautifulsoup docs. I need the code because I will be repeating the activity on other sites in order to scrape data for training a model.
To get all the "REELS" links, you need to do the following:
Get the links in between "REELS" and "SLIDES" as you mentioned. To do that, first you'll need to find the <tr> tag containing <a name="reels">REELS</a>. This can be done using the .find_parent() method.
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
Now, you can use the .find_next_siblings() method to get all the <tr> tags after "REELS". We can break the loop when we find the <tr> tag with <a name="slides">SLIDES</a> (or .find('a').text == 'SLIDES').
Complete code:
def import_midifiles():
BASE_URL = 'http://www.tadpoletunes.com/tunes/celtic1/'
r = requests.get(BASE_URL)
soup = BeautifulSoup(r.text, 'lxml')
midi_list = []
reels_tr = soup.find('a', {'name': 'reels'}).find_parent('tr')
for tr in reels_tr.find_next_siblings('tr'):
if tr.find('a').text == 'SLIDES':
break
midi_list.append(BASE_URL + tr.find('a')['href'])
return midi_list
print(import_midifiles())
Partial output:
['http://www.tadpoletunes.com/tunes/celtic1/ashplant.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bashful.mid',
'http://www.tadpoletunes.com/tunes/celtic1/bigpat.mid',
'http://www.tadpoletunes.com/tunes/celtic1/birdcage.mid',
'http://www.tadpoletunes.com/tunes/celtic1/boatstre.mid',
...
...
'http://www.tadpoletunes.com/tunes/celtic1/silspear.mid',
'http://www.tadpoletunes.com/tunes/celtic1/stafreel.mid',
'http://www.tadpoletunes.com/tunes/celtic1/kilkenny.mid',
'http://www.tadpoletunes.com/tunes/celtic1/swaltail.mid',
'http://www.tadpoletunes.com/tunes/celtic1/cuptea.mid']

Scraping html page

I want to get the movie tittle, year, rating, genres, and run time of five movies from the html page given in the code. These are in the rows of table called results.
from bs4 import BeautifulSoup
import urllib2
def read_from_url(url, num_m=5):
html_string = urllib2.urlopen(url)
soup = BeautifulSoup(html_string)
movie_table = soup.find('table', 'results') # table of movie
list_movies = []
count = 0
for row in movie_table.find_all("tr"):
dict_each_movie = {}
title = title.encode("ascii", "ignore") # getting title
dict_each_movie["title"] = title
year = year.encode("ascii","ignore") # getting year
dict_each_movie["year"] = year
rank = rank.encode("ascii","ignore") # getting rank
dict_each_movie["rank"] = rank
# genres = [] # getting genres of a movie
runtime = runtime.encode("ascii","ignore") # getting rank
dict_each_movie["runtime"] = runtime
list_movies.append(dict_each_movie)
count+=1
if count==num_of_m:
break
return list_movies
print read_from_url('http://www.imdb.com/search/title?at=0&sort=user_rating&start=1&title_type=feature&year=2005,2015',2)
Expected output:
[{'rating': '10.0', 'genres': ['Comedy', 'Family'], 'title': 'How to Beat a Bully', 'rank': '1', 'year': '2014', 'runtime': '90'},..........]
You're accessing a variable that hasn't been declared. When the interpreter sees title.encode("ascii", "ignore") it looks for the variable title which hasn't been declared previously. Python can't possible know what title is thus you can't call encode on it. Same goes for year and rank. Instead use:
title = 'How to Beat a Bully'.encode('ascii','ignore')
Why so???
Make your life easier with CSS Selectors.
<table>
<tr class="my_class">
<td id="id_here">
<a href = "link_here"/>First Link</a>
</td>
<td id="id_here">
<a href = "link_here"/>Second Link</a>
</td>
</tr>
</table>
for tr in movie_table.select("tr.my_class"):
for td in tr.select("td#id_here"):
print("Link " + td.select("a")[0]["href"])
print("Text "+ td.select("a")[0].text)

Get an attribute value based on the name attribute with BeautifulSoup

I want to print an attribute value based on its name, take for example
<META NAME="City" content="Austin">
I want to do something like this
soup = BeautifulSoup(f) # f is some HTML containing the above meta tag
for meta_tag in soup("meta"):
if meta_tag["name"] == "City":
print(meta_tag["content"])
The above code give a KeyError: 'name', I believe this is because name is used by BeatifulSoup so it can't be used as a keyword argument.
It's pretty simple, use the following:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<META NAME="City" content="Austin">')
>>> soup.find("meta", {"name":"City"})
<meta name="City" content="Austin" />
>>> soup.find("meta", {"name":"City"})['content']
'Austin'
theharshest answered the question but here is another way to do the same thing.
Also, In your example you have NAME in caps and in your code you have name in lowercase.
s = '<div class="question" id="get attrs" name="python" x="something">Hello World</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div').attrs
print attributes_dictionary
# prints: {'id': 'get attrs', 'x': 'something', 'class': ['question'], 'name': 'python'}
print attributes_dictionary['class'][0]
# prints: question
print soup.find('div').get_text()
# prints: Hello World
6 years late to the party but I've been searching for how to extract an html element's tag attribute value, so for:
<span property="addressLocality">Ayr</span>
I want "addressLocality". I kept being directed back here, but the answers didn't really solve my problem.
How I managed to do it eventually:
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs('<span property="addressLocality">Ayr</span>', 'html.parser')
>>> my_attributes = soup.find().attrs
>>> my_attributes
{u'property': u'addressLocality'}
As it's a dict, you can then also use keys and 'values'
>>> my_attributes.keys()
[u'property']
>>> my_attributes.values()
[u'addressLocality']
Hopefully it helps someone else!
theharshest's answer is the best solution, but FYI the problem you were encountering has to do with the fact that a Tag object in Beautiful Soup acts like a Python dictionary. If you access tag['name'] on a tag that doesn't have a 'name' attribute, you'll get a KeyError.
The following works:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<META NAME="City" content="Austin">', 'html.parser')
metas = soup.find_all("meta")
for meta in metas:
print meta.attrs['content'], meta.attrs['name']
One can also try this solution :
To find the value, which is written in span of table
htmlContent
<table>
<tr>
<th>
ID
</th>
<th>
Name
</th>
</tr>
<tr>
<td>
<span name="spanId" class="spanclass">ID123</span>
</td>
<td>
<span>Bonny</span>
</td>
</tr>
</table>
Python code
soup = BeautifulSoup(htmlContent, "lxml")
soup.prettify()
tables = soup.find_all("table")
for table in tables:
storeValueRows = table.find_all("tr")
thValue = storeValueRows[0].find_all("th")[0].string
if (thValue == "ID"): # with this condition I am verifying that this html is correct, that I wanted.
value = storeValueRows[1].find_all("span")[0].string
value = value.strip()
# storeValueRows[1] will represent <tr> tag of table located at first index and find_all("span")[0] will give me <span> tag and '.string' will give me value
# value.strip() - will remove space from start and end of the string.
# find using attribute :
value = storeValueRows[1].find("span", {"name":"spanId"})['class']
print value
# this will print spanclass
If tdd='<td class="abc"> 75</td>'
In Beautifulsoup
if(tdd.has_attr('class')):
print(tdd.attrs['class'][0])
Result: abc

Categories

Resources