Cleaning up a scraped HTML List

Cleaning up a scraped HTML List - python

I'm trying to extract names from a wiki page. Using BeautifulSoup I am able to get a very dirty list (including lots of extraneous items) that I want to clean up, however my attempt to 'sanitise' the list leaves it unchanged.
#1).
#Retreive the page
import requests
from bs4 import BeautifulSoup
weapons_url = 'https://escapefromtarkov.gamepedia.com/Weapons'
weapons_page = requests.get(weapons_url)
weapons_soup = BeautifulSoup(weapons_page.content, 'html.parser')
#2).
#Attain the data I need, plus lot of unhelpful data
flithy_scraped_weapon_names = weapons_soup.find_all('td', href="", title="")
#3a).
#Identify keywords that reoccur in unhelpful:extraneous list items
dirt = ["mm", "predecessor", "File", "image"]
#3b). - Fails
#Remove extraneous data containing above-defined keywords
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
#4).
#Check data
print(weapon_names_sanitised)
#Returns a list identical to flithy_scraped_weapon_names

The problem is in this section:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in s for xs in dirt)]
It should instead be:
weapon_names_sanitised = [s for s in flithy_scraped_weapon_names\
if not any(xs in str(s) for xs in dirt)]
The reason is that flithy_scraped_weapon_names contains Tag objects, which will be cast to a string when printed, but need to be explicitly cast to a string for xs in str(s) to work as expected.

Related

BeautifulSoup find_all('href') returns only part of the value

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
William Shatner
Leonard Nimoy
Nicholas Guest
while the ones for other contributors look like this
Nicholas Meyer
Gene Roddenberry
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later.
I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page?
(Also, feel free to offer suggestions to tighten up the code)

It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).
However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">
So:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
Result:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
And the code with a few more tweaks and without the commentary on your code:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()

How can I group it by using "search" function in regular expression?

I have been developing a python web-crawler to collect the used car stock data from this website. (http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page=20)
First of all, I would like to collect only "BMW" from the list. So, I used "search" function in regular expression like the code below. But, it keeps returning "None".
Is there anything wrong in my code?
Please give me some advice.
Thanks.
from bs4 import BeautifulSoup
import urllib.request
import re
CAR_PAGE_TEMPLATE = "http://www.bobaedream.co.kr/cyber/CyberCar.php?gubun=I&page="
def fetch_post_list():
for i in range(20,21):
URL = CAR_PAGE_TEMPLATE + str(i)
res = urllib.request.urlopen(URL)
html = res.read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table', class_='cyber')
print ("Page#", i)
# 50 lists per each page
lists=table.find_all('tr', itemtype="http://schema.org/Article")
count=0
r=re.compile("[BMW]")
for lst in lists:
if lst.find_all('td')[3].find('em').text:
lst_price=lst.find_all('td')[3].find('em').text
lst_title=lst.find_all('td')[1].find('a').text
lst_link = lst.find_all('td')[1].find('a')['href']
lst_photo_url=''
if lst.find_all('td')[0].find('img'):
lst_photo_url = lst.find_all('td')[0].find('img')['src']
count+=1
else: continue
print('#',count, lst_title, r.search("lst_title"))
return lst_link
fetch_post_list()

r.search("lst_title")
This is searching inside the string literal "lst_title", not the variable named lst_title, that's why it never matches.
r=re.compile("[BMW]")
The square brackets indicate that you're looking for one of those characters. So, for example, any string containing M will match. You just want "BMW". In fact you don't even need regular expressions, you can just test:
"BMW" in lst_title

Empty text in Xpath

I have written this line of code for creating a list through XPath
classes=tree.xpath('//a[#class="pathm"]/../../../../../td[3]/font/text()')
It creates a list.Their are also items containing empty text.The list does not contain them.It contains only non empty values.I want to take empty string in the list wherever their is no text. Please help

You can get only //font and later use loop to get text or own text if there is empty text (or rather None)
import lxml.html
data = '''
<font>A</font>
<font></font>
<font>C</font>
'''
tree = lxml.html.fromstring(data)
fonts = tree.xpath('//font')
result = [x.text if x.text else '' for x in fonts]
print(result)
If you don't know how list comprehension works - it do this
result = []
for x in fonts:
if x.text: # not None
result.append(x.text)
else:
result.append('')
print(result)

How to separate/parse strings and put them into their own list? (Python Web Parsing)

from bs4 import BeautifulSoup #imports beautifulSoup package
import urllib2
url2 = 'http://www.waldenu.edu/doctoral/phd-in-management/faculty'
page2 = urllib2.urlopen(url2)
soup2 = BeautifulSoup(page2.read(), "lxml")
row2 = soup2.findAll('p')
row2 = row2[18:-4]
names2 = []
for x in row2:
currentString2 = x.findAll('strong')
if len(currentString2) > 0:
currentString2 = currentString2[0]
names2.append(currentString2.text)
This produces a list of names with first and last names. I'm trying to separate the first and last names and put all of the first names into one list and the last names into their own separate list. (Also removing the commas and spaces incidentally). What's the best way in doing so?

You'll want to use the .split() method of strings.
E.g.
>>> 'Lilburn P. Hoehn'.split()
['Lilburn', 'P.', 'Hoehn']
>>> 'Jean Gordon'.split()
['Jean', 'Gordon']
Then have some logic around whether the list is 2 or 3 elements long.

Parse HTML Table with Python BeautifulSoup

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.
From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text + "|",
print
I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks

HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:
from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group
""" looking for this recurring pattern:
<td valign="top" bgcolor="#FFFFCC">00-03</td>
<td valign="top">.50</td>
<td valign="top">.50</td>
and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""
td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))
realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')
# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend +
Group(2*(td + realnum + tdend))("vals"))
This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).
Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:
# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
for i in range(entry.start, entry.end+1):
lookup[i] = tuple(entry.vals)
And now to try out some lookups:
# print out some test values
for test in (0,20,100,700):
print (test, lookup[test])
prints:
0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)

I think the above answer is better than what I would offer, but I have a BeautifulSoup answer that can get you started. This is a bit hackish, but I figured I would offer it nevertheless.
With BeautifulSoup, you can find all the tags with certain attributes in the following way (assuming you have a soup.object already set up):
soup.find_all('td', attrs={'bgcolor':'#FFFFCC'})
That will find all of your keys. The trick is to associate these with the values you want, which all show up immediately afterward and which are in pairs (if these things change, by the way, this solution won't work).
Thus, you can try the following to access what follows your key entries and put those into your_dictionary:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling
The problem is that the "next_sibling" is actually a '\n', so you have to do the following to capture the next value (the first value you want):
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling.next_sibling.string
And if you want the two following values, you have to double this:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]
Disclaimer: that last line is pretty ugly to me.

I've used BeautifulSoup 3, but it probably will work under 4.
# Import System libraries
import re
# Import Custom libraries
from BeautifulSoup import BeautifulSoup
# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
# Read the file into the BeautifulSoup class
soup = BeautifulSoup(file_h.read())
tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location
str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
for key in tr.findAll(key_location): # Loop through all found Integer key tds
key_list = []
key_str = key.text.strip()
for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
key_list.append(td.text)
key_list = map(float, key_list) # Convert the text values to floats
# String based dictionary section
str_key_dict[key_str] = key_list
# Number based dictionary section
num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
if(len(num_range) == 2):
num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
else:
num_key_dict.update([(num_range[0], key_list)])
for x in num_key_dict.items():
print x

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning up a scraped HTML List - python

Related

BeautifulSoup find_all('href') returns only part of the value

How can I group it by using "search" function in regular expression?

Empty text in Xpath

How to separate/parse strings and put them into their own list? (Python Web Parsing)

Parse HTML Table with Python BeautifulSoup

Categories

Resources