WIkipedia API get text under headers - python

I can scripe a wikipedia usein wikipedia api
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
regex_result = re.findall("==\s(.+?)\s==", text)
print(regex_result)
and I can from every element in a regex_result(Wikipedia headers ) get a text bellow and append it to another list. I dug the internet and I do not know how to do that with some function in Wikipedia API.
Second chance to get it in get a text and with some module extract a text between headers more here: find a some text in string bettwen some specific characters
I have tried this:
l = 0
for n in regex_result:
try:
regal = re.findall(f"==\s{regex_result[l]}\s==(.+?)\s=={regex_result[l+1]}\s==", text)
l+=2
except Exception:
continue
But I am not working:
output is only []

You don't want to call re twice, but rather iterate directly through the results provided by regex_result. Named groups in the form of (?P<name>...) make it even easier to extract the header name without the surrounding markup.
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
# using the number 2 for '=' means you can easily find sub-headers too by increasing the value
regex_result = re.findall("\n={2}\s(?P<header>.+?)\s={2}\n", text)
regex_result will then be a list of strings of the all the top-level section headers.
Here's what I use to make a table of contents from a wiki page. (Note: f-strings require Python 3.6)
def get_wikiheader_regex(level):
'''The top wikiheader level has two = signs, so add 1 to the level to get the correct number.'''
assert isinstance(level, int) and level > -1
header_regex = f"^={{{level+1}}}\s(?P<section>.*?)\s={{{level+1}}}$"
return header_regex
def get_toc(raw_page, level=1):
'''For a single raw wiki page, return the level 1 section headers as a table of contents.'''
toc = []
header_regex = get_wikiheader_regex(level=level)
for line in raw_page.splitlines():
if line.startswith('=') and re.search(header_regex, line):
toc.append(re.search(header_regex, line).group('section'))
return toc
>>> get_toc(text)

Related

BeautifulSoup find_all('href') returns only part of the value

I'm attempting to scrape actor/actress IDs from an IMDB movie page. I only want actors and actresses (I don't want to get any of the crew), and this question is specifically about getting the person's internal ID. I already have peoples' names, so I don't need help getting those. I'm starting with this webpage (https://www.imdb.com/title/tt0084726/fullcredits?ref_=tt_cl_sm#cast) as a hard-coded url to get the code right.
On examination of the links I was able to find that the links for the actors look like this.
William Shatner
Leonard Nimoy
Nicholas Guest
while the ones for other contributors look like this
Nicholas Meyer
Gene Roddenberry
This should allow me to differentiate actors/actresses from crew like the director or writer by checking for the end of the href being "t[0-9]+$" rather than the same but with "dr" or "wr".
Here's the code I'm running.
import urllib.request
from bs4 import BeautifulSoup
import re
movieNumber = 'tt0084726'
url = 'https://www.imdb.com/title/' + movieNumber + '/fullcredits?ref_=tt_cl_sm#cast'
def clearLists(n):
return [[] for _ in range(n)]
def getSoupObject(urlInput):
page = urllib.request.urlopen(urlInput).read()
soup = BeautifulSoup(page, features="html.parser")
return(soup)
def getPeopleForMovie(soupObject):
listOfPeopleNames, listOfPeopleIDs, listOfMovieIDs = clearLists(3)
#get all the tags with links in them
link_tags = soupObject.find_all('a')
#get the ids of people
for linkTag in link_tags:
link = str(linkTag.get('href'))
#print(link)
p = re.compile('t[0-9]+$')
q = p.search(link)
if link.startswith('/name/') and q != None:
id = link[6:15]
#print(id)
listOfPeopleIDs.append(id)
#return the names and IDs
return listOfPeopleNames, listOfPeopleIDs
newSoupObject = getSoupObject(url)
pNames, pIds = getPeopleForMovie(newSoupObject)
The above code returns an empty list for the IDs, and if you uncomment the print statement you can see that it's because the value that gets put in the "link" variable ends up being what's below (with variations for the specific people)
/name/nm0583292/
/name/nm0000638/
That won't do. I want the IDs only for the actors and actresses so that I can use those IDs later.
I've tried to find other answers on stackoverflow; I haven't been able to find this particular issue.
This question (Beautifulsoup: parsing html – get part of href) is close to what I want to do, but it gets the info from the text part between tags rather than from the href part in the tag attribute.
How can I make sure I get only the name IDs that I want (just the actor ones) from the page?
(Also, feel free to offer suggestions to tighten up the code)
It appears that the links you are trying to match have either been modified by JavaScript after loading, or perhaps get loaded differently based on other variables than the URL alone (like cookies or headers).
However, since you're only after links of people in the cast, an easier way would be to simply match the ids of people in the cast section. This is actually fairly straightforward, since they are all in a single element, <table class="cast_list">
So:
import urllib.request
from bs4 import BeautifulSoup
import re
# it's Python, so use Python conventions, no uppercase in function or variable names
movie_number = 'tt0084726'
# The f-string is often more readable than a + concatenation
url = f'https://www.imdb.com/title/{movieNumber}/fullcredits?ref_=tt_cl_sm#cast'
# this is overly fancy for something as simple as initialising some variables
# how about:
# a, b, c = [], [], []
# def clearLists(n):
# return [[] for _ in range(n)]
# in an object-oriented program, assuming something is an object is the norm
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
soup = BeautifulSoup(page, features="html.parser")
# removed needless parentheses - arguably, even `soup` is superfluous:
# return BeautifulSoup(page, features="html.parser")
return soup
# keep two empty lines between functions, it's standard and for good reason
# it's easier to spot where a function starts and stops
# try using an editor or IDE that highlights your PEP8 mistakes, like PyCharm
# (that's just my opinion there, other IDEs than PyCharm will do as well)
def get_people_for_movie(soup_object):
# removed unused variables, also 'list_of_people_ids' is needlessly verbose
# since they go together, why not return people as a list of tuples, or a dictionary?
# I'd prefer a dictionary as it automatically gets rid of duplicates as well
people = {}
# (put a space at the start of your comment blocks!)
# get all the anchors tags inside the `cast_list` table
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
# the whole point of compiling the regex is to only have to do it once,
# so outside the loop
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of people
for link_tag in link_tags:
# the href attributes is a strings, so casting with str() serves no purpose
href = link_tag.get('href')
# matching and extracting part of the match can all be done in one step:
match = id_regex.search(href)
if match:
# don't shadow Python keywords like `id` with variable names!
identifier = match.group(1)
name = link_tag.text.strip()
# just ignore the ones with no text, they're the thumbs
if name:
people[identifier] = name
# return the names and IDs
return people
def main():
# don't do stuff globally, it'll just cause problems when reusing names in functions
soup = get_soup(url)
people = get_people_for_movie(soup)
print(people)
# not needed here, but a good habit, allows you to import stuff without running the main
if __name__ == '__main__':
main()
Result:
{'0000638': 'William Shatner', '0000559': 'Leonard Nimoy', '0001420': 'DeForest Kelley', etc.
And the code with a few more tweaks and without the commentary on your code:
import urllib.request
from bs4 import BeautifulSoup
import re
def get_soup(url_input):
page = urllib.request.urlopen(url_input).read()
return BeautifulSoup(page, features="html.parser")
def get_people_for_movie(soup_object):
people = {}
link_tags = soup_object.find('table', class_='cast_list').find_all('a')
id_regex = re.compile(r'/name/nm(\d+)/')
# get the ids and names of the cast
for link_tag in link_tags:
match = id_regex.search(link_tag.get('href'))
if match:
name = link_tag.text.strip()
if name:
people[match.group(1)] = name
return people
def main():
movie_number = 'tt0084726'
url = f'https://www.imdb.com/title/{movie_number}/fullcredits?ref_=tt_cl_sm#cast'
people = get_people_for_movie(get_soup(url))
print(people)
if __name__ == '__main__':
main()

Beautifulsoup scrape content of a cell beside another one

I am trying to scrape the content of a cell besides another cell of which I know the name e.g. "Staatsform", "Amtssprache", "Postleitzahl" etc. In the picture the needed content is always in the right cell.
The basic code is the following one, but I am stuck with it:
source_code = requests.get('https://de.wikipedia.org/wiki/Hamburg')
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
stastaform = soup.find(text="Staatsform:")...???
Many thanks in advance!
I wanted to exercise care in limiting the search to what is called the 'Infobox' in the English-language wikipedia. Therefore, I searched first for the heading 'Basisdaten', requiring that it be a th element. Not exactly definitive perhaps but more likely to be. Having found that I looked for tr elements under 'Basisdaten' until I found another tr including a (presumed different) heading. In this case, I search for 'Postleitzahlen:' but this approach makes it possible to find any/all of the items between 'Basisdaten' and the next heading.
PS: I should also mention the reason for if not current.name. I noticed some lines consisting of just new lines which BeautifulSoup treats as strings. These don't have names, hence the need to treat them specially in code.
import requests
import bs4
page = requests.get('https://de.wikipedia.org/wiki/Hamburg').text
soup = bs4.BeautifulSoup(page, 'lxml')
def getInfoBoxBasisDaten(s):
return str(s) == 'Basisdaten' and s.parent.name == 'th'
basisdaten = soup.find_all(string=getInfoBoxBasisDaten)[0]
wanted = 'Postleitzahlen:'
current = basisdaten.parent.parent.nextSibling
while True:
if not current.name:
current = current.nextSibling
continue
if wanted in current.text:
items = current.findAll('td')
print (items[0])
print (items[1])
if '<th ' in str(current): break
current = current.nextSibling
Result like this: two separate td elements, as requested.
<td>Postleitzahlen:</td>
<td>20095–21149,<br/>
22041–22769,<br/>
27499</td>
This works most of the time:
def get_content_from_right_column_for_left_column_containing(text):
"""return the text contents of the cell adjoining a cell that contains `text`"""
navigable_strings = soup.find_all(text=text)
if len(navigable_strings) > 1:
raise Exception('more than one element with that text!')
if len(navigable_strings) == 0:
# left-column contents that are links don't have a colon in their text content...
if ":" in text:
altered_text = text.replace(':', '')
# but `td`s and `th`s do.
else:
altered_text = text + ":"
navigable_strings = soup.find_all(text=altered_text)
try:
return navigable_strings[0].find_parent('td').find_next('td').text
except IndexError:
raise IndexError('there are no elements containing that text.')

Getting regular text from wikipedia page

I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Beautiful Soup .get_text() does not equal Python string when it should

I am using Beautiful Soup to grab text from an html element.
I am then using a loop and if statement to compare that text to a list of words. If they match I want to return a confirmation.
However, the code is not confirming any matches, even though print statements show there are in fact matches.
def findText():
text = ""
url = 'www.site.com'
#Get url and store
page = requests.get(url)
#Get page content
soup = BeautifulSoup(page.content,"html.parser")
els = soup.select(".className")
lists = els[1].select(".className2")
for l in lists:
try:
text=l.find("li").get_text()
except(AttributeError):
text="null"
return text
def isMatch(text):
#Open csv file
listFile = open('list.csv', 'rb')
#prep file to be read
newListFile =csv.reader(listFile)
match = ""
for r in newListFile:
if r[0]==text.lower():
match = True
else:
match = False
return match
congressCSVFile.close()
match is always False in the output
print(r[0]) returns (let's just say) "cat" in terminal
print(text) also returns "cat" in terminal
Your loop is the problem, or at least one of them. Once you find a record that matches, you keep going. match will only end up True if the last record matches. To fix this, simply return when you find a match:
for r in newListFile:
if r[0]==text.lower():
return True
return False
The match variable is not needed.
Better yet, use the any() function:
return any(r[0] == text.lower() for r in newListFile)
In your try: text = l.find("li").get_text(strip=True)
Soup and html in general adds a significant amount of white space. If you don't parse it out with the strip parameter then you may never get a match unless the white space is included in your list file.

In Python, how do I search an html webpage for a set of strings in text file?

I'm trying to figure out a way to search an html webpage for a number of strings that I have written in a text file, each on its own line. The code I have so far is this:
def mangasearch():
r = requests.get("https://www.mangaupdates.com/releases.html")
soup = BeautifulSoup(r.text)
if soup.find(text=re.compile(line)):
print("New chapter found!")
print("")
else:
print("No new chapters")
print("")
def textsearch():
global line
with open("manga.txt") as file:
for line in file:
print(line)
mangasearch()
It's supposed to read manga.txt and search the webpage for each string separately, but it always returns "no new chapters". If I replace if soup.find(text=re.compile(line)): with if soup.find(text=re.compile("An actual string")): it works correctly, but for some reason it doesn't want to use the line variable. Any help would be appreciated!
The problem is that you are trying to search with a string that contains some special character, like ' ', and '\n'.
Note that str.strip() removes ' ' and other whitespace characters as well (e.g. tabs and newlines), so, update the following line:
if soup.find(text=re.compile(line.strip())):
def textsearch():
#manga_file = open("managa.txt").readlines()
##assuming your text file has 3 titles
manga_file = ["Karakuri Circus","Sun-ken Rock","Shaman King Flowers"]
manga_html = "https://www.mangaupdates.com/releases.html"
manga_page = urllib2.urlopen(manga_html)
found = 0
soup = BeautifulSoup(manga_page)
##use BeautifulSoup to parse the section of interest out
##use 'inspect element' in Chrome or Firefox to find the interesting table
##in this case the data you are targeting is in a <div class="alt"> element
chapters = soup.findAll("div", attrs={"class":"alt"})
##chapters now contains all of the sections under the dates on the page
##so we iterate through each date to get all of the titles in each
##<tr> element of each table
for dated_section in chapters:
rows = dated_section.findAll("tr")
for r in rows:
title = r.td.text
#print title
#url = r.td.a.href
if title in manga_file:
found +=1
print "New Chapter Found!"
print r
if found >0:
print "Found a total of %d title"%found
else:
print "No chapters found"
above is not meant to be optimized but it does a good job of showing how BeautifulSoup can be used to parse the specific elements that you are looking for. In this case I used Chrome, right-clicked on the table that contains the titles to "inspect-element' and looked for the element that contains them to direct BeautifulSoup's attention directly to there. The rest is explained in code. I don't know what your managa.txt file looks like so I just created a list of 3 titles to search for as an example.

Categories

Resources