Getting regular text from wikipedia page - python

I am trying to get the text or the summary text from a random wikipedia page, i need it, to be a list of lists of words (list of sentences) in the end.
I am using the following code
def get_random_pages_summary(pages = 0):
import wikipedia
page_names = [wikipedia.random(1) for i in range(pages)]
return [[p,wikipedia.page(p).summary] for p in page_names]
def text_to_list_of_words_without_new_line(text):
t = text.replace("\n", " ").strip()
t1 = t.split()
t2 = ["".join(w) for w in t1]
return t2
text = get_random_pages_summary(1)
for i,row in enumerate(text):
text[i][1] = text_to_list_of_words_without_new_line(row[1])
print text[0][1]
I am getting weird tokens, i assume they are a relic of the markdown code for the wikipedia page e.g
Russian:', u'\u0418\u0432\u0430\u043d
I found that it is probably happening when there is a quote from another language inside the English page, it also happens when having a range of years in the page e.g 2015-2016
I would like to convert all of these to regular words, and remove those that i can not convert to regular words.
Thanks.

Related

WIkipedia API get text under headers

I can scripe a wikipedia usein wikipedia api
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
regex_result = re.findall("==\s(.+?)\s==", text)
print(regex_result)
and I can from every element in a regex_result(Wikipedia headers ) get a text bellow and append it to another list. I dug the internet and I do not know how to do that with some function in Wikipedia API.
Second chance to get it in get a text and with some module extract a text between headers more here: find a some text in string bettwen some specific characters
I have tried this:
l = 0
for n in regex_result:
try:
regal = re.findall(f"==\s{regex_result[l]}\s==(.+?)\s=={regex_result[l+1]}\s==", text)
l+=2
except Exception:
continue
But I am not working:
output is only []
You don't want to call re twice, but rather iterate directly through the results provided by regex_result. Named groups in the form of (?P<name>...) make it even easier to extract the header name without the surrounding markup.
import wikipedia
import re
page = wikipedia.page("Albert Einstein")
text = page.content
# using the number 2 for '=' means you can easily find sub-headers too by increasing the value
regex_result = re.findall("\n={2}\s(?P<header>.+?)\s={2}\n", text)
regex_result will then be a list of strings of the all the top-level section headers.
Here's what I use to make a table of contents from a wiki page. (Note: f-strings require Python 3.6)
def get_wikiheader_regex(level):
'''The top wikiheader level has two = signs, so add 1 to the level to get the correct number.'''
assert isinstance(level, int) and level > -1
header_regex = f"^={{{level+1}}}\s(?P<section>.*?)\s={{{level+1}}}$"
return header_regex
def get_toc(raw_page, level=1):
'''For a single raw wiki page, return the level 1 section headers as a table of contents.'''
toc = []
header_regex = get_wikiheader_regex(level=level)
for line in raw_page.splitlines():
if line.startswith('=') and re.search(header_regex, line):
toc.append(re.search(header_regex, line).group('section'))
return toc
>>> get_toc(text)

re.sub() gives Nameerror when no match

So I'm trying to search and replace rows of texts from a csv file, and I keep getting errors from it if re.sub() can't find any matches.
Say if the text in a row is
text = "a00123 一二三四五"
And my codes are
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
namelist_raw = re.sub(r'([a-z])00(\d{3})',r'\1-\2',text)
p = re.findall(r'\w',namelist_raw)
if p:
q = re.findall(r'([a-z]-\d{3})',namelist_raw)
for namelist in q:
print(namelist)
else:
namelist = "failed"
link = html + namelist
print(link)
so for this i should be getting a result of
www.abcdefg.com/a-123
so that's no problem.
but if the text is something like this,
text = "asdfdsdfd123 一二三四五"
I'll get Nameerror saying name 'namelist' is not defined
Why is that? I thought at the if else statement I've already wrote if anything else, namelist is "failed"
my code
Your p = re.findall(r'\w',namelist_raw) is extracting every word char from a string, and later, you only extract the values from the string if there were matches. You do not need that check.
Next, namelist is only populated if there is a match for [a-z]-\d{3}, but if there is no match, you do not get it populated. You need to account for that scenario, too.
Use
import re
html = "www.abcdefg.com/"
text = "a00123 一二三四五"
p = re.findall(r'([a-z])00(\d{3})', text) # Extract a list of tuples
namelist = [] # Init the list
for letter, number in p:
namelist.append(f"{letter}-{number}") # Populate namelist with formatted tuple values
if len(namelist): # If there was a match
namelist = "/".join(namelist) # Create a string by joining namelist items with /
else:
namelist = "failed" # Else, assign failed to the namelist
link = html + namelist
print(link)
See the Python demo.

Extract text between two span() iterators in python

I am trying to extract text between two iterators.
I have tried using span() function on it to find the start and the end span
How do I proceed further, to extract text between these spans
start_matches = start_pattern.finditer(filter_lines)
end_matches = end_pattern.finditer(filter_lines)
for s_match in start_matches :
s_cargo=s_match.span()
for e_match in end_matches :
e_cargo=e_match.span()
Using the span: 1) s_cargo and 2) e_cargo, I would want to find the text within the string filter_lines
I am relatively new to python, any kind of help is much appreciated.
you can try:
my_data = []
for s, e in zip(s_cargo, e_cargo):
start, _ = s
_, end = e
my_data.append(your_text[start: end])
variable your_text should be the text over whom you are filtering using regex

Same code gives different output depends whether it has list comprehensions or generators

I am trying to clean this website and get every word. But using generators gives me more words than using lists. Also, these words are inconsistent. Sometimes I have more 1 words, sometimes none, sometimes more than 30 words. I have read about generators on python documentation and looked up some questions about generators. What i understand is it shouldn't differ. I don't understand what's going on underneath the hood. I am using python 3.6. Also I have read Generator Comprehension different output from list comprehension? but I can't understand the situation.
This is first function with generators.
def text_cleaner1(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = (line.strip() for line in text.splitlines()) # break into lines
print(type(lines))
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) # break multi-headlines into a line each
print(type(chunks))
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
This is second function with list comprehensions.
def text_cleaner2(website):
'''
This function just cleans up the raw html so that I can look at it.
Inputs: a URL to investigate
Outputs: Cleaned text only
'''
try:
site = requests.get(url).text # Connect to the job posting
except:
return # Need this in case the website isn't there anymore or some other weird connection problem
soup_obj = BeautifulSoup(site, "lxml") # Get the html from the site
for script in soup_obj(["script", "style"]):
script.extract() # Remove these two elements from the BS4 object
text = soup_obj.get_text() # Get the text from this
lines = [line.strip() for line in text.splitlines()] # break into lines
chunks = [phrase.strip() for line in lines for phrase in line.split(" ")] # break multi-headlines into a line each
def chunk_space(chunk):
chunk_out = chunk + ' ' # Need to fix spacing issue
return chunk_out
text = ''.join(chunk_space(chunk) for chunk in chunks if chunk).encode('utf-8') # Get rid of all blank lines and ends of line
# Now clean out all of the unicode junk (this line works great!!!)
try:
text = text.decode('unicode_escape').encode('ascii', 'ignore') # Need this as some websites aren't formatted
except: # in a way that this works, can occasionally throw
return # an exception
text = str(text)
text = re.sub("[^a-zA-Z.+3]"," ", text) # Now get rid of any terms that aren't words (include 3 for d3.js)
# Also include + for C++
text = text.lower().split() # Go to lower case and split them apart
stop_words = set(stopwords.words("english")) # Filter out any stop words
text = [w for w in text if not w in stop_words]
text = set(text) # Last, just get the set of these. Ignore counts (we are just looking at whether a term existed
# or not on the website)
return text
And this code give me different results randomly.
text_cleaner1("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae") - text_cleaner2("https://www.indeed.com/rc/clk?jk=02ecc871f377f959&fccid=c46d0116f6e69eae")
Generator is "lazy" - it doesn't execute code immediately but it executes it later when results will be needed. It means it doesn't get values from variables or functions immediately but it keeps references to variables and functions.
Example from link
all_configs = [
{'a': 1, 'b':3},
{'a': 2, 'b':2}
]
unique_keys = ['a','b']
for x in zip( *([c[k] for k in unique_keys] for c in all_configs) ):
print(x)
print('---')
for x in zip( *((c[k] for k in unique_keys) for c in all_configs) ):
print(list(x))
In generator there is for loop inside another for loop.
Internal generator gets reference to c instead of value in c and it will get value later.
Later (when it has to get results from generators) it starts execution with external generator for c in all_configs. When external generator is executed it loops and generates two internal geneartors which use reference to c, not value from c, but when it loops it also changes value in c - so finally you have "list" with two internal generators and {'a': 2, 'b':2} in c.
After that it executes internals generators which finally get value from c but in this moment c already has {'a': 2, 'b':2}.
BTW: there is similar problem with lambda in for loop when you use it with Buttons in tkinter.

In Python, how do I search an html webpage for a set of strings in text file?

I'm trying to figure out a way to search an html webpage for a number of strings that I have written in a text file, each on its own line. The code I have so far is this:
def mangasearch():
r = requests.get("https://www.mangaupdates.com/releases.html")
soup = BeautifulSoup(r.text)
if soup.find(text=re.compile(line)):
print("New chapter found!")
print("")
else:
print("No new chapters")
print("")
def textsearch():
global line
with open("manga.txt") as file:
for line in file:
print(line)
mangasearch()
It's supposed to read manga.txt and search the webpage for each string separately, but it always returns "no new chapters". If I replace if soup.find(text=re.compile(line)): with if soup.find(text=re.compile("An actual string")): it works correctly, but for some reason it doesn't want to use the line variable. Any help would be appreciated!
The problem is that you are trying to search with a string that contains some special character, like ' ', and '\n'.
Note that str.strip() removes ' ' and other whitespace characters as well (e.g. tabs and newlines), so, update the following line:
if soup.find(text=re.compile(line.strip())):
def textsearch():
#manga_file = open("managa.txt").readlines()
##assuming your text file has 3 titles
manga_file = ["Karakuri Circus","Sun-ken Rock","Shaman King Flowers"]
manga_html = "https://www.mangaupdates.com/releases.html"
manga_page = urllib2.urlopen(manga_html)
found = 0
soup = BeautifulSoup(manga_page)
##use BeautifulSoup to parse the section of interest out
##use 'inspect element' in Chrome or Firefox to find the interesting table
##in this case the data you are targeting is in a <div class="alt"> element
chapters = soup.findAll("div", attrs={"class":"alt"})
##chapters now contains all of the sections under the dates on the page
##so we iterate through each date to get all of the titles in each
##<tr> element of each table
for dated_section in chapters:
rows = dated_section.findAll("tr")
for r in rows:
title = r.td.text
#print title
#url = r.td.a.href
if title in manga_file:
found +=1
print "New Chapter Found!"
print r
if found >0:
print "Found a total of %d title"%found
else:
print "No chapters found"
above is not meant to be optimized but it does a good job of showing how BeautifulSoup can be used to parse the specific elements that you are looking for. In this case I used Chrome, right-clicked on the table that contains the titles to "inspect-element' and looked for the element that contains them to direct BeautifulSoup's attention directly to there. The rest is explained in code. I don't know what your managa.txt file looks like so I just created a list of 3 titles to search for as an example.

Categories

Resources