What is the best way to extract this data - python

Looking at the site, i suppose not to see an error because each local language(Yoruba) as it Meaning and Translation, and there are 220 local language(Yoruba).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import re
res = requests.get('http://yoruba.unl.edu/yoruba.php-text=1a&view=0&uni=0&l=1.htm')
soup = BeautifulSoup(res.content,'html.parser')
edu = {'Yoruba':[],'Translation':[],'Meaning':[]}
# first loop
for br in soup.select('p > br:nth-of-type(1)'):
text = br.previous_sibling.strip()
edu['Yoruba'].append(text)
# second loop
for br in soup.select('p > br:nth-of-type(2)'):
text = br.previous_sibling
if isinstance(text, str):
edu['Translation'].append(text.strip())
# third loop
for br in soup.select('p > br:nth-of-type(3)'):
text = br.previous_sibling
if isinstance(text, str):
edu['Meaning'].append(re.sub(r'[\(\)]','',str(text.strip())))
df7 = pd.DataFrame(edu)
Error
ValueError: arrays must all be same length

Since each of the three keys has different length, I guess the best way to address it is to pad the short keys to the length of the longest key (220, in this case). To do that add the following right before creating your dataframe:
length = max(len(edu['Meaning']),len(edu['Translation']),len(edu['Yoruba'])) #in case you don't know, find the length of the longest key
for k in edu:
for i in range(length-len(edu[k])):
edu[k].append("NA") # this is where the padding is; you can replacing NA with anything else, obviously
df7 = pd.DataFrame.from_dict(edu) #since edu is a dictionary, I would use this method
df7
Let me know if that works.

Related

Python - Returning String with recursion

The project that I am doing requires us to input a url and then follow the link in a particular position a number of number of times then return the last page visited. I have found the solution with a while loop, but now I am trying to do it with recursion.
Example: http://py4e-data.dr-chuck.net/known_by_Fikret.html
Find the link at position 3 (the first name is 1). Follow that link. Repeat this process 4 times. The answer is the last name that you retrieve.
Sequence of names: Fikret Montgomery Mhairade Butchi Anayah
Last name in sequence: Anayah
My code is this so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
import re
cnt = input("Enter count:")
post = input("Enter position:")
url = "http://py4e-data.dr-chuck.net/known_by_Fikret.html"
count = int(cnt)
position = int(post)
def GetLinks(initalPage, position, count):
html = urlopen(initalPage).read()
soup = BeautifulSoup(html, "html.parser")
temp = ""
links = list()
tags = soup('a')
for tag in tags:
x = tag.get('href', None)
links.append(x)
print(links[position - 1])
if count > 1:
GetLinks(links[position-1], position, count - 1)
return links[position - 1]
y = GetLinks(url, position, count)
print("****", y)
I see two problems with my code.
I am creating a list that is expanding with every recursion, which makes it very hard to locate the proper value.
Second, I am obviously returning the wrong item.
I don't know exactly how to fix this.

Beautiful soup, list index out of range

I looked at site html source, and found what i need for namePlayer, it was 4 column and 'a' tag. And i tried to find it at answers.append with 'namePlayer': cols[3].a.text
But when i complile it, i get IndexError. Then i try to change index to 2,3,4,5 but nothing.
Issue: why i get IndexError: list index out of range, when all is ok(i think :D)
source:
#!/usr/bin/env python3
import re
import urllib.request
from bs4 import BeautifulSoup
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def get_html(url):
opener = AppURLopener()
response = opener.open(url)
return response.read()
def parse(html):
soup = BeautifulSoup(html)
table = soup.find(id='answers')
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
answers.append({
'namePlayer': cols[3].a.text
})
for answer in answers:
print(answers)
def main():
parse(get_html('http://jaze.ru/forum/topic?id=50&page=1'))
if __name__ == '__main__':
main()
You are overwriting cols during your loop. The last length of cols is zero hence your error.
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
print(len(cols))
Run the above and you will see cols ends up at length 0.
This might also occur elsewhere in loop so you should test the length and also decide if your logic needs updating. Also, you need to account for whether there is a child a tag.
So, you might, for example, do the following (bs4 4.7.1+ required):
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div:has(>a)')
if len(cols) >= 3:
answers.append({
'namePlayer': cols[3].a.text
})
Note that answers has been properly indented so you are working with each cols value. This may not fit your exact use case as I am unsure what your desired result is. If you state the desired output I will update accordingly.
EDIT:
playerNames
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = {i.text.strip() for i in soup.select('[id^=answer_] .left-side a')}
You can preserve order and de-duplicated with OrderedDict (this by #Michael - other solutions in that Q&A)
from bs4 import BeautifulSoup as bs
import requests
from collections import OrderedDict
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = OrderedDict.fromkeys(names).keys()
It does sound like you are providing an index for which a list element does not exist. Remember index starts at 0. example: 0,1,2,3. So if I ask for element 10 I would get an Index error.
why you use for loop for finding all div tag :
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
by using this you got all the tag you want
cols = table.find_all('div')[16:]
so just change your code with this code and you got your answer.

getting specific part from a page source python

I am trying to extract a specific part from a page using regex but it isn't working.
This is the part I want to be extracted from the page:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
So far I've tried this :
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
mystrx = re.search(r'^{"clickTrackingParams".*"voteStatus":"LIKE"}}]}}', html_source)
but it didn't work out for me.
Try this:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
# Find first occurence
end = html_source.find(snd)
# Get closest index
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
Which Outputs:
{"clickTrackingParams":"CPcBEJhNIhMIwrDVo4qw3gIVTBnVCh28iAtzKPgd","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"performCommentActionEndpoint":{"action":"CAUQAhoaVWd4MEdWUGNadTdvclcwT09WdDRBYUFCQWcqC1pNZlAzaERwdjlBMAA4AEoVMTA1MTc3MTgyMDc5MDg5MzQ1ODM4UACKAVQSC1pNZlAzaERwdjlBMixlaHBWWjNnd1IxWlFZMXAxTjI5eVZ6QlBUMVowTkVGaFFVSkJadyUzRCUzRMABAMgBAOABAaICDSj___________8BQAA%3D","clientActions":[{"updateCommentVoteAction":{"voteCount":{"accessibility":{"accessibilityData":{"label":"80 likes"}},"simpleText":"80"},"voteStatus":"LIKE"}}]}}
If you want to get the second occurence, you can try something along the lines of:
import requests
import re
r = requests.get('http://rophoto.es/ash.txt')
html_source = r.text
fst, snd = '{"clickTrackingParams":', '"voteStatus":"LIKE"}}]}}'
def find_nth(string, to_find, n):
"""
Finds nth match from string
"""
# find all occurences
matches = [idx.start() for idx in re.finditer(to_find, string)]
# return nth match
return matches[n]
# finds second match
end = find_nth(html_source, snd, 1)
# Gets closest index to end
start = max(idx.start() for idx in re.finditer(fst, html_source) if idx.start() < end)
print(html_source[start:end+len(snd)])
Note: In the second example you can run into IndexError's if you request an occurence outside of the found matches. You will need to handle this behaviour yourself.

How to Crawl Multiple Websites to find common Words (BeautifulSoup,Requests,Python3)

I'm wondering how to crawl multiple different websites using beautiful soup/requests without having to repeat my code over and over.
Here is my code right now:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)
What I am trying to do is ideally crawl 5 different websites, find all of the individual words on these websites, find the frequency of each word on each website, ADD all the frequencies together for each particular word, then combine all of this data into one dataframe that can be exported using Pandas.
Hopefully the output would look like this
Word Frequency
the 200
man 300
is 400
tired 300
My code can only do this for ONE website at a time right now and I'm trying to avoid repeating my code.
Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. I was wondering if anyone had a faster way or any advice? Thank you!
Make a function:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
cnt = Counter()
def GetData(url):
Website1 = requests.get(url)
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
cnt.update(a.most_common())
websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
GetData(url)
makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
Just loop and update a main Counter dict:
main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)
The update method unlike a normal dict.update adds to the values, it does not replace the values
On a style note, use lowercase for variable names and use underscore's make_a_frame
Try:
comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)

Parse HTML Table with Python BeautifulSoup

I am attempting to use BeautifulSoup to parse an html table which I uploaded to http://pastie.org/8070879 in order to get the three columns (0 to 735, 0.50 to 1.0 and 0.5 to 0.0) as lists. To explain why, I will want the integers 0-735 to be keys and the decimal numbers to be values.
From reading many of the other posts on SO, I have come up with the following which does not come close to creating the lists I want. All it does is display the text in the table as is seen here http://i1285.photobucket.com/albums/a592/TheNexulo/output_zps20c5afb8.png
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("fide.html"))
table = soup.find('table')
rows = table.findAll('tr')
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = ''.join(td.find(text=True))
print text + "|",
print
I'm new to Python and BeautifulSoup, so please be gentle with me! Thanks
HTML parsers like BeautifulSoup presume that what you want is an object model that mirrors the input HTML structure. But sometimes (like in this case) that model gets in the way more than helps. Pyparsing includes some HTML parsing features that are more robust than just using raw regexes, but otherwise work in similar fashion, letting you define snippets of HTML of interest, and just ignoring the rest. Here is a parser that reads through your posted HTML source:
from pyparsing import makeHTMLTags,withAttribute,Suppress,Regex,Group
""" looking for this recurring pattern:
<td valign="top" bgcolor="#FFFFCC">00-03</td>
<td valign="top">.50</td>
<td valign="top">.50</td>
and want a dict with keys 0, 1, 2, and 3 all with values (.50,.50)
"""
td,tdend = makeHTMLTags("td")
keytd = td.copy().setParseAction(withAttribute(bgcolor="#FFFFCC"))
td,tdend,keytd = map(Suppress,(td,tdend,keytd))
realnum = Regex(r'1?\.\d+').setParseAction(lambda t:float(t[0]))
integer = Regex(r'\d{1,3}').setParseAction(lambda t:int(t[0]))
DASH = Suppress('-')
# build up an expression matching the HTML bits above
entryExpr = (keytd + integer("start") + DASH + integer("end") + tdend +
Group(2*(td + realnum + tdend))("vals"))
This parser not only picks out the matching triples, it also extracts the start-end integers and the pairs of real numbers (and also already converts from string to integers or floats at parse time).
Looking at the table, I'm guessing you actually want a lookup that will take a key like 700, and return the pair of values (0.99, 0.01), since 700 falls in the range of 620-735. This bit of code searches the source HTML text, iterates over the matched entries and inserts key-value pairs into the dict lookup:
# search the input HTML for matches to the entryExpr expression, and build up lookup dict
lookup = {}
for entry in entryExpr.searchString(sourcehtml):
for i in range(entry.start, entry.end+1):
lookup[i] = tuple(entry.vals)
And now to try out some lookups:
# print out some test values
for test in (0,20,100,700):
print (test, lookup[test])
prints:
0 (0.5, 0.5)
20 (0.53, 0.47)
100 (0.64, 0.36)
700 (0.99, 0.01)
I think the above answer is better than what I would offer, but I have a BeautifulSoup answer that can get you started. This is a bit hackish, but I figured I would offer it nevertheless.
With BeautifulSoup, you can find all the tags with certain attributes in the following way (assuming you have a soup.object already set up):
soup.find_all('td', attrs={'bgcolor':'#FFFFCC'})
That will find all of your keys. The trick is to associate these with the values you want, which all show up immediately afterward and which are in pairs (if these things change, by the way, this solution won't work).
Thus, you can try the following to access what follows your key entries and put those into your_dictionary:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling
The problem is that the "next_sibling" is actually a '\n', so you have to do the following to capture the next value (the first value you want):
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = node.next_sibling.next_sibling.string
And if you want the two following values, you have to double this:
for node in soup.find_all('td', attrs={'bgcolor':'#FFFFCC'}):
your_dictionary[node.string] = [node.next_sibling.next_sibling.string, node.next_sibling.next_sibling.next_sibling.next_sibling.string]
Disclaimer: that last line is pretty ugly to me.
I've used BeautifulSoup 3, but it probably will work under 4.
# Import System libraries
import re
# Import Custom libraries
from BeautifulSoup import BeautifulSoup
# This may be different between BeautifulSoup 3 and BeautifulSoup 4
with open("fide.html") as file_h:
# Read the file into the BeautifulSoup class
soup = BeautifulSoup(file_h.read())
tr_location = lambda x: x.name == u"tr" # Row location
key_location = lambda x: x.name == u"td" and bool(set([(u"bgcolor", u"#FFFFCC")]) & set(x.attrs)) # Integer key location
td_location = lambda x: x.name == u"td" and not dict(x.attrs).has_key(u"bgcolor") # Float value location
str_key_dict = {}
num_key_dict = {}
for tr in soup.findAll(tr_location): # Loop through all found rows
for key in tr.findAll(key_location): # Loop through all found Integer key tds
key_list = []
key_str = key.text.strip()
for td in key.findNextSiblings(td_location)[:2]: # Loop through the next 2 neighbouring Float values
key_list.append(td.text)
key_list = map(float, key_list) # Convert the text values to floats
# String based dictionary section
str_key_dict[key_str] = key_list
# Number based dictionary section
num_range = map(int, re.split("\s*-\s*", key_str)) # Extract a value range to perform interpolation
if(len(num_range) == 2):
num_key_dict.update([(x, key_list) for x in range(num_range[0], num_range[1] + 1)])
else:
num_key_dict.update([(num_range[0], key_list)])
for x in num_key_dict.items():
print x

Categories

Resources