Beautiful soup, list index out of range - python

I looked at site html source, and found what i need for namePlayer, it was 4 column and 'a' tag. And i tried to find it at answers.append with 'namePlayer': cols[3].a.text
But when i complile it, i get IndexError. Then i try to change index to 2,3,4,5 but nothing.
Issue: why i get IndexError: list index out of range, when all is ok(i think :D)
source:
#!/usr/bin/env python3
import re
import urllib.request
from bs4 import BeautifulSoup
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def get_html(url):
opener = AppURLopener()
response = opener.open(url)
return response.read()
def parse(html):
soup = BeautifulSoup(html)
table = soup.find(id='answers')
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
answers.append({
'namePlayer': cols[3].a.text
})
for answer in answers:
print(answers)
def main():
parse(get_html('http://jaze.ru/forum/topic?id=50&page=1'))
if __name__ == '__main__':
main()

You are overwriting cols during your loop. The last length of cols is zero hence your error.
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
print(len(cols))
Run the above and you will see cols ends up at length 0.
This might also occur elsewhere in loop so you should test the length and also decide if your logic needs updating. Also, you need to account for whether there is a child a tag.
So, you might, for example, do the following (bs4 4.7.1+ required):
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div:has(>a)')
if len(cols) >= 3:
answers.append({
'namePlayer': cols[3].a.text
})
Note that answers has been properly indented so you are working with each cols value. This may not fit your exact use case as I am unsure what your desired result is. If you state the desired output I will update accordingly.
EDIT:
playerNames
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = {i.text.strip() for i in soup.select('[id^=answer_] .left-side a')}
You can preserve order and de-duplicated with OrderedDict (this by #Michael - other solutions in that Q&A)
from bs4 import BeautifulSoup as bs
import requests
from collections import OrderedDict
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = OrderedDict.fromkeys(names).keys()

It does sound like you are providing an index for which a list element does not exist. Remember index starts at 0. example: 0,1,2,3. So if I ask for element 10 I would get an Index error.

why you use for loop for finding all div tag :
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
by using this you got all the tag you want
cols = table.find_all('div')[16:]
so just change your code with this code and you got your answer.

Related

Removing quotes from re.findall output

I am trying to remove the quotes from my re.findall output using Python 3. I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself.
My code:
import requests
from bs4 import BeautifulSoup
import re
import time
price = [];
while True:
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.prettify()
for p in data:
match = re.findall('\d*\.?\d+',data)
print("ETH/USDT",match)
price.append(match)
break
Output of match gives:
['143.19000000']. I would like it to be like: [143.1900000] but I cannot figure out how to do this.
Another problem I am encountering is that the list price appends every object like a single list. So the output of price would be for example [[a], [b], [c]]. I would like it to be like [a, b, c] I am having a bit of trouble to solve these two problems.
Thanks :)
Parse the response from requests.get() as JSON, rather than using BeautifulSoup:
import requests
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
response = requests.get(url)
response.raise_for_status()
data = response.json()
print(data["price"])
To get floats instead of strings:
float_match = [float(el) for el in match]
To get a list instead of a list of lists:
for el in float_match:
price.append(el)

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

When I try to use list indexing in my if statement, why does it fail?

When I try to index in the statement, it says index out of range. I am trying to scrape stuff from the website.
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.set.or.th/set/factsheet.do?symbol=TRUE&ssoPageId=3&language=en&country=US')
soup = BeautifulSoup(page.text, 'html.parser')
list_stuff = list()
for x in soup.findAll('table',{'class':'factsheet'}):
for tr in x.findAll('tr'):
stuff=[td for td in tr.stripped_strings]
if stuff[0] == 'Beta':
list_stuff.append(stuff[1])
The code returns an error saying list index out of range and points to the
stuff[0] line in the for loop
Add a step to check if the tuple is empty or not.
if not not stuff: #check if stuff is not empty
if stuff[0] == 'Beta':
list_stuff.append(stuff[1])

How to Crawl Multiple Websites to find common Words (BeautifulSoup,Requests,Python3)

I'm wondering how to crawl multiple different websites using beautiful soup/requests without having to repeat my code over and over.
Here is my code right now:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
Website1 = requests.get("http://www.nerdwallet.com/the-best-credit-cards")
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
makeaframe = pd.DataFrame(b)
makeaframe.columns = ['Words', 'Frequency']
print(makeaframe)
What I am trying to do is ideally crawl 5 different websites, find all of the individual words on these websites, find the frequency of each word on each website, ADD all the frequencies together for each particular word, then combine all of this data into one dataframe that can be exported using Pandas.
Hopefully the output would look like this
Word Frequency
the 200
man 300
is 400
tired 300
My code can only do this for ONE website at a time right now and I'm trying to avoid repeating my code.
Now, I can do this manually by repeating my code over and over and crawling each individual website and then concatenating my results for each of these dataframes together but that seems very unpythonic. I was wondering if anyone had a faster way or any advice? Thank you!
Make a function:
import requests
from bs4 import BeautifulSoup
from collections import Counter
import pandas as pd
cnt = Counter()
def GetData(url):
Website1 = requests.get(url)
soup = BeautifulSoup(Website1.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
cnt.update(a.most_common())
websites = ['http://www.nerdwallet.com/the-best-credit-cards','http://www.other.com']
for url in websites:
GetData(url)
makeaframe = pd.DataFrame(cnt.most_common())
makeaframe.columns = ['Words', 'Frequency']
print makeaframe
Just loop and update a main Counter dict:
main_c = Counter() # keep all results here
urls = ["http://www.nerdwallet.com/the-best-credit-cards","http://stackoverflow.com/questions/tagged/python"]
for url in urls:
website = requests.get(url)
soup = BeautifulSoup(website.content)
texts = soup.findAll(text=True)
a = Counter([x.lower() for y in texts for x in y.split()])
b = (a.most_common())
main_c.update(b)
make_a_frame = pd.DataFrame(main_c.most_common())
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame)
The update method unlike a normal dict.update adds to the values, it does not replace the values
On a style note, use lowercase for variable names and use underscore's make_a_frame
Try:
comm = [[k,v] for k,v in main_c]
make_a_frame = pd.DataFrame(comm)
make_a_frame.columns = ['Words', 'Frequency']
print(make_a_frame).sort("Frequency",ascending=False)

Python beautifulsoup level 1 only text

I've looked at the other beautifulsoup get same level type questions. Seems like my is slightly different.
Here is the website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
I'm trying to get that table on the right. Notice how the first row of the table expands into a detailed break down of that data. I don't want that data. I only want the very top level data. You can also see that the other rows also can be expanded, but not in this case. So just looping and skipping tr[2] might not work. I've tried this:
r = requests.get(page)
r.encoding = 'gb2312'
soup = BeautifulSoup(r.text,'html.parser')
table=soup.find('div', class_='right1').findAll('tr', {"class" : re.compile('list.*')})
but there is still more nested list* at other levels. How to get only the first level?
Limit your search to direct children of the table element only by setting the recursive argument to False:
table = soup.find('div', class_='right1').table
rows = table.find_all('tr', {"class" : re.compile('list.*')}, recursive=False)
#MartijnPieters' solution is already perfect, but don't forget that BeautifulSoup allows you to use multiple attributes as well when locating elements. See the following code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
url = "http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31"
r = rq.get(url)
r.encoding = "gb2312"
soup = bsoup(r.content, "html.parser")
div = soup.find("div", class_="right1")
rows = div.find_all("tr", {"class":re.compile(r"list\d+"), "style":"cursor:pointer;"})
for row in rows:
first_td = row.find_all("td")[0]
print first_td.get_text().encode("utf-8")
Notice how I also added "style":"cursor:pointer;". This is unique to the top-level rows and is not an attribute of the inner rows. This gives the same result as the accepted answer:
百度汇总
360搜索
新搜狗
谷歌
微软必应
雅虎
0
有道
其他
[Finished in 2.6s]
Hopefully this also helps.

Categories

Resources