Python beautifulsoup level 1 only text

Python beautifulsoup level 1 only text - python

I've looked at the other beautifulsoup get same level type questions. Seems like my is slightly different.
Here is the website http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31
I'm trying to get that table on the right. Notice how the first row of the table expands into a detailed break down of that data. I don't want that data. I only want the very top level data. You can also see that the other rows also can be expanded, but not in this case. So just looping and skipping tr[2] might not work. I've tried this:
r = requests.get(page)
r.encoding = 'gb2312'
soup = BeautifulSoup(r.text,'html.parser')
table=soup.find('div', class_='right1').findAll('tr', {"class" : re.compile('list.*')})
but there is still more nested list* at other levels. How to get only the first level?

Limit your search to direct children of the table element only by setting the recursive argument to False:
table = soup.find('div', class_='right1').table
rows = table.find_all('tr', {"class" : re.compile('list.*')}, recursive=False)

#MartijnPieters' solution is already perfect, but don't forget that BeautifulSoup allows you to use multiple attributes as well when locating elements. See the following code:
from bs4 import BeautifulSoup as bsoup
import requests as rq
import re
url = "http://engine.data.cnzz.com/main.php?s=engine&uv=&st=2014-03-01&et=2014-03-31"
r = rq.get(url)
r.encoding = "gb2312"
soup = bsoup(r.content, "html.parser")
div = soup.find("div", class_="right1")
rows = div.find_all("tr", {"class":re.compile(r"list\d+"), "style":"cursor:pointer;"})
for row in rows:
first_td = row.find_all("td")[0]
print first_td.get_text().encode("utf-8")
Notice how I also added "style":"cursor:pointer;". This is unique to the top-level rows and is not an attribute of the inner rows. This gives the same result as the accepted answer:
百度汇总
360搜索
新搜狗
谷歌
微软必应
雅虎
0
有道
其他
[Finished in 2.6s]
Hopefully this also helps.

Related

Change scraped output

I have a loop putting URLs into my broswer and scraping its content, generating this output:
2PRACE,0.0014
Hispanic,0.1556
API,0.0688
Black,0.0510
AIAN,0.0031
White,0.7200
The code looks like this:
f1 = open('urlz.txt','r',encoding="utf8")
ethnicity_urls = f1.readlines()
f1.close()
from urllib import request
from bs4 import BeautifulSoup
import time
import openpyxl
import pprint
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
print(soup1)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup1))
resultFile.close()
My problem is quite simple yet I do not find any tool that helps me achieve it. I would like to change the output from a list with "\n" in it to this:
2PRACE,0.0014 Hispanic,0.1556 API,0.0688 Black,0.0510 AIAN,0.0031 White,0.7200
I did not succeed by using replace as it told me I am treating a number of elements the same as a single element.
My approach here was:
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
soup2 = soup1.replace('\n',' ')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
Can you help me find the correct approach to mutate the output before writing it to a csv?
The error message I get:
AttributeError: ResultSet object has no attribute 'replace'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
See the solution to the problem in my answer below. Thanks for all the responses!

soup1 seems to be an iterable, so you cannot just call replace on it.
Instead you could loop through all string items in soup1 and then call replace for every single one of them and then save the changes string to your soup2 variable. Something like this:
for e in soup1:
soup2.append(e.replace('\n',' '))

You need to iterate over the soup.
Soup is a list of elements:
The BS4 Documentation is excellent and has many many examples:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Use strip() to remove the \n
for x in soup1:
for r in x.children:
try:
print(r.strip())
except TypeError:
pass

Thank you both for the ideas and resources. I think I could implement what you suggested. The current build is
for each in ethnicity_urls:
time.sleep(1)
scraped = request.urlopen(each)
soup = BeautifulSoup(scraped)
soup1 = soup.select('p')
for e in soup1:
soup2 = str(soup1)
soup2 = soup2.replace('\n','')
print(soup2)
resultFile = open('results.csv','a')
resultFile.write(pprint.pformat(soup2))
resultFile.close()
And works just fine. I can do the final adjustments now in excel.

Beautiful soup, list index out of range

I looked at site html source, and found what i need for namePlayer, it was 4 column and 'a' tag. And i tried to find it at answers.append with 'namePlayer': cols[3].a.text
But when i complile it, i get IndexError. Then i try to change index to 2,3,4,5 but nothing.
Issue: why i get IndexError: list index out of range, when all is ok(i think :D)
source:
#!/usr/bin/env python3
import re
import urllib.request
from bs4 import BeautifulSoup
class AppURLopener(urllib.request.FancyURLopener):
version = "Mozilla/5.0"
def get_html(url):
opener = AppURLopener()
response = opener.open(url)
return response.read()
def parse(html):
soup = BeautifulSoup(html)
table = soup.find(id='answers')
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
answers.append({
'namePlayer': cols[3].a.text
})
for answer in answers:
print(answers)
def main():
parse(get_html('http://jaze.ru/forum/topic?id=50&page=1'))
if __name__ == '__main__':
main()

You are overwriting cols during your loop. The last length of cols is zero hence your error.
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
print(len(cols))
Run the above and you will see cols ends up at length 0.
This might also occur elsewhere in loop so you should test the length and also decide if your logic needs updating. Also, you need to account for whether there is a child a tag.
So, you might, for example, do the following (bs4 4.7.1+ required):
answers = []
for row in table.find_all('div')[16:]:
cols = row.find_all('div:has(>a)')
if len(cols) >= 3:
answers.append({
'namePlayer': cols[3].a.text
})
Note that answers has been properly indented so you are working with each cols value. This may not fit your exact use case as I am unsure what your desired result is. If you state the desired output I will update accordingly.
EDIT:
playerNames
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = {i.text.strip() for i in soup.select('[id^=answer_] .left-side a')}
You can preserve order and de-duplicated with OrderedDict (this by #Michael - other solutions in that Q&A)
from bs4 import BeautifulSoup as bs
import requests
from collections import OrderedDict
r = requests.get('https://jaze.ru/forum/topic?id=50&page=1')
soup = bs(r.content, 'lxml')
answer_blocks = soup.select('[id^=answer_]')
names = [i.text.strip() for i in soup.select('[id^=answer_] .left-side a')]
unique_names = OrderedDict.fromkeys(names).keys()

It does sound like you are providing an index for which a list element does not exist. Remember index starts at 0. example: 0,1,2,3. So if I ask for element 10 I would get an Index error.

why you use for loop for finding all div tag :
for row in table.find_all('div')[16:]:
cols = row.find_all('div')
by using this you got all the tag you want
cols = table.find_all('div')[16:]
so just change your code with this code and you got your answer.

Get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup

Is it possible to get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup. If so does anyone know how to do it?
Here's an example of how I'm trying to do it
from bs4 import BeautifulSoup
import requests
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
possibles = bs.find_all(None, {None: text_to_match})
print(possibles)
This gives me an empty list [].
If I replace {None: text_to_match} with {'href': text_to_match} this example will give some results as expected. I'm trying to figure out how to do this without specifying the attribute's name, and only matching the value.

You can try to find_all with no limitation and filter those who doesn't correspond to your needs, as such
text_to_match = 'https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg'
url = 'https://www.betts.com.au/item/37510-command.html?colour=chocolate'
r = requests.get(url)
bs = BeautifulSoup(r.text, features="html.parser")
tags = [tag for tag in bs.find_all() if text_to_match in str(tag)]
print(tags)
this sort of solution is a bit clumsy as you might get some irrelevant tags, you make your text a bit more tag specific by:
text_to_match = r'="https://s3-ap-southeast-2.amazonaws.com/bettss3/images/003obzt0t_w1200_h1200.jpg"'
which is a bit closer to the str representation of a tag with attribute

Extracting Embedded <span> in Python using BeautifulSoup

I am trying to extract a value in a span however the span is embedded into another. I was wondering how I get the value of only 1 span rather than both.
from bs4 import BeautifulSoup
some_price = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"})
some_price.span
# that code returns this:
'''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
# BUT I only want the $289 part, not the 99 associated with it
After making this adjustment:
some_price.span.text
the interpreter returns
$28999
Would it be possible to somehow remove the '99' at the end? Or to only extract the first part of the span?
Any help/suggestions would be appreciated!

You can access the desired value from the soup.contents attribute:
from bs4 import BeautifulSoup as soup
html = '''
<span>$289<span class="rightEndPrice_6y_hS">99</span></span>
'''
result = soup(html, 'html.parser').find('span').contents[0]
Output:
'$289'
Thus, in the context of your original div lookup:
result = page_soup.find("div", {"class":"price_FHDfG large_3aP7Z"}).span.contents[0]

printing result of 2 for loops in same line

I'm fairly new to web scraping in Python; and after reading most of the tutorials on the topic online I decided to give it a shot. I finally got one site working but the output is not formatted properly.
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
page = requests.get("https://leeweebrothers.com/our-food/lunch-boxes/#")
soup = BeautifulSoup(page.text, "html.parser")
for div in soup.find_all('h2'): #prints the name of the food"
print(div.text)
for a in soup.find_all('span', {'class' : 'amount'}): #prints price of the food
print(a.text)
Output
I want both the name of the food to be printed side by side with the corresponding price of the food, concatenated by a "-" ... Would appreciate any help given, thanks!
Edit: After #Reblochon Masque comments below - I've run into another problem; As you can see there is a $0.00 which is a value from the inbuilt shopping cart on the website, how would i exclude this as an outlier and continue moving down the loop while ensuring that the other items in the price "move up" to correspond to the correct food?

Best practice is to use zip function in the for loop, but we can do that this way also. This is to just to show we can do by using indexing the two lists.
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for index in range(len(names)):
print('{} - {}'.format(names[index].text, rest[index].text))

You could maybe zip the two results:
names = soup.find_all('h2')
rest = soup.find_all('span', {'class' : 'amount'})
for div, a in zip(names, rest):
print('{} - {}'.format(div.text, a.text))
# print(f"{div.text} - {a.text}") # for python > 3.6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python beautifulsoup level 1 only text - python

Limit your search to direct children of the table element only by setting the recursive argument to False: table = soup.find('div', class_='right1').table rows = table.find_all('tr', {"class" : re.compile('list.*')}, recursive=False)

Related

Change scraped output

Beautiful soup, list index out of range

Get all elements that match a specific attribute value, but match any tag or attribute name with BeautifulSoup

Extracting Embedded <span> in Python using BeautifulSoup

printing result of 2 for loops in same line

Categories

Resources