How can I get the text from this specific div class?

How can I get the text from this specific div class? - python

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)

Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))

You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Related

Beautiful Soup parsing table from react html

Im trying to parse table with orders from html page.
Here the html:
HTML PIC
I need to get data from those table rows,
Here what i tried to do:
response = requests.get('https://partner.market.yandex.ru/supplier/23309133/fulfillment/orders', headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
q = soup.findAll('tr')
a = soup.find('tr')
print(q)
print(a)
But it gives me None. So any idea how to get into those table rows?
I tried to iterate over each div in html... once i get closer to div which contains those tables it give me None as well.
Appreciate any help

Aight. I found a solution by using selenium instead of requests lib.
I don't have any idea why it doesn't work with requests lib since it's doing the same thing as selenium (just sending an get request). But, with the selenium it works.
So here is what I do:
driver = webdriver.Chrome(r"C:\Users\Booking\PycharmProjects\britishairways\chromedriver.exe")
driver.get('https://www.britishairways.com/travel/managebooking/public/ru_ru')
time.sleep(15) # make an authorization
res = driver.page_source
print(res)
soup = BeautifulSoup(res, 'lxml')
b = soup.find_all('tr')

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.

Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.

As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

how to remove the starting and ending tags using python Beautiful soup

I'm having difficulty in stripping the starting and ending tags from a json url. I've used beautiful soup and the only problem i'm facing is that i'm getting <pre> tags in my response. Please advise how can i remove the starting and ending tags. The code chunk i'm using is here:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json)

So Thanks to Demian Wolf. The solution is something like this:
page = Page( "link to json")
soup = bs.BeautifulSoup(page.html, "html.parser")
#fetching the response i want from the url it's inside pre tags.
json = soup.find("pre")
print(json.text)

You may use soup.text to remove all the tags:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<pre>Hello, world!</pre>", "html.parser")
print(soup.find("pre").text)

Extracting specific Information from a website using BeautifulSoup (Python)

I am accessing the following website to extract a list of stocks:
http://www.barchart.com/stocks/performance/12month.php
I am using the following code:
from bs4 import BeautifulSoup
import requests
url=raw_input("http://www.barchart.com/stocks/performance/12month.php")
r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
data = r.text
soup =BeautifulSoup(data, "lxml")
for link in soup.find_all('a'):
print(link.get('href'))
The problem is I am getting a lot of other information that is not needed. I wanted to ask what would be a method that would just give me the stock names and nothing else.

r = requests.get("http://www.barchart.com/stocks/performance/12month.php")
html = r.text
soup = BeautifulSoup(html, 'html.parser')
tds = soup.find_all("td", {"class": "ds_name"})
for td in tds:
print td.a.text
If you look at the source code of the page, you will find that all you need is in a table. To be specific, the stocks' names are in <td></td> whose class="ds_name". So, that's it.

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?

From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I get the text from this specific div class? - python

Related

Beautiful Soup parsing table from react html

Soup works on one IMBD page but not on another. How to solve?

how to remove the starting and ending tags using python Beautiful soup

Extracting specific Information from a website using BeautifulSoup (Python)

How to crawl the description for sfglobe using python

Categories

Resources