Not getting json when using .text in bs4 - python

In this code I think I made a mistake or something because I'm not getting the correct json when I print it, indeed I get nothing but when I index the script I get the json but using .text nothing appears I want the json alone.
CODE :
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
import requests
import selenium.webdriver as webdriver
base_url = 'https://www.instagram.com/{}'
search = input('Enter the instagram account: ')
final_url = base_url.format(quote_plus(search))
response = requests.get(final_url)
print(response.status_code)
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
print(scripts[0].text)

Change the line print(scripts[0].text) to print(scripts[0].string).
scripts[0] is a Beautiful Soup Tag object, and its string contents can be accessed through the .string property.
Source: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
If you want to then turn the string into a json so that you can access the data, you can do something like this:
...
if response.ok:
html = response.text
bs_html = BeautifulSoup(html)
scripts = bs_html.select('script[type="application/ld+json"]')
json_output = json.loads(scripts[0].string)
Then, for example, if you run print(json_output['name']) you should be able to access the name on the account.

Related

BeautifulSoup - search text inside a tag

from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import requests
user = UserAgent()
headers = {
'user-agent' : user.random
}
url = 'https://www.wildberries.ru/?utm_source=domain&utm_campaign=wilberes.ru'
def main():
resp = requests.get(url, headers=headers)
soup = BeautifulSoup(resp.text, 'lxml')
main = soup.find('div', class_='menu-burger__main')
ul = main.find('ul', class_='menu-burger__main-list')
all = ul.find_all_next('li', class_='menu-burger__main-list-item')
f = open('link.txt', 'a')
for lin in all:
get_link = lin.find('a').get('href')
f.write(get_link + '\n')
f.close()
if __name__ == '__main__':
main()
I'm trying to parse a link to a section and its name. I managed to get the link, but how can I get the name if it is not in the tag?
Using the string property seems to be the correct method based on BS4 documentation:
myLink = soup.find('a')
myLinkText = str(myLink.string)
The idea of using str() is to convert the text to regular python strings, as .string returns a BS4 NavigableString object (you may find you don't really have to do this, though - it might be safer though, as well as stripping whitespace from the string so that you don't get weird newlines or padding with the result -- str(myLink.string).strip())
Reference: https://crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring
In your code, you are actually getting the href, so take note that in my code above I am getting the anchor tag, not just it's href attribute.

Parsing XML object in python 3.9

I'm trying to get some data using the NCBI API. I am using requests to make the connection to the API.
What I'm stuck on is how do I convert the XML object that requests returns into something that I can parse?
Here's my code for the function so far:
def getNCBIid(speciesName):
import requests
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
url = base_url + "esearch.fcgi?db=assembly&term=(%s[All Fields])&usehistory=y&api_key=f1e800ad255b055a691c7cf57a576fe4da08" % speciesName
#xml object
api_request = requests.get(url)
You would use something like BeautifulSoup for this ('this' being 'convert and parse the xml object').
What you are calling your xml object is still the response object, and you need to extract the content from that object first.
from bs4 import BeautifulSoup
def getNCBIid(speciesName):
import requests
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
url = base_url + "esearch.fcgi?db=assembly&term=(%s[All Fields])&usehistory=y&api_key=f1e800ad255b055a691c7cf57a576fe4da08" % speciesName
#xml object. <--- this is still just your response object
api_request = requests.get(url)
# grab the response content
xml_content = api_request.content
# parse with beautiful soup
soup = BeautifulSoup(xml_content, 'xml')
# from here you would access desired elements
# here are docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifulSoup findAll returns empty list when selecting class

findall() returns empty list when specifying class
Specifying tags work fine
import urllib2
from bs4 import BeautifulSoup
url = "https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week"
hdr = { 'User-Agent' : 'tempro' }
req = urllib2.Request(url, headers=hdr)
htmlpage = urllib2.urlopen(req).read()
BeautifulSoupFormat = BeautifulSoup(htmlpage,'lxml')
name_box = BeautifulSoupFormat.findAll("a",{'class':'title'})
for data in name_box:
print(data.text)
I'm trying to get only the text of the post. The current code prints out nothing. If I remove the {'class':'title'} it prints out the post text as well as username and comments of the post which I don't want.
I'm using python2 with the latest versions of BeautifulSoup and urllib2
To get all the comments you are going to need a method like selenium which will allow you to scroll. Without that, just to get initial results, you can grab from a script tag in the requests response
import requests
from bs4 import BeautifulSoup as bs
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0'}
r = requests.get('https://www.reddit.com/r/Showerthoughts/top/?sort=top&t=week', headers = headers)
soup = bs(r.content, 'lxml')
script = soup.select_one('#data').text
p = re.compile(r'window.___r = (.*); window')
data = json.loads(p.findall(script)[0])
for item in data['posts']['models']:
print(data['posts']['models'][item]['title'])
The selector you try to use is not good, because you do not have a class = "title" for those posts. Please try this below:
name_box = BeautifulSoupFormat.select('a[data-click-id="body"] > h2')
this finds all the <a data-click-id="body"> where you have <h2> tag that contain the post text you need
More about selectors using BeatufulSoup you can read here:
(https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)

How to download tickers from webpage, beautifulsoup didnt get all content

I want to get the ticker values from this webpage https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false
However when using Beautifulsoup I don't seem to get all the content, and I don't quite understand how to change my code in order to achieve my goal
import urllib3
from bs4 import BeautifulSoup
def oslobors():
http=urllib3.PoolManager()
url = 'https://www.oslobors.no/markedsaktivitet/#/list/shares/quotelist/ob/all/all/false'
response = http.request('GET', url)
soup=BeautifulSoup(response.data, "html.parser")
print(soup)
return
print(oslobors())
The content you wanna parse generates dynamically. You can either use any browser simulator like selenium or you can try the below url containing json response. The following is the easy way to go.
import requests
url = 'https://www.oslobors.no/ob/servlets/components?type=table&generators%5B0%5D%5Bsource%5D=feed.ob.quotes.EQUITIES%2BPCC&generators%5B1%5D%5Bsource%5D=feed.merk.quotes.EQUITIES%2BPCC&filter=&view=DELAYED&columns=PERIOD%2C+INSTRUMENT_TYPE%2C+TRADE_TIME%2C+ITEM_SECTOR%2C+ITEM%2C+LONG_NAME%2C+BID%2C+ASK%2C+LASTNZ_DIV%2C+CLOSE_LAST_TRADED%2C+CHANGE_PCT_SLACK%2C+TURNOVER_TOTAL%2C+TRADES_COUNT_TOTAL%2C+MARKET_CAP%2C+HAS_LIQUIDITY_PROVIDER%2C+PERIOD%2C+MIC%2C+GICS_CODE_LEVEL_1%2C+TIME%2C+VOLUME_TOTAL&channel=a66b1ba745886f611af56cec74115a51'
res = requests.get(url)
for ticker in res.json()['rows']:
ticker_name = ticker['values']['ITEM']
print(ticker_name)
Results you may get like (partial):
APP
HEX
APCL
ODFB
SAS NOK
WWI
ASC

html content changes when using beautifulSoup

I am trying to extract the attribute value of src from a block of html, the html block is :
<img class="product-image first-image" src="https://cache.net-a-porter.com/images/products/1083507/1083507_in_pp.jpg">
my code is :
import requests
import json
from bs4 import BeautifulSoup
import re
headers = {'User-agent': 'Mozilla/5.0'}
url = 'https://www.net-a-porter.com/us/en/product/1083507/maje/layered-plaid-twill-and-stretch-cotton-jersey-top'
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]["src"]
print i
the result i get:
//cache.net-a-porter.com/images/products/1083507/1083507_in_xs.jpg
but i want to get what is exactly in original html, which should be:
https://cache.net-aporter.com/images/products/1083507/1083507_in_pp.jpg
my result is different from the original src value, the http:is gone, and 1083507_in_pp changes to 1083507_in_xs. I don't know why it happens, does anyone know how to solve this? Thanks!
You are close, however, you need to access the "src" key from the builtin attrs key:
if url.find('net-a-porter')!=-1 :
i = soup.find_all('img', class_="product-image first-image")[0]
print i['src']

Categories

Resources