I am trying to scrape the sic description but I have not been successful. I have been trying to use requests and beautiful soup but I am coming nowhere near close.
https://sec.report/CIK/1418076
To get value of row 'SIC', you can use this example (also correct User-Agent needs to be specified):
import requests
from bs4 import BeautifulSoup
url = 'https://sec.report/CIK/1418076'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print( soup.find('td', text="SIC").find_next('td').text )
Prints:
7129: Other Business Financing Companies Investors, Not Elsewhere Classified 6799
EDIT: Change the parser to lxml for correct parsing of HTML document:
import requests
from bs4 import BeautifulSoup
url = 'https://sec.report/CIK/1002771'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
print( soup.find('td', text="SIC").find_next('td').text )
Prints:
1121: Distillery Products Industry Pharmaceutical Preparations 2834
Try this code:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 '}
r = requests.get('https://sec.report/CIK/1418076', headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
sic = soup.select_one('.table:nth-child(5) tr~ tr+ tr td:nth-child(2)')
print(sic.text)
Output:
7129: Other Business Financing Companies Investors, Not Elsewhere Classified 6799
Related
BeautifulSoup doesn’t find any tag on this page. Does anyone know what the problem can be?
I can find elements on the page with selenium, but since I have a list of pages, I don’t want to use selenium.
import requests
from bs4 import BeautifulSoup
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
soup.find_all('h1')
You can get the info on that page by adding headers to your requests, mimicking what you can see in Dev tools - Network tab main request to that url. Here is one way to get all links from that page:
import requests
from bs4 import BeautifulSoup as bs
headers = {
'Cookie': 'sso_checked=1',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.79 Safari/537.36'
}
url = 'https://dzen.ru/news/story/VMoskovskoj_oblasti_zapushhen_chat-bot_ochastichnoj_mobilizacii--b093f9a22a32ed6731e4a4ca50545831?lang=ru&from=reg_portal&fan=1&stid=fOB6O7PV5zeCUlGyzvOO&t=1664886434&persistent_id=233765704&story=90139eae-79df-5de1-9124-0d830e4d59a5&issue_tld=ru'
r = requests.get(url, headers=headers)
soup = bs(r.text, 'html.parser')
links = [a.get('href') for a in soup.select('a')]
print(links)
Result printed in terminal:
['/news', 'https://dzen.ru/news', 'https://dzen.ru/news/region/moscow', 'https://dzen.ru/news/rubric/mobilizatsiya', 'https://dzen.ru/news/rubric/personal_feed', 'https://dzen.ru/news/rubric/politics', 'https://dzen.ru/news/rubric/society', 'https://dzen.ru/news/rubric/business', 'https://dzen.ru/news/rubric/world', 'https://dzen.ru/news/rubric/sport', 'https://dzen.ru/news/rubric/incident', 'https://dzen.ru/news/rubric/culture', 'https://dzen.ru/news/rubric/computers', 'https://dzen.ru/news/rubric/science', 'https://dzen.ru/news/rubric/auto', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://www.mosobl.kp.ru/online/news/4948743/?utm_source=yxnews&utm_medium=desktop', 'https://mosregtoday.ru/soc/v-podmoskove-zapustili-chat-bot-po-voprosam-chastichnoj-mobilizacii/?utm_source=yxnews&utm_medium=desktop', ...]
below is the code, which I have used to crawl amazon site but output coming blank value. pls help
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import requests
import time
HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0','Accept-Language': 'en-US, en;q=0.5'})
data = pd.DataFrame([])
URL= "https://www.amazon.in/dp/B09NM3WWGY"
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "lxml")
dom = etree.HTML(str(soup))
Price = (dom.xpath("//div[#id='corePrice_desktop']/div/table/tbody/tr[2]/td[2]/span/span/text()"))
#if price!=None:
#price = (dom.xpath("//div[#id='corePrice_desktop']/div/table/tbody/tr[2]/td[2]/span/span/text()"))
#else:
#price = "No Data"
print(Price)
output getting blank
Remove tbody from the xpath & then working perfectly
from bs4 import BeautifulSoup
import pandas as pd
from lxml import etree
import requests
import time
HEADERS = ({'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:94.0) Gecko/20100101 Firefox/94.0','Accept-Language': 'en-US, en;q=0.5'})
data = pd.DataFrame([])
URL= "https://www.amazon.in/dp/B09NM3WWGY"
webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(webpage.content, "lxml")
dom = etree.HTML(str(soup))
Price = (dom.xpath("//div[#id='corePrice_desktop']/div/table/tr[2]/td[2]/span/span/text()"))
#if price!=None:
#price = (dom.xpath("//div[#id='corePrice_desktop']/div/table/tbody/tr[2]/td[2]/span/span/text()"))
#else:
#price = "No Data"
print(Price)
I am working on a webscraper using html requests and beautiful soup (New to this). For 1 webpage (https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1) I am trying to scrape a part, which I will replicate for other products. The html looks like:
<div class="plp-listing-load-status c-list-header__counter initialized" data-page-number="1" data-total-pages-count="57" data-products-count="60" data-total-products-count="3361" data-status-format="{available}/{total} results">60/3361 results</div>
I want the scrape the "57" from the data-total-pages-count="57". I have tried using:
soup = BeautifulSoup(page.content, "html.parser")
nopagesstr = soup.find(class_="plp-listing-load-status c-list-header__counter initialized").get('data-total-pages-count')
and
nopagesstr = r.html.find('[data-total-pages-count]',first=True)
But both return None. I am not sure how to select the 57 specifically. Any help would be appreicated
To get total pages count, you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://www.selfridges.com/GB/en/cat/beauty/make-up/?pn=1"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
print(soup.select_one("[data-total-pages-count]")["data-total-pages-count"])
Prints:
56
I tried using beautifulsoup, but when parsing the page for some reason gives the mobile version, where links to comments have a different look. Using vk_api, I need to specify the group id, which I need. I will be happy to give you any advice! Thanks! Here is the link "link". If you click on it, the comment "sexism" will appear. I want to implement this programmatically in python. If you can help me, I would be very grateful.
If I understand you right, you want to extract word "sexism" from the provided link that includes reply=<some number>.
You can do this for example:
import requests
import urllib.parse
from bs4 import BeautifulSoup
url = 'https://vk.com/wall-12648877_5011889?reply=5011893'
reply_number = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)['reply'][0]
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
text = soup.select_one('div.ReplyItem:has(a[name="reply{}"]) .ReplyItem__body'.format(reply_number)).text.strip()
print(text)
Prints:
sexism
EDIT:
import requests
import urllib.parse
from bs4 import BeautifulSoup
url = 'https://vk.com/wall-12648877_5013166?reply=5013335&thread=5013176'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
reply_number = urllib.parse.parse_qs(urllib.parse.urlparse(url).query)['reply'][0]
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
text = soup.select_one('div.ReplyItem:has(a[name="reply{reply_number}"]) .ReplyItem__body, .reply_text div[id$="_{reply_number}"]'.format(reply_number=reply_number)).text.strip()
print(text)
Prints:
ахаха, опять ты. и опять твою фотку запостили в комментах, просто ОР, в прошлый раз ты что-то про геев говорил кажется. ))
I've already tried the other solutions but did not work.
Here is my span tag:
<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
and here is my full code
import requests
from bs4 import BeautifulSoup
#<span class="DFlfde SwHCTb" data-precision="2" data-value="7.0498">7,05</span>
url = "https://www.google.com/search?q={}+kaç+tl".format(input())
r = requests.get(url)
source = BeautifulSoup(r.content,"html")
print(source.find_all("span",string="DFlfde SwHCTb"))
It returns a empty list, i need the value "7.05", how can i reach it? Thanks
There are 2 things to do to get your data:
specify User-Agent header (google needs this to return correct data)
in .find_all specify class_= parameter, not string
Code:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?q=100+kaç+tl"
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:75.0) Gecko/20100101 Firefox/75.0'}
r = requests.get(url, headers=headers)
source = BeautifulSoup(r.content, "html.parser")
print(source.find_all("span", class_="DFlfde SwHCTb"))
# or
print(source.select_one('span.DFlfde.SwHCTb').text)
Prints:
[<span class="DFlfde SwHCTb" data-precision="2" data-value="13.0095">13,01</span>]
13,01