Scraping Table from Nasdaq (beginner) - python

url = https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts
I am trying to grab the table from the url above. However, when I try to find the table using beautifulsoup, I am unsuccessful. I simply get an empty list.
Please help.
Thanks
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts")
soup = BeautifulSoup(page.text, "lxml")
item = soup.find(class_="genTable")
print(item)

I think the code is quite clear, but if you have any questions feel free to ask.
headers = {"Referer": "https://www.nasdaqtrader.com/trader.aspx?id=TradeHalts"}
data = {"id":2,"method":"BL_TradeHalt.GetTradeHalts","params":"[]","version":"1.1"}
url = "https://www.nasdaqtrader.com/RPCHandler.axd"
req = requests.post(url, json=data, headers=headers)
result = req.json()['result']
soup = BeautifulSoup(result, 'html.parser')
table = soup.find('table')

Related

How to scrape headline news, link and image?

I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below.
but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.
This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)
First things first: never post code as an image.
<h2> in your HTML has no text. What it does have, is an <a> element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.
As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

How can I get the text from this specific div class?

I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)
Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))
You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help

Scraping links from href on Sephora website

Hi So I am trying to scrape the links for all the products a specific page on Sephora. My code only gives me the first 12 links while there are 48 products on the website. I think this is because Sephora is a User-Interactive-website(Please correct me if I am wrong) so it doesn't load the rest. But I do not know how to get the rest. Please send some help! Thank you!!!
Here is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
link_list = []
keyword = 'product'
for link in soup.findAll('a'):
href = link.get('href')
if keyword in href:
link_list.append('https://www.sephora.com/' + href)
else:
continue
If you take a look at the source code, you will see their data stored as a json object. You can get the json object by this:
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
data = json.loads(soup.find('script', id='linkJSON').text)
products = data[3]['props']['products']
prefix = "https://www.sephora.com"
url_links = [prefix+p["targetUrl"] for p in products]
print(url_links)
By investigating the json data, you can find where the links stored. To view the json data more clearly, I use this website: https://codebeautify.org/jsonviewer

Web scraping with beautifulsoup not finding anything

I'm trying to scrape coinmarketcap.com just to get an update of a certain currency price, also just to learn how to web scrape. I'm still a beginner and can't figure out where I'm going wrong, because whenever I try to run it, it just tells me there are none. Although I know that line does exist. Any help is appreciated!
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('data-currency-price data-usd=')
print (price)
If you are going to be doing alot of this consider doing a single call using the official API and get all the prices. Then extract what you want. The following is from the site with an amendment by me to show the desired value for electroneum. The API guidance shows how to retrieve one at a time as well, though that requires a higher plan than the basic.
from requests import Request, Session
from requests.exceptions import ConnectionError, Timeout, TooManyRedirects
import json
url = 'https://pro-api.coinmarketcap.com/v1/cryptocurrency/listings/latest'
parameters = {
'start': '1',
'limit': '5000',
'convert': 'USD',
}
headers = {
'Accepts': 'application/json',
'X-CMC_PRO_API_KEY': 'yourKey',
}
session = Session()
session.headers.update(headers)
try:
response = session.get(url, params=parameters)
# print(response.text)
data = json.loads(response.text)
print(data['data'][64]['quote']['USD']['price'])
except (ConnectionError, Timeout, TooManyRedirects) as e:
print(e)
You can always deploy a loop and check against a desired list e.g.
interested = ['Electroneum','Ethereum']
for item in data['data']:
if item['name'] in interested:
print(item)
For your current example:
You can use an attribute selector for data-currency-value
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
soup.select_one('[data-currency-value]').text
You can use the class attribute to get the value.
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find('span' ,attrs={"class" : "h2 text-semi-bold details-panel-item--price__value"})
print (price.text)
Output :
0.006778
You can get the value like this:
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
price = soup.find("span", id="quote_price").get('data-usd')
print (price)
You should try to be more specific in how you want to FIND the item.
you currently are using soup.find('') I am not sure what you have put inside this as you wrote data-currency-price data-usd=
Is that an ID a class name?
why not try finding the item using an ID.
soup.find(id="link3")
or find by tag
soup.find("relevant tag name like div or a")
or something like this
find_this = soup.find("a", id="ID HERE")
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/electroneum/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html, 'html.parser')
x=soup(id="quote_price").text
print (x)
Look for ID better,or search through soup.find_all(text="data-currency-price data-usd")[1].text

Categories

Resources