In general, I try to get at least some tags from this site, and always gives none. I have no idea how to fix this.
There is a button Tickets, after you press it from the side there is an additional panel, so I want to parse it, I can not understand how. As I understand it, this tab is not loaded immediately after clicking, what to do next I do not understand. P.S. just started to learn it.
# coding: utf-8-sig
import urllib.request
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"}
def get_html(url):
request = urllib.request.Request(url,None,headers)
response = urllib.request.urlopen(request)
return response.read()
def parse(html):
soup = BeautifulSoup(html,"html.parser")
table = soup.find('body', class_='panel-open')
print(table)
def main():
parse(get_html('http://toto-info.co/'))
if __name__ == '__main__':
main()
That would be because the body element of the web page http://toto-info.co/ does not contain the class attribute "panel-open".
You can see what the body element contains by changing the line in your code:
table = soup.find('body', class_='panel-open')
to
table = soup.find('body')
This will now print the body element and all the elements it contains.
As you will see the body element contains very little except script if you want to get the script to render you will have to use other technologies I suggest you do a Google search for starters e.g. Web-scraping JavaScript page with Python.
An example that does select something by class, if you are interested is:
table = soup.find('div', class_='standalone')
But that selects from this page:
<div class="standalone" data-app="" id="app"></div>
but that is about all of the markup on this page that is displayed without JavaScript.
Related
I was trying to web scrape all of the Form N-MFP2 and then open the link to web scrape the information within the form. However, I am stuck at retrieving the form. I tried multiple methods of web scraping, including beautifulSoup and selenium, but the returned is empty and I could not go further to get the row data. Appreciate any help because I've been working on this problem for over 3 hours.
My code is as follows:
# Create an URL object
url = 'https://www.sec.gov/edgar/browse/?CIK=843781'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser') # it does not work even with "lxml"
# Obtain information from tag <table>
table = soup.find("table", id="filingsTable")
The webpage: https://www.sec.gov/edgar/browse/?CIK=843781
The table screenshot is here; Form N-MFP2 is highlighted as red
That table is rendered dynamically. You can access the data source from the json file. Once you have that, it's simply just pulling out the accessionNumber and then creating the approriate url to get to the link.
import requests
import pandas as pd
# Create an URL object
cik = 843781
cik_padded = f'{cik:010}'
url = f'https://data.sec.gov/submissions/CIK{cik_padded}.json'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
jsonData = requests.get(url, headers=headers).json()
df = pd.DataFrame(jsonData['filings']['recent'])
nmfp2 = df[df['form'] == 'N-MFP2']
Not sure what you want from the linked page, but once you get the relevant data to create the url, you can start getting the linked data page. Note: the link to the document is in there too...so if that's what you are after, you can use that instead. But like I said, don't know what you wanted from here.
for idx, row in nmfp2.iterrows():
accessionNumber = row['accessionNumber']
accessionNumber_alt = ''.join(accessionNumber.split('-'))
url = f'https://www.sec.gov/Archives/edgar/data/{cik_padded}/{accessionNumber_alt}/{accessionNumber}-index.htm'
response = requests.get(url, headers=headers)
dfs = pd.read_html(response.text)
print('\n')
for table in dfs:
print(table)
Trying to get a specific portion of text from this web page... trying to use code I found from a similar post:
# Import required modules
from lxml import html
import requests
# Request the page
page = requests.get('https://www.baseball-reference.com/players/k/kershcl01.shtml')
# Parsing the page
tree = html.fromstring(page.content)
# Get element using XPath
share = tree.xpath(
'//div[#id="leaderboard_cyyoung"]/table/tbody/tr[11]/td/a')
print(share)
Output is just empty brackets []
You are getting empty results because the div element you are trying to query is commented out in the requested page's source. Note that when you use the requests.get method, you get the page's HTML source code, not the rendered HTML code generated by the browser from your interaction with the page and that you can inspect with the browser's developer tools.
So I would say: check again if this is really the element you see rendered on the page and see if you can identify what kind of interaction makes it rendered. Then you can use a tool to mock this interaction so that you can get the rendered HTML code within your Python environment. I would suggest helium for doing so. If this is not the right element, you can simply update the specified XPath to get the right source-code available element and successfully scrape it.
As stated, this is rendered/dynamic part of the site. It is there in the comments, so you'll need to pull out the comments of the html, then parse. The other issue with it is in the comments, there is no <tbody> tag, so it wont find anything, you'd need to remove that. I'm not sure what you want to pull out though (is it the link, is it the text?). I alerted your code to show you how to use it with lxml, but hoestly not a fan. I'd prefer to just use BeautifulSoup. BeautifulSoup however doesn't intigrate with xpath, so used css selector instead.
Your code altered:
import requests
from lxml import html
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
htmlStr = str(each)
# Parsing the page
tree = html.fromstring(htmlStr)
# Get element using XPath
share = tree.xpath('//div[#id="leaderboard_cyyoung"]/table/tr[11]/td/a')
print(share)
How I would do it:
import requests
from bs4 import BeautifulSoup, Comment
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'}
url = "https://www.baseball-reference.com/players/k/kershcl01.shtml"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for each in comments:
if 'leaderboard_cyyoung' in str(each):
soup = BeautifulSoup(str(each), 'html.parser')
share = soup.select('div#leaderboard_cyyoung > table > tr:nth-child(12) > td > a')
print(share)
break
Output:
[4.58 Career Shares]
first off I want to not there are posts on how to do this the old way using meta tags, but for whatever reason meta tags won't work anymore and I've seen using json can work somehow but I'm not very familiar with json. Like at all. I kind of modified what I had to work for this but still nothing. The goal is simply just to get the number of followers of an account (user).
def follower_amt(self, user):
time.sleep(6)
# old method deprecated
# now requires using json file
html = requests.get(f'https://www.instagram.com/{user}/?__a=1')
soup = BeautifulSoup(html.text, 'lxml')
data = soup.findAll('span', {'class':'g47SY'})
text = data[0].get('content').split()
user = '%s %s %s' % (text[-3], text[-2], text[-1])
followers = text[0]
Any help is appreciated!!
(NOTE: Not tested as doubt scraping is allowed)
I see that value in page source in script tag which means you may be able to regex out as follows:
import requests, re
r = requests.get('https://www.instagram.com/brandonator24/', headers = {'User-Agent':'Mozilla/5.0').text
print(int(re.search('"edge_follow":{"count":(\d+)}', r).groups(0)[0]))
This is of course a broad assumption other pages are of similar set-up.
Regex meaning:
According to the ApiUrl you gave.
You can get what you want, right?
import requests
headers = {
'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
res = requests.get(f'https://www.instagram.com/{user}/?__a=1', headers=headers)
print(res.json()['graphql']['user']['username'])
print(res.json()['graphql']['user']['edge_followed_by']['count'])
I'm trying to make a bot that send me an email once a new product is online on a website.
I tried to do that with requests and beautifulSoup.
This is my code :
import requests
from bs4 import BeautifulSoup
URL = 'https://www.vinted.fr/vetements?search_text=football&size_id[]=207&price_from=0&price_to=15&order=newest_first'
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", class_="c-box")
print(len(products))
Next, I'll want to compare the number of products before and after my new request in a loop.
But when I try to see the number of products that I found, I get an empty list : []
I don't know how to fix that ...
The div that I use is in others div, I don't know if it has a relation
Thanks by advance
You have problem with the website that you are trying to parse.
The website in your code generates elements you are looking for(div.c-box) after the website is fully loaded, using javascript, at the client-side. So it's like:
Browser gets HTML source from server --(1)--> JS files loaded as browser loads html source --> JS files add elements to the HTML source --(2)--> Those elements are loaded to the browser
You cannot fetch the data you want by requests.get because requests.get method can only get HTML source at point (1), but the website loads the data at (2) point. To fetch such data, you should use automated browser modules such as selenium.
You should always check the data.
Convert your BeautifulSoup object to string with soup.decode('utf-8') and write it on a file. Then check what you get from the website. In this case, there is no element with c-box class.
You should use selenium instead of requests.
I'd like to scrape google search result url with python.
Here's my code
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:])
return result
search('computer')
Then I can get result. First url of the list is wikipedia.com which is,
'https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQFggTMAA&usg=AOvVaw2nvT-2sO4iJenW_fkyCS3i',
'?q=computer&num=100&ie=UTF-8&prmd=ivnsbp&tbm=isch&tbo=u&source=univ&sa=X&ved=0ahUKEwixyfu7q5HdAhWR3lQKHUfoDcsQsAQIHg'
I want to get clean url, which is 'https://en.wikipedia.org/wiki/Computer' including all the other search result in this case.
How can I modify my codes?
Edited: As you see the image below, I want to get the real url (marked yellow), not the messy and long url above.
How about appending
.split('&')[0]
to your code, in a way such that it becomes:
import requests
from bs4 import BeautifulSoup
def search(keyword):
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword)).text
soup = BeautifulSoup(html, 'html.parser')
result = []
for i in soup.find_all('h3', {'class':'r'}):
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
return result
search('computer')
[EDIT]
Taking https://en.wikipedia.org/wiki/Computer as an example:
Through chrome developer tools the url looks clean.
Since it belongs to <h3 class="r">, your code should work fine and return the clean url.
Instead, if you replace
result.append(i.find('a', href = True) ['href'][7:])
with
print i
then in my terminal it returns the following for the above link:
/url?q=https://en.wikipedia.org/wiki/Computer&sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
you can see that /url?q= has been prepended, and &sa=U&ved=0ahUKEwinqcqdypHdAhVhKH0KHVWIBEUQFggfMAU&usg=AOvVaw1pduIWw_TSCJUxtP9W_kHJ
has been appended.
By looking at other links as well, I observed that the prepended part always looks like /url?q=, and the appended part always begins with a &.
Therefore it's my belief that my original answer should work:
result.append(i.find('a', href = True) ['href'][7:].split('&')[0])
[7:] removes the prepended string, and split('&')[0] the appended string.
I found solution.
This modification in the search function works.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
html = requests.get('https://www.google.co.kr/search?q={}&num=100&sourceid=chrome&ie=UTF-8'.format(keyword), headers = headers).text