.py web scraping used in .html file

.py web scraping used in .html file - python

I have never really worked that much with python but i'm basicly trying to get a small line of text from a website by web scraping like this:
import bs4
import requests
from bs4 import BeautifulSoup
url = "http://finance.yahoo.com/quote/GME/"
r = requests.get(url)
text = BeautifulSoup(r.text, 'lxml')
text = text.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})[0].text
print(text)
and then somehow include the result of the "print(text)" line in a separate but local html file. So the problem here is that i'm not sure how i would go about making my .phy program interact with an .html file or if it's even possible.
to put simply, i have the infromation from the URl and i now want to to appear on an html website.
I'm not even sure if this is possible, would appreciate some input on this idea.

Related

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

python crawling text from <em></em>

Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)

When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.

Downloading PDF's using Python webscraping not working

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.

Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

How to loop through a vector of URLs and scrape some basic tags from each

I am trying to loop through a list of URLs and scrape some data from each link. Here is my code.
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
url_list = ['https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy_history']
for link in url_list:
File = webbrowser.open(link)
File = requests.get(link)
data = File.text
soup = bs(data, "lxml")
tspans = soup.find_all("tspan")
tspans
I think this is pretty close, but I'm getting nothing for the 'tspans' variable. I get no error; 'tspans' just shows [].
This is an internal corporate intranet, so I can't share the exact details, but I think it's just a matter of grabbing all the HTML elements named 'tspans' and writing all of them to a text file or a CSV file. That's my ultimate goal. I want to collate everything into a large list and write it all to a file.
As an aside, I was going to use Selenium to log into this site, which requires creds, but it seem like the code I'm testing now allows you you open new tabs on a browser, and everything loads up fine, if you are already logged in. Is this the best practice, or should I use the full login creds + Selenium? I'm just trying to keep things simple.

How to Crawl Website Content by Python

I'm studying Python. I want to get content on one URL . Get all text in one title on the website and save it to file .txt. Can you show me some code example?

By Get all text in one title on the website I assume you mean get the title of the page?
Firstly, you'll need BeautifulSoup
If you have pip, use
pip install beautifulsoup4
Now onto the code:
from bs4 import BeautifulSoup
from requests import get
r = get(url).text
soup = BeautifulSoup(r, 'html.parser')
title = soup.title.string #I save the title to a variable rather then jus
with open('url.txt', 'w') as f:
f.write(title)
Now, where ever you have the script saved, will have a file called url.txt containing the URL.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

.py web scraping used in .html file - python

Related

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

python crawling text from <em></em>

Downloading PDF's using Python webscraping not working

How to loop through a vector of URLs and scrape some basic tags from each

How to Crawl Website Content by Python

Categories

Resources