How to Crawl Website Content by Python

How to Crawl Website Content by Python - python

I'm studying Python. I want to get content on one URL . Get all text in one title on the website and save it to file .txt. Can you show me some code example?

By Get all text in one title on the website I assume you mean get the title of the page?
Firstly, you'll need BeautifulSoup
If you have pip, use
pip install beautifulsoup4
Now onto the code:
from bs4 import BeautifulSoup
from requests import get
r = get(url).text
soup = BeautifulSoup(r, 'html.parser')
title = soup.title.string #I save the title to a variable rather then jus
with open('url.txt', 'w') as f:
f.write(title)
Now, where ever you have the script saved, will have a file called url.txt containing the URL.

Related

.py web scraping used in .html file

I have never really worked that much with python but i'm basicly trying to get a small line of text from a website by web scraping like this:
import bs4
import requests
from bs4 import BeautifulSoup
url = "http://finance.yahoo.com/quote/GME/"
r = requests.get(url)
text = BeautifulSoup(r.text, 'lxml')
text = text.find_all('div', {'class' : 'My(6px) Pos(r) smartphone_Mt(6px)'})[0].text
print(text)
and then somehow include the result of the "print(text)" line in a separate but local html file. So the problem here is that i'm not sure how i would go about making my .phy program interact with an .html file or if it's even possible.
to put simply, i have the infromation from the URl and i now want to to appear on an html website.
I'm not even sure if this is possible, would appreciate some input on this idea.

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.

That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

When I scrape data from a website it only returns a newline

I've tried the code with different websites and elements, but nothing was working.
import requests
from lxml import html
page = requests.get('https://www.instagram.com/username.html')
tree = html.fromstring(page.content)
follow = tree.xpath('//span[#class="g47SY"]/text()')
print(follow)
input()
Above is the code I tried to use to aquire the number of instagram followers someone had.

One issue with web scraping Instagram is that a lot of content, including tag attribute values, is rendered dynamically. So the class you are using to fetch followers may change.
If you are able to use the Beautiful Soup library in Python, you might have an easier time parsing the page and getting the data. You can install it using pip install bs4. You can then search for the og:description descriptor, which follows the Open Graph protocol, and parse it to get follower counts.
Here's an example script that should get the follower count for a particular user:
import requests
from bs4 import BeautifulSoup
username = 'google'
html = requests.get('https://www.instagram.com/' + username)
bs = BeautifulSoup(html.text, 'lxml')
item = bs.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
follower_count = item.get("content").split(",")[0]
print(follower_count)

Automate download all links (of PDFs) inside multiple pdf files

I'm trying to download journal issues from a website (http://cis-ca.org/islamscience1.php). I ran something to get all the PDF's on this page. However these PDF's have links inside them that link to another PDF.
I want to get the terminal articles from all the PDF links.
Got all the PDF's from the page: http://cis-ca.org/islamscience1.php
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://cis-ca.org/islamscience1.php"
#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
I'd like to get the articles linked inside these PDF's.
Thanks in advance

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python
Take a look at this link. It shows how to identify hyperlink and sanitize the PDF document. You could follow it upto the identification part and then perform an operation to store the hyperlink instead of sanitizing.
Alternatively, take a look at this library: https://github.com/metachris/pdfx

Automating filename generation from url text

I am parsing some content from the web and then saving it to a file. So far I manually create the filename.
Here's my code:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
How could I automate the process of creating and saving the following html filename from the url:
The-Google-Way-Revolutionizing-Management (instead of html_output_test?
This name comes from the original bookstore url that I posted and that probably was modified to avoid product adv.
Thanks!

You can use BeautifulSoup to get the title text from the page, I would let requests handle the encoding with .content:
url = "http://rads.stackoverflow.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
For that particular page if you just want The Google Way: How One Company is Revolutionizing Management As We Know It the product title is in the class a-size-large:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
The link with The-Google-Way-Revolutionizing-Management is in the link tag:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
So to get that part you need to parse it:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)

You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to Crawl Website Content by Python - python

I'm studying Python. I want to get content on one URL . Get all text in one title on the website and save it to file .txt. Can you show me some code example?

Related

.py web scraping used in .html file

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

When I scrape data from a website it only returns a newline

Automate download all links (of PDFs) inside multiple pdf files

Automating filename generation from url text

Categories

Resources