Automating filename generation from url text

Automating filename generation from url text - python

I am parsing some content from the web and then saving it to a file. So far I manually create the filename.
Here's my code:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
How could I automate the process of creating and saving the following html filename from the url:
The-Google-Way-Revolutionizing-Management (instead of html_output_test?
This name comes from the original bookstore url that I posted and that probably was modified to avoid product adv.
Thanks!

You can use BeautifulSoup to get the title text from the page, I would let requests handle the encoding with .content:
url = "http://rads.stackoverflow.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
For that particular page if you just want The Google Way: How One Company is Revolutionizing Management As We Know It the product title is in the class a-size-large:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
The link with The-Google-Way-Revolutionizing-Management is in the link tag:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
So to get that part you need to parse it:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)

You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.

Related

how to read and write only text from an html page in python?

i want to read all the text information from an html page that i have stored locally. i managed to get it to read all the page's information but it is also reading the html tags and javascript code.
i am trying to get the information from a downloading html file and not a url from a website. i want a method to only get the text from the html page i have that works with my code below
how can i make it such that it only writes the text that is in the html page into the text file?
here is my code:
with open("ct.html","r",encoding='utf') as f:
data = f.read()
with open("test.txt", "w",encoding='utf-8-sig') as f:
for line in data:
f.write(line)

You can also try some new methods.
from simplified_scrapy import SimplifiedDoc, utils, req
html = utils.getFileContent('test.html')
doc = SimplifiedDoc(html)
utils.appendFile('test.txt', doc.text)
# Or
utils.appendFile('test2.txt', doc.title.text)
utils.appendFile('test2.txt', doc.body.text)

Downloading PDF's using Python webscraping not working

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.

Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

Automate download all links (of PDFs) inside multiple pdf files

I'm trying to download journal issues from a website (http://cis-ca.org/islamscience1.php). I ran something to get all the PDF's on this page. However these PDF's have links inside them that link to another PDF.
I want to get the terminal articles from all the PDF links.
Got all the PDF's from the page: http://cis-ca.org/islamscience1.php
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://cis-ca.org/islamscience1.php"
#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
I'd like to get the articles linked inside these PDF's.
Thanks in advance

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python
Take a look at this link. It shows how to identify hyperlink and sanitize the PDF document. You could follow it upto the identification part and then perform an operation to store the hyperlink instead of sanitizing.
Alternatively, take a look at this library: https://github.com/metachris/pdfx

How to Crawl Website Content by Python

I'm studying Python. I want to get content on one URL . Get all text in one title on the website and save it to file .txt. Can you show me some code example?

By Get all text in one title on the website I assume you mean get the title of the page?
Firstly, you'll need BeautifulSoup
If you have pip, use
pip install beautifulsoup4
Now onto the code:
from bs4 import BeautifulSoup
from requests import get
r = get(url).text
soup = BeautifulSoup(r, 'html.parser')
title = soup.title.string #I save the title to a variable rather then jus
with open('url.txt', 'w') as f:
f.write(title)
Now, where ever you have the script saved, will have a file called url.txt containing the URL.

How can I create a script that manufactures MLA citations?

I have a folder full of Windows .URL files. I'd like to translate them into a list of MLA citations for my paper.
Is this a good application of Python? How can I get the page titles? I'm on Windows XP with Python 3.1.1.

This is a fantastic use for Python! The .URL file format has a syntax like this:
[InternetShortcut]
URL=http://www.example.com/
OtherStuff=irrelevant
To parse your .URL files, start with ConfigParser, which will read this and make an InternetShortcut section that you can read the URL from. Once you have a list of URLs, you can then use urllib or urllib2 to load the URL, and use a dumb regex to get the page title (or BeautifulSoup as Alex suggests).
Once you have that, you have a list of URLs and page titles...not enough for a full MLA citation, but should be enough to get you started, no?
Something like this (very rough, coding in the SO window):
from glob import glob
from urllib2 import urlopen
from ConfigParser import ConfigParser
from re import search
# I use RE here, you might consider BeautifulSoup because RE can be stupid
TITLE = r"<title>([^<]+)</title>"
result = []
for file in glob("*.url"):
config = ConfigParser.ConfigParser()
config.read(file)
url = config.get("InternetShortcut", "URL")
# Get the title
page = urlopen(url).read()
try: title = search(TITLE, page).groups()[0]
except: title = "Couldn't find title"
result.append((url, title))
for url, title in result:
print "'%s' <%s>" % (title, url)

Given a file that contains an HTML page, you can parse it to extract its title, and BeautifulSoup is the recommended third-party library for the job. Get the BeautifulSoup version compatible with Python 3.1 here, install it, then:
parse each file's contents into a soup object e.g. with:
from BeautifulSoup import BeautifulSoup
html = open('thefile.html', 'r').read()
soup = BeautifulSoup(html)
get the title tag, if any, and print its string contents (if any):
title = soup.find('title')
if title is None: print('No title!')
else: print('Title: ' + title.string)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Automating filename generation from url text - python

You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.

Related

how to read and write only text from an html page in python?

Downloading PDF's using Python webscraping not working

Automate download all links (of PDFs) inside multiple pdf files

How to Crawl Website Content by Python

How can I create a script that manufactures MLA citations?

Categories

Resources