Python saves empty html file - python

import requests
url = 'https://hkpropel.humankinetics.com/mylibrary.htm'
r = requests.get(url)
with open('saving.html', 'wb+') as f:
f.write(r.content)
f.close()
I'm trying to save web in html file, it's work fine for others web , but for this one it's always save empty data

The page you're trying to copy likely requires you to be logged in. If your not it redirects you to the login page and you ended up getting a copy of that.
Also, I'm pretty new to Python, but I don't think you need to use f.close()

Related

Downloading PDF's using Python webscraping not working

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.
Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

Writing to txt file multiple string, only last string saves?

I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...

How to loop through a vector of URLs and scrape some basic tags from each

I am trying to loop through a list of URLs and scrape some data from each link. Here is my code.
from bs4 import BeautifulSoup as bs
import webbrowser
import requests
url_list = ['https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy',
'https://corp-intranet.com/admin/graph?dag_id=emm1_daily_legacy_history']
for link in url_list:
File = webbrowser.open(link)
File = requests.get(link)
data = File.text
soup = bs(data, "lxml")
tspans = soup.find_all("tspan")
tspans
I think this is pretty close, but I'm getting nothing for the 'tspans' variable. I get no error; 'tspans' just shows [].
This is an internal corporate intranet, so I can't share the exact details, but I think it's just a matter of grabbing all the HTML elements named 'tspans' and writing all of them to a text file or a CSV file. That's my ultimate goal. I want to collate everything into a large list and write it all to a file.
As an aside, I was going to use Selenium to log into this site, which requires creds, but it seem like the code I'm testing now allows you you open new tabs on a browser, and everything loads up fine, if you are already logged in. Is this the best practice, or should I use the full login creds + Selenium? I'm just trying to keep things simple.

Loading more content in a webpage and issues writing to a file

I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
Only the first few links are scraped. I'm unable to extract links
from other pages(Website contains load more button). I don't know
how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the
csv file), scrapes the respective information and stores it in a
text file. It does not go through all the links from the beginning.
I am unable to figure out where I have gone wrong in terms of file
handling and f.seek(0).
from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content, "lxml")
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
results = soup.findAll('a', {'rel': 'bookmark'})
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
f.write('\n'.join(get_url_for_search_key('digital advertising')))
f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.
For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.
If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results.
I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage.
Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.
As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.
Unfortunately there is no magic navigate through subresults option....
But this gets pretty close :).
Good luck.

Inherent way to save web page source

I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.
Is there a way to do the equivalent of saving a page's source from a web brower?
That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?
Here is where I got to:
import urllib
f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")
#somehow save the web page source
f.close()
Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.
Perhaps there is a readlines() equivalent for reading lines of a web page?
You may try urllib2:
import urllib2
page = urllib2.urlopen('http://stackoverflow.com')
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)
Updated code, for Python 3 (where urllib2 is deprecated):
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
with open('page_content.html', 'w') as fid:
fid.write(html)
Answer from SoHei will not work because it's missing html.read() and the file must be opened with 'wb' parameter instead of just a 'w'. The 'b' indicates that data will be written in binary mode (since .read() returns sequence of bytes).
The fully working code is:
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
page_content = html.read()
with open('page_content.html', 'wb') as fid:
fid.write(page_content)

Categories

Resources