So, I am a bit new to python, and I can't get my head to wrap around why this code snippet is not working.
In short, I have a list of 500 sites, all in the following format: https://www.domain . com/subfolder/subfolder separated by a new line, and I am trying to download them. This is the code:
import wget
f = open("500_sites.txt", "r")
content = f.readlines()
url = ""
for x in range(1, len(content)):
print(content[x])
wget.download(content[x], 'index.html')
input("wait a bit")
I am expecting the code to read the text file line by line in the content list. Then, I would like the wget.download() function to download the whole source code of the content[x] webpage.
Using the wget.download() with a given variable it works perfectly:
...
url = "https://domain . com/subfolder/subfolder"
wget.download(url, 'index.html')
...
Thanks in advance!
Related
I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!
I have an Excel file with a column filled with +4000 URLs each one in a different cell. I need to use Python to open it with Chrome and scraping the website some of the data from a website.
past them in excel.
And then do the same step for the next URL. Could you please help me with that?
export the excel file to csv file read data from it as
def data_collector(url):
# do your code here and return data that you want to write in place of url
return url
with open("myfile.csv") as fobj:
content = fobj.read()
#below line will return you urls in form of list
urls = content.replace(",", " ").strip()
for url in urls:
data_to_be_write = data_collector(url)
# added extra quotes to prevent csv from breaking it is prescribed
# to use csv module to write in csv file but for ease of understanding
# i did it like this, Hoping You will correct it by yourself
content = "\"" + {content.replace(url, data_to_be_write) + "\""
with open("new_file.csv", "wt") as fnew:
fnew.write(content)
after running this code you will get new_file.csv opening it with Excel you will get your desired data in place of url
if you want your url with data just append it like with data in string seprated by colon.
I have a list of URLs saved in a .txt file and I would like to feed them, one at a time, to a variable named url to which I apply methods from the newspaper3k python library. The program extracts the URL content, authors of the article, a summary of the text, etc, then prints the info to a new .txt file. The script works fine when you give it one URL as user input, but what should I do in order to read from a .txt with thousands of URLs?
I am only beginning with Python, as a matter of fact this is my first script, so I have tried to simply say url = (myfile.txt), but I realized this wouldn't work because I have to read the file one line at a time. So I have tried to apply read() and readlines() to it, but it wouldn't work properly because 'str' object has no attribute 'read' or 'readlines'. What should I use to read those URLs saved in a .txt file, each beginning in a new line, as the input of my simple script? Should I convert string to something else?
Extract from the code, lines 1-18:
from newspaper import Article
from newspaper import fulltext
import requests
url = input("Article URL: ")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
Later I have built some functions to display the info in a desired format and save it to a new .txt. I know this is a very basic one, but I am honestly stuck... I have read other similar questions here but I couldn't properly understand or apply the suggestions. So, what is the best way to read URLs from a .txt file in order to feed them, one at a time, to the url variable, to which other methods are them applied to extract its content?
This is my first question here and I understand the forum is aimed at more experienced programmers, but I would really appreciate some help. If I need to edit or clarify something in this post, please let me know and I will correct immediately.
Here is one way you could do it:
from newspaper import Article
from newspaper import fulltext
import requests
with open('myfile.txt',r) as f:
for line in f:
#do not forget to strip the trailing new line
url = line.rstrip("\n")
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
This could help you:
url_file = open('myfile.txt','r')
for url in url_file.readlines():
print url
url_file.close()
You can apply it on your code as the following
from newspaper import Article
from newspaper import fulltext
import requests
url_file = open('myfile.txt','r')
for url in url_file.readlines():
a = Article(url, language='pt')
html = requests.get(url).text
text = fulltext(html)
download = a.download()
parse = a.parse()
nlp = a.nlp()
title = a.title
publish_date = a.publish_date
authors = a.authors
keywords = a.keywords
summary = a.summary
url_file.close()
I'm currently creating an app thats supposed to take a input in form of a url (here a PDF-file) and recognize this as a PDF and then upload it to a tmp folder i have on a server.
I have absolutely no idea how to proceed with this. I've already made a form which contains a FileField which works perfectly, but when it comes to urls i have no clue.
Thank you for all answers, and sorry about the lacking english skills.
The first 4 bytes of a pdf file are %PDF so you could just download the first 4 bytes from that url and compare them to %PDF. If it matches, then download the whole file.
Example:
import urllib2
url = 'your_url'
req = urllib2.urlopen(url)
first_four_bytes = req.read(4)
if first_four_bytes == '%PDF':
pdf_content = urllib2.urlopen(url).read()
# save to temp folder
else:
# file is not PDF
I'm trying to take a URL from a list (~1500 entries) and access them one by one using the twill lib for python. The reason that I'm using twill is because I like it and I might have to perform basic formfilling later on.
The problem I have is declaring the contents of the loop.
I'm sure this is actually pretty simple to solve, but the solution just won't come to my mind at the moment.
from twill.commands import *
CONTAINER = open('urls.txt') #opening file
CONTAINER_CONTENTS = CONTAINER.readlines() #reading
CONTAINER_CONTENTS = map(lambda s: s.strip, CONTAINER_CONTENTS) #this is just to remove the ^N (newline) that was appended to each URL
for i in CONTAINER_CONTENTS:
<educate me>
..
go(url)
etc.
Thanks in Advance.
from twill.commands import *
with open('urls.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
go(url)
# now do something with the page