Parsing the file name from list of url links - python

Ok so I am using a script that is downloading a files from urls listed in a urls.txt.
import urllib.request
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
urllib.request.urlretrieve(link)
Unfortunately they are saved as temporary files due to lack of second argument in my urllib.request.urlretrieve function. As there are thousand of links in my text file naming them separately is not an option. The thing is that the name of the file is contained in those links, i.e. /DocumentXML2XLSDownload.vm?firsttime=true&repengback=true&d‌​ocumentId=XXXXXX&xsl‌​FileName=rher2xml.xs‌​l&outputFileName=XXX‌​X_2017_06_25_4.xls where the name of the file comes after outputFileName=
Is there an easy way to parse the file names and then use them in urllib.request.urlretrieve function as secondary argument? I was thinking of extracting those names in excel and placing them in another text file that would be read in similar fashion as urls.txt but I'm not sure how to implement it in Python. Or is there a way to make it exclusively in python without using excel?

You could parse the link on the go.
Example using a regular expression:
import re
with open("urls.txt", "r") as file:
linkList = file.readlines()
for link in linkList:
regexp = '((?<=\?outputFileName=)|(?<=\&outputFileName=))[^&]+'
match = re.search(regexp, link.rstrip())
if match is None:
# Make the user aware that something went wrong, e.g. raise exception
# and/or just print something
print("WARNING: Couldn't find file name in link [" + link + "]. Skipping...")
else:
file_name = match.group(0)
urllib.request.urlretrieve(link, file_name)

You can use urlparse and parse_qs to get the query string
from urlparse import urlparse,parse_qs
parse = urlparse('http://www.cwi.nl:80/%7Eguido/Python.html?name=Python&version=2')
print(parse_qs(parse.query)['name'][0]) # prints Python

Related

Search for a word in webpage and save to TXT in Python

I am trying to: Load links from a .txt file, search for a specific Word, and if the word exists on that webpage, save the link to another .txt file but i am getting error: No scheme supplied. Perhaps you meant http://<_io.TextIOWrapper name='import.txt' mode='r' encoding='cp1250'>?
Note: the links has HTTPS://
The code:
import requests
list_of_pages = open('import.txt', 'r+')
save = open('output.txt', 'a+')
word = "Word"
save.truncate(0)
for page_link in list_of_pages:
res = requests.get(list_of_pages)
if word in res.text:
response = requests.request("POST", url)
save.write(str(response) + "\n")
Can anyone explain why ? thank you in advance !
Try putting http:// behind the links.
When you use res = requests.get(list_of_pages) you're creating HTTP connection to list_of_pages. But requests.get takes URL string as a parameter (e.g. http://localhost:8080/static/image01.jpg), and look what list_of_pages is - it's an already opened file. Not a string. You have to either use requests library, or file IO API, not both.
If you have an already opened file, you don't need to create HTTP request at all. You don't need this request.get(). Parse list_of_pages like a normal, local file.
Or, if you would like to go the other way, don't open this text file in list_of_arguments, make it a string with URL of that file.

Python3. How to save downloaded webpages to a specified dir?

I am trying to save all the < a > links within the python homepage into a folder named 'Downloaded pages'. However after 2 iterations through the for loop I receive the following error:
www.python.org#content <_io.BufferedWriter name='Downloaded
Pages/www.python.org#content'> www.python.org#python-network
<_io.BufferedWriter name='Downloaded
Pages/www.python.org#python-network'>
Traceback (most recent call last): File "/Users/Lucas/Python/AP book
exercise/Web Scraping/linkVerification.py", line 26, in
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb') IsADirectoryError: [Errno 21]
Is a directory: 'Downloaded Pages/'
I am unsure why this happens as it appears the pages are being saved as due to seeing '<_io.BufferedWriter name='Downloaded Pages/www.python.org#content'>', which says to me its the correct path.
This is my code:
import requests, os, bs4
# Create a new folder to download webpages to
os.makedirs('Downloaded Pages', exist_ok=True)
# Download webpage
url = 'https://www.python.org/'
res = requests.get(url)
res.raise_for_status() # Check if the download was successful
soupObj = bs4.BeautifulSoup(res.text, 'html.parser') # Collects all text form the webpage
# Find all 'a' links on the webpage
linkElem = soupObj.select('a')
numOfLinks = len(linkElem)
for i in range(numOfLinks):
linkUrlToOpen = 'https://www.python.org' + linkElem[i].get('href')
print(os.path.basename(linkUrlToOpen))
# save each downloaded page to the 'Downloaded pages' folder
downloadedPage = open(os.path.join('Downloaded Pages', os.path.basename(linkUrlToOpen)), 'wb')
print(downloadedPage)
if linkElem == []:
print('Error, link does not work')
else:
for chunk in res.iter_content(100000):
downloadedPage.write(chunk)
downloadedPage.close()
Appreciate any advice, thanks.
The problem is that when you try to do things like parse the basename of a page with an .html dir it works, but when you try to do it with one that doesn't specify it on the url like "http://python.org/" the basename is actually empty (you can try printing first the url and then the basename bewteen brackets or something to see what i mean). So to work arround that, the easiest solution would be to use absolue paths like #Thyebri said.
And also, remember that the file you write cannot contain characters like '/', '\' or '?'
So, i dont know if the following code it's messy or not, but using the re library i would do the following:
filename = re.sub('[\/*:"?]+', '-', linkUrlToOpen.split("://")[1])
downloadedPage = open(os.path.join('Downloaded_Pages', filename), 'wb')
So, first i remove part i remove the "https://" part, and then with the regular expressions library i replace all the usual symbols that are present in url links with a dash '-' and that is the name that will be given to the file.
Hope it works!

Python - html2text write to file

I have text files which contain html tags which I want to remove using html2text with Python:
import html2text
html = open("textFileWithHtml.txt").read()
print html2text.html2text(html)
My question is how can I write the output to a .txt file ? (I want to create the new text file without the html elements -- the file does not previously exist)
You need to open another file for writing.
import html2text
html = open("textFileWithHtml.txt")
f = html.read()
w = open("out.txt", "w")
w.write(html2text.html2text(f).encode('utf-8'))
html.close()
w.close()
You should open a file and write to it.
import html2text
# Open your file
with open("textFileWithHtml.txt", 'r') as f_html:
html = f_html.read()
# Open a file and write to it
with open('your_file.txt', 'w') as f:
f.write(html2text.html2text(html).encode('utf-8'))
It is good practice to use the with keyword when dealing with file objects.
And it is more pythonic too.
See more information for files reading / writing files : https://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Edit
If you have issues with encoding, try using .encode('utf-8'). I've added it in my code snipped. Look for python unicode if you have issues regarding this (https://docs.python.org/2/howto/unicode.html)

Python not printing beautifulsoup data to .txt file (or I can't find it)

I'm trying to put all the anchor text on a page in a txt file
print(anchor.get('href'))
with open('file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
and the script executes with no errors but I cannot find file.txt anywhere on my computer. Am I missing out on something really obvious?
With file open mode a (append) the file will be created if it doesn't already exist, otherwise writes will be appended to the end of the file. As written in the question, the file will be created in the current directory of the running process... look there.
from bs4 import BeautifulSoup
import requests
response = requests.get('http://httpbin.org/')
soup = BeautifulSoup(response.text)
with open('file.txt', 'a') as fd:
links = (link.get('href') for link in soup.find_all('a'))
fd.write('\n'.join(links) + '\n')
Try to set a full path to filename, so you can find in this path.
Example:
print(anchor.get('href'))
with open('/home/wbk/file.txt', 'a') as fd:
fd.write(anchor.get('href') + '\n')
And yes, you can use 'a' to create a new file, although this is not a good practice. The 'a' is to append data to the end of an existing file as describes doc
The following code works for me:
with open("example.txt","a") as a:
a.write("hello")
are you sure that the file isn't in your computer? In which ways you have demonstrated it? In my case, i work with eclipse IDE, so I have to refresh the Eclipse file explorer to see it..
Try this:
with open("prova.txt","a") as a:
a.write("ciao")
try: open("prova.txt","r")
except: print "file doesn't exist"

Parse each file in a directory with BeautifulSoup/Python, save out as new file

New to Python & BeautifulSoup. I have a Python program that opens a file called "example.html", runs a BeautifulSoup action on it, then runs a Bleach action on it, then saves the result as file "example-cleaned.html". So far it is working for all contents of "example.html".
I need to modify it so that it opens each file in folder "/posts/", runs the program on it, then saves it out as "/posts-cleaned/X-cleaned.html" where X is the original filename.
Here's my code, minimised:
from bs4 import BeautifulSoup
import bleach
import re
text = BeautifulSoup(open("posts/example.html"))
text.encode("utf-8")
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts/example-cleaned.html", "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
print "Done"
Assistance & pointers to existing solutions gladly received!
You can use os.listdir() to get a list of all files in a directory. If you want to recurse all the way down the directory tree, you'll need os.walk().
I would move all this code to handle a single file to function, and then write a second function to handle parsing the whole directory. Something like this:
def clean_dir(directory):
os.chdir(directory)
for filename in os.listdir(directory):
clean_file(filename)
def clean_file(filename):
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
with open(filename, 'r') as fhandle:
text = BeautifulSoup(fhandle)
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
# this appends -cleaned to the file;
# relies on the file having a '.'
dot_pos = filename.rfind('.')
cleaned_filename = '{0}-cleaned{1}'.format(filename[:dot_pos], filename[dot_pos:])
with open(cleaned_filename, 'w') as fout:
fout.write(cleaned.encode("utf-8"))
print "Done"
Then you just call clean_dir('/posts') or what not.
I'm appending "-cleaned" to the files, but I think I like your idea of using a whole new directory better. That way you won't have to handle conflicts if -cleaned already exists for some file, etc.
I'm also using the with statement to open files here as it closes them and handles exceptions automatically.
Answer to my own question, for others who might find the Python docs for os.listdir a bit unhelpful:
from bs4 import BeautifulSoup
import bleach
import re
import os, os.path
tag_black_list = ['iframe', 'script']
tag_white_list = ['p','div']
attr_white_list = {'*': ['title']}
postlist = os.listdir("posts/")
for post in postlist:
# HERE: you need to specify the directory again, the value of "post" is just the filename:
text = BeautifulSoup(open("posts/"+post))
text.encode("utf-8")
# Step one, with BeautifulSoup: Remove tags in tag_black_list, destroy contents.
[s.decompose() for s in text(tag_black_list)]
pretty = (text.prettify())
# Step two, with Bleach: Remove tags and attributes not in whitelists, leave tag contents.
cleaned = bleach.clean(pretty, strip="TRUE", attributes=attr_white_list, tags=tag_white_list)
fout = open("posts-cleaned/"+post, "w")
fout.write(cleaned.encode("utf-8"))
fout.close()
I cheated and made a separate folder called "posts-cleaned/" because savings files to there was easier than splitting the filename, adding "cleaned", and re-joining it, although if anyone wants to show me a good way to do that, that would be even better.

Categories

Resources