I need to access the source code of a locally saved file, but I need to automate this because there are multiple files in one folder. I've looked at the inspect module and the selenium module, but I still understand what to do. After accessing the source code, I need to use bs4 to extract from it.
I've read several posts on here and elsewhere with similar problems, but the thing is that my file does not open in the source code (it's written in xml and so far everything needs to be in source code before you can use these modules). If I open the file, it just uses my browser to open a regular page and then I have to click view page source.
How can I automate this so that it will open the page, go to the source code, and save it so I can stick it into a soup for later parsing?
path_g_jurt = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt'
file = r'C:\Users\g\Desktop\t\SDU\jurt htmls\jurt\meta jurt' + "/" + file
for file in path_g_jurt:
if file.endswith(".xhtml"):
with open(file, encoding = "utf-8") as mdata_jurt:
soup = BeautifulSoup(mdata_jurt)
main = file.find("jcid").get_text()
misc_links = []
for item in file.find_all("regelgeving"):
misc = item.find("misc:link")
misc_links.append(misc.get("misc:jcid"))
Any help would be appreciated.
Related
I want to ask how to download file using Python from a link like this, I crawled through the stack for a while and didn't find anything that works.
I got a link to a file, something like this:
https://w3.google.com/tools/cio/forms/anon/org/contentload?content=https://w3.ibm.com/tools/cio/forms/secure/org/data/f48f2294-495b-48f5-8d4e-e418f4b25a48/F_Form1/attachment/bba4ddfd-837d-47a6-87ef-2114f6b3da08 (link doesn't work, just showing you how it should look)
And after clicking on it it opens a browser and starts opening file:
I don't know how the file will be named or what format file will have, I only have a URL that links to file like this image up.
I tried this:
def Download(link):
r = requests.get(link)
with open('filename.docx', 'wb') as f:
f.write(r.content)
But this definitely doesn't work, as you can see I manually put the name of the file because it desperate but it doesn't work either, it makes file but only 1kb size and nothing in it.
I don't know how to code it to automatically download it from links like this? Can you help?
use urlretrieve from urllib. See here
You can use urllib.request.urlretrieve to get the contents of the file.
Example:
import urllib.request
with open('filename.docx', 'wb') as f:
f.write(urllib.request.urlretrieve("https://w3.google.com/tools/cio/forms/anon/org/contentload?content=https://w3.ibm.com/tools/cio/forms/secure/org/data/f48f2294-495b-48f5-8d4e-e418f4b25a48/F_Form1/attachment/bba4ddfd-837d-47a6-87ef-2114f6b3da08"))
I've been attempting to create a function that iterates over inputs from a text file that contains URLs, using webbrowser package. It works fine when I create a empty list to which URLs are literally appended, as in:
import webbrowser
list = []
list.append(url1)
list.append(url2)
def webbrowsing(list)
for i in range(0, len(list)):
webbrowser.open(list[i])
where url1 and url2 are any valid URLs. And webbrowser.open() opens the URLs in Chrome and it is really good.
However, when I try and do the same thing with inputs from a text file of URLs, webbrowser opens the URLs from the file in Internet Explorer. I gave it a try using webbrowser.get(), explicitly directing it to use Chrome, but that didn't work.
I am not very sure why it does not open the URLs in Chrome, when almost everything seems the same as when the list is used as mentioned above. Chrome is set as my default web browser, and I rarely use the IE. I'd really appreciate any tips on that issue.
How do you define the 'webbrowser' object? I use something like this:
driver = webdriver.Chrome(driverPath) #driverPath contains the path to the 'chromedriver.exe' file
driver.get(url)
For example I would like to save the .pdf file # http://arxiv.org/pdf/1506.07825 with the filename: 'Data Assimilation- A Mathematical Introduction' at the location 'D://arXiv'.
But I have many such files. So, my input is of the form of a .csv file with rows given by (semi-colon is the delimiter):
url; file name; location.
I found some code here: https://github.com/ravisvi/IDM
But that is a bit advanced for me to parse. I want to start with something simpler. The above seems to have more functionality than I need right now - threading, pausing etc.
So can you please write me a very minimal code to do the above:
save the file 'Data Assimilation- A Mathematical Introduction'
from 'http://arxiv.org/pdf/1506.07825'
at 'D://arXiv'?
I think I will be able to generalize it to deal with a .csv file.
Or, hint me a place to get started. (The github repository already has a solution, and it is too perfect! I want something simpler.) My guess is, with Python, a task as above should be possible with no more than 10 lines of code. So tell me important ingredients of the code, and perhaps I can figure it out.
Thanks!
I would use the requests module, you can just pip install requests.
Then, the code is simple:
import requests
response = requests.get(url)
if response.ok:
file = open(file_path, "wb+") # write, binary, allow creation
file.write(response.content)
file.close()
else:
print("Failed to get the file")
Using Python 3.6.5
Here is a method that can create a folder and save the file in a folder.
dataURL - Complete URL path
data_path - Where the file needs to be saved.
tgz_path - Name of the datafile with the extension.
def fetch_data_from_tar(data_url,data_path,tgz_path):
if not os.path.isdir(data_path):
os.mkdir(data_path)
print ("Data Folder Created # Path", data_path)
else:
print("Folder path already exists")
tgz_path = os.path.join(data_path,tgz_path)
urllib.request.urlretrieve(data_url,filename=tgz_path)
data_tgz = tarfile.open(tgz_path)
data_tgz.extractall(path=data_path)
data_tgz.close()
I see that you can set where to download a file to through Webdriver, as follows:
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",getcwd())
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/csv")
browser = webdriver.Firefox(firefox_profile=fp)
But, I was wondering if there is a similar way to give the file a name when it is downloaded? Preferably, probably not something that is associated with the profile, as I will be downloading ~6000 files through one browser instance, and do not want to have to reinitiate the driver for each download.
I would suggest a little bit strange way: do not download files with the use of Selenium if possible.
I mean get the file URL and use urllib library to download the file and save it to disk in a 'manual' way. The issue is that selenium doesn't have a tool to handle Windows dialogs, such as 'save as' dialog. I'm not sure, but I doubt that it can handle any OS dialogs at all, please correct me I'm wrong. :)
Here's a tiny example:
import urllib
urllib.urlretrieve( "http://www.yourhost.com/yourfile.ext", "your-file-name.ext")
The only job for us here is to make sure that we handle all the urllib Exceptions. Please see http://docs.python.org/2/library/urllib.html#urllib.urlretrieve for more info.
I do not know if there is a pure Selenium handler for this, but here is what I have done when I needed to do something with the downloaded file.
Set a loop that polls your download directory for the latest file that does not have a .part extension (this indicates a partial download and would occasionally trip things up if not accounted for. Put a timer on this to ensure that you don't go into an infinite loop in the case of timeout/other error that causes the download not to complete. I used the output of the ls -t <dirname> command in Linux (my old code uses commands, which is deprecated so I won't show it here :) ) and got the first file by using
# result = output of ls -t
result = result.split('\n')[1].split(' ')[-1]
If the while loop exits successfully, the topmost file in the directory will be your file, which you can then modify using os.rename (or anything else you like).
Probably not the answer you were looking for, but hopefully it points you in the right direction.
Solution with code as suggested by the selected answer. Rename the file after each one is downloaded.
import os
os.chdir(SAVE_TO_DIRECTORY)
files = filter(os.path.isfile, os.listdir(SAVE_TO_DIRECTORY))
files = [os.path.join(SAVE_TO_DIRECTORY, f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
newest_file = files[-1]
os.rename(newest_file, docName + ".pdf")
This answer was posted as an edit to the question naming a file when downloading with Selenium Webdriver by the OP user1253952 under CC BY-SA 3.0.
I have a rather simple program that writes HTML code ready for use.
It works fine, except that if one were to run the program from the Python command line, as is the default, the HTML file that is created is created where python.exe is, not where the program I wrote is. And that's a problem.
Do you know a way of getting the .write() function to write a file to a specific location on the disc (e.g. C:\Users\User\Desktop)?
Extra cool-points if you know how to open a file browser window.
The first problem is probably that you are not including the full path when you open the file for writing. For details on opening a web browser, read this fine manual.
import os
target_dir = r"C:\full\path\to\where\you\want\it"
fullname = os.path.join(target_dir,filename)
with open(fullname,"w") as f:
f.write("<html>....</html>")
import webbrowser
url = "file://"+fullname.replace("\\","/")
webbrowser.open(url,True,True)
BTW: the code is the same in python 2.6.
I'll admit I don't know Python 3, so I may be wrong, but in Python 2, you can just check the __file__ variable in your module to get the name of the file it was loaded from. Just create your file in that same directory (preferably using os.path.dirname and os.path.join to remain platform-independent).