Downloading files with complicated name structures from HTTP

Downloading files with complicated name structures from HTTP - python

When I try to download a file using this code:
import urllib
urllib.urlretrieve("http://e4ftl01.cr.usgs.gov/MOLT/MOD11A1.005/2012.07.11/MOD11A1.A2012193.h22v10.005.2012196013617.hdf","1.hdf")
the file is correctly downloaded.
But my objective is to build a function that will download files depending to some inputs that are parts of the file name.
There are many files one the webpage. Some parts of the file names are the same for every file, (e.g. "/MOLT/MOD11A1.005/"), so this is not a problem. Some other parts change from file to file following some well defined rules (e.g."h22v10") and I have solved this using %s (e.g. h%sv%s), so this isn't a problem either. The problem is that some parts of the names change without any rule (e.g. "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts. So, I want to download files whose names contain the first two parts (the part that does not change, and the part that changes under a rule) and WHATEVER else.
I thought, I could use wildcards for WHATEVER, so I tried this:
import urllib
def download(url,date,h,v):
urllib.urlretrieve("%s/MOLT/MOD11A1.005/%s/MOD11A1.*.h%sv%s.005.*.hdf" %
(url, date1, h, v), "2.hdf")
download("http://e4ftl01.cr.usgs.gov", "2012.07.11", "22", "10")
This does not download the requested file, but instead generates an error file that says:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html>
<head>
<title>404 Not Found</title>
</head>
<body>
<h1>Not Foun d</h1>
<p>The requested URL /MOLT/MOD11A1.005/2012.07.11/MOD11A1\*\h22v10.005\*\.hdf was not found on this server.</p >
</body>
</html>
It seems like wildcards do not work with HTTP. Do you have any idea how to solve this?

The problem is that some parts of the names change without any rule (e.g. "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts
That is not possible. HTTP URLs do not support 'wildcards'. You must provide an existing URL.

Here is a solution: This assumes that the PartialName is a string with the first part of the filename (as much as is known and constant), that URLtoSearch is the URL where the file can be found (also a string), and that FileExtension a string of the form ".ext", ".mp3", ".zip", etc
def findURLFile(PartialName, URLtoSearch, FileExtension):
import urllib2
sourceURL = urllib2.urlopen(URLtoSearch)
readURL = sourceURL.read()
#find the first instance of PartialName and get the Index
#of the first character in the string (an integer)
fileIndexStart = readURL.find(PartialName)
#find the first instance of the file extension after the first
#instance of the string and add 4 to get past the extension
fileIndexEnd = readURL[fileIndexStart:].find(FileExtension) + 4
#get the filename
fileName = readURL[fileIndexStart:fileIndexStart+fileIndexEnd]
#stop reading the url -not sure if this is necessary
sourceURL.close()
#output the URL to download the file from
downloadURL = URLtoSearch + fileName
return downloadURL
I am rather new at coding python and this could probably benefit from some exception handling and perhaps a while loop. It works for what I need, but I will likely refine the code and make it more elegant.

Related

Accessing Windows File "Tags" metadata in Python

How do I access the tags attribute here in the Windows File Properties panel?
Are there any modules I can use? Most google searches yield properties related to media files, file access times, but not much related to metadata properties like Tags, Description etc.
the exif module was able to access a lot more properties than most of what I've been able to find, but still, it wasn't able to read the 'Tags' property.
The Description -> Tags property is what I want to read and write to a file.

There's an entire module dedicated to exactly what I wanted: IPTCInfo3.
import iptcinfo3, os, sys, random, string
# Random string gennerator
rnd = lambda length=3 : ''.join(random.choices(list(string.ascii_letters), k=length))
# Path to the file, open a IPTCInfo object
path = os.path.join(sys.path[0], 'DSC_7960.jpg')
info = iptcinfo3.IPTCInfo(path)
# Show the keywords
print(info['keywords'])
# Add a keyword and save
info['keywords'] = [rnd()]
info.save()
# Remove the weird ghost file created after saving
os.remove(path + '~')
I'm not particularly sure what the ghost file is or does, it looks to be an exact copy of the original file since the file size remains the same, but regardless, I remove it since it's completely useless to fulfilling the read/write purposes of metadata I need.
There have been some weird behaviours I've noticed while setting the keywords, like some get swallowed up into the file (the file size changed, I know they're there, but Windows doesn't acknowledge this), and only after manually deleting the keywords do they reappear suddenly. Very strange.

Python FTP: parseable directory listing

I'm using the Python FTP lib for the first time. My goal is simply to connect to an FTP site, get a directory listing, and then download all files which are newer than a certain date - (e.g. download all files created or modified within the last 5 days, for example)
This turned out to be a bit more complicated than I expected for a few reasons. Firstly, I've discovered that there is no real "standard" FTP file list format. Most FTP sites conventionally use the UNIX ls format, but this isn't guaranteed.
So, my initial thought was to simply parse the UNIX ls format: it's not so bad after all, and it seems most mainstream FTP servers will use it in response to the LIST command.
This was easy enough to code with Python's ftplib:
import ftplib
def callback(line):
print(line)
ftp = ftplib.FTP("ftp.example.com")
result = ftp.login(user = "myusername", passwd = "XXXXXXXX")
dirlist = ftp.retrlines("LIST", callback )
This works, except the problem is that the date given in the UNIX list format returned by the FTP server I'm dealing with doesn't have a year. A typical entry is:
-rw-rw-r-- 1 user user 1505581 Dec 9 21:53 somefile.txt
So the problem here is that I'd have to code in extra logic to sort of "guess" if the date refers to the current year or not. Except really, I'd much rather not code some complex logic like that when it seems so unnecessary - there's no reason the FTP server shouldn't be able to give me the year.
Okay, so after Googling around for some alternative ways to get LIST information, I've found that many FTP servers support the MLST and MLSD command, which apparently provides a directory listing in a "machine-readable" format, i.e. a list format which is much more amenable to automatic processing. Great. So, I try the following:
dirlist = ftp.sendcmd("MLST")
print(dirlist)
This produces a single line response, giving me data about the current working directory, but NOT a list of files.
250-Start of list for /
modify=20151210094445;perm=flcdmpe;type=cdir;unique=808U6EC0051;UNIX.group=1003;UNIX.mode=0775;UNIX.owner=1229; /
250 End of list
So this looks great, and easy to parse, and it also has a modify date with the year. Except it seems the MLST command is showing information about the directory itself, rather than a listing of files.
So, I've Googled around and read the relevant RFCs, but can't seem to figure out how to get a listing of files in "MLST" format. It seems the MLSD command is what I want, but I get a 425 error when I try that:
File "temp8.py", line 8, in <module>
dirlist = ftp.sendcmd("MLSD")
File "/usr/lib/python3.2/ftplib.py", line 255, in sendcmd
return self.getresp()
File "/usr/lib/python3.2/ftplib.py", line 227, in getresp
raise error_temp(resp)
ftplib.error_temp: 425 Unable to build data connection: Invalid argument
So how can I get a full directory listing in MLST/MLSD format here?

There is another module ftputil which is built based on ftplib, and has many features emulating os, os.path, shutil. I found it pretty easy to use and robust in related operation. Maybe you could give it a try.
As for your purpose, the introduction codes solves it exactly.

you could try this, and see if you can get what you need.
print(ftp.mlst('directory'))
I am working on something similar where i need to parse the content of directory and all sub directories within. However the server that I am working with did not allow mlst command, so i accomplished what i need by,
parse the main directory content
for loop through main directory content
Append for loop output to pandas DataFrame.
test = pd.Series('ftp.nlst('/target directory/'))
df_server_content = pd.DataFrame()
for i in test:
data_dir = '/target directory/' + i
server_series = pd.Series(ftp.nlst(data_dir))
df_server_content = df_server_content.append(server_series)

How do I escape forward slashes in python, so that open() sees my file as a filename to write, instead of a filepath to read?

Let me preface this by saying I'm not exactly sure what is happening with my code; I'm fairly new to programming.
I've been working on creating an individual final project for my python CS class that checks my teacher's website on a daily basis and determines if he's changed any of the web pages on his website since the last time the program ran or not.
The step I'm working on right now is as follows:
def write_pages_files():
'''
Writes the various page files from the website's links
'''
links = get_site_links()
for page in links:
site_page = requests.get(root_url + page)
soup = BeautifulSoup(site_page.text)
with open(page + ".txt", mode='wt', encoding='utf-8') as out_file:
out_file.write(str(soup))
The links look similar to this:
/site/sitename/class/final-code
And the error I get is as follows:
with open(page + ".txt", mode='wt', encoding='utf-8') as out_file:
FileNotFoundError: [Errno 2] No such file or directory: '/site/sitename/class.txt'
How can I write the site pages with these types of names (/site/sitename/nameofpage.txt)?

you cannot have / in the file basename on unix or windows, you could replace / with .:
page.replace("/",".") + ".txt"
Python presumes /site etc.. is a directory.

On Unix/Mac OS, for the middle slashes, you can use : which will convert to / when viewed, but trigger the subfolders that / does.
site/sitename/class/final-code -> final-code file in a class folder in a sitename folder in a site folder in the current folder
site:sitename:class:final-code -> site/sitename/class/final-code file in the current folder.

Related to the title of the question, though not the specifics, if you really want your file names to include something that looks like a slash, you can use the unicode character "∕" (DIVISION SLASH), aka u'\u2215'.
This isn't useful in most circumstances (and could be confusing), but can be useful when the standard nomenclature for a concept you wish to include in a filename includes slashes.

Trouble using requests library in python

I am attempting to check for active web site folders against a list that was created using robots.txt (this is for learning security, Im doing this on a server that I own and control). I am using Python 2.7 on Kali Linux.
My code works if I just do one web address at a time, as I get a proper 200 or 404 response for folders that are active and not working, respectively.
When I attempt to this against the entire list, I get a string of 404 errors. When i print out actual addresses that the script is creating, everything looks correct.
Here is the code that I am doing:
import requests
attempt = open('info.txt', 'r')
folders = attempt.readlines()
for line in folders:
host = 'http://10.0.1.66/mutillidae'+line
attempt = requests.get(host)
print attempt
This results in a string of 404 errors. If I take the loop out, and try each one individually, I get a 200 response back showing that it is up and running.
I have also printed out the address using the same loop against the text document that contains the correct folders, and the addresses seem to look fine which I verified through copy and pasting. I have tried this with a file containing multiple folders and a single folder listed, and always get a 404 when attempting to read from the file.
The info.txt file contains the following:
/passwords/
/classes/
/javascript/
/config
/owasp-esapi-php/
/documentation/
Any advice is appreciated.

Lines returned by file.readlines() contain trailing newlines, which you must remove before passing them to requests.get. Replace the statement:
host = 'http://10.0.1.66/mutillidae'+line
with:
host = 'http://10.0.1.66/mutillidae' + line.rstrip()
and the problem will go away.
Note that your code would be easier to read if you refrained from using the same generic variable name such as attempt for different purposes, all in the same scope. Also, one should try to use variable names that reflect their usage—for example, host would be better named url, as it doesn't hold the host name, but the entire URL.

How to display a XML file in a folder like system

I have searched quite a bit for some help on this but most information about XML and Python is about how to read and parse files, not how to write them. I currently have a script that outputs the file system of a folder as an XML file but I want this file to include HTML tags that can be read on the web.
Also, I am looking for a graphical folder style display of the XML information. I have seen this solved in javascript but I don't have any experience in javascript so is this possible in python?
import os
import sys
from xml.sax.saxutils import quoteattr as xml_quoteattr
dirname = sys.argv[1]
a = open("output2.xml", "w")
def DirAsLessXML(path):
result = '<dir name = %s>\n' % xml_quoteattr(os.path.basename(path))
for item in os.listdir(path):
itempath = os.path.join(path, item)
if os.path.isdir(itempath):
result += '\n'.join(' ' + line for line in
DirAsLessXML(os.path.join(path, item)).split('\n'))
elif os.path.isfile(itempath):
result += ' <file name=%s />\n' % xml_quoteattr(item)
result += '</dir>'
return result
if __name__ == '__main__':
a.write('<structure>\n' + DirAsLessXML(dirname))

Generally it's not a good idea to code HTML directly in Python by building strings as you do it. You won't understand what kind of HTML you produce if look at your code in a month or so. I recommend using a templating engine.
Alternatively, as you are converting from XML to XML, XSLT (an XML transformation language) seems like a good solution.
Concerning JavaScript, you'll need it if you want an interactive folder view. But this does not influence how to generate the HTML page. Whatever approach you use, just add a script include tag to the HTML header and implement the JavaScript part in an extra file (the one you include, of course).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading files with complicated name structures from HTTP - python

The problem is that some parts of the names change without any rule (e.g. "2012196013617", ). These parts of the name does not matter, and I want to ignore these parts That is not possible. HTTP URLs do not support 'wildcards'. You must provide an existing URL.

Related

Accessing Windows File "Tags" metadata in Python

Python FTP: parseable directory listing

How do I escape forward slashes in python, so that open() sees my file as a filename to write, instead of a filepath to read?

Trouble using requests library in python

How to display a XML file in a folder like system

Categories

Resources