Python parsing xml directly from web address

Python parsing xml directly from web address - python

Hey. I tried to find a way but i can't. I have set up a xml.sax parser in python and it works perfect when i read a local file (for example calendar.xml), but i need to read a xml file from a web address.
I figured it would work if i do this:
toursxml='http://api.songkick.com/api/3.0/artists/mbid:'+mbid+'/calendar.xml?apikey=---------'
toursurl=urllib2.urlopen(toursxml)
toursurl=toursurl.read()
parser.parse(toursurl)
but it doesnt. im sure theres an easy way but i cant find it.
so yeah I can easily go to the url and download the file and open it by doing
parser.parse("calendar.xml")
as a work around ive set it up to read the file and create the file locally, close the file, and then read it. But as you can guess its slow as hell.
Is there anyone to directly read the xml? also note that the url name does not end in ".xml" so that may be a problem later

First, your example is mixed up. Please don't reuse variables.
toursurl= urllib2.urlopen(toursxml)
toursurl_string= toursurl.read()
parser.parseString( toursurl_string )
Reads the entire file into a string, named toursurl_string.
To parse a string, you use the parseString(toursurl_string) method.
http://docs.python.org/library/xml.sax.html#xml.sax.parseString
If you want to combine reading and parsing, you have to pass the "stream" or filename to parse.
toursurl= urllib2.urlopen(toursxml)
parser.parse(toursurl)

parser.parse(xyz)
expects xyz to be a file; you are looking for
parser.parseString(xyz)
which expects xyz to be a string containing XML.

Related

How to resume parsing with new open file [duplicate]

What would be the best practice - if there is any - to parse multiple config files?
I want to parse the mysql server configuration and also write the configuration again.
The configuration allows to issue multiple lines like:
!includedir /etc/mysql.d/
So the interesting thing is, that some configuration may be located in the main file but other may be located in a sub file.
I think pyparsing only works on ONE single file or one content string.
So I probably first need to read all files and maybe restructures the contents like adding headers for the different files...
====main file====
[mysql]
....
!includedir /etc/mysql.d/
====/etc/mysql.d/my.cnf====
[client]
.....
I would only have one pyparsing call.
Then I could parse everything into one big data object, group the file sections and have the file names as keys. This way I could also write the data back to the disk...
The other possibility would be to parse the main file and programmatically parse all other files that were found in the main file.
Thus I would have several pyparsing calls.
What do you think?

In your pyparsing code, attach a parse action to the expression that matches the include statements, have it parse the contents of the referenced files or directory of files, then merge those results into the current parse output. The parse action would make the successive calls to parseString, your code would only make a single call.
See this new example added to the pyparsing examples directory: https://github.com/pyparsing/pyparsing/blob/master/examples/include_preprocessor.py

Python: How to get the URL to a file when the file is received from a pipe?

I created, in Python, an executable whose input is the URL to a file and whose output is the file, e.g.,
file:///C:/example/folder/test.txt --> url2file --> the file
Actually, the URL is stored in a file (url.txt) and I run it from a DOS command line using a pipe:
type url.txt | url2file
That works great.
I want to create, in Python, an executable whose input is a file and whose output is the URL to the file, e.g.,
a file --> file2url --> URL
Again, I am using DOS and connecting executables via pipes:
type url.txt | url2file | file2url
Question: file2url is receiving a file. How do I get the file's URL (or path)?

In general, you probably can't.
If the url is not stored in the file, I seems very difficult to get the url. Imagine someone reads a text to you. Without further information you have no way to know what book it comes from.
However there are certain usecases where you can do it.
Pipe the url together with the file.
If you need the url and you can do that, try to keep the url together with the file. Make url2file pipe your url first and then the file.
Restructure your pipeline
Maybe you don't need to find the url for the file, if you restructure your pipeline.
Index your files
If only a certain files could potentially be piped into file2url, you could precalculate a hash for all files and store it in your program together with the url. In python you would do this using a dict where the key is the file (as a string) and the value is the url. You could use pickle to write the dict object to a file and load it at the start of your program.
Then you could simply lookup the url from this dict.
You might want to research how databases or search functions in explorers handle indexing or alternative solutions.
Searching for the file
You could use one significant line of the file and use something like grep or head on linux to search all files of your computer for this line. Note that grep and head are programs, not python functions. For DOS, you might need to google the equivalent programs.
FYI: grep searches for one line of text inside a file.
head puts out the first few lines of a file. I suggest comparing only the first few lines of files to avoid searching through huge file.
Searching all files on the computer might take very long.
You could only search files with the same size as your piped input.
Use url.txt
If file2url knows the location of the file url.txt, then you could look up all files in url.txt until you find a file identical to the file that was piped into your program. You could combine this with the hashing/ indexing solution.

'file2url' receives the data via standard input (like keyboard).
The data is transferred by the kernel and it doesn't necessarily have to have any file-system representation. So if there's no file there's no URL or path to that for you to get.

Let's try to do it by obvious way:
$ cat test.py | python test.py
import sys
print ''.join(sys.stdin.readlines())
print sys.stdin.name
<stdin>
So, filename is "< stdin>" because, for the python there is no filename - only input.
Another way is a system-dependent. Find a command line, which was used, for example, but no garantee that is will be works.

How do I get the snapshot length of a .pcap file using dpkt?

I am trying to get the snapshot length of a .pcap file. I have gone to the man page for pcap and pcap_snapshot but have not been able to get the function to work.
I am running a VM Fedora20 and it is written in python
First I try to import the file that the man page says to include but I get a syntax error on the import and the pcap_snapshot()
I am new at python so I imagine its something simple but not sure what it is. Any help is much appreciated!
import <pcap/pcap.h>
import dpkt
myPcap = open('mycapture.pcap')
myFile = dpkt.pcap.Reader(myPcap)
print "Snapshot length = ", myFile.pcap_snapshot()

Don't read the man page first unless you're writing code in C, C++, or Objective-C.
If you're not using a C-flavored language, you'll need to use a wrapper for libpcap, and should read the documentation for the wrapper first, as you won't be calling the C functions from libpcap, you'll be calling functions from the wrapper. If you try to import a C-language header file, such as pcap/pcap.h, in Python, that will not work. If you try to directly call a C-language function, such as pcap_snapshot(), that won't work, either.
Dpkt is not a wrapper; it is, instead, a library to parse packets and to read pcap files, with the code to read pcap files being independent of libpcap. Therefore, it won't offer wrappers for libpcap APIs such as pcap_snapshot().
Dpkt's documentation is, well, rather limited. A quick look at its pcap.py module seems to suggest that
print "Snapshot length = ", myFile.snaplen
would work; give that a try.

create a tar file in a string using python

I need to generate a tar file but as a string in memory rather than as an actual file. What I have as input is a single filename and a string containing the assosiated contents. I'm looking for a python lib I can use and avoid having to role my own.
A little more work found these functions but using a memory steam object seems a little... inelegant. And making it accept input from strings looks like even more... inelegant. OTOH it works. I assume, as most of it is new to me. Anyone see any bugs in it?

Use tarfile in conjunction with cStringIO:
c = cStringIO.StringIO()
t = tarfile.open(mode='w', fileobj=c)
# here: do your work on t, then...:
s = c.getvalue() # extract the bytestring you need

How to separate content from a file that is a container for binary and other forms of content

I am trying to parse some .txt files. These files serve as containers for a variable number of 'children' files that are set off or identified within the container with SGML tags. With python I can easily separate the children files. However I am having trouble writing the binary content back out as a binary file (say a gif or jpg). In the simplest case the container might have an embedded html file followed by a graphic that is called by the html. I am assuming that my problem is because I am reading the original .txt file using open(filename,'r'). But that seems the only option to find the sgml tags to split the file.
I would appreciate any help to identify some relevant reading material.
I appreciate the suggestions but I am still struggling with the most basic questions. For example when I open the file with wordpad and scroll down to the section tagged as a gif I see this:
<FILENAME>h65803h6580301.gif
<DESCRIPTION>GRAPHIC
<TEXT>
begin 644 h65803h6580301.gif
M1TE&.#EA(P)I`=4#`("`#,#`P$!`0+^_OW]_?_#P\*"#H.##X-#0T&!#8!`0
M$+"PL"`#('!P<)"0D#`P,%!04#\_/^_O[Y^?GZ^OK]_?WX^/C\_/SV]O;U]?
I can handle finding the section easily enough but where does the gif file begin. Does the header start with 644, the blanks after the word begin or the line beginning with MITE?
Next, when the file is read into python does it do anything to the binary code that has to be undone when it is read back out?
I can find the lines where the graphics begin:
filerefbin=file('myfile.txt','rb')
wholeFile=filerefbin.read()
import re
graphicReg=re.compile('<DESCRIPTION>GRAPHIC')
locationGraphics=graphicReg.finditer(wholeFile)
graphicsTags=[]
for match in locationGraphics:
graphicsTags.append(match.span())
I can easily use the same process to get to the word begin, or to identify the filename and get to the end of the filename in the 'first' line. I have also successefully gotten to the end of the embedded gif file. But I can't seem to write out the correct combination of things so when I double click on h65803h6580301.gif when it has been isolated and saved I get to see the graphic.
Interestingly, when I open the file in rb, the line endings appear to still be present even though they don't seem to have any effect in notebpad. So that is clearly one of my problems I might need to readlines and join the lines together after stripping out the \n
I love this site and I love PYTHON
This was too easy once I read bendin's post. I just had to snip the section that began with the word begin and save that in a txt file and then run the following command:
import uu
uu.decode(r'c:\test2.txt',r'c:\test.gif')
I have to work with some other stuff for the rest of the day but I will post more here as I look at this more closely. The first thing I need to discover is how to use something other than a file, that is since I read the whole .txt file into memory and clipped out the section that has the image I need to work with the clipped section instead of writing it out to test2.txt. I am sure that can be done its just figuring out how to do it.

What you're looking at isn't "binary", it's uuencoded. Python's standard library includes the module uu, to handle uuencoded data.
The module uu requires the use of temporary files for encoding and decoding. You can accomplish this without resorting to temporary files by using Python's codecs module like this:
import codecs
data = "Let's just pretend that this is binary data, ok?"
uuencode = codecs.getencoder("uu")
data_uu, n = uuencode(data)
uudecode = codecs.getdecoder("uu")
decoded, m = uudecode(data_uu)
print """* The initial input:
%(data)s
* Encoding these %(n)d bytes produces:
%(data_uu)s
* When we decode these %(m)d bytes, we get the original data back:
%(decoded)s""" % globals()

You definitely need to be reading in binary mode if the content includes JPEG images.
As well, Python includes an SGML parser, http://docs.python.org/library/sgmllib.html .
There is no example there, but all you need to do is setup do_ methods to handle the sgml tags you wish.

You need to open(filename,'rb') to open the file in binary mode. Be aware that this will cause python to give You confusing, two-byte line endings on some operating systems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.