To implement a college project, I need to handle XML files. For this I choose lxml after doing some research. However I can't seem to find some nice tutorial to help me get started. I can't choose most specifically which type of parsing I need to use. My XML files don't have that much data but speed is main concern, not memory.
Can anyone point me to some tutorial that would help me or some book that I can lookup? I have already tried the tutorial on lxml site but that didn't help me much. Is there some small application I can look up to get a hang of parsing XML with lxml
No applications but examples:
http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
http://infohost.nmt.edu/tcc/help/pubs/pylxml/pylxml.pdf
Related
I need your help in order to find a tutorial or any other information regarding the scraping with Python (Legally: Because this is a part of a data collection for my thesis so I will need the legal ways to scrap the data please).
Whould you please help me to find out the usefull sources to realize this ?
https://medium.com/analytics-vidhya/web-scraping-instagram-with-selenium-python-b8e77af32ad4
Medium usually provides fairly good instructions.
this is my first question so I will try to do everything as proper as possible.
I am currently using LaTeX to write my documents at my University because I want to use the powerful citing capabilities provided by BibTeX. For ease of use, I am writing on scripts that will implement my .bib-files into my .tex files easier and allow easier management of my .bib-files. As I am using Arch Linux, I did this in bash, but it is a little clunky. Therefore I wanted to switch to python, as I came across the TexSoup-library for Python.
My issue is now, that I cannot find resources regarding the use of TexSoup for .bib files, I can only find resources on .tex-files. Does anybody know, if and if yes how I can use TeXSoup to find books / articles or other entries in my bib-files with python (or the TexSoup-library)?
with open("bib_complete.bib") as f:
soup = TexSoup(f)
print(soup)
This is a code sample I am trying to use, but I don't know how to look for entry names or entry-types with the package. I would really appreciate if someone could guide me to good resources if they exist.
I hope my writing was comprehensive enough and not too long.
Thanks everybody!
I did some googling and didn't find anything complete for my problem, but it is so generic, there has to be something.
I need feed parsing tool for my Django app (i want to fetch atom feed from somewhere and store its contents). I just found some feedparser.py references but the actual site is a long gone.
Could you provide some pointers?
feedparser is still pretty much the canonical solution for this in Python. It's very far from gone: see the documentation here and the actual project page here
I am working with Python 3.x
I want to extract text from several webpages. What is a good library to allow me do just that?
Thanks,
Barry.
http://www.crummy.com/software/BeautifulSoup/
and the documentation to get you started
http://www.crummy.com/software/BeautifulSoup/documentation.html
mechanize is good library but unfortunately not ready for python 3, but you can take a look at lxml.html
I would suggest using Beautiful Soup and than it's just a matter of going through the returned structure for anything similar to an email address.
You could also just use urllib2 for this but Beautiful Soup takes care of a lot of syntax issues for you.
You don't say what you want to do with the extracted text, and that makes a big difference in how much effort you are willing to go to in order to get it out.
If you are trying to get the body text of a web page minus all of the site-related cruft (a nontrivial task), take a look at boilerpipe. It is written in Java, but it does an amazingly good job at getting essential text out of random web pages.
One of my hobbies over the next few weeks is recreating the core logic of boilerpipe in Python. We need the functionality it provides for a project, but don't want to haul the 10-ton rock that is the JVM around with it. I'm pretty certain we will be releasing it once it is fairly stable.
I keep getting mismatched tag errors all over the place. I'm not sure why exactly, it's the text on craigslist homepage which looks fine to me, but I haven't skimmed it thoroughly enough. Is there perhaps something more forgiving I could use or is this my best bet for html parsing with the standard library?
The mismatched tag errors are likely caused by mismatched tags. Browsers are famous for accepting sloppy html, and have made it easy for web page coders to write badly formed html, so there's a lot of it. THere's no reason to believe that creagslist should be immune to bad web page designers.
You need to use a grammar that allows for these mismatches. If the parser you are using won't let you redefine the grammar appropriately, you are stuck. (There may be a better Python library for this, but I don't know it).
One alternative is to run the web page through a tool like Tidy that cleans up such mismatches, and then run your parser on that.
The best library for parsing unpredictable HTML is BeautifulSoup. Here's a quote from the project page:
You didn't write that awful page.
You're just trying to get some data
out of it. Right now, you don't really
care what HTML is supposed to look
like.
Neither does this parser.
However it isn't well-supported for Python 3, there's more information about this at the end of the link.
Parsing HTML is not an easy problem, using libraries are definitely the solution here. The two common libraries for parsing HTML that isn't well formed are BeautifulSup and lxml.
lxml supports Python 3, and it's HTML parser handles unpredictable HTML well. It's awesome and fast as well as it uses c-libraries in the bottom. I highly recommend it.
BeautifulSoup 3.1 supports Python 3, but is also deemed a failed experiment" and you are told not to use it, so in practice BeautifulSoup doesn't support Python 3 yet, leaving lxml as the only alternative.