parsing wikipedia stopwords html with nltk - python

Related to this question, I am working on a program to extract the introduction of wikipedia entities. As you can read in the above link, I already succeeded to query the api and am now focussing on the processing of the xml returned by the api call. I use nltk to process the xml, where I use
wikiwords = nltk.word_tokenize(introtext)
for wikiword in wikiwords:
wikiword = lemmatizer.lemmatize(wikiword.lower())
...
But with this I end up having recorded words like </, /p, <, ... . Since I am not using the structure of the xml, simply ignoring all xml would work, I guess. Is there a tool of nltk or is there a stopwords list available. I would just like to know, what's best practice?

You didn't specify what exact query are you using, but it seems what you have now is HTML, not XML, which you extracted from the XML response.
And if you want to strip all HTML tags from the HTML code and leave only the text, you should use HTML library for that, like BeautifulSoup.

Related

python-docx How to get content/body of a section

I am using Word's sections term to be able for each page to have different header, where I mark page with some markup like {page1}.
Using python-docx I get sections by:
doc = Document(my_file)`
doc_sections = doc.sections
doc_page_one = doc_sections[0]
I am able to get header and footer of each page and their texts:
doc_page_one.header.paragraphs[0].text
But I don't see the actual page content/body or shapes, while debugging I was not able to find where do they live.
Does python-docx have this possibility?
At present, python-docx does not have API support for getting what I would imagine would be the "block-items" (paragraphs + tables) that are "contained" in a certain section.
You would have to navigate the underlying XML if you wanted it bad enough, probably starting at document._body._body.xml. You could get an idea what that looks like with:
print(document._body._body.xml)
Basically you'd be looking for w:sectPr elements, each of which ends a section. There is some more detail on the XML schema involved in the python-docx analysis page here: https://python-docx.readthedocs.io/en/latest/dev/analysis/features/sections.html

python-docx - replacing characters

I am trying to build a small program in which I open a docx document and replace characters by others, to do some old school caesar-style encrypting, after checking the documentation: [ https://python-docx.readthedocs.io ] I am afraid I can't find the object methods and attributes, the documentation just kind-of explains how to do certain stuff like creating paragraphs and sections but I can't find anything on retrieving document data and parsing. I would like to find a list of the objects in the document so I can parse through them.
I would like to do something like this:
from docx import Document
document = Document('essay.docx')
paragraph = []
for i in document:
paragraph.append(i)
for i in paragraph:
for y in i:
y.replace("a", "y")
...
Can python-docx do something like this? If so where would I find the documentation that could show me how to do it?
If maybe I am using the incorrect library I would also appreciate it if you could point it out.
The API documentation is indexed (i.e. its table of contents appears) on the page you link to and describes all the objects and methods. https://python-docx.readthedocs.io/en/latest/#api-documentation
I think I found something useful in case future readers might be interested. The problem with python-docx is I could get paragraphs individually and it would take a lot of time. I don't even know if titles, footers and headers count as paragraphs.
But there is a library called textract that can read docx and other files, it integrates with python-docx, or at least that's what the short documentation says. But what I can do, is save my docx file to PDF and use:
text = textract.process(
'path/to/norwegian.pdf',
method='pdftofile',
language='nor',
)
This allows you to get all the text as a string and save it preserving the layout of the pdf. Haven't tested it yet, will edit this post if it doesn't work as intended.
http://textract.readthedocs.io/en/latest/python_package.html#python-package

Using Beautifulsoup and regex to traverse javascript in page

I'm fetching webpages with a bunch of javascript on it, and I'm interested in parsing through the javascript portion of the pages for certain relevant info. Right now I have the following code in Python/BeautifulSoup/regex:
scriptResults = soup('script',{'type' : 'text/javascript'})
which yields an array of scripts, of which I can use a for loop to search for text I'd like:
for script in scriptResults:
for block in script:
if *patterniwant* in block:
**extract pattern from line using regex**
(Text in asterisks is pseudocode, of course.)
I was wondering if there was a better way for me to just use regex to find the pattern in the soup itself, searching only through the scripts themselves? My implementation works, but it just seems really clunky so I wanted something more elegant and/or efficient and/or Pythonic.
Thanks in advance!
I lot of website have client side data in JSON format. I that case I would suggest to extract JSON part from JavaScirpt code and parse it using Python's json modules (e.g. json.json.loads ). As a result you will get standard dictionary object.
Another option is to check with your browser what sort of AJAX requests application makes. Quite often it also returns structured data in JSON.
I would also check if page has any structured data already available (e.g. OpenGraph, microformats, RDFa, RSS feeds). A lot of web sites include this to improve pages SEO and make it better integrating with social network sharing.

Twitter RSS feed double-escaping special characters?? And how do I deal with this using the Universal Feed Parser?

I'm parsing a set of feeds using the Universal Feed Parser
It looks like when twitter generates an RSS feed, it double-escapes certain special characters in the <description /> field. For example let's say I tweet:
I can't parse this!
Which is actually
I can&apos;t parse this!
in HTML entities.
The when you look at the bare XML from Twitter's RSS or Atom feed, it is rendered thusly:
I can&apos;t parse this!
Universal Feed Parser appears to have some serious problems with this. When you parse out one of the entries and look at how it parses this, you end up with:
I can&apost parse this!
which renders on the screen as
I can&apost parse this!
Any ideas how I can get this to behave? When I open the feed in Firefox, the entities are handled correctly, so clearly it's possible to parse the string correctly.
I'm pretty sure that the Universal Feed Parser's behavior is incorrect, but I'm having a hard time finding what part of the code needs to be fixed.
I'm also perplexed because right on the website it states: "3000 unit tests."
Surely one of those tests looks at a feed that contains entities?

Extracting the introduction part of a Wikipedia article, by python

I want to extract the introduction part of a wikipedia article(ignoring all other stuff, including tables, images and other parts). I looked at html source of the articles, but I don't see any special tag which this part is wrapped in.
Can anyone give me a quick solution to this? I'm writing python scripts.
thanks
You may want to check mwlib to parse the wikipedia source
Alternatively, use the wikidump lib
HTML screen scraping through BeautifulSoup
Ah, there is a question already on SO on this topic:
Parsing a Wikipedia dump
How to parse/extract data from a mediawiki marked-up article via python
I think you can often get to the intro text by taking the full page, stripping out all the tables, and then looking for the first sequence of <p>...</p> blocks after the marker. That last bit would be this regex:
/<!-- bodytext -->.*?(<p>.*?<\/p>\s*)+/
With the .S option to make . match newlines...

Categories

Resources