Split Documents into Paragraphs

Split Documents into Paragraphs - python

I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).
Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.
Is there training data for that? should I try some complex regular expression that might work?

I will try to give an easier way to deal with your problem: What you need to do is check for the double \nl then if you find double \nl then sort data considering that, and if you do not find double \nl then just sort data according to single \nl.
Another thing, i am thinking \nl is not a special character since i could not get any ASCII value for it, it is probably newline character but since you have asked for \nl i am giving the example accordingly(if it is indeed \n then you need to just change the part checking for double \nl).
Rough example to detect the way for new paragraph used in the file:
f=open('yourfile','r')
a=f.read()
f.close()
temp=0
for z in range(len(a)-4):
if a[z:z+4]=='\nl\nl':
temp=1
break
#temp=1 if formatting is by double \nl otherwise 0
After this you can use simple string formatting to check for single \nl or double \nl and replace them according to your need to distinguish new line or new paragraph.(Please read the file in chunks if the file size is too big, otherwise you might have memory problems or slower code)

You say
some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs
so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.
You can then pass all the files to the next stage where you detect paragraphs using a single \n character.

from nltk import tokenize
tk=tokenize
a='para here'
tk.sent_tokenize(a)
#output =list of sentences
#thats all u need

Related

Python regex that matches superscripted text

My question is very simple but I couldn't figure it out by myself: how to I match superscripted text with regex in Python? I'd like to match patterns like [a-zA-Z0-9,[]] but only if it's superscripted.
regards

The main problem is that information about "superscript" and "subscript" text is not conveyed at the character level.
Unicode even defines some characters to be used as sub and super-script, most notably, all decimal digits have a corresponding character - but just a handful of other latin letters have a full super-script or sub-script character with its own code. Check: https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts
So, if you want to match digits only, you could just put the corresponding characters in the regular expressions: "\u207X" (with X varying from 0 to 9) plus "\u00BX" with X in {2, 3, 9} - the table in the linked wikipedia article has all characters.
For the remaining characters, what takes place when we are faced with superscript text is that it is formatting information in a protocol separated from the characters: for example if you are dealing with HTML markup text, text inside the <sup> </sup> marks.
Just as happen with HTML, in any instance you find superscript text, have to be marked in a text-protocol "outside" of the characters themselves - and therefore, outside what you'd look up in the characters themselves with a regular expression.
If you are dealing with HTML text, you can search your text for the "<sup>" tag, for example. However, if it is formatted text inside a web page, there are tens of ways of marking the superscript text, as the super-script transformation can be specified in CSS, and the CSS may be applied to the page-text in several different ways.
Other text-protocols exist that might encode super-script text, like "rich-text" (rtf files) . Otherwise you have to say how the text you are dealing with is encoded, and how it does encode the markup for superscript text, in order for a proper regular expression to be built.
If it is plain HTML using "<sup>" tags, it could be as simple as:
re.findall(r"\<sup.*?\>(.*?)\<\/sup"). Otherwise, you should inspect your text stream, find out the superscript markup, and use an appropriate regexp, or even another better parsing tool (for HTML/XML you are usually better off using beautifulsoup or other XML tools than regexps, for example).
And, of course, that applies if the information for which text is superscripted is embedded in the text channel, as some kind of markup. It might be on a side-channel: another data block telling at which text indexes the effect of superscript should apply. In this case, you essentially have to figure that out, and then use that information directly.

Regex Python: Adding . after every 15 terms

I have a text file containing clean tweets and after every 15th term I need to insert a period.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Basically, so that each line becomes its own string after a period.
Or is there an alternative way to split a paragraph into individual sentences.

Splitting paragraphs into sentences can be achieved with functions in nltk package. Please refer to this answer Python split text on sentences

Accommodate two types of quotes in a regex

I am using a regex to replace quotes within in an input string. My data contains two 'types' of quotes -
" and “
There's a very subtle difference between the two. Currently, I am explicitly mentioning both these types in my regex
\"*\“*
I am afraid though that in future data I may get a different 'type' of quote on which my regex may fail. How many different types of quotes exist? Is there way to normalize these to just one type so that my regex won't break for unseen data?
Edit -
My input data consists of HTML files and I am escaping HTML entities and URLs to ASCII
escaped_line = HTMLParser.HTMLParser().unescape(urllib.unquote(line.decode('ascii','ignore')))
where line specifies each line in the HTML file. I need to 'ignore' the ASCII as all files in my database don't have the same encoding and I don't know the encoding prior to reading the file.
Edit2
I am unable to do so using replace function. I tried replace('"','') but it doesn't replace the other type of quote '“'. If I add it in another replace function it throws me NON-ASCII character error.
Condition
No external libraries allowed, only native python libraries could be used.

I don't think there is a "quotation marks" character class in Python's regex implementation so you'll have to do the matching yourself.
You could keep a list of common quotation mark unicode characters (here's a list for a good start) and build the part of regex that matches quotation marks programmatically.

I can only help you with the original question about quotations marks. As it turns out, Unicode defines many properties per character and these are all available though the Unicode Character Database. "Quotation mark" is one of these properties.
How many different types of quotes exist?
29, according to Unicode, see below.
The Unicode standard brings us a definitive text file on Unicode properties, PropList.txt, among which a list of quotation marks. Since Python does not support all Unicode properties in regular expressions, you cannot currently use \p{QuotationMark}. However, it's trivial to create a regular expression character class:
// placed on multiple lines for readability, remove spaces
// and then place in your regex in place of the current quotes
[\u0022 \u0027 \u00AB \u00BB
\u2018 \u2019 \u201A \u201B
\u201C \u201D \u201E \u201F
\u2039 \u203A \u300C \u300D
\u300E \u300F \u301D \u301E
\u301F \uFE41 \uFE42 \uFE43
\uFE44 \uFF02 \uFF07 \uFF62
\uFF63]
As "tchrist" pointed out above, you can save yourself the trouble by using Matthew Barnett's regex library which supports \p{QuotationMark}.

Turns out there's a much easier way to do this. Just append the literal 'u' in front of your regex you write in python.
regexp = ru'\"*\“*'
Make sure you use the re.UNICODE flag when you want to compile/search/match your regex to your string.
re.findall(regexp, string, re.UNICODE)
Don't forget to include the
#!/usr/bin/python
# -*- coding:utf-8 -*-
at the start of the source file to make sure unicode strings can be written in your source file.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????

>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']

If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']

What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.

You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?

It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Extra characters Extracted with XPath and Python (html)

I have been using XPath with scrapy to extract text from html tags online, but when I do I get extra characters attached. An example is trying to extract a number, like "204" from a <td> tag and getting [u'204']. In some cases its much worse. For instance trying to extract "1 - Mathoverflow" and instead getting [u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t ']. Is there a way to prevent this, or trim the strings so that the extra characters arent a part of the string? (using items to store the data). It looks like it has something to do with formatting, so how do I get xpath to not pick up that stuff?

What does the line of code look like that returns [u'204']? It looks like what is being returned is a Python list containing a unicode string with the value you want. Nothing wront there--just subscript. As for the carriage returns, linefeeds and tabs, as Wai Yip Tung just answered, strip will take them out.
Probably
my_answer = item1['Title'][0].strip()
Or if you are expecting several matches
for ans_i in item1['Title']:
do_something_with( ans_i.strip() )

The standard XPath function normalize-space() has exactly the wanted effect.
It deletes the leading and trailing wite space and replaces any inner whitespace with just one space.
So, you could use:
normalize-space(someExpression)

Use strip() to remove the leading and trailing white spaces.
>>> u'\r\n\t\t 1 \u2013 MathOverflow\r\n\t\t '.strip()
u'1 \u2013 MathOverflow'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.