Splitting words in running text using Python? - python

I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text.
I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python?

Assuming your definition of "word" agrees with that of the regular expression module (re), that is, letters, digits and underscores, it's easy:
import re
fullwords = re.findall(r'\w+', thetext)
where thetext is the string in question (e.g., coming from an f.read() of a file object f open for reading, if that's where you get your text from).
If you define words differently (e.g. you want to include apostrophes so for example "it's" will be considered "one word"), it isn't much harder -- just use as the first argument of findall the appropriate pattern, e.g. r"[\w']+" for the apostrophe case.
If you need to be very, very sophisticated (e.g., deal with languages that use no breaks between words), then the problem suddenly becomes much harder and you'll need some third-party package like nltk.

Assuming your delimiters are whitespace characters (like space, \r and \n), then basic str.split() does what you want:
>>> "asdf\nfoo\r\nbar too\tbaz".split()
['asdf', 'foo', 'bar', 'too', 'baz']

Related

Split joined/concatenated words list of different languages

I'm trying to split words from different languages that are joined.
My expected result:
input = ['françaisenglishtext']
output = ['français','english','text']
The first word in the output is french and the other words are english.
I tried to use the python wordninja library from this question:How to split text without spaces into list of words by first using it with english and then remove the non-english words and use pyenchant and keep only the english words.
The problem with this method, the output of wordninja will also split the french part into english. So I cannot know which part of the output is french after splitting. It'll also remove the special french character.
My current result:
result_with_wordninja = ['fran', 'a', 'is', 'english', 'text']
Finally I tried to change the dictionary of wordninja but I still face the same problem.
I've also check this answer:Split a paragraph containing words in different languages but it doesn't work for my case since I just have latin characters in my list.
Is there a specific library or method to be able to split joined words from different languages ?
Thank you,

Detect abbreviations in the text in python

I want to find abbreviations in the text and remove it. What I am currently doing is identifying consecutive capital letters and remove them.
But I see that it does not remove abbreviations such as MOOCs, M.O.O.C, M.O.O.Cs. Is there an easy way of doing this in python? Or are there any libraries that I can use instead?
The re regex library is probably the tool for the job.
In order to remove every string of consecutive uppercase letters, the following code can be used:
import re
mytext = "hello, look an ACRONYM"
mytext = re.sub(r"\b[A-Z]{2,}\b", "", mytext)
Here, the regex "\b[A-Z]{2,}\b" searches for multiple consecutive (indicated by [...]{2,}) capital letters (A-Z), forming a complete word (\b...\b). It then replaces them with the second string, "".
The convenient thing about regex is how easily it can be modified for more complex cases. For example:
mytext = re.sub(r"\b[A-Z\.]{2,}\b", "", mytext)
Will replace consecutive uppercase letters and full stops, removing acronyms like A.B.C.D. as well as ABCD. The \ before the . is necessary as . otherwise is used by regex as a kind of wildcard.
The ? specifier could also be used to remove acronyms that end in s, for example:
mytext = re.sub(r"\b[A-Z\.]{2,}s?\b", "", mytext)
This regex will remove acronyms like ABCD, A.B.C.D, and even A.B.C.Ds. If other forms of acronym need to be removed, the regex can easily be modified to accommodate them.
The re library also includes functions like findall, or the match function, which allow for programs to locate and process each acronym individually. This might come in handy if you want to, for example, look at a list of the acronyms being removed and check there are no legitimate words there.
An intuitive way would be the use of regex
This regular expression does the job :([A-Z]\.*){2,}s?
Which gives in python :
import re
re.sub("([A-Z]\.*){2,}s?","", your_text)
Please visit regex documentation in case of doubt
https://docs.python.org/2/library/re.html#re.sub

Splitting on regex without removing delimiters

So, I would like to split this text into sentences.
s = "You! Are you Tom? I am Danny."
so I get:
["You!", "Are you Tom?", "I am Danny."]
That is I want to split the text by the regex '[.!\?]' without removing the delimiters. What is the most pythonic way to achieve this in python?
I am aware of these questions:
JS string.split() without removing the delimiters
Python split() without removing the delimiter
But my problem has various delimiters (.?!) which complicates the problem.
You can use re.findall with regex .*?[.!\?]; the lazy quantifier *? makes sure each pattern matches up to the specific delimiter you want to match on:
import re
s = """You! Are you Tom? I am Danny."""
re.findall('.*?[.!\?]', s)
# ['You!', ' Are you Tom?', ' I am Danny.']
Strictly speaking, you don't want to split on '!?.', but rather on the whitespace that follows those characters. The following will work:
>>> import re
>>> re.split(r'(?<=[\.\!\?])\s*', s)
['You!', 'Are you Tom?', 'I am Danny.']
This splits on whitespace, but only if it is preceded by either a ., !, or ? character.
If Python supported split by zero-length matches, you could achieve this by matching an empty string preceded by one of the delimiters:
(?<=[.!?])
Demo: https://regex101.com/r/ZLDXr1/1
Unfortunately, Python does not support split by zero-length matches. Yet the solution may still be useful in other languages that support lookbehinds.
However, based on you input/output data samples, you rather need to split by spaces preceded by one of the delimiters. So the regex would be:
(?<=[.!?])\s+
Demo: https://regex101.com/r/ZLDXr1/2
Python demo: https://ideone.com/z6nZi5
If the spaces are optional, the re.findall solution suggested by #Psidom is the best one, I believe.
If you prefer use split method rather than match, one solution split with group
splitted = filter(None, re.split( r'(.*?[\.!\?])', s))
Filter removes empty strings if any.
This will work even if there is no spaces between sentences, or if you need catch trailing sentence that ends with a different punctuation sign, such as an unicode ellipses (or does have any at all)
It even possible to keep you re as is (with escaping correction and adding parenthesis).
splitted = filter(None, re.split( r'([\.!\?])', s))
Then merge even and uneven elements and remove extra spaces
Python split() without removing the delimiter
Easiest way is to use nltk.
import nltk
nltk.sent_tokenize(s)
It will return a list of all your sentences without loosing delimiters.

Split string with caret character in python

I have a huge text file, each line seems like this:
Some sort of general menu^a_sub_menu_title^^pagNumber
Notice that the first "general menu" has white spaces, the second part (a subtitle) each word is separate with "_" character and finally a number (a pag number). I want to split each line in 3 (obvious) parts, because I want to create some sort of directory in python.
I was trying with re module, but as the caret character has a strong meaning in such module, I couldn't figure it out how to do it.
Could someone please help me????
>>> "Some sort of general menu^a_sub_menu_title^^pagNumber".split("^")
['Some sort of general menu', 'a_sub_menu_title', '', 'pagNumber']
If you only want three pieces you can accomplish this through a generator expression:
line = 'Some sort of general menu^a_sub_menu_title^^pagNumber'
pieces = [x for x in line.split('^') if x]
# pieces => ['Some sort of general menu', 'a_sub_menu_title', 'pagNumber']
What you need to do is to "escape" the special characters, like r'\^'. But better than regular expressions in this case would be:
line = "Some sort of general menu^a_sub_menu_title^^pagNumber"
(menu, title, dummy, page) = line.split('^')
That gives you the components in a much more straightforward fashion.
You could just say string.split("^") to divide the string into an array containing each segment. The only caveat is that it will divide consecutive caret characters into an empty string. You could protect against this by either collapsing consecutive carats down into a single one, or detecting empty strings in the resultant array.
For more information see http://docs.python.org/library/stdtypes.html
Does that help?
It's also possible that your file is using a format that's compatible with the csv module, you could also look into that, especially if the format allows quoting, because then line.split would break. If the format doesn't use quoting and it's just delimiters and text, line.split is probably the best.
Also, for the re module, any special characters can be escaped with \, like r'\^'. I'd suggest before jumping to use re to 1) learn how to write regular expressions, 2) first look for a solution to your problem instead of jumping to regular expressions - «Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. »

Python regex \w doesn't match combining diacritics?

I have a UTF8 string with combining diacritics. I want to match it with the \w regex sequence. It matches characters that have accents, but not if there is a latin character with combining diacritics.
>>> re.match("a\w\w\wz", u"aoooz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> print u"ao\u00F3oz"
aoóoz
>>> re.match("a\w\w\wz", u"ao\u00F3oz", re.UNICODE)
<_sre.SRE_Match object at 0xb7788f38>
>>> re.match("a\w\w\wz", u"aoo\u0301oz", re.UNICODE)
>>> print u"aoo\u0301oz"
aóooz
(Looks like the SO markdown processer is having trouble with the combining diacritics in the above, but there is a ́ on the last line)
Is there anyway to match combining diacritics with \w? I don't want to normalise the text because this text is from filename, and I don't want to have to do a whole 'file name unicode normalization' yet. This is Python 2.5.
I've just noticed a new "regex" package on pypi. (if I understand correctly, it is a test version of a new package that will someday replace the stdlib re package).
It seems to have (among other things) more possibilities with regard to unicode. For example, it supports \X, which is used to match a single grapheme (whether it uses combining or not). It also supports matching on unicode properties, blocks and scripts, so you can use \p{M} to refer to combining marks. The \X mentioned before is equivalent to \P{M}\p{M}* (a character that is NOT a combining mark, followed by zero or more combining marks).
Note that this makes \X more or less the unicode equivalent of ., not of \w, so in your case, \w\p{M}* is what you need.
It is (for now) a non-stdlib package, and I don't know how ready it is (and it doesn't come in a binary distribution), but you might want to give it a try, as it seems to be the easiest/most "correct" answer to your question. (otherwise, I think your down to explicitly using character ranges, as described in my comment to the previous answer).
See also this page with information on unicode regular expressions, that might also contain some useful information for you (and can serve as documentation for some of the things implemented in the regex package).
You can use unicodedata.normalize to compose the combining diacritics into one unicode character.
>>> import re
>>> from unicodedata import normalize
>>> re.match(u"a\w\w\wz", normalize("NFC", u"aoo\u0301oz"), re.UNICODE)
<_sre.SRE_Match object at 0x00BDCC60>
I know you said you didn't want to normalize, but I don't think there will be a problem with this solution, as you're only normalizing the string to match against, and do not have to change the filename itself or something.

Categories

Resources