Use concordance to find hyphenated words

Use concordance to find hyphenated words - python

I was able to get the expected output of this book, page 4 "Searching Text". When I tried to apply it to my case I got No matches which was not my expected output. I think I'm not tokenizing at the proper level (word instead of character) but am unsure of how to correct that. Any suggestions? The output I want is every hyphen lined up vertically with its surrounding context.
>>> f = open('hyphen.txt')
>>> raw = f.read()
>>> import nltk
>>> tokens = nltk.word_tokenize(raw)
>>> text = nltk.Text(tokens)
>>> text.concordance("-")
No matches
>>> text
<Text: Fog Air-Flow Switch stuck off ? Bubble Tower...>
(Python 3.4.3)
EDIT
I think I'm close by using regular expressions but I don't know how to remove the 'NoneType' objects. Any suggestions?
The output I'd want to see would look like this:
Fog Air-Flow Switch stuck off?
Bubble Tower Check-Valve stuck closed?
Chamber Drain-Trap broken, dry, or missing?
Chamber Exhaust-Vent blocked or restricted?
etc.
It's okay if the context is wider than the sentence with the hyphen - all that matters to me is that the hyphens are lined up vertically with its surrounding context.

Need to change your code little bit.
import nltk
f = open("/path/to/file") //path of the file
raw = f.read()
text = nltk.Text(raw)
text.concordance("-")
Required Output:

Related

python regex get last words of a text up to a stop word with re module (not regex module)

I am after getting the last words of a text up to a stopword.
Imagine I have the text:
first_part = "This is a text that with the blue paper"
going from end back I would like to get "blue paper".
In order to do that I use the regex module
import regex as re
print(first_part)
result=re.search(r"(?r)(?<=(\s*\b(an|a|the|for)\b\s*))(?P<feature>.*?)(?=\s*)$",first_part)
print(result)
Regex explanation:
(?r) = reverse
(?<=(\s*\b(an|a|the|for)\b\s*)) =look behind any of the stop words with word boundary \b
(?P feature .?) = basically whatever .
$ = from the end of the string
This works just fine.
but I am using the module regex in order to be able to use "(?r)" meaning reverse.
Anyone knows if it would be possible to do this using re?
I need to implement this functionality with standard libraries functionalities.

If you add a greedy match in front and a lazy one in the back, you will just get the last words.. Not 100% sure this is what you want though.
>>> first_part = "This is a text that with the blue paper"
>>> m = re.match(r"(?:.*)(?:an|a|the|for)\W(.+?)$", first_part)
>>> m[1]
'blue paper'

Replacing method for words with boundaries in python (like with regex)

I am seeking for a more robust replace method in python because I am building a
spellchecker to input words in ocr-context.
Let's say we have the following text in python:
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
It is easy to realize that instead of "his is a text" the right phrase would be "this is a text".
And if I do text.replace('his','this') then I replace every single 'his' for this, so I would get errors like "tthis" is a text.
When I do a replacement. I would like to replace the whole word 'this' and not his or this.
Why not trying this?
word_to_replace='his'
corrected_word = 'this'
corrected_text = re.sub('\b'+word_to_replace+'\b',corrected_word,text)
corrected_text
Awesome, we did it, but the problem is... what if the word to correct contains an special character like '|'. For example,
'|ights are on' instead of 'lights are one'. Trust me, it happened to me, the re.sub is a disaster in that case.
The question is, have you encountered the same problem? Is there any method to solve this? The replacement is the most
robust option.
I tried text.replace(' '+word_to_replace+' ',' '+word_to_replace+' ') and this solve a lot of things but still
have the problem of phrases like "his is a text " because the replacement doesnt work here since 'his' is at the begining of a sentence
and not ' his ' for ' this '.
Is there any replacement method in python that takes the whole word like in regexs \b word_to_correct \b
as input ?

after a few days I solved the problem that I had. I hope this could
be helpful for someone else. Let me know if you have any question or something.
text = """
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, his text is very difficult to work with.
"""
# Asume you already have corrected your word via ocr
# and you just have to replace it in the text (I did it with my ocr spellchecker)
# So we get the following word2correct and corrected_word (word after spellchecking system)
word2correct = 'his'
corrected_word = 'this'
#
# now we replace the word and the its context
def context_replace(old_word,new_word,text):
# Match word between boundaries \\b\ using regex. This will capture his and its context but not this and its context
phrase2correct = re.findall('.{1,10}'+'\\b'+word2correct+'\\b'+'.{1,10}',text)[0]
# Once you matched the context, input the new word
phrase_corrected = phrase2correct.replace(word2correct,corrected_word)
# Now replace the old phrase (phrase2correct) with the new one *phrase_corrected
text = text.replace(phrase2correct,phrase_corrected)
return text
Test if the function works...
print(context_replace(old_word=word2correct,new_word=corrected_word,text=text))
Output:
this is a text, generated using optical character recognition.
this ls having a lot of errors because
the scanned pdf has too bad resolution.
Unfortunately, this text is very difficult to work with.
It worked for my purpose. I hope this is helpful for someone else.

How to use the regular expression to transform the <U+5C16> into \u5c16?

I am doing a sentiment analysis project and firstly, I need to clean the text data. Some text contains Chinese, Tagalog and what I am doing now is trying to translate them to English. But until now, all the Chinese characters in this datafile have some Unicode representation like:
<U+5C16>
which could not be coped with using the Python Encoding&Decoding path. So I want to transform this kind of pattern to:
\u5c16
Then I think we could use the following code to get the Chinese characters I want:
text.encode('latin-1').decode('unicode_escape')
So the question now is how to use the regex to transform <U+5C16> into\u5c16?
Thank you very much!
Update: I think the most difficult thing here is that I need to let the 5c16 part in \u5c16 be equivalent to the lowercase of the 5C16 in <U+5C16>. And in my social media dataset, what I see most is the text data like the following:
<U+5C16><U+6C99><U+5480><U+9418><U+6A13>
If I could transform the above text to '\u5c16\u6c99\u5480\u9418\u6a13' and print it in Python, I could get what I really want:
尖沙咀鐘樓
But how could I do this? Any insights and hints would be appreciated!

The required regex is something like this:
find: r'<U\+([A-Fa-f0-9]+?)>'
replace with: r'\u\1'
To turn the resulting string to unicode make s.encode().decode('unicode-escape')
Example:
re.sub(r'<U\+([A-Fa-f0-9]+?)>',r'\u\1',s).encode().decode('unicode-escape')

If your file is exactly as you describe, here's how to convert it:
text = "text with <U+5C16> and so on"
ready = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text)
go = re.sub(r"<U\+([0-9a-fA-F]{4})>", r"\u\1", text) # BMP: 4 hex digits
go = re.sub(r"<U\+([0-9a-fA-F]{5})>", r"\U000\1", go) # SMP: 5 -> 8 hex digits
print(go.encode("ascii").decode('unicode_escape'))
(The line marked "SMP" is only needed if you have characters outside the "basic multilingual plane".)
Output: text with 尖 and so on

Find all word with a #

I want to find all words which have a # attached to it.
I tried:
import re
text = "I was searching my #source to make a big desk yesterday."
re.findall(r'\b#\w+', text)
but it does not work...

Here's a small regex to do that:
>>> import re
>>> s = "I was searching my #source to make a big desk yesterday."
>>> re.findall(r"#(\w+)", s)
['source']
If you want to include the hashtag then use:
>>> re.findall(r"#.\w+", s)
['#source']

You can use:
re.findall(r"#.+?\b", text)
which gives:
['#source']
Here is a link to regex101 which gives in-depth insight into what each part does.
Basically what is happening is:
the # means capture the '#' character literally
then we say to match any character with a .
but the + signifies to capture one or more of them
then the ? begins a non-greedy match to whatever follows
the \b is a word boundary and signifies when to stop the lookup
Update
As pointed out by #AnthonySottile, there is a case where the above regex will fail, namely:
hello#fred
where a match is made when it shouldn't be.
To get around this problem, a /s could be added to the front of the regex so as to make sure the # comes after some whitespace, but this fails in the case where the hashtag comes right at the start of the string. A /b also won't suffice as the # makes the hashtag not count as a word.
So, to get around these, I came up with this rather ugly solution of adding a space to the start of the string before doing the findall:
re.findall(r"\s(#.+?)\b", " " + text)
It's not very neat I know but there really isn't another way of doing it. I tried using an OR at the start to match a whitespace or the start of the string, as in (^|\s), but this will produce multiple groups (as tuples) in the list that is returned from re.findall so would requires some post-processing which is even less neat.

You do not need regex to solve this problem:
text = "I was searching my #source to make a big desk yesterday."
final_text = [i for i in text.split() if i.startswith('#')]
Output:
['#source']
However, this regex will work:
import re
text = "I was searching my #source to make a big desk yesterday."
final_text = filter(lambda x:x, re.findall('(?<=^)|(?<=\s)#\w+(?=\s)|(?=$)', text))
Output:
['#source']

Tokenizing unsplit words from OCR using NLTK

I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I, or thatposition instead of that position, or andhe's instead of and he's.
My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?

I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:
>>> text = "IfI am inthat position, Idon't think I will." # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
for suggestion in error.suggest():
if error.word.replace(' ', '') == suggestion.replace(' ', ''): # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
error.replace(suggestion)
break
>>> checker.get_text()
"If I am in that position, I don't think I will." # text is now fixed

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.