How to tokenize continuous words with no whitespace delimiters?

How to tokenize continuous words with no whitespace delimiters? - python

I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?

I am not aware of such tools, but the solution of your problem depends on the language.
For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.
You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.

Related

Split joined/concatenated words list of different languages

I'm trying to split words from different languages that are joined.
My expected result:
input = ['françaisenglishtext']
output = ['français','english','text']
The first word in the output is french and the other words are english.
I tried to use the python wordninja library from this question:How to split text without spaces into list of words by first using it with english and then remove the non-english words and use pyenchant and keep only the english words.
The problem with this method, the output of wordninja will also split the french part into english. So I cannot know which part of the output is french after splitting. It'll also remove the special french character.
My current result:
result_with_wordninja = ['fran', 'a', 'is', 'english', 'text']
Finally I tried to change the dictionary of wordninja but I still face the same problem.
I've also check this answer:Split a paragraph containing words in different languages but it doesn't work for my case since I just have latin characters in my list.
Is there a specific library or method to be able to split joined words from different languages ?
Thank you,

Writing a tokenizer for an english text in python with nltk and regex

I want to write a tokenizer for an English text and I'm working with the RegExp tokenizer from the nltk module in python.
This is the expression right now I use to split the words:
[\w\.]+
(the "." so something like u.s.a doesn't get butchered.)
Problem: At the same time i want to remove the punctuation from the word: usa.
Of course I can do it in separate steps but I thought there has to be a smoother way than iterating over the whole text again just to remove punctuation.
Since it needs to be scalable I want to optimize the runtime as best as I can.
I'm pretty new to Regular Expressions and have a really hard time, so I'm really happy for any help I can get.

The module uses more then regular expressions alone (specifically trained sets) and does a pretty good job on its own with abbreviations, really:
from nltk import sent_tokenize, word_tokenize
text = """
In recent times, the U.S. has had to endure difficult
political times and many trials and tribulations.
Maybe things will get better soon - but only with the
right punctuation marks. Am I right, Dr.?"""
words = []
for nr, sent in enumerate(sent_tokenize(text, 1)):
print("{}. {}".format(nr, sent))
for word in word_tokenize(sent):
words.append(word)
print(words)
Don't reinvent the wheel here with own regular expressions.

Print something when a word is in a word list

So I am currently trying to build a Caesar encrypted that automatically tries all the possibilities and compares them to a big list of words to see if it is a real word, so some sort of dictionary attack I guess.
I found a list with a lot of German words, and they even are split so that each word is on a new line. Currently, I am struggling with comparing the sentence that I currently have with the whole word list. So that when the program sees that a word in my sentence is also a word in the Word list that it prints out that this is a real word and possible the right sentence.
So this is how far I currently am, I have not included the code with which I try all the 26 letters. Only my way to look through the word list and compares it to a sentence. Maybe someone can tell me what I am doing wrong and why it doesn't work:
No idea why it doesn't work. I have also tried it with regular expressions but nothing works. The list is really long (166k Words).

There are /n at the en of each word of the list you created from the file, so the they will never be the same as what they are compared to.
Remove the newline character before appending (you can, for example, wordlist.append(line.rstrip())

NLP: How do I combine stemming and tagging?

I'm trying to write code which passes in text that has been tokenized and had the stop words filtered out, and then stems and tags it. However, I'm not sure in what order I should stem and tag. This is what I have at the moment:
#### Stemming
ps = PorterStemmer() # PorterStemmer imported from nltk.stem
stemText = []
for word in swFiltText: # Tagged text w/o stop words
stemText.append(ps.stem(word))
#### POS Tagging
def tagging():
tagTot = []
try:
for i in stemText:
words = nltk.word_tokenize(i) # I need to tokenize again (idk why?)
tagged = nltk.pos_tag(words)
tagTot = tagTot + tagged # Combine tagged words into list
except Exception as e:
print(str(e))
return tagTot
tagText = tagging()
At first glance, this works just fine. However, because I stemmed first, pos_tag often mislabels words. For example, it marked "hous" as an adjective, when the original word was really the noun "house". But when I try to stem after tagging, it gives me an error about how pos_tag can't deal with 'tuples' - I'm guessing this has something to do with the way that the stemmer formats the word list as [('come', 'VB'), ('hous', 'JJ'), etc.
Should I be using a different stemmer/tagger? Or is the error in my code?
Thanks in advance!

You should tag the text before you apply stemming or lemmatisation to it.
Removing the endings of words takes away crucial clues about what part-of-speech tag a word can be.
The reason that you got hous as an adjective is that any tagger expects unprocessed tokens, and words ending in -ous in English are usually adjectives (nefarious, serious). If you tag it first, it can be recognises (even without context) as either a noun or a verb. The tagger can then use context (preceded by the? -> noun) to disambiguate which is the most likely one.
A good lemmatiser can take the part-of-speech into account, eg housing could be a noun (lemma: housing) or a verb (lemma: house). With p-o-s information a lemmatiser can make the correct choice there.
Whether you use stemming or lemmatisation depends on your application. For many purposes they will be equivalent. The main difference from my experience are that:
Stemming is a lot faster, as stemmers have a few rules on how to handle various endings
Lemmatisation gives you 'proper' words which you can look up in dictionaries (if you want to get glosses in other languages or definitions)
Stemmed strings sometimes don't look anything like the original word, and if you present them to a human user they might get confused
Stemmers conflate words which have similar meanings but different lemmas, so for information retrieval they might be more useful
Stemmers don't need a word list, so if you want to write your own stemmer, it's quicker than writing a lemmatiser (if you're processing languages for which no ready-made tools exist)

I would suggest using lemmatization over stemming, stemming just chops off letters from the end until the root/stem word is reached. Lemmatization also looks at the surrounding text to determine the given words's part of speech.

Reading sentences from a text file and appending into a list with Python 3 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]

That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)

First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.