Split text without separating e.g. 'New York'

Split text without separating e.g. 'New York' - python

I know how to split a string into a list of words, like this:
some_string = "Siva is belongs to new York and he was living in park meadows mall apartment "
some_string.split()
# ['Siva', 'is', 'belongs', 'to', 'new', 'York', 'and', 'he', 'was', living', 'in', 'park', 'meadows', 'mall', 'apartment']
However, some of the words should not be separated, for example, "New York" and "Park Meadows Mall". I have saved such special cases in a list called ´some_list´:
some_list = [('new York'), ('park meadows mall')]
where the desired result would be:
['Siva', 'is', 'belongs', 'to', 'new York', 'and', 'he', 'was', living', 'in', 'park meadows mall', 'apartment']
Any ideas on how I can get this done?

You can reconstruct the splitted elements into their compound form. Ideally, you want to scan over your splitted string only once, checking each element against all possible replacements.
A naive approach is too transform some_list into a lookup table for all possible sequences given one word. For example, the element 'new' indicates a potential replacement for 'new', 'York'. Such a table can be built by splitting off the first word of each compound word:
replacements = {}
for compound in some_list:
words = compound.split() # 'new York' => 'new', 'York'
try: # 'new' => [('new', 'York'), ('new', 'Orleans')]
replacements[words[0]] = [words]
except KeyError: # 'new' => [('new', 'York')]
replacements[words[0]].append(words)
Using this, you can traverse your splitted string and test for each word whether it might be part of a compound word. The tricky part is to avoid adding trailing parts of compound words.
splitted_string = some_string.split()
compound_string = []
insertion_offset = 0
for index, word in enumerate(splitted_string):
# we already added a compound string, skip its members
if len(compound_string) + insertion_offset > index:
continue
# check if a compound word starts here...
try:
candidate_compounds = replacements[word]
except KeyError:
# definitely not, just keep the word
compound_string.append(word)
else:
# try all possible compound words...
for compound in candidate_compounds:
if splitted_string[index:index+len(compound)] == compound:
insertion_offset += len(compound)
compound_string.append(' '.join(compound))
break
# ...but otherwise, just keep the word
else:
compound_string.append(word)
This will stitch together all individual pieces of compound words:
>>> print(compound_string)
['Siva', 'is', 'belongs', 'to', 'new York', 'he', 'was', 'living', 'in', 'park meadows mall']
Note that the ideal structure of the replacements table depends on your words in some_list. If there are no collisions of first words, you can skip the list of compound words and have only one compound word each. If there are many collisions, you may have to nest several tables inside to avoid having to try all candidates. The later is especially important if some_string is large.

Related

Creating a dictionary of words as keys mapped to their instances in a set of 'documents'

I need to take a list of tuples that includes sentences that are preprocessed as such (the 0 is an integer that corresponds to the publication, and the set at the end finds all unique words in the sentence):
(0, 'political commentators on both sides of the political divide agreed that clinton tried to hammer home the democrats theme that trump is temperamentally unfit with the line about his tweets and nuclear weapons', {'weapons', 'political', 'theme', 'line', 'and', 'sides', 'commentators', 'of', 'tried', 'about', 'is', 'agreed', 'clinton', 'the', 'home', 'to', 'divide', 'tweets', 'that', 'democrats', 'unfit', 'on', 'temperamentally', 'both', 'hammer', 'his', 'nuclear', 'with', 'trump'})
and returns a dictionary that includes the words as the key, and a list of integers that are the "index" position of the word as the value. i.e if this sentence was the 12th of the list, the dictionary value would contain 12 next to all the present words.
I know that I need to enumerate the original set of documents and then take the words from the set in the tuple, but I'm having a hard time finding the proper syntax to iterate into the sets of words within the tuple. Right now I'm stumped as to even where to start. If you want to see my code for how I produced the tuples from an original document of lines here it is.
def makeDocuments(filename):
with open(filename) as f:
g = [l for l in f]
return [tuple([int(l[0:2]), re.sub(r'\W', ' ',(l[2:-1])), set(re.findall(r'[a-zA-Z%]+', l))]) for l in g]
A test case was provided for me, upon searching for a given key the results should look something like:
assert index['happiness'] == [16495,66139,84943,
85998,91589,93472,
120070,133078,193349]
where the word 'happiness' occurs inside the sentences at those index positions.

Parsing that string is hard and you have pretty much just done a brute force extraction of data. Instead of trying to guess whether that's going to work on all possible input, you can use python's ast module to convert literals (what you type into a python program to represent stings, tuples, sets and so forth) into python objects for processing. After that, its just a question of associating the words in the newly created tuple to the indexes.
import ast
def makeDocuments(filename):
catalog = {}
with open(filename) as f:
for line in f:
index, text, words = ast.literal_eval(line)
for word in words:
if word not in catalog:
catalog[word] = []
catalog[word].append(index)
return catalog

how can i get the number of words that have been influenced by the lemmatization approach in a text?

For example, in the below sentence where the lemmatizer has affected 5 words, the number 5 should be displayed in the output.
lemmatizer = WordNetLemmatizer()
sentence = "The striped bats are hanging on their feet for best"
print([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(sentence)])
#> ['The', 'strip', 'bat', 'be', 'hang', 'on', 'their', 'foot', 'for', 'best']

Probably not the most elegant way, but as a workaround you could try to compare every single element of the tokenized sentence and the lemmatized sentence (as far as I know, lemmatization does not remove elements so this should work).
Something like that:
count = 0
for i, el in enumerate(tokenized):
if el!=lemmatized[i]:
count+=1
The value of count will be the number of elements that differ from the 2 lists, thus the number of elements affected by lemmatization.

Python splitting text with line breaks into a list

I'm trying to convert some text into a list. The text contains special characters, numbers, and line breaks. Ultimately I want to have a list with each word as an item in the list without any special characters, numbers, or spaces.
exerpt from text:
I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the&lt I
Currently I'm using this line to split each word into an item in the list:
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k) \
for k in content.split(" ")]
print(text_list)
This code is leaving in spaces and combining words in each item of the list like below
Result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between St ', 'Petersburgh', 'and', 'Archangel ', ' lt the lt I']
I would like to split the words into individual items of the list and remove the string ' lt ' and numbers from my list items.
Expected result:
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post', 'road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the' 'I']
Please help me resolve this issue.
Thanks

Since it looks like you're parsing html text, it's likely all entities are enclosed in & and ;. Removing those makes matching the rest quite easy.
import re
content = 'I have no ambition to lose my life on the post-road between St. Petersburgh and Archangel. <the< I'
# first, remove entities, the question mark makes sure the expression isn't too greedy
content = re.sub(r'&[^ ]+?;', '', content)
# then just match anything that meets your rules
text_list = re.findall(r"[a-zA-Z0-9]+", content)
print(text_list)
Note that 'St Petersburg' likely got matched together because the character between the 't' and 'P' probably isn't a space, but a non-breaking space. If this were just html, I'd expect there to be or something of the sort, but it's possible that in your case there's some UTF non-breaking space character there.
That should not matter with the code above, but if you use a solution using .split(), it likely won't see that character as a space.
In case the &lt is not your mistake, but in the original, this works as a replacement for the .sub() statement:
content = re.sub(r'&[^ ;]+?(?=[ ;]);?', '', content)
Clearly a bit more complicated: it substitutes any string that starts with & [&], followed by one or more characters that are not a space or ;, taking as little as possible [[^ ;]+?], but only if they are then followed by a space or a ; [(?=[ ;])], and in that case that ; is also matched [;?].

Here is what can can be done. You just need to replace any known code of syntax in advance
import re
# define some special syntax that want to remove
special_syntax = r"&(lt|nbsp|gt|amp|quot|apos|cent|pound|yen|euro|copy|reg|)[; ]"
text_list = [re.sub(r"[^a-zA-Z0-9]+", ' ', k).strip() \
# Here I remove the syntax before split them and substitue special char again
for k in re.sub(special_syntax, ' ', content).split(" ")]
# remove empty string from the list
filter_object = filter(lambda x: x != "", text_list)
list(filter_object)
Output
['I', 'have', 'no', 'ambition', 'to', 'lose', 'my', 'life', 'on', 'the',
'post road', 'between', 'St', 'Petersburgh', 'and', 'Archangel', 'the', 'I']

How can I make my title-case regular expression match prefix titles?

I need to pull possible titles out of a chunk of text. So for instance, I want to match words like "Joe Smith", "The Firm", or "United States of America". I now need to modify it to match names that begin with a title of some kind (such as "Dr. Joe Smith"). Here's the regular expression I have:
NON_CAPPED_WORDS = (
# Articles
'the',
'a',
'an',
# Prepositions
'about',
'after',
'as',
'at',
'before',
'by',
'for',
'from',
'in',
'into',
'like',
'of',
'on',
'to',
'upon',
'with',
'without',
)
TITLES = (
'Dr\.',
'Mr\.',
'Mrs\.',
'Ms\.',
'Gov\.',
'Sen\.',
'Rep\.',
)
# These are words that don't match the normal title case regex, but are still allowed
# in matches
IRREGULAR_WORDS = NON_CAPPED_WORDS + TITLES
non_capped_words_re = r'[\s:,]+|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|{0})+\s*)""".format(non_capped_words_re))
Which builds the following regular expression:
(?P<title>([A-Z0-9&][a-zA-Z0-9]*[\s,:-]*|the[\s:,]+|a[\s:,]+|an[\s:,]+|about[\s:,]+|after[\s:,]+|as[\s:,]+|at[\s:,]+|before[\s:,]+|by[\s:,]+|for[\s:,]+|from[\s:,]+|in[\s:,]+|into[\s:,]+|like[\s:,]+|of[\s:,]+|on[\s:,]+|to[\s:,]+|upon[\s:,]+|with[\s:,]+|without[\s:,]+|Dr\.[\s:,]+|Mr\.[\s:,]+|Mrs\.[\s:,]+|Ms\.[\s:,]+|Gov\.[\s:,]+|Sen\.[\s:,]+|Rep\.)+\s*)
This doesn't seem to be working though:
>>> whitelisting.TITLE_RE.findall('Dr. Joe Smith')
[('Dr', 'Dr'), ('Joe Smith', 'Smith')]
Can someone who has better regex-fu help me fix this mess of a regex?

The problem seems to be that the first part of the expression, [A-Z0-9&][a-zA-Z0-9]*[\s,:-]*, is gobbling up the initial characters in your "prefix titles", since they are title-cased until you get to the period. So, when the + is repeating the subexpression and encounters 'Dr.', that initial part of the expression matches 'Dr', and leaves only the non-matching period.
One easy fix is to simply move the "special cases" to the front of the expression, so they're matched as a first resort, not a last resort (this essentially just moves {0} from the end of the expression to the front):
TITLE_RE = re.compile(r"""(?P<title>({0}|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))
Result:
>>> TITLE_RE.findall('Dr. Joe Smith');
[('Dr. Joe Smith', 'Smith')]
I would probably go further and modify the expression to avoid all the repetition of [\s:,]+, but I'm not sure there's any real benefit, aside from making the formatted expression look a little nicer:
'|'.join(IRREGULAR_WORDS)
TITLE_RE = re.compile(r"""(?P<title>((?:{0})[\s:,]+|[A-Z0-9&][a-zA-Z0-9]*[\s,:-]*)+\s*)""".format(non_capped_words_re))

How to remove list of words from a list of strings

Sorry if the question is bit confusing. This is similar to this question
I think this the above question is close to what I want, but in Clojure.
There is another question
I need something like this but instead of '[br]' in that question, there is a list of strings that need to be searched and removed.
Hope I made myself clear.
I think that this is due to the fact that strings in python are immutable.
I have a list of noise words that need to be removed from a list of strings.
If I use the list comprehension, I end up searching the same string again and again. So, only "of" gets removed and not "the". So my modified list looks like this
places = ['New York', 'the New York City', 'at Moscow' and many more]
noise_words_list = ['of', 'the', 'in', 'for', 'at']
for place in places:
stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]
I would like to know as to what mistake I'm doing.

Without regexp you could do like this:
places = ['of New York', 'of the New York']
noise_words_set = {'of', 'the', 'at', 'for', 'in'}
stuff = [' '.join(w for w in place.split() if w.lower() not in noise_words_set)
for place in places
]
print stuff

Here is my stab at it. This uses regular expressions.
import re
pattern = re.compile("(of|the|in|for|at)\W", re.I)
phrases = ['of New York', 'of the New York']
map(lambda phrase: pattern.sub("", phrase), phrases) # ['New York', 'New York']
Sans lambda:
[pattern.sub("", phrase) for phrase in phrases]
Update
Fix for the bug pointed out by gnibbler (thanks!):
pattern = re.compile("\\b(of|the|in|for|at)\\W", re.I)
phrases = ['of New York', 'of the New York', 'Spain has rain']
[pattern.sub("", phrase) for phrase in phrases] # ['New York', 'New York', 'Spain has rain']
#prabhu: the above change avoids snipping off the trailing "in" from "Spain". To verify run both versions of the regular expressions against the phrase "Spain has rain".

>>> import re
>>> noise_words_list = ['of', 'the', 'in', 'for', 'at']
>>> phrases = ['of New York', 'of the New York']
>>> noise_re = re.compile('\\b(%s)\\W'%('|'.join(map(re.escape,noise_words_list))),re.I)
>>> [noise_re.sub('',p) for p in phrases]
['New York', 'New York']

Since you would like to know what you are doing wrong, this line:
stuff = [place.replace(w, "").strip() for w in noise_words_list if place.startswith(w)]
takes place, and then begins to loop over words. First it checks for "of". Your place (e.g. "of the New York") is checked to see if it starts with "of". It is transformed (call to replace and strip) and added to the result list. The crucial thing here is that result is never examined again. For every word you iterate over in the comprehension, a new result is added to the result list. So the next word is "the" and your place ("of the New York") doesn't start with "the", so no new result is added.
I assume the result you got eventually is the concatenation of your place variables. A simpler to read and understand procedural version would be (untested):
results = []
for place in places:
for word in words:
if place.startswith(word):
place = place.replace(word, "").strip()
results.append(place)
Keep in mind that replace() will remove the word anywhere in the string, even if it occurs as a simple substring. You can avoid this by using regexes with a pattern something like ^the\b.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split text without separating e.g. 'New York' - python

Related

Creating a dictionary of words as keys mapped to their instances in a set of 'documents'

how can i get the number of words that have been influenced by the lemmatization approach in a text?

Python splitting text with line breaks into a list

How can I make my title-case regular expression match prefix titles?

How to remove list of words from a list of strings

Categories

Resources