I have a text and I have got a task in python with reading module:
Find the names of people who are referred to as Mr. XXX. Save the result in a dictionary with the name as key and number of times it is used as value. For example:
If Mr. Churchill is in the novel, then include {'Churchill' : 2}
If Mr. Frank Churchill is in the novel, then include {'Frank Churchill' : 4}
The file is .txt and it contains around 10-15 paragraphs.
Do you have ideas about how can it be improved? (It gives me error after some words, I guess error happens due to the reason that one of the Mr. is at the end of the line.)
orig_text= open('emma.txt', encoding = 'UTF-8')
lines= orig_text.readlines()[32:16267]
counts = dict()
for line in lines:
wordsdirty = line.split()
try:
print (wordsdirty[wordsdirty.index('Mr.') + 1])
except ValueError:
continue
Try this:
text = "When did Mr. Churchill told Mr. James Brown about the fish"
m = [x[0] for x in re.findall('(Mr\.( [A-Z][a-z]*)+)', text)]
You get:
['Mr. Churchill', 'Mr. James Brown']
To solve the line issue simply read the entire file:
text = file.read()
Then, to count the occurrences, simply run:
Counter(m)
Finally, if you'd like to drop 'Mr. ' from all your dictionary entries, use x[0][4:] instead of x[0].
This can be easily done using regex and capturing group.
Take a look here for reference, in this scenario you might want to do something like
# retrieve a list of strings that match your regex
matches = re.findall("Mr\. ([a-zA-Z]+)", your_entire_file) # not sure about the regex
# then create a dictionary and count the occurrences of each match
# if you are allowed to use modules, this can be done using Counter
Counter(matches)
To access the entire file like that, you might want to map it to memory, take a look at this question
I am looking for a way to create several lists and for the keywords in those lists to be extracted and matched with a responce.
User Input: This is a good day I am heading out for a jog.
List 1 : Keywords : good day, great day, awesome day, best day.
List 2 : Keywords : a run, a swim, a game.
But for a huge database of words, can this be linked to just the lists? Or does it need to be especific words?
Also would you recommend Python for a huge database of keywords?
The first thing to do is to break the input string up into tokens. A token is just a piece of the string that you want to match. In your case, it looks like your token size is 2 words (but it doesn't have to be). You might also want to strip all punctuation from the input string as well.
Then for your input, your tokens are
['This is', 'is a', 'a good', 'good day', 'day I', 'I am', 'am heading', 'heading out', 'out for', 'for a', 'a jog']
Then you can iterate over the tokens and check to see if they're contained in each one of the lists. Might look like this:
input = 'This is a good day I am heading out for a jog'
words = input.split(' ')
tokens = [' '.join(words[i:i+2]) for i in range(len(words) - 1)]
for token in tokens:
if token in list1:
print('{} is in list1'.format(token))
if token in list2:
print('{} is in list2'.format(token))
One thing you will likely want to do to optimize this is to use sets for list1 and list2, instead of lists.
set1 = set(list1)
sets offer O(1) lookups, as opposed to O(n) for lists, which is critical if your keyword lists are large.
This question already has answers here:
How can I split a text into sentences?
(20 answers)
Closed 3 years ago.
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals or abbreviations or title of a name or if the sentence has a .com This is attempt at regex that doesn't work.
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r' *[\.\?!][\'"\)\]]* *', text)
for stuff in sentences:
print(stuff)
Example output of what it should look like
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s
Try this. split your string this.You can also check demo.
http://regex101.com/r/nG1gU7/27
Ok so sentence-tokenizers are something I looked at in a little detail, using regexes, nltk, CoreNLP, spaCy. You end up writing your own and it depends on the application. This stuff is tricky and valuable and people don't just give their tokenizer code away. (Ultimately, tokenization is not a deterministic procedure, it's probabilistic, and also depends very heavily on your corpus or domain, e.g. legal/financial documents vs social-media posts vs Yelp reviews vs biomedical papers...)
In general you can't rely on one single Great White infallible regex, you have to write a function which uses several regexes (both positive and negative); also a dictionary of abbreviations, and some basic language parsing which knows that e.g. 'I', 'USA', 'FCC', 'TARP' are capitalized in English.
To illustrate how easily this can get seriously complicated, let's try to write you that functional spec for a deterministic tokenizer just to decide whether single or multiple period ('.'/'...') indicates end-of-sentence, or something else:
function isEndOfSentence(leftContext, rightContext)
Return False for decimals inside numbers or currency e.g. 1.23 , $1.23, "That's just my $.02" Consider also section references like 1.2.A.3.a, European date formats like 09.07.2014, IP addresses like 192.168.1.1, MAC addresses...
Return False (and don't tokenize into individual letters) for known abbreviations e.g. "U.S. stocks are falling" ; this requires a dictionary of known abbreviations. Anything outside that dictionary you will get wrong, unless you add code to detect unknown abbreviations like A.B.C. and add them to a list.
Ellipses '...' at ends of sentences are terminal, but in the middle of sentences are not. This is not as easy as you might think: you need to look at the left context and the right context, specifically is the RHS capitalized and again consider capitalized words like 'I' and abbreviations. Here's an example proving ambiguity which : She asked me to stay... I left an hour later. (Was that one sentence or two? Impossible to determine)
You may also want to write a few patterns to detect and reject miscellaneous non-sentence-ending uses of punctuation: emoticons :-), ASCII art, spaced ellipses . . . and other stuff esp. Twitter. (Making that adaptive is even harder). How do we tell if #midnight is a Twitter user, the show on Comedy Central, text shorthand, or simply unwanted/junk/typo punctuation? Seriously non-trivial.
After you handle all those negative cases, you could arbitrarily say that any isolated period followed by whitespace is likely to be an end of sentence. (Ultimately, if you really want to buy extra accuracy, you end up writing your own probabilistic sentence-tokenizer which uses weights, and training it on a specific corpus(e.g. legal texts, broadcast media, StackOverflow, Twitter, forums comments etc.)) Then you have to manually review exemplars and training errors. See Manning and Jurafsky book or Coursera course [a].
Ultimately you get as much correctness as you are prepared to pay for.
All of the above is clearly specific to the English-language/ abbreviations, US number/time/date formats. If you want to make it country- and language-independent, that's a bigger proposition, you'll need corpora, native-speaking people to label and QA it all, etc.
All of the above is still only ASCII, which is practically speaking only 96 characters. Allow the input to be Unicode, and things get harder still (and the training-set necessarily must be either much bigger or much sparser)
In the simple (deterministic) case, function isEndOfSentence(leftContext, rightContext) would return boolean, but in the more general sense, it's probabilistic: it returns a float 0.0-1.0 (confidence level that that particular '.' is a sentence end).
References: [a] Coursera video: "Basic Text Processing 2-5 - Sentence Segmentation - Stanford NLP - Professor Dan Jurafsky & Chris Manning" [UPDATE: an unofficial version used to be on YouTube, was taken down]
Try to split the input according to the spaces rather than a dot or ?, if you do like this then the dot or ? won't be printed in the final result.
>>> import re
>>> s = """Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't."""
>>> m = re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', s)
>>> for i in m:
... print i
...
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it.
Did he mind?
Adam Jones Jr. thinks he didn't.
In any case, this isn't true...
Well, with a probability of .9 it isn't.
sent = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)',text)
for s in sent:
print s
Here the regex used is : (?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)(\s|[A-Z].*)
First block: (?<!\w\.\w.) : this pattern searches in a negative feedback loop (?<!) for all words (\w) followed by fullstop (\.) , followed by other words (\.)
Second block: (?<![A-Z][a-z]\.): this pattern searches in a negative feedback loop for anything starting with uppercase alphabets ([A-Z]), followed by lower case alphabets ([a-z]) till a dot (\.) is found.
Third block: (?<=\.|\?): this pattern searches in a feedback loop of dot (\.) OR question mark (\?)
Fourth block: (\s|[A-Z].*): this pattern searches after the dot OR question mark from the third block. It searches for blank space (\s) OR any sequence of characters starting with a upper case alphabet ([A-Z].*).
This block is important to split if the input is as
Hello world.Hi I am here today.
i.e. if there is space or no space after the dot.
Naive approach for proper english sentences not starting with non-alphas and not containing quoted parts of speech:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
NonEndings = re.compile(r'(?:Mrs?|Jr|i\.e)\.\s*$')
parts = EndPunctuation.split(text)
sentence = []
for part in parts:
if len(part) and len(sentence) and EndPunctuation.match(sentence[-1]) and not NonEndings.search(''.join(sentence)):
print(''.join(sentence))
sentence = []
if len(part):
sentence.append(part)
if len(sentence):
print(''.join(sentence))
False positive splitting may be reduced by extending NonEndings a bit. Other cases will require additional code. Handling typos in a sensible way will prove difficult with this approach.
You will never reach perfection with this approach. But depending on the task it might just work "enough"...
I'm not great at regular expressions, but a simpler version, "brute force" actually, of above is
sentence = re.compile("([\'\"][A-Z]|([A-Z][a-z]*\. )|[A-Z])(([a-z]*\.[a-z]*\.)|([A-Za-z0-9]*\.[A-Za-z0-9])|([A-Z][a-z]*\. [A-Za-z]*)|[^\.?]|[A-Za-z])*[\.?]")
which means
start acceptable units are '[A-Z] or "[A-Z]
please note, most regular expressions are greedy so the order is very important when we do |(or). That's, why I have written i.e. regular expression first, then is come forms like Inc.
Try this:
(?<!\b(?:[A-Z][a-z]|\d|[i.e]))\.(?!\b(?:com|\d+)\b)
I wrote this taking into consideration smci's comments above. It is a middle-of-the-road approach that doesn't require external libraries and doesn't use regex. It allows you to provide a list of abbreviations and accounts for sentences ended by terminators in wrappers, such as a period and quote: [.", ?', .)].
abbreviations = {'dr.': 'doctor', 'mr.': 'mister', 'bro.': 'brother', 'bro': 'brother', 'mrs.': 'mistress', 'ms.': 'miss', 'jr.': 'junior', 'sr.': 'senior', 'i.e.': 'for example', 'e.g.': 'for example', 'vs.': 'versus'}
terminators = ['.', '!', '?']
wrappers = ['"', "'", ')', ']', '}']
def find_sentences(paragraph):
end = True
sentences = []
while end > -1:
end = find_sentence_end(paragraph)
if end > -1:
sentences.append(paragraph[end:].strip())
paragraph = paragraph[:end]
sentences.append(paragraph)
sentences.reverse()
return sentences
def find_sentence_end(paragraph):
[possible_endings, contraction_locations] = [[], []]
contractions = abbreviations.keys()
sentence_terminators = terminators + [terminator + wrapper for wrapper in wrappers for terminator in terminators]
for sentence_terminator in sentence_terminators:
t_indices = list(find_all(paragraph, sentence_terminator))
possible_endings.extend(([] if not len(t_indices) else [[i, len(sentence_terminator)] for i in t_indices]))
for contraction in contractions:
c_indices = list(find_all(paragraph, contraction))
contraction_locations.extend(([] if not len(c_indices) else [i + len(contraction) for i in c_indices]))
possible_endings = [pe for pe in possible_endings if pe[0] + pe[1] not in contraction_locations]
if len(paragraph) in [pe[0] + pe[1] for pe in possible_endings]:
max_end_start = max([pe[0] for pe in possible_endings])
possible_endings = [pe for pe in possible_endings if pe[0] != max_end_start]
possible_endings = [pe[0] + pe[1] for pe in possible_endings if sum(pe) > len(paragraph) or (sum(pe) < len(paragraph) and paragraph[sum(pe)] == ' ')]
end = (-1 if not len(possible_endings) else max(possible_endings))
return end
def find_all(a_str, sub):
start = 0
while True:
start = a_str.find(sub, start)
if start == -1:
return
yield start
start += len(sub)
I used Karl's find_all function from this entry: Find all occurrences of a substring in Python
My example is based on the example of Ali, adapted to Brazilian Portuguese. Thanks Ali.
ABREVIACOES = ['sra?s?', 'exm[ao]s?', 'ns?', 'nos?', 'doc', 'ac', 'publ', 'ex', 'lv', 'vlr?', 'vls?',
'exmo(a)', 'ilmo(a)', 'av', 'of', 'min', 'livr?', 'co?ls?', 'univ', 'resp', 'cli', 'lb',
'dra?s?', '[a-z]+r\(as?\)', 'ed', 'pa?g', 'cod', 'prof', 'op', 'plan', 'edf?', 'func', 'ch',
'arts?', 'artigs?', 'artg', 'pars?', 'rel', 'tel', 'res', '[a-z]', 'vls?', 'gab', 'bel',
'ilm[oa]', 'parc', 'proc', 'adv', 'vols?', 'cels?', 'pp', 'ex[ao]', 'eg', 'pl', 'ref',
'[0-9]+', 'reg', 'f[ilĂ]s?', 'inc', 'par', 'alin', 'fts', 'publ?', 'ex', 'v. em', 'v.rev']
ABREVIACOES_RGX = re.compile(r'(?:{})\.\s*$'.format('|\s'.join(ABREVIACOES)), re.IGNORECASE)
def sentencas(texto, min_len=5):
# baseado em https://stackoverflow.com/questions/25735644/python-regex-for-splitting-text-into-sentences-sentence-tokenizing
texto = re.sub(r'\s\s+', ' ', texto)
EndPunctuation = re.compile(r'([\.\?\!]\s+)')
# print(NonEndings)
parts = EndPunctuation.split(texto)
sentencas = []
sentence = []
for part in parts:
txt_sent = ''.join(sentence)
q_len = len(txt_sent)
if len(part) and len(sentence) and q_len >= min_len and \
EndPunctuation.match(sentence[-1]) and \
not ABREVIACOES_RGX.search(txt_sent):
sentencas.append(txt_sent)
sentence = []
if len(part):
sentence.append(part)
if sentence:
sentencas.append(''.join(sentence))
return sentencas
Full code in: https://github.com/luizanisio/comparador_elastic
If you want to break up sentences at 3 periods (not sure if this is what you want) you can use this regular expresion:
import re
text = """\
Mr. Smith bought cheapsite.com for 1.5 million dollars, i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.
"""
sentences = re.split(r'\.{3}', text)
for stuff in sentences:
print(stuff)
i hope someone can point me in the right direction.
what would be an efficient way to translate data within a row[x]?
for example i want to convert the following: street,avenue,road,court to st,ave,rd,ct.
i was thinking of using a dictionary, the reason being is that sometimes the first letter will be capitalized and sometimes it wont ie: {'ave':['Avenue','avenue','AVENUE','av','AV']}
having said that, could i also do something (prior to translating) like convert all the data to lower case (in the original csv file) to prevent working with data that contains mixed caps?
this is for csv files with anywhere between 500-1000 lines..
thank you
edit: i should add that the row[x] string would be something like: '123 main street' and that is what im looking to translate to '123 main st'
edit#2:
mydict = {'avenue':'ave', 'street':'st', 'road':'rd', 'court':'ct'}
add1 = '123 MAIN ROAD'
newadd1 = []
for i in add1.lower().split(' '):
newtext = mydict.get(i.lower(),i)
newadd1.append(newtext)
print ' '.join(newadd1)
thank you everyone
The way I would tackle it would be, as you suggested, constructing a dictionary . For example, say that any form of Avenue - I would like to display as "Ave":
mapper = {'ave': 'Ave', 'avenue': 'Ave', 'av': 'Ave', 'st': 'Street', 'street': 'Street, ...}
and then use it with every word in the address field as follows:
word = mapper.get(word.lower(), word)
I'm sorry, but I haven't been able to find a working solution from any of the solutions Google's been giving me (a couple of "recipes" on some site were pretty close, but way old and I haven't found something that gives me the result I'm looking for.
I'm renaming files, so I have a function that spits out the filename, for this I'm just using 'test_string's:
So, all the dots, (and underscores) and stuff are removed first--since those are the most common thing all these professors do differently and makes all this stuff impossible to deal with (or look at) without removing.
5 Examples:
test_string_1 = 'legal.studies.131.race.relations.in.the.United.States.'
'legal.studies' --> 'Legal Studies'
test_string_2 = 'mediastudies the triumph of bluray over hddvd'
'mediastudies' --> 'Media Studies', 'bluray' --> 'Blu-ray, 'hddvd' --> 'HD DVD'
test_string_3 = 'computer Science Microsoft vs unix'
'computer Science' --> 'Computer Science', 'unix' --> 'UNIX'
test_string_4 = 'Perception - metamers dts'
'Perception' would already be good (but who cares), big picture is they want to keep the audio information in there, so 'dts' --> DTS
test_string_5 = 'Perception - Cue Integration - flashing dot example aac20 xvid'
'aac20' --> 'AAC2.0', 'xvid' --> 'XviD'
Currently I'm running this through something like:
new_string = re.sub(r'(?i)Legal(\s|-|)Studies', 'Legal Studies', re.sub(r'(?i)Sociology', 'Sociology', re.sub(r'(?i)Media(\s|-|)Studies', 'Media Studies', re.sub(r'(?i)UNIX', 'UNIX', re.sub(r'(?i)Blu(\s|-|)ray', 'Blu-ray', re.sub(r'(?i)HD(\s|-|)DVD', 'HD DVD', re.sub(r'(?i)xvid(\s|-|)', 'XviD', re.sub(r'(?i)aac(\s|-|)2(\s|-|\.|)0', 'AAC2.0', re.sub(r'(?i)dts', 'DTS', re.sub(r'\.', r' ', original_string.title()))))))))))
I have them all smushed together on one line; because I'm not changing/updating it much and (the way my brain/ADD works) it's easier to have it as minimal/out-of-the-way as possible while I'm doing other things once I'm not messing with this part anymore.
So, with my example:
new_test_string_1 = 'Legal Studies 131 Race Relations In The United States'
new_test_string_2 = 'Media Studies The Triumph Of Blu-ray Over HD DVD'
new_test_string_3 = 'Computer Science Microsoft Vs UNIX'
new_test_string_4 = 'Perception - Metamers DTS'
new_test_string_5 = 'Perception - Cue Integration - Flashing Dot Example AAC2.0 XviD'
However, as I have more and more of these it's really starting to become the kind of thing I want to have a dictionary or something for--I don't want to blow up the code to anything crazy, but I'd like to be able to add new replacements as real life examples come up that need to be added (for example, there are a lot of audio codecs/containers/whatevers out there, and it looks like I might have to just throw them all in). I have no opinion about the method used by this master-list/dictionary/whatever.
Big picture: I'm fixing spaces and underscores in the filenames, replacing a bunch of shit with capitalization stuff (at the moment, universally title-casing it with the exception of the re.subs I'm making, which deal with plenty of cases where the capitalization isn't perfect and there may or may not be a space, dash, or dot in the input that the output should have).
Similarly, a one-liner, unnamed (such as lambda) function would be preferable.
P.S.
Sorry for some of the weirdness and some of the initial lack of clarity. One of the problems here is in my major/studies most of the stuff is actually pretty straight-forward, it's other classes that need all the Blu-ray, HD DVD, DTS, AAC2.0, XviD, etc.
>>> import re
>>> def string_fix(text,substitutions):
text_no_dots = text.replace('.',' ').strip()
for key,substitution in substitutions.items():
text_no_dots = re.sub(key,substitution,text_no_dots,flags=re.IGNORECASE)
return text_no_dots
>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> d = {
r'Legal(\s|-|)Studies' : 'Legal Studies',
r'Sociology' : 'Sociology',
r'Media(\s|-|)Studies' : 'Media Studies'
}
>>> string_fix(teststring,d)
'Legal Studies 131 race relations in the U S'
And here is a much better way of doing it without a dictionary
>>> teststring = 'legal.studies.131.race.relations.in.the.U.S.'
>>> def repl(match):
return ' '.join(re.findall('\w+',match.group())).title()
>>> re.sub(r'Legal(\s|-|)Studies|Sociology|Media(\s|-|)Studies',repl,teststring.replace('.',' ').strip(),flags=re.IGNORECASE)
'Legal Studies 131 race relations in the U S'
import re
def string_fix(filename, dict):
filename = filename.replace('.', ' ')
for key, val in dict.items():
filename = re.sub(key, val, filename, flags=re.IGNORECASE)
return filename
dict = {
r'Legal[\s\-_]?Studies' : 'Legal Studies',
r'Media[\s\-_]?Studies' : 'Media Studies',
r'dts' : 'DTS',
r'hd[\s\-_]?dvd': 'HD DVD',
r'blu[\s\-_]?ray' : 'Blu-ray',
r'unix' : 'UNIX',
r'aac[\s\-_]?2[\.]?0' : 'AAC2.0',
r'xvid' : 'XviD',
r'computer[\s\-_]?science' : 'Computer Science'
}
string_1 = 'legal.studies.131.race.relations.in.the.United.States.'
string_2 = 'mediastudies the triumph of bluray over hddvd'
string_3 = 'computer Science Microsoft vs unix'
string_4 = 'Perception - metamers dts'
string_5 = 'Perception - Cue Integration - flashing dot example aac20 xvid'
print(string_fix(string_1, dict))
print(string_fix(string_2, dict))
print(string_fix(string_3, dict))
print(string_fix(string_4, dict))
print(string_fix(string_5, dict))