I am attempting to search for text between two keywords. My solution so far is using split() to change string to list. It works but I was wondering if there is more efficient/elegant way to achieve this. Below is my code:
words = "Your meeting with Dr Green at 8pm"
list_words = words.split()
before = "with"
after = "at"
title = list_words[list_words.index(before) + 1]
name = list_words[list_words.index(after) - 1]
if title != name:
var = title + " " + name
print(var)
else:
print(title)
Results:
>>> Dr Green
Id prefer a solution that is configurable as the text I'm searching for can be dynamic so Dr Green could be replaced by a name with 4 words or 1 word.
Sounds like a job for regular expressions. This uses the pattern (?:with)(.*?)(?:at) to look for 'with', and 'at', and lazily match anything in-between.
import re
words = 'Your meeting with Dr Green at 8pm'
start = 'with'
end = 'at'
pattern = r'(?:{})(.*?)(?:{})'.format(start, end)
match = re.search(pattern, words).group(1).strip()
print(match)
Outputs;
Dr Green
Note that the Regex does actually match the spaces on either side of Dr Green, I've included a simple match.strip() to remove trailing whitespace.
Using RE
import re
words = "Your meeting with Dr Green at 8pm"
before = "Dr"
after = "at"
result = re.search('%s(.*)%s' % (before, after), words).group(1)
print before + result
Output :
Dr Green
How about slicing the list at start and end, then just splitting it?
words = "Your meeting with Dr Jebediah Caruseum Green at 8pm"
start = "with"
end = "at"
list_of_stuff = words[words.index(start):words.index(end)].replace(start, '', 1).split()
list_of_stuff
['Dr', 'Jebediah', 'Caruseum', 'Green']
You can do anything you like with the list. For example I would parse for title like this:
list_of_titles = ['Dr', 'Sr', 'GrandMaster', 'Pleb']
try:
title = [i for i in list_of_stuff if i in list_of_titles][0]
except IndexError:
#title not found, skipping
title = ''
name = ' '.join([x for x in list_of_stuff if x != title])
print(title, name)
Related
I have code that looks like this:
data = u"Species:cat color:orange and white with yellow spots number feet: 4"
from spacy.matcher import PhraseMatcher
import en_core_web_sm
nlp = en_core_web_sm.load()
data=data.lower()
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"species",u"color", u"number feet"]
patterns = list(nlp.tokenizer.pipe(terminology_list))
matcher.add("TerminologyList", None, *patterns)
doc = nlp(data)
for idd, (match_id, start, end) in enumerate(matcher(doc)):
span = doc[start:end]
print(span.text)
I want to be able to grab everything until the next match. So that the match looks like this:
species:cat
color:orange and white with yellow spots
number feet: 4
I was trying to extend the span but I don't know how to say stop before the next match. I know that I can have it be like span = doc[start:end+4] or something but that is hard-coding how far ahead to go and I won't know how far I should extend the index.
Thank you
I have an idea that does not use spaCy.
First I split up the string into token
split = "Species:cat color:orange and white with yellow spots number feet: 4".replace(": ", ":").split()
Then I iterate over the list of tokens, save the key, and then merge the values to the key as long, as there is a new key
goal = []
key_value = None
for token in split:
print(token)
if ":" in token:
if key_value:
goal.append(kv)
key_value = token
else:
key_value = token
else:
key_value += " " + token
goal.append(key_value)
goal
>>>
['Species:cat', 'color:orange and white with yellow spots number', 'feet:4']
I found that spacy matcher orders the indexes of the matched terms even if it finds a term listed in the term list earlier than another term. So I can just end the span up to right before the next matched index.
Code to show what I mean:
data = u"Species:cat color:orange and white with yellow spots number feet: 4"
from spacy.matcher import PhraseMatcher
import en_core_web_sm
nlp = en_core_web_sm.load()
data=data.lower()
matcher = PhraseMatcher(nlp.vocab)
terminology_list = [u"species",u"color", u"number feet"]
patterns = list(nlp.tokenizer.pipe(terminology_list))
matcher.add("Terms", None, *patterns)
doc = nlp(data)
matches=matcher(doc)
matched_phrases={}
for idd, (match_id, start, end) in enumerate(matches):
key_match = doc[start:end]
if idd != len(matches)-1:
end_index=matches[idd+1][1]
else:
end_index=len(doc)
phrase = doc[end:end_index]
if phrase.text != '':
matched_phrases[key_match] = phrase
print(matched_phrases)
I want to print a specific word a different color every time it appears in the text. In the existing code, I've printed the lines that contain the relevant word "one".
import json
from colorama import Fore
fh = open(r"fle.json")
corpus = json.loads(fh.read())
for m in corpus['smsCorpus']['message']:
identity = m['#id']
text = m['text']['$']
strtext = str(text)
utterances = strtext.split()
if 'one' in utterances:
print(identity,text, sep ='\t')
I imported Fore but I don't know where to use it. I want to use it to have the word "one" in a different color.
output (section of)
44814 Ohhh that's the one Johnson told us about...can you send it to me?
44870 Kinda... I went but no one else did, I so just went with Sarah to get lunch xP
44951 No, it was directed in one place loudly and stopped when I stoppedmore or less
44961 Because it raised awareness but no one acted on their new awareness, I guess
44984 We need to do a fob analysis like our mcs onec
Thank you
You could also just use the ANSI color codes in your strings:
# define aliases to the color-codes
red = "\033[31m"
green = "\033[32m"
blue = "\033[34m"
reset = "\033[39m"
t = "That was one hell of a show for a one man band!"
utterances = t.split()
if "one" in utterances:
# figure out the list-indices of occurences of "one"
idxs = [i for i, x in enumerate(utterances) if x == "one"]
# modify the occurences by wrapping them in ANSI sequences
for i in idxs:
utterances[i] = red + utterances[i] + reset
# join the list back into a string and print
utterances = " ".join(utterances)
print(utterances)
If you only have 1 coloured word you can use this I think, you can expand the logic for n coloured words:
our_str = "Ohhh that's the one Johnson told us about...can you send it to me?"
def colour_one(our_str):
if "one" in our_str:
str1, str2 = our_str.split("one")
new_str = str1 + Fore.RED + 'one' + Style.RESET_ALL + str2
else:
new_str = our_str
return new_str
I think this is an ugly solution, not even sure if it works. But it's a solution if you can't find anything else.
i use colour module from this link or colored module that link
Furthermore if you dont want to use a module for coloring you can address to this link or that link
I am trying to allow the user to do this:
Lets say initially the text says:
"hello world hello earth"
when the user searches for "hello" it should display:
|hello| world |hello| earth
here's what I have:
m = re.compile(pattern)
i =0
match = False
while i < len(self.fcontent):
content = " ".join(self.fcontent[i])
i = i + 1;
for find in m.finditer(content):
print i,"\t"+content[:find.start()]+"|"+content[find.start():find.end()]+"|"+content[find.end():]
match = True
pr = raw_input( "(n)ext, (p)revious, (q)uit or (r)estart? ")
if (pr == 'q'):
break
elif (pr == 'p'):
i = i - 2
elif (pr == 'r'):
i = 0
if match is False:
print "No matches in the file!"
where :
pattern = user specified pattern
fcontent = contents of a file read in and stored as array of words and lines e.g:
[['line','1'],['line','2','here'],['line','3']]
however it prints
|hello| world hello earth
hello world |hello| earth
how can i merge the two lines to be displayed as one?
Thanks
Edit:
This a part of a larger search function where the pattern..in this case the word "hello" is passed from the user, so I have to use regex search/match/finditer to find the pattern. The replace and other methods sadly won't work because the user can choose to search for "[0-9]$" and that would mean to put the ending number between |'s
If you're just doing that, use str.replace.
print self.content.replace(m.find, "|%s|" % m.find)
you can use regexp as follows:
import re
src = "hello world hello earth"
dst = re.sub('hello', '|hello|', src)
print dst
or use string replace:
dst = src.replace('hello', '|hello|')
Ok, going back to original solution since OP confirmed that word would stand on its own (ie not be a substring of another word).
target = 'hello'
line = 'hello world hello earth'
rep_target = '|{}|'.format(target)
line = line.replace(target, rep_target)
yields:
|hello| world |hello| earth
As has been pointed out based on your example, using str.replace is the easiest. If more complex criteria is required, then you can adapt the following...
import re
def highlight(string, words, boundary='|'):
if isinstance(words, basestring):
words = [words]
rs = '({})'.format(boundary.join(sorted(map(re.escape, words), key=len, reverse=True)))
return re.sub(rs, lambda L: '{0}{1}{0}'.format(boundary, L.group(1)), string)
I have the following string which forces my Python script to quit:
"625 625 QUAIL DR UNIT B"
I need to delete the extra spaces in the middle of the string so I am trying to use the following split join script:
import arcgisscripting
import logging
logger = logging.getLogger()
gp = arcgisscripting.create(9.3)
gp.OverWriteOutput = True
gp.Workspace = "C:\ZP4"
fcs = gp.ListWorkspaces("*","Folder")
for fc in fcs:
print fc
rows = gp.UpdateCursor(fc + "//Parcels.shp")
row = rows.Next()
while row:
Name = row.GetValue('SIT_FULL_S').join(s.split())
print Name
row.SetValue('SIT_FULL_S', Name)
rows.updateRow(row)
row = rows.Next()
del row
del rows
Your source code and your error do not match, the error states you didn't define the variable SIT_FULL_S.
I am guessing that what you want is:
Name = ' '.join(row.GetValue('SIT_FULL_S').split())
Use the re module...
>>> import re
>>> str = 'A B C'
>>> re.sub(r'\s+', ' ', str)
'A B C'
I believe you should use regular expressions to match all the places where you find two or more spaces and then replace it (each occurence) with a single space.
This can be made using shorter portion of code:
re.sub(r'\s{2,}', ' ', your_string)
It's a bit unclear, but I think what you need is:
" ".join(row.GetValue('SIT_FULL_S').split())
I’m going to explain to you in details of what I want to achieve.
I have 2 programs about dictionaries.
The code for program 1 is here:
import re
words = {'i':'jeg','am':'er','happy':'glad'}
text = "I am happy.".split()
translation = []
for word in text:
word_mod = re.sub('[^a-z0-9]', '', word.lower())
punctuation = word[-1] if word[-1].lower() != word_mod[-1] else ''
if word_mod in words:
translation.append(words[word_mod] + punctuation)
else:
translation.append(word)
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
This program has following advantages:
You can write more than one sentence
You get the first letter capitalized after “.”
The program “append” the untranslated word to the output (“translation = []”)
Here is the code for program 2:
words = {('i',): 'jeg', ('read',): 'leste', ('the', 'book'): 'boka'}
max_group = len(max(words))
text = "I read the book".lower().split()
translation = []
position = 0
while text:
for m in range(max_group - 1, -1, -1):
word_mod = tuple(text[:position + m])
if word_mod in words:
translation.append(words[word_mod])
text = text[position + m:]
position += 1
translation = ' '.join(translation).split('. ')
print('. '.join(s.capitalize() for s in translation))
With this code you can translate idiomatic expressions or
“the book” to “boka”.
Here is how the program proceeds the codes.
This is the output:
1
('i',)
['jeg']
['read', 'the', 'book']
0
()
1
('read', 'the')
0
('read',)
['jeg', 'leste']
['the', 'book']
1
('the', 'book')
['jeg', 'leste', 'boka']
[]
0
()
Jeg leste boka
What I want is to implement some of the codes from program 1 into program 2.
I have tried many times with no success…
Here is my dream…:
If I change the text to the following…:
text = "I read the book. I read the book! I read the book? I read the book.".lower().split()
I want the output to be:
Jeg leste boka. Jeg leste boka! Jeg leste boka? Jeg leste boka.
So please, tweak your brain and help me with a solution…
I appreciate any reply very much!
Thank you very much in advance!
My solution flow would be something like this:
dict = ...
max_group = len(max(dict))
input = ...
textWPunc = input.lower().split()
textOnly = [re.sub('[^a-z0-9]', '', x) for x in input.lower().split()]
translation = []
while textOnly:
for m in [max_group..0]:
if textOnly[:m] in words:
check for punctuation here using textWPunc[:m]
if punctuation present in textOnly[:m]:
Append translated words + punctuation
else:
Append only translated words
textOnly = textOnly[m:]
textWPunc = textWPunc[m:]
join translation to finish
The key part being you keep two parallel lines of text, one that you check for words to translate and the other you check for punctuation if your translation search comes up with a hit. To check for punctuation, I fed the word group that I was examining into re() like so: re.sub('[a-z0-9]', '', wordGroup) which will strip out all characters but no punctuation.
Last thing was that your indexing looks kind of weird to me with that position variable. Since you're truncating the source string as you go, I'm not sure that's really necessary. Just check the leftmost x words as you go instead of using that position variable.