I want to replace the value of first variable using second variable but i want to keep the commas. i used regex, but i don't know if its possible cause i'm still learning it. so here is my code.
import re
names = 'Mat,Rex,Jay'
nicknames = 'AgentMat LegendRex KillerJay'
split_nicknames = nicknames.split(' ')
for a in range(len(split_nicknames)):
replace = re.sub('\\w+', split_nicknames[a], names)
print(replace)
my output is:
KillerJay,KillerJay,KillerJay
and i want a output like this:
AgentMat,LegendRex,KillerJay
I suspect what you are looking for should resemble something like this:
import re
testString = 'This is my complicated test string where Mat, Rex and Jay are all having a lark, but MatReyRex is not changed'
mapping = { 'Mat' : 'AgentMat',
'Jay' : 'KillerJay',
'Rex' : 'LegendRex'
}
reNames = re.compile(r'\b('+'|'.join(mapping)+r')\b')
res = reNames.sub(lambda m: mapping[m.group(0)], testString)
print(res)
Executing this results in the mapped result:
This is my complicated test string where AgentMat, LegendRex and KillerJay are all having a lark, but MatReyRex is not changed
We can build the mapping as follows :
import re
names = 'Mat,Rex,Jay'
nicknames = 'AgentMat LegendRex KillerJay'
my_dict = dict(zip(names.split(','), nicknames.split(' ')))
replace = re.sub(r'\b\w+\b', lambda m:my_dict[m[0]], names)
print(replace)
Then use lambda to apply the mapping.
Related
So I have a string, "this-is-a-big-tool" and swap out THIS and TOOL for different words but maintain BIG
import re
test = "this-is-a-big-tool"
s = [("a","b"), ("a","d"), ("c","d")]
for a,b in s:
result = re.sub("this-[\w]+-[\w]+-[big|giant]-tool", "%s-moves-big-%s" % (a,b), test)
print(result)
The issue is that say the only thing I care about is THIS, BIG, TOOL. I want to swap THIS and TOOL but keep BIG. and I dont care about the other words.
So my goal is to do something like:
a-is-a-big-b
a-is-a-giant-d
c-is-a-giant-d
The issue is that i figured out the regex, but how to i pass BIG or GIANT into the replace portion of the code?
result = re.sub("this-[\w]+-[\w]+-[big|giant]-tool", "%s-moves-big-%s" % (a,b), test)
How Do I pass This ---^ into --^
You can try this:
import re
test = "this-is-a-big-tool"
s = [("a","b"), ("a","d"), ("c","d")]
new_results = [re.sub('this|tool', '{}', test).format(*i) for i in s]
Output:
['a-is-a-big-b', 'a-is-a-big-d', 'c-is-a-big-d']
Intro
Hello, I'm working on a project that requires me to replace dictionary keys within a pandas column of text with values - but with potential misspellings. Specifically I am matching names within a pandas column of text and replacing them with "First Name". For example, I would be replacing "tommy" with "First Name".
However, I realize there's the issue of misspelled names and text within the column of strings that won't be replaced by my dictionary. For example 'tommmmy" has extra m's and is not a first name within my dictionary.
#Create df
d = {'message' : pd.Series(['awesome', 'my name is tommmy , please help with...', 'hi tommy , we understand your quest...'])}
names = ["tommy", "zelda", "marcon"]
#create dict
namesdict = {r'(^|\s){}($|\s)'.format(el): r'\1FirstName\2' for el in names}
#replace
d['message'].replace(namesdict, regex = True)
#output
Out:
0 awesome
1 my name is tommmy , please help with...
2 hi FirstName , we understand your quest...
dtype: object
so "tommmy" doesn't match to "tommy" in the -> I need to deal with misspellings. I thought about trying to do this prior to the actual dictionary key and value replacement, like scan through the pandas data frame and replace the words within the column of strings ("messages") with the appropriate name. I've seen a similar approach using an index on specific strings like this one
but how do you match and replace words within the sentences within a pandas df, using a list of correct spelling? Can I do this within the df.series replace argument? Should I stick with a regex string replace?*
Any suggestions appreciated.
Update , trying Yannis's answer
I'm trying Yannis's answer but I need to use a list from an outside source, specifically the US census of first names for matching. But it's not matching on the whole names with the string I download.
d = {'message' : pd.Series(['awesome', 'my name is tommy , please help with...', 'hi tommy , we understand your quest...'])}
import requests
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
#US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
#turn list to string, force lower case
fnstring = ', '.join('"{0}"'.format(w) for w in firstnamelist )
fnstring = ','.join(firstnamelist)
fnstring = (fnstring.lower())
##turn to list, prepare it so it matches the name preceded by either the beginning of the string or whitespace.
names = [x.strip() for x in fnstring.split(',')]
#import jellyfish
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
d["message"].apply(lambda x: fuzzy_replace(x, names))
Results in:
Out:
0 FirstName
1 FirstName name is tommy , please help with...
2 FirstName tommy , we understand your quest...
But if I use a smaller list like this it works:
names = ["tommy", "caitlyn", "kat", "al", "hope"]
d["message"].apply(lambda x: fuzzy_replace(x, names))
Is it something with the longer list of names that's causing a problem?
Edit:
Changed my solution to use difflib. The core idea is to tokenize your input text and match each token against a list of names. If best_match finds a match then it reports the position (and the best matching string), so then you can replace the token with "FirstName" or anything you want. See the complete example below:
import pandas as pd
import difflib
df = pd.DataFrame(data=[(0,"my name is tommmy , please help with"), (1, "hi FirstName , we understand your quest")], columns=["A", "message"])
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x):
names = ["tommy", "john"] # just a simple replacement list
tokens = x.split()
res = best_match(tokens, names)
if res is not None:
pos, replacement = res
tokens[pos] = "FirstName"
return u" ".join(tokens)
return x
df.message.apply(lambda x: fuzzy_replace(x))
And the output you should get is the following
0 my name is FirstName , please help with
1 hi FirstName , we understand your quest
Name: message, dtype: object
Edit 2
After the discussion, I decided to have another go, using NLTK for parts of speech tagging and run the fuzzy matching only for the NNP tags (proper nouns) against the name list. The problem is that sometimes the tagger doesn't get the tag right, e.g. "Hi" might be also tagged as proper noun. However if the list of names are lowercased then get_close_matches doesn't match Hi against a name but matches all other names. I recommend that df["message"] is not lowercased to increase the chances that NLTK tags the names properly. One can also play with StanfordNER but nothing will work 100%. Here is the code:
import pandas as pd
import difflib
from nltk import pos_tag, wordpunct_tokenize
import requests
import re
r = requests.get('http://deron.meranda.us/data/census-derived-all-first.txt')
# US Census first names (5000 +)
firstnamelist = re.findall(r'\n(.*?)\s', r.text, re.DOTALL)
# turn list to string, force lower case
# simplified things here
names = [w.lower() for w in firstnamelist]
df = pd.DataFrame(data=[(0,"My name is Tommmy, please help with"),
(1, "Hi Tommy , we understand your question"),
(2, "I don't talk to Johhn any longer"),
(3, 'Michale says this is stupid')
], columns=["A", "message"])
def match_names(token, tag):
print token, tag
if tag == "NNP":
best_match = difflib.get_close_matches(token, names, n=1)
if len(best_match) > 0:
return "FirstName" # or best_match[0] if you want to return the name found
else:
return token
else:
return token
def fuzzy_replace(x):
tokens = wordpunct_tokenize(x)
pos_tokens = pos_tag(tokens)
# Every token is a tuple (token, tag)
result = [match_names(token, tag) for token, tag in pos_tokens]
x = u" ".join(result)
return x
df['message'].apply(lambda x: fuzzy_replace(x))
And I get in the output:
0 My name is FirstName , please help with
1 Hi FirstName , we understand your question
2 I don ' t talk to FirstName any longer
3 FirstName says this is stupid
Name: message, dtype: object
Okay, so I have the following piece of code.
out = out + re.sub('\{\{([A-z]+)\}\}', values[re.search('\{\{([A-z]+)\}\}',item).group().strip('{}')],item) + " "
Or, more broken down:
out = out + re.sub(
'\{\{([A-z]+)\}\}',
values[
re.search(
'\{\{([A-z]+)\}\}',
item
).group().strip('{}')
],
item
) + " "
So, basically, if you give it a string which contains {{reference}}, it will find instances of that, and replace them with the given reference. The issue with it in it's current form is that it can only work based on the first reference. For example, say my values dictionary was
values = {
'bob': 'steve',
'foo': 'bar'
}
and we passed it the string
item = 'this is a test string for {{bob}}, made using {{foo}}'
I want it to put into out
'this is a test string for steve, made using bar'
but what it currently outputs is
'this is a test string for steve, made using steve'
How can I change the code such that it takes into account the position in the loop.
It should be noted, that doing a word split would not work, as the code needs to work even if the input is {{foo}}{{steve}}
I got the output using the following code,
replace_dict = { 'bob': 'steve','foo': 'bar'}
item = 'this is a test string for {{foo}}, made using {{steve}}'
replace_lst = re.findall('\{\{([A-z]+)\}\}', item)
out = ''
for r in replace_lst:
if r in replace_dict:
item = item.replace('{{' + r + '}}', replace_dict[r])
print item
How's this?
import re
values = {
'bob': 'steve',
'foo': 'bar'
}
item = 'this is a test string for {{bob}}, made using {{foo}}'
pat = re.compile(r'\{\{(.*?)\}\}')
fields = pat.split(item)
fields[1] = values[fields[1]]
fields[3] = values[fields[3]]
print ''.join(fields)
If you could change the format of reference from {{reference}} to {reference}, you could achieve your needs just with format method (instead of using regex):
values = {
'bob': 'steve',
'foo': 'bar'
}
item = 'this is a test string for {bob}, made using {foo}'
print(item.format(**values))
# prints: this is a test string for steve, made using bar
In your code, re.search will start looking from the beginning of the string each time you call it, thus always returning the first match {{bob}}.
You can access the match object you are currently replacing by passing a function as replacement to re.sub:
values = { 'bob': 'steve','foo': 'bar'}
item = 'this is a test string for {{bob}}, made using {{foo}}'
pattern = r'{{([A-Za-z]+)}}'
# replacement function
def get_value(match):
return values[match.group(1)]
result = re.sub(pattern, get_value, item)
# print result => 'this is a test string for steve, made using bar'
I am trying to replace any i's in a string with capital I's. I have the following code:
str.replace('i ','I ')
However, it does not replace anything in the string. I am looking to include a space after the I to differentiate between any I's in words and out of words.
Thanks if you can provide help!
The exact code is:
new = old.replace('i ','I ')
new = old.replace('-i-','-I-')
new = old.replace('i ','I ')
new = old.replace('-i-','-I-')
You throw away the first new when you assign the result of the second operation over it.
Either do
new = old.replace('i ','I ')
new = new.replace('-i-','-I-')
or
new = old.replace('i ','I ').replace('-i-','-I-')
or use regex.
I think you need something like this.
>>> import re
>>> s = "i am what i am, indeed."
>>> re.sub(r'\bi\b', 'I', s)
'I am what I am, indeed.'
This only replaces bare 'i''s with I, but the 'i''s that are part of other words are left untouched.
For your example from comments, you may need something like this:
>>> s = 'i am sam\nsam I am\nThat Sam-i-am! indeed'
>>> re.sub(r'\b(-?)i(-?)\b', r'\1I\2', s)
'I am sam\nsam I am\nThat Sam-I-am! indeed'
I have the following string which forces my Python script to quit:
"625 625 QUAIL DR UNIT B"
I need to delete the extra spaces in the middle of the string so I am trying to use the following split join script:
import arcgisscripting
import logging
logger = logging.getLogger()
gp = arcgisscripting.create(9.3)
gp.OverWriteOutput = True
gp.Workspace = "C:\ZP4"
fcs = gp.ListWorkspaces("*","Folder")
for fc in fcs:
print fc
rows = gp.UpdateCursor(fc + "//Parcels.shp")
row = rows.Next()
while row:
Name = row.GetValue('SIT_FULL_S').join(s.split())
print Name
row.SetValue('SIT_FULL_S', Name)
rows.updateRow(row)
row = rows.Next()
del row
del rows
Your source code and your error do not match, the error states you didn't define the variable SIT_FULL_S.
I am guessing that what you want is:
Name = ' '.join(row.GetValue('SIT_FULL_S').split())
Use the re module...
>>> import re
>>> str = 'A B C'
>>> re.sub(r'\s+', ' ', str)
'A B C'
I believe you should use regular expressions to match all the places where you find two or more spaces and then replace it (each occurence) with a single space.
This can be made using shorter portion of code:
re.sub(r'\s{2,}', ' ', your_string)
It's a bit unclear, but I think what you need is:
" ".join(row.GetValue('SIT_FULL_S').split())