I have a string (as an output of a model that generates sequences) in the format --
<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>
Because this is a collection of elements generated as a sentence/sequence, I would like to reformat it to a list/dictionary (to evaluate the quality of responses) --
[ [ent1, rel1_ent1, rel2_ent1], [ent2, rel1_ent2] ] or
{ "ent1" : ["rel1_ent1", "rel2_ent1"], "ent2" : ["rel1_ent2"] }
So far, the way I have been looking at this is via splitting the string by <bos> and/or <eos> special tokens -- test_string.split("<bos>")[1].split("<eos>")[0].split("<rel>")[1:]. But I am not sure how to handle generality if I do this across a large set of sequences with varying length (i.e. # of rel_ents associated with a given ent).
Also, I feel there might be a more optimal way to do this (without ugly splitting and looping) -- maybe regex?. Either way, I am entirely unsure and looking for a more optimal solution.
Added note: the special tokens <bos>, <new_gen>, <gen>, <eos> can be entirely removed from the generated output if that helps.
Well, there could be a smoother way without, as you mentioned it "ugly splitting and looping", but maybe re.finditer could be a good option here. Find each substring of interest with pattern:
<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)
See an online demo. We then can use capture group 1 as our key values and capture group 2 as a substring we need to split again into lists:
import regex as re
s = '<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>'
result = re.finditer(r'<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)', s)
d = {}
for match_obj in result:
d[match_obj.group(1)] = match_obj.group(2).split(' <gen> ')
print(d)
Prints:
{'ent1': ['rel1_ent1', 'rel2_ent1'], 'ent2': ['rel1_ent2']}
Hi Everyone I have python to find origin of a word so I got result in list's How I want it to separate or split it with comma (,).
origin=ety.origins(wordtodo)
print(origin)
>>[Word(how, Middle English (1100-1500) [enm]), Word(haugr, Old Norse [non])]
in the result I want text inside (...) braket's and store into different variable
e.g.
forigin=(how, Middle English (1100-1500) [enm])
and
sorigin=(haugr, Old Norse [non])
forigin = repr(origin[0])[4:]
sorigin = repr(origin[1])[4:]
Author of ety here 🙂
ety.origins returns a list of Word objects.
Get properties of a Word with specific fields .word and .language, or use .pretty to get a string version of the word/language in format {word} ({lang}) - e.g. 'how (Middle English (1100-1500))'
I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?
I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.
I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"
Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse
I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods
Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)
If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.
The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.
I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped
The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)