regexp value elements in array on Python 2.7

regexp value elements in array on Python 2.7 - python

in Python2.7.
I have an array with objects like:
[{"TEMPLATE_NAME": "HP_LaserJet_P2055dn_USB_S29HDY6_HPLIP",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "hp:/usb/HP_LaserJet_P2055dn?serial=S29HDY6"},
{"TEMPLATE_NAME": "HP_LaserJet_P2055dn",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "usb://HP/LaserJet%20P2055dn?serial=S29HDY6"}]
It is necessary for any coincidence of the argument and the string to get the first object found in the array. Now it is done like this:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if ArgPrinter in [name['PRINTER_INFO'], name['DEVICE_URI'], name['TEMPLATE_NAME'], name['PRINTER_LOCATION']])
print ArgInListFindNewPrinters
>> {"TEMPLATE_NAME": "HP_LaserJet_P2055dn_49A71E", "PRINTER_INFO": "HP HP LaserJet P2055dn", "PRINTER_LOCATION": "Локальный принтер", "DEVICE_URI": "dnssd://HP%20LaserJet%20P2055dn%20%5B49A71E%5D._pdl-datastream._tcp.local/"}
The disadvantage of this method is that it looks for a complete match of the argument and the string, but I need any case-insensitive entry.
Example: ArgPrinter = "LaserJe", ArgPrinter = "=S29HD"
The main problem is finding any occurrences of a substring in a string.
===========================================================================
I found a solution, but it is not very practical because translation into a string requires a change in encoding:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if re.search(ArgPrinter, str(name), re.IGNORECASE))
If there are more optimal ways to do this, I will be grateful.

Convert both the target string and the searched string to lowercase to perform a case-insensitive search.
Use if x in string to match substrings.
There may be a way to do this more nicely, but this works:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters
if ArgPrinter.lower() in name['PRINTER_INFO'].lower()
or ArgPrinter.lower() in name['DEVICE_URI'].lower()
or ArgPrinter.lower() in name['TEMPLATE_NAME'].lower()
or ArgPrinter.lower() in name['PRINTER_LOCATION'].lower())

Related

Reformat a string with special tokens to a list/dictionary containing its tokens as elements?

I have a string (as an output of a model that generates sequences) in the format --
<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>
Because this is a collection of elements generated as a sentence/sequence, I would like to reformat it to a list/dictionary (to evaluate the quality of responses) --
[ [ent1, rel1_ent1, rel2_ent1], [ent2, rel1_ent2] ] or
{ "ent1" : ["rel1_ent1", "rel2_ent1"], "ent2" : ["rel1_ent2"] }
So far, the way I have been looking at this is via splitting the string by <bos> and/or <eos> special tokens -- test_string.split("<bos>")[1].split("<eos>")[0].split("<rel>")[1:]. But I am not sure how to handle generality if I do this across a large set of sequences with varying length (i.e. # of rel_ents associated with a given ent).
Also, I feel there might be a more optimal way to do this (without ugly splitting and looping) -- maybe regex?. Either way, I am entirely unsure and looking for a more optimal solution.
Added note: the special tokens <bos>, <new_gen>, <gen>, <eos> can be entirely removed from the generated output if that helps.

Well, there could be a smoother way without, as you mentioned it "ugly splitting and looping", but maybe re.finditer could be a good option here. Find each substring of interest with pattern:
<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)
See an online demo. We then can use capture group 1 as our key values and capture group 2 as a substring we need to split again into lists:
import regex as re
s = '<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>'
result = re.finditer(r'<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)', s)
d = {}
for match_obj in result:
d[match_obj.group(1)] = match_obj.group(2).split(' <gen> ')
print(d)
Prints:
{'ent1': ['rel1_ent1', 'rel2_ent1'], 'ent2': ['rel1_ent2']}

Python list in double index How to split it ety.origin list

Hi Everyone I have python to find origin of a word so I got result in list's How I want it to separate or split it with comma (,).
origin=ety.origins(wordtodo)
print(origin)
>>[Word(how, Middle English (1100-1500) [enm]), Word(haugr, Old Norse [non])]
in the result I want text inside (...) braket's and store into different variable
e.g.
forigin=(how, Middle English (1100-1500) [enm])
and
sorigin=(haugr, Old Norse [non])

forigin = repr(origin[0])[4:]
sorigin = repr(origin[1])[4:]

Author of ety here 🙂
ety.origins returns a list of Word objects.
Get properties of a Word with specific fields .word and .language, or use .pretty to get a string version of the word/language in format {word} ({lang}) - e.g. 'how (Middle English (1100-1500))'

Extracting multiple text with regex and matching the existence as a condition

I am trying to extract elements using a regex, while needing to also distinguish which lines have "-External" at the end. The naming structure I am working with is:
<ServerName>: <Country>-<CountryCode>
or
<ServerName>: <Country>-<CountryCode>-External
For example:
test1 = 'Neo1: Brussels-BRU-External'
test2 = 'Neo1: Brussels-BRU'
match = re.search(r'(?<=: ).+', test1)
print match.group(0)
This gives me "Brussels-BRU". I am trying to extract "Brussels" and "BRU" separately, while not caring about anything to the left of the :.
After, I need to know when a line has "-External". Is there a way I can treat the existence of "-External" as True and without as None?

I suggest that regexs are not needed here, and that a simple split or 2 can get you what are after. Here is a way to split() the line into pieces from which you can then select what you are interested in:
Code:
def split_it(a_string):
on_colon = a_string.split(':')
return on_colon[0], on_colon[1].strip().split('-')
Test Code:
tests = (
'Neo1: Brussels-BRU-External',
'Neo1: Brussels-BRU',
)
for test in tests:
print(split_it(test))
Results:
('Neo1', ['Brussels', 'BRU', 'External'])
('Neo1', ['Brussels', 'BRU'])
Analysis:
The length of the list can be used to determine if the additional field 'External' is present.

Array has multi strings against text with multiline ( regular expression) Python

I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"

Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse

Python String Cleanup + Manipulation (Accented Characters)

I have a database full of names like:
John Smith
Scott J. Holmes
Dr. Kaplan
Ray's Dog
Levi's
Adrian O'Brien
Perry Sean Smyre
Carie Burchfield-Thompson
Björn Árnason
There are a few foreign names with accents in them that need to be converted to strings with non-accented characters.
I'd like to convert the full names (after stripping characters like " ' " , "-") to user logins like:
john.smith
scott.j.holmes
dr.kaplan
rays.dog
levis
adrian.obrien
perry.sean.smyre
carie.burchfieldthompson
bjorn.arnason
So far I have:
Fullname.strip() # get rid of leading/trailing white space
Fullname.lower() # make everything lower case
... # after bad chars converted/removed
Fullname.replace(' ', '.') # replace spaces with periods

Take a look at this link [redacted]
Here is the code from the page
def latin1_to_ascii (unicrap):
"""This replaces UNICODE Latin-1 characters with
something equivalent in 7-bit ASCII. All characters in the standard
7-bit ASCII range are preserved. In the 8th bit range all the Latin-1
accented letters are stripped of their accents. Most symbol characters
are converted to something meaningful. Anything not converted is deleted.
"""
xlate = {
0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += i
return r
# This gives an example of how to use latin1_to_ascii().
# This creates a string will all the characters in the latin-1 character set
# then it converts the string to plain 7-bit ASCII.
if __name__ == '__main__':
s = unicode('','latin-1')
for c in range(32,256):
if c != 0x7f:
s = s + unicode(chr(c),'latin-1')
print 'INPUT:'
print s.encode('latin-1')
print
print 'OUTPUT:'
print latin1_to_ascii(s)

If you are not afraid to install third-party modules, then have a look at the python port of the Perl module Text::Unidecode (it's also on pypi).
The module does nothing more than use a lookup table to transliterate the characters. I glanced over the code and it looks very simple. So I suppose it's working on pretty much any OS and on any Python version (crossingfingers). It's also easy to bundle with your application.
With this module you don't have to create your lookup table manually ( = reduced risk it being incomplete).
The advantage of this module compared to the unicode normalization technique is this: Unicode normalization does not replace all characters. A good example is a character like "æ". Unicode normalisation will see it as "Letter, lowercase" (Ll). This means using the normalize method will give you neither a replacement character nor a useful hint. Unfortunately, that character is not representable in ASCII. So you'll get errors.
The mentioned module does a better job at this. This will actually replace the "æ" with "ae". Which is actually useful and makes sense.
The most impressive thing I've seen is that it goes much further. It even replaces Japanese Kana characters mostly properly. For example, it replaces "は" with "ha". Wich is perfectly fine. It's not fool-proof though as the current version replaces "ち" with "ti" instead of "chi". So you'll have to handle it with care for the more exotic characters.
Usage of the module is straightforward::
from unidecode import unidecode
var_utf8 = "æは".decode("utf8")
unidecode( var_utf8 ).encode("ascii")
>>> "aeha"
Note that I have nothing to do with this module directly. It just happens that I find it very useful.
Edit: The patch I submitted fixed the bug concerning the Japanese kana. I've only fixed the one's I could spot right away. I may have missed some.

The following function is generic:
import unicodedata
def not_combining(char):
return unicodedata.category(char) != 'Mn'
def strip_accents(text, encoding):
unicode_text= unicodedata.normalize('NFD', text.decode(encoding))
return filter(not_combining, unicode_text).encode(encoding)
# in a cp1252 environment
>>> print strip_accents("déjà", "cp1252")
deja
# in a cp1253 environment
>>> print strip_accents("καλημέρα", "cp1253")
καλημερα
Obviously, you should know the encoding of your strings.

I would do something like this
# coding=utf-8
def alnum_dot(name, replace={}):
import re
for k, v in replace.items():
name = name.replace(k, v)
return re.sub("[^a-z.]", "", name.strip().lower())
print alnum_dot(u"Frédrik Holmström", {
u"ö":"o",
" ":"."
})
Second argument is a dict of the characters you want replaced, all non a-z and . chars that are not replaced will be stripped

The translate method allows you to delete characters. You can use that to delete arbitrary characters.
Fullname.translate(None,"'-\"")
If you want to delete whole classes of characters, you might want to use the re module.
re.sub('[^a-z0-9 ]', '', Fullname.strip().lower(),)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.