Array has multi strings against text with multiline ( regular expression) Python - python

I am working on the regular expression on python. I spend the whole week I can't understand what wrong with my code. it obvious that multi-string should match, but I get a few of them. such as "model" , '"US"" but I can't match 37abc5afce16xxx and "-104.99875". My goal is just to tell whether there is a match for any string on the array or not and what is that matching.
I have string such as:'
text = {'"version_name"': '"8.5.2"', '"abi"': '"arm64-v8a"', '"x_dpi"':
'515.1539916992188', '"environment"': '{"sdk_version"',
'"time_zone"':
'"America\\/Wash"', '"user"': '{}}', '"density_default"': '560}}',
'"resolution_width"': '1440', '"package_name"':
'"com.okcupid.okcupid"', '"d44bcbfb-873454-4917-9e02-2066d6605d9f"': '{"language"', '"country"':
'"US"}', '"now"': '1.515384841291E9', '{"extras"': '{"sessions"',
'"device"': '{"android_version"', '"y_dpi"': '37abc5afce16xxx',
'"model"': '"Nexus 6P"', '"new"': 'true}]', '"only_respond_with"':
'["triggers"]}\n0\r\n\r\n', '"start_time"': '1.51538484115E9',
'"version_code"': '1057', '"-104.99875"': '"0"', '"no_acks"': 'true}',
'"display"': '{"resolution_height"'}
An array has multi-string as :
Keywords =["37abc5afce16xxx","867686022684243", "ffffffff-f336-7a7a-0f06-65f40033c587", "long", "Lat", "uuid", "WIFI", "advertiser", "d44bcbfb-873454-4917-9e02-2066d6605d9f","deviceFinger", "medialink", "Huawei","Andriod","US","local_ip","Nexus", "android2.10.3","WIFI", "operator", "carrier", "angler", "MMB29M", "-104.99875"]
My code as
for x in Keywords:
pattern = r"^.*"+str(x)+"^.*"
if re.findall(pattern, str(values1),re.M):
print "Match"
print x
else:
print "Not Match"

Your code's goal is a bit confusing, so this is assuming you want to check for which items from the Keywords list are also in the text dictionary
In your code, it looks like you only compare the regex to the dictionary values, not the keys (assuming that's what the values1 variable is).
Also, instead of using the regex "^.*" to match for strings, you can simply do
for X in Keywords:
if X in yourDictionary.keys():
doSomething
if X in yourDictionary.values():
doSomethingElse

Related

Reformat a string with special tokens to a list/dictionary containing its tokens as elements?

I have a string (as an output of a model that generates sequences) in the format --
<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>
Because this is a collection of elements generated as a sentence/sequence, I would like to reformat it to a list/dictionary (to evaluate the quality of responses) --
[ [ent1, rel1_ent1, rel2_ent1], [ent2, rel1_ent2] ] or
{ "ent1" : ["rel1_ent1", "rel2_ent1"], "ent2" : ["rel1_ent2"] }
So far, the way I have been looking at this is via splitting the string by <bos> and/or <eos> special tokens -- test_string.split("<bos>")[1].split("<eos>")[0].split("<rel>")[1:]. But I am not sure how to handle generality if I do this across a large set of sequences with varying length (i.e. # of rel_ents associated with a given ent).
Also, I feel there might be a more optimal way to do this (without ugly splitting and looping) -- maybe regex?. Either way, I am entirely unsure and looking for a more optimal solution.
Added note: the special tokens <bos>, <new_gen>, <gen>, <eos> can be entirely removed from the generated output if that helps.
Well, there could be a smoother way without, as you mentioned it "ugly splitting and looping", but maybe re.finditer could be a good option here. Find each substring of interest with pattern:
<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)
See an online demo. We then can use capture group 1 as our key values and capture group 2 as a substring we need to split again into lists:
import regex as re
s = '<bos> <new_gen> ent1 <gen> rel1_ent1 <gen> rel2_ent1 <new_gen> ent2 <gen> rel1_ent2 <eos>'
result = re.finditer(r'<new_gen>\s(\w+)\s<gen>\s(\w+(?:\s<gen>\s\w+)*)', s)
d = {}
for match_obj in result:
d[match_obj.group(1)] = match_obj.group(2).split(' <gen> ')
print(d)
Prints:
{'ent1': ['rel1_ent1', 'rel2_ent1'], 'ent2': ['rel1_ent2']}

regexp value elements in array on Python 2.7

in Python2.7.
I have an array with objects like:
[{"TEMPLATE_NAME": "HP_LaserJet_P2055dn_USB_S29HDY6_HPLIP",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "hp:/usb/HP_LaserJet_P2055dn?serial=S29HDY6"},
{"TEMPLATE_NAME": "HP_LaserJet_P2055dn",
"PRINTER_INFO": "HP LaserJet P2055dn",
"PRINTER_LOCATION": "Локальный принтер",
"DEVICE_URI": "usb://HP/LaserJet%20P2055dn?serial=S29HDY6"}]
It is necessary for any coincidence of the argument and the string to get the first object found in the array. Now it is done like this:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if ArgPrinter in [name['PRINTER_INFO'], name['DEVICE_URI'], name['TEMPLATE_NAME'], name['PRINTER_LOCATION']])
print ArgInListFindNewPrinters
>> {"TEMPLATE_NAME": "HP_LaserJet_P2055dn_49A71E", "PRINTER_INFO": "HP HP LaserJet P2055dn", "PRINTER_LOCATION": "Локальный принтер", "DEVICE_URI": "dnssd://HP%20LaserJet%20P2055dn%20%5B49A71E%5D._pdl-datastream._tcp.local/"}
The disadvantage of this method is that it looks for a complete match of the argument and the string, but I need any case-insensitive entry.
Example: ArgPrinter = "LaserJe", ArgPrinter = "=S29HD"
The main problem is finding any occurrences of a substring in a string.
===========================================================================
I found a solution, but it is not very practical because translation into a string requires a change in encoding:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters if re.search(ArgPrinter, str(name), re.IGNORECASE))
If there are more optimal ways to do this, I will be grateful.
Convert both the target string and the searched string to lowercase to perform a case-insensitive search.
Use if x in string to match substrings.
There may be a way to do this more nicely, but this works:
ArgInListFindNewPrinters = next(name for name in ListFindNewPrinters
if ArgPrinter.lower() in name['PRINTER_INFO'].lower()
or ArgPrinter.lower() in name['DEVICE_URI'].lower()
or ArgPrinter.lower() in name['TEMPLATE_NAME'].lower()
or ArgPrinter.lower() in name['PRINTER_LOCATION'].lower())

Python and regex: create a template

I need to find a lot of substrings in string but It takes a lot of time, so I need to combine it in pattern:
I should find string
003.ru/%[KEYWORD]%
1click.ru/%[KEYWORD]%
3dnews.ru/%[KEYWORD]%
where % - is an any symbols
and [KEYWORD] - can be ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
I try to do a search with
keywords = ['sony%xperia', 'iphone', 'samsung%galaxy', 'lenovo_a706']
for i, key in enumerate(keywords):
coding['keyword_url'] = coding.url.apply(lambda x: x.replace('[KEYWORD]', key).replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+') if '[KEYWORD]' in x else x.replace('%', '[a-zA-Z0-9-_\.\?!##$%^&*+=]+'))
for (domain, keyword_url) in zip(coding.domain.values.tolist(), coding.keyword_url.values.tolist()):
df.loc[df.event_address.str.contains(keyword_url), 'domain'] = domain
Where df contains only event_address (urls)
coding
domain url
003.ru 003.ru/%[KEYWORD]%
1CLICK 1click.ru/%[KEYWORD]%
33033.ru 33033.ru/%[KEYWORD]%
3D NEWS 3dnews.ru/%[KEYWORD]%
96telefonov.ru 96telefonov.ru/%[KEYWORD]%
How can I improve my pattern to do it faster?
First, you should consider using re module. Look at the re.compile function for your patterns and then you can match them.

filtering the tag set and make the sequence of bigram

I'm sorry to ask question with same text file.
below is my working text file string.
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd
This string consists of the "word / its tag" format, as you can see. From this string, I want to filter only the sequence of "noun + adjective" and make them to the bigram. For example, "Grand/jj-tl Jury/nn-tl" is exact word sequence that I want. (nn means noun, jj means adjective and adjuncts such as "-tl" are additional information about the tag.)
Maybe this will be easy job. And I first used regex for filtering. Below is my code.
import re
f = open(textfile)
raw = f.read()
tag_list = re.findall("\w+/jj-?\w* \w+/nn-?\w*", raw)
print tag_list
This codes give me the exact words list. However, what I want is the bigram data. That code only gives me the list of words, such like this.
['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
I want this data to be converted such as below.
[('Grand/jj-tl, Jury/nn-tl'), ('recent/jj ,primary/nn'), ('Executive/jj-tl , Committee/nn-tl')]
i.e. the list of bigram data. I need your advice.
I think once you have found the tag_list it should be an easy job afterwards just using the list comprehension:
>>> tag_list = ['Grand/jj-tl Jury/nn-tl', 'recent/jj primary/nn', 'Executive/jj-tl Committee/nn-tl']
>>> [tag.replace(' ', ', ') for tag in tag_list]
['Grand/jj-tl, Jury/nn-tl', 'recent/jj, primary/nn', 'Executive/jj-tl, Committee/nn-tl']
In your original demonstration, I am not sure why do you have ('Grand/jj-tl, Jury/nn-tl') and I am also not sure why would you like to join these bigrams using comma.
I think it would be better to have a list of list where the inner list have the bigram data:
>>> [tag.split() for tag in tag_list]
[['Grand/jj-tl', 'Jury/nn-tl'], ['recent/jj', 'primary/nn'], ['Executive/jj-tl', 'Committee/nn-tl']]

python, re.search / re.split for phrases which looks like a title, i.e. starting with an uppper case

I have a list of phrases (input by user) I'd like to locate them in a text file, for examples:
titles = ['Blue Team', 'Final Match', 'Best Player',]
text = 'In today Final match, The Best player is Joe from the Blue Team and the second best player is Jack from the Red team.'
1./ I can find all the occurrences of these phrases like so
titre = re.compile(r'(?P<title>%s)' % '|'.join(titles), re.M)
list = [ t for t in titre.split(text) if titre.search(t) ]
(For simplicity, I am assuming a perfect spacing.)
2./ I can also find variants of these phrases e.g. 'Blue team', final Match', 'best player' ... using re.I, if they ever appear in the text.
But I want to restrict to finding only variants of the input phrases with their first letter upper-cased e.g. 'Blue team' in the text, regardless how they were entered as input, e.g. 'bluE tEAm'.
Is it possible to write something to "block" the re.I flag for a portion of a phrase? In pseudo code I imagine generate something like '[B]lue Team|[F]inal Match'.
Note: My primary goal is not, for example, calculating frequency of the input phrases in the text but extracting and analyzing the text fragments between or around them.
I would use re.I and modify the list-comp to:
l = [ t for t in titre.split(text) if titre.search(t) and t[0].isupper() ]
I think regular expressions won't let you specify just a region where the ignore case flag is applicable. However, you can generate a new version of the text in which all the characters have been lower cased, but the first one for every word:
new_text = ' '.join([word[0] + word[1:].lower() for word in text.split()])
This way, a regular expression without the ignore flag will match taking into account the casing only for the first character of each word.
How about modifying the input so that it is in the correct case before you use it in the regular expression?

Categories

Resources