Matching variable number of occurrences of token using regex in python

Matching variable number of occurrences of token using regex in python - python

I am trying to match a token multiple times, but I only get back the last occurrence, which I understand is the normal behavior as per this answer, but I haven't been able to get the solution presented there in my example.
My text looks something like this:
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
So basically multiple lines, each with a starting string, spaces, then a variable number of key pairs. If you are wondering where this comes from, it is a robot framework variables file that I am trying to transform into a python variables file.
I will be iterating per line to match the key pairs and construct a python dictionary from them.
My current regex pattern is:
&{([^ ]+)}=[ ]{2,}(?:[ ]{2,}([^\s=]+)=([^\s=]+))+
This correctly gets me the dict name but the key pairs only match the last occurrence, as mentioned above. How can I get it to return a tuple containing: ("dict1_name","key1","key1value"..."keyn","keynvalue") so that I can then iterate over this and construct the python dictionary like so:
dict1_name= {"key1": "key1value",..."keyn": "keynvalue"}
Thanks!

As you point out, you will need to work around the fact that capture groups will only catch the last match. One way to do so is to take advantage of the fact that lines in a file are iterable, and to use two patterns: one for the "line name", and one for its multiple keyvalue pairs:*
import re
dname = re.compile(r'^&{(?P<name>\w+)}=')
keyval = re.compile(r'(?P<key>\w+)=(?P<val>\w+)')
data = {}
with open('input/keyvals.txt') as f:
for line in f:
name = dname.search(line)
if name:
name = name.group('name')
data[name] = dict(keyval.findall(line))
*Admittedly, this is a tad inefficient since you're conducting two searches per line. But for moderately sized files, you should be fine.
Result:
>>> from pprint import pprint
>>> pprint(data)
{'d5': {'key1': '28f_s', 'key2': 'key2value'},
'name1': {'key1': '5', 'key2': 'x'},
'othername2': {'key1': 'key1value', 'key2': '7'}}
Note that \w matches Unicode word characters.
Sample input, keyvals.txt:
&{name1}= key1=5 key2=x
&{othername2}= key1=key1value key2=7
&{d5}= key1=28f_s key2=aaa key2=key2value

You could use two regexes one for the names and other for the items, applying the one for the items after the first space:
import re
lines = ['&{dict1_name}= key1=key1value key2=key2value',
'&{dict2_name}= key1=key1value']
name = re.compile('^&\{(\w+)\}=')
item = re.compile('(\w+)=(\w+)')
for line in lines:
n = name.search(line).group(1)
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
exec('{} = {}'.format(n, i))
print(locals()[n])
Output
{'key2': 'key2value', 'key1': 'key1value'}
{'key1': 'key1value'}
Explanation
The '^&\{(\w+)\}=' matches an '&' followed by a word (\w+) surrounded by curly braces '\{', '\}'. The second regex matches any words joined by a '='. The line:
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
creates a dictionary literal, finally you create a dictionary with the required name using exec. You can access the value of the dictionary querying locals.

Use two expressions in combination with a dict comprehension:
import re
junkystring = """
lorem ipsum
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
lorem ipsum
"""
rx_outer = re.compile(r'^&{(?P<dict_name>[^{}]+)}(?P<values>.+)', re.M)
rx_inner = re.compile(r'(?P<key>\w+)=(?P<value>\w+)')
result = {m_outer.group('dict_name'): {m_inner.group('key'): m_inner.group('value')
for m_inner in rx_inner.finditer(m_outer.group('values'))}
for m_outer in rx_outer.finditer(junkystring)}
print(result)
Which produces
{'dict1_name': {'key1': 'key1value', 'key2': 'key2value'},
'dict2_name': {'key1': 'key1value'}}
With the two expressions being
^&{(?P<dict_name>[^{}]+)}(?P<values>.+)
# the outer format
See a demo on regex101.com. And the second
(?P<key>\w+)=(?P<value>\w+)
# the key/value pairs
See a demo for the latter on regex101.com as well.
The rest is simply sorting the different expressions in the dict comprehension.

Building off of Brad's answer, I made some modifications. As mentioned in my comment on his reply, it failed at empty lines or comment lines. I modified it to ignore these and continue. I also added handling of spaces: it now matches spaces in dictionary names but replaces them with underscore since python cannot have spaces in variable names. Keys are left untouched since they are strings.
import re
def robot_to_python(filename):
"""
This function can be used to convert robot variable files containing dicts to a python
variables file containing python dict that can be imported by both python and robot.
"""
dname = re.compile(r"^&{(?P<name>.+)}=")
keyval = re.compile(r"(?P<key>[\w|:]+)=(?P<val>[\w|:]+)")
data = {}
with open(filename + '.robot') as f:
for line in f:
n = dname.search(line)
if n:
name = dname.search(line).group("name").replace(" ", "_")
if name:
data[name] = dict(keyval.findall(line))
with open(filename + '.py', 'w') as file:
for dictionary in data.items():
dict_name = dictionary[0]
file.write(dict_name + " = { \n")
keyvals = dictionary[1]
for k in sorted(keyvals.keys()):
file.write("'%s':'%s', \n" % (k, keyvals[k]))
file.write("}\n\n")
file.close()

Related

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?

The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.

Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

Extract data using regular expressions in python

I'm trying to parse a file with Serial numbers and part numbers etc and sort them into a structure. I would like to parse this file by tagging off of the identifiers but then I only really need the actual numbers/codes for my data structure. I need to assume that all the numbers/codes are of varied length however I can depend on the identifiers to precede the numbers/codes and also depend on the end line after each value.
//Text file with serials and information
Serial: 523524234235
Part Number: MHC-1251-A
Manufacturer: KNL-ETA
Serial: 523524281238
Part Number: QLC-851
Manufacturer: MHQ-MCE
.
.
.

On each line you can apply regular expressions to extract desired part like this:
>>> import re
>>> text = "Serial: 523524234235"
>>> m = re.search(r'Serial: (\d+)', text)
>>> m.group(1)
'523524234235'
You can also use split to get two parts in each line and then check first part to see what kind of token it is Serial, Part Number etc.
your regular expression needs some improvement.
m = re.search(r'Serial: (\d+)', text) ==> ` m = re.search(r'Serial:[\s]*(\d+)[\s]*', text)`

open the file and readlines and iterate and split by ':' to get your numbers. You can use regex if values are not line by line.

I agree with #loki; from what you are telling, the use of regex is not necessary. An appropriate structure extracted from a file like yours might be set up like:
parts={} # data structure
entry={} # single set
for line in open('file.dat', 'r'):
flds = [fld.strip() for fld in line.split(':')[:2]]
if len(flds) > 1:
k,v = flds
if k == 'Serial': # use serial number as key vor corresponding entry
entry = {}
parts[v] = entry
else:
entry[k] = v # save information in data set
Result:
{'523524234235': {'Part Number': 'MHC-1251-A', 'Manufacturer': 'KNL-ETA'}, '523524281238': {'Part Number': 'QLC-851', 'Manufacturer': 'MHQ-MCE'}, ...}

Python - get subset from 2nd, to last element

I have a bunch of strings, all on one line, separated by a single space.
I would like to store these values in a map, with the first string as the key, and a set of the remaining values.
I am trying
map = {}
input = raw_input().split()
map[input[0]] = input[1:-1]
which works, apart from leaving off the last element.
I have found
map[input[0]] = input[1:len(input)]
works, but I would much rather use something more like the former
(for example, input is something like "key value1 value2 value3"
I want a map like
{'key' : ['value1', 'value2', 'value3']}
but my current method gives me
{'key' : ['value1', 'value2']}
)

That's because you are specifying -1 as the index to go to - simply leave the index out to go to the end of the list. E.g:
input[1:]
See here for more on the list slicing syntax.
Note an alternative (which I feel is far nicer and more readable), if you are using Python 3.x, is to use extended iterable unpacking:
key, *values = input().split()
map[key] = values

myDict = {}
for line in lines:
tokens = line.split()
map[tokens[0]] = tokens[1:]
Alternatively:
def lineToPair(line):
tokens = line.split()
return tokens[0],tokens[1:]
myDict = dict(lineToPair(x) for x in lines)

appending regex matches to a dictionary

I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.

Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())

t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d

why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.

Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted

Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?

i've searched pretty hard and cant find a question that exactly pertains to what i want to..
I have a file called "words" that has about 1000 lines of random A-Z sorted words...
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
abacus
abalone
abandon
abase
abash
abate
abater
abbas
abbe
abbey
abbot
Abbott
abbreviate
abc
abdicate
abdomen
abdominal
abduct
Abe
abed
Abel
Abelian
I am trying to load this file into a dictionary, where using the word are the key values and the keys are actually auto-gen/auto-incremented for each word
e.g {0:10th, 1:1st, 2:2nd} ...etc..etc...
below is the code i've hobbled together so far, it seems to sort of works but its only showing me the last entry in the file as the only dict pair element
f3data = open('words')
mydict = {}
for line in f3data:
print line.strip()
cmyline = line.split()
key = +1
mydict [key] = cmyline
print mydict

key = +1
+1 is the same thing as 1. I assume you meant key += 1. I also can't see a reason why you'd split each line when there's only one item per line.
However, there's really no reason to do the looping yourself.
with open('words') as f3data:
mydict = dict(enumerate(line.strip() for line in f3data))

dict(enumerate(x.rstrip() for x in f3data))
But your error is key += 1.

f3data = open('words')
print f3data.readlines()

The use of zero-based numeric keys in a dict is very suspicious. Consider whether a simple list would suffice.
Here is an example using a list comprehension:
>>> mylist = [word.strip() for word in open('/usr/share/dict/words')]
>>> mylist[1]
'A'
>>> mylist[10]
"Aaron's"
>>> mylist[100]
"Addie's"
>>> mylist[1000]
"Armand's"
>>> mylist[10000]
"Loyd's"
I use str.strip() to remove whitespace and newlines, which are present in /usr/share/dict/words. This may not be necessary with your data.
However, if you really need a dictionary, Python's enumerate() built-in function is your friend here, and you can pass the output directly into the dict() function to create it:
>>> mydict = dict(enumerate(word.strip() for word in open('/usr/share/dict/words')))
>>> mydict[1]
'A'
>>> mydict[10]
"Aaron's"
>>> mydict[100]
"Addie's"
>>> mydict[1000]
"Armand's"
>>> mydict[10000]
"Loyd's"

With keys that dense, you don't want a dict, you want a list.
with open('words') as fp:
data = map(str.strip, fp.readlines())
But if you really can't live without a dict:
with open('words') as fp:
data = dict(enumerate(X.strip() for X in fp))

{index: x.strip() for index, x in enumerate(open('filename.txt'))}
This code uses a dictionary comprehension and the enumerate built-in, which takes an input sequence (in this case, the file object, which yields each line when iterated through) and returns an index along with the item. Then, a dictionary is built up with the index and text.
One question: why not just use a list if all of your keys are integers?
Finally, your original code should be
f3data = open('words')
mydict = {}
for index, line in enumerate(f3data):
cmyline = line.strip()
mydict[index] = cmyline
print mydict

Putting the words in a dict makes no sense. If you're using numbers as keys you should be using a list.
from __future__ import with_statement
with open('words.txt', 'r') as f:
lines = f.readlines()
words = {}
for n, line in enumerate(lines):
words[n] = line.strip()
print words

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Matching variable number of occurrences of token using regex in python - python

Related

python script not joining strings as expected

Extract data using regular expressions in python

Python - get subset from 2nd, to last element

appending regex matches to a dictionary

Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?

Categories

Resources