appending regex matches to a dictionary

appending regex matches to a dictionary - python

I have a file in which there is the following info:
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
I'm looking for a way to match the colon and append whatever appears afterwards (the numbers) to a dictionary the keys of which are the name of the animals in the beginning of each line.

Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises.
Assuming that data is a variable containing the string that you listed above:
dict(item.split(":") for item in data.split())

t = """
dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478
"""
import re
d = {}
for p, q in re.findall(r'^(.+?)_.+?:(.+)', t, re.M):
d.setdefault(p, []).append(q)
print d

why dont you use the python find method to locate the index of the colons which you can use to slice the string.
>>> x='dogs_3351.txt:34.13559322033898'
>>> key_index = x.find(':')
>>> key = x[:key_index]
>>> key
'dogs_3351.txt'
>>> value = x[key_index+1:]
>>> value
'34.13559322033898'
>>>
Read in each line of the file as a text and process the lines individually as above.

Without regex and using defaultdict:
from collections import defaultdict
data = """dogs_3351.txt:34.13559322033898
cats_1875.txt:23.25581395348837
cats_2231.txt:22.087912087912088
elephants_3535.txt:37.092592592592595
fish_1407.txt:24.132530120481928
fish_2078.txt:23.470588235294116
fish_2041.txt:23.564705882352943
fish_666.txt:23.17241379310345
fish_840.txt:21.77173913043478"""
dictionary = defaultdict(list)
for l in data.splitlines():
animal = l.split('_')[0]
number = l.split(':')[-1]
dictionary[animal] = dictionary[animal] + [number]
Just make sure your data is well formatted

Related

Get list from string with exec in python

I have:
"[15765,22832,15289,15016,15017]"
I want:
[15765,22832,15289,15016,15017]
What should I do to convert this string to list?
P.S. Post was edited without my permission and it lost important part. The type of line that looks like list is 'bytes'. This is not string.
P.S. №2. My initial code was:
import urllib.request, re
f = urllib.request.urlopen("http://www.finam.ru/cache/icharts/icharts.js")
lines = f.readlines()
for line in lines:
m = re.match('var\s+(\w+)\s*=\s*\[\\s*(.+)\s*\]\;', line.decode('windows-1251'))
if m is not None:
varname = m.group(1)
if varname == "aEmitentIds":
aEmitentIds = line #its type is 'bytes', not 'string'
I need to get list from line
line from web page looks like
[15765, 22832, 15289, 15016, 15017]

Assuming s is your string, you can just use split and then cast each number to integer:
s = [int(number) for number in s[1:-1].split(',')]
For detailed information about split function:
Python3 split documentation

What you have is a stringified list. You could use a json parser to parse that information into the corresponding list
import json
test_str = "[15765,22832,15289,15016,15017]"
l = json.loads(test_str) # List that you need.
Or another way to do this would be to use ast
import ast
test_str = "[15765,22832,15289,15016,15017]"
data = ast.literal_eval(test_str)
The result is
[15765, 22832, 15289, 15016, 15017]
To understand why using eval() is bad practice you could refer to this answer

You can also use regex to pull out numeric values from the string as follows:
import re
lst = "[15765,22832,15289,15016,15017]"
lst = [int(number) for number in re.findall('\d+',lst)]
Output of the above code is,
[15765, 22832, 15289, 15016, 15017]

Extract list of words from filenames

I need to get a list of words, that files contains. Here is the files:
sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
I need to get that goes after task-<>_, so my list should looks:
['FmriPictures','FmriVernike','FmriWgWords','RestingState']
how can I implement it in python3?

Here's a Python Solution for this which uses Regex.
>>> import re
>>> test_str = 'sub-Dzh_task-FmriPictures_space-
MNI152NLin2009cAsym_desc-preproc_bold_mask-
Language_sub01_component_ica_s1_.nii'
>>> re.search('task-(.*?)_', test_str).group(1)
'FmriPictures'
I think you can do the same for every string.

l=["sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii",
"sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii"]
k=[]
for i in l:
k.append(i.split('-')[2].replace("_space",""))
print(k)
thats just approach.

You can loop over your list and use regex to get the names from the strings like this example:
import re
a = ['sub-Dzh_task-FmriPictures_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriVernike_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-FmriWgWords_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii',
'sub-Dzh_task-RestingState_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii']
out = []
for elm in a:
condition = re.search(r'_task-(.*?)_', elm)
if bool(condition):
out.append(condition.group(1))
print(out)
Output:
['FmriPictures', 'FmriVernike', 'FmriWgWords', 'RestingState']

I would just simply replace
sub-Dzh_task-
and
_space-MNI152NLin2009cAsym_desc-preproc_bold_mask-Language_sub01_component_ica_s1_.nii
with null. Just empty those lines out and you'll get the file names.

Matching variable number of occurrences of token using regex in python

I am trying to match a token multiple times, but I only get back the last occurrence, which I understand is the normal behavior as per this answer, but I haven't been able to get the solution presented there in my example.
My text looks something like this:
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
So basically multiple lines, each with a starting string, spaces, then a variable number of key pairs. If you are wondering where this comes from, it is a robot framework variables file that I am trying to transform into a python variables file.
I will be iterating per line to match the key pairs and construct a python dictionary from them.
My current regex pattern is:
&{([^ ]+)}=[ ]{2,}(?:[ ]{2,}([^\s=]+)=([^\s=]+))+
This correctly gets me the dict name but the key pairs only match the last occurrence, as mentioned above. How can I get it to return a tuple containing: ("dict1_name","key1","key1value"..."keyn","keynvalue") so that I can then iterate over this and construct the python dictionary like so:
dict1_name= {"key1": "key1value",..."keyn": "keynvalue"}
Thanks!

As you point out, you will need to work around the fact that capture groups will only catch the last match. One way to do so is to take advantage of the fact that lines in a file are iterable, and to use two patterns: one for the "line name", and one for its multiple keyvalue pairs:*
import re
dname = re.compile(r'^&{(?P<name>\w+)}=')
keyval = re.compile(r'(?P<key>\w+)=(?P<val>\w+)')
data = {}
with open('input/keyvals.txt') as f:
for line in f:
name = dname.search(line)
if name:
name = name.group('name')
data[name] = dict(keyval.findall(line))
*Admittedly, this is a tad inefficient since you're conducting two searches per line. But for moderately sized files, you should be fine.
Result:
>>> from pprint import pprint
>>> pprint(data)
{'d5': {'key1': '28f_s', 'key2': 'key2value'},
'name1': {'key1': '5', 'key2': 'x'},
'othername2': {'key1': 'key1value', 'key2': '7'}}
Note that \w matches Unicode word characters.
Sample input, keyvals.txt:
&{name1}= key1=5 key2=x
&{othername2}= key1=key1value key2=7
&{d5}= key1=28f_s key2=aaa key2=key2value

You could use two regexes one for the names and other for the items, applying the one for the items after the first space:
import re
lines = ['&{dict1_name}= key1=key1value key2=key2value',
'&{dict2_name}= key1=key1value']
name = re.compile('^&\{(\w+)\}=')
item = re.compile('(\w+)=(\w+)')
for line in lines:
n = name.search(line).group(1)
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
exec('{} = {}'.format(n, i))
print(locals()[n])
Output
{'key2': 'key2value', 'key1': 'key1value'}
{'key1': 'key1value'}
Explanation
The '^&\{(\w+)\}=' matches an '&' followed by a word (\w+) surrounded by curly braces '\{', '\}'. The second regex matches any words joined by a '='. The line:
i = '{{{}}}'.format(','.join("'{}' : '{}'".format(m.group(1), m.group(2)) for m in item.finditer(' '.join(line.split()[1:]))))
creates a dictionary literal, finally you create a dictionary with the required name using exec. You can access the value of the dictionary querying locals.

Use two expressions in combination with a dict comprehension:
import re
junkystring = """
lorem ipsum
&{dict1_name}= key1=key1value key2=key2value
&{dict2_name}= key1=key1value
lorem ipsum
"""
rx_outer = re.compile(r'^&{(?P<dict_name>[^{}]+)}(?P<values>.+)', re.M)
rx_inner = re.compile(r'(?P<key>\w+)=(?P<value>\w+)')
result = {m_outer.group('dict_name'): {m_inner.group('key'): m_inner.group('value')
for m_inner in rx_inner.finditer(m_outer.group('values'))}
for m_outer in rx_outer.finditer(junkystring)}
print(result)
Which produces
{'dict1_name': {'key1': 'key1value', 'key2': 'key2value'},
'dict2_name': {'key1': 'key1value'}}
With the two expressions being
^&{(?P<dict_name>[^{}]+)}(?P<values>.+)
# the outer format
See a demo on regex101.com. And the second
(?P<key>\w+)=(?P<value>\w+)
# the key/value pairs
See a demo for the latter on regex101.com as well.
The rest is simply sorting the different expressions in the dict comprehension.

Building off of Brad's answer, I made some modifications. As mentioned in my comment on his reply, it failed at empty lines or comment lines. I modified it to ignore these and continue. I also added handling of spaces: it now matches spaces in dictionary names but replaces them with underscore since python cannot have spaces in variable names. Keys are left untouched since they are strings.
import re
def robot_to_python(filename):
"""
This function can be used to convert robot variable files containing dicts to a python
variables file containing python dict that can be imported by both python and robot.
"""
dname = re.compile(r"^&{(?P<name>.+)}=")
keyval = re.compile(r"(?P<key>[\w|:]+)=(?P<val>[\w|:]+)")
data = {}
with open(filename + '.robot') as f:
for line in f:
n = dname.search(line)
if n:
name = dname.search(line).group("name").replace(" ", "_")
if name:
data[name] = dict(keyval.findall(line))
with open(filename + '.py', 'w') as file:
for dictionary in data.items():
dict_name = dictionary[0]
file.write(dict_name + " = { \n")
keyvals = dictionary[1]
for k in sorted(keyvals.keys()):
file.write("'%s':'%s', \n" % (k, keyvals[k]))
file.write("}\n\n")
file.close()

How do you split a string at a specific point?

I am new to python and want to split what I have read in from a text file into two specific parts. Below is an example of what could be read in:
f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]
So what I want to achieve is to be able to execute the second part of the program is:
words = ['Cats','like','dogs','as','much','cats.']
numbers = [1,2,3,4,5,4,3,2,6]
I have tried using:
words,numbers = f.split("][")
However, this removes the double bracets from the two new variable which means the second part of my program which recreates the original text does not work.
Thanks.

I assume f is a string like
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
then we can find the index of ][ and add one to find the point between the brackets
i = f.index('][')
a, b = f[:i+1], f[i+1:]
print(a)
print(b)
output:
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]

Another Alternative if you want to still use split()
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
d="]["
print f.split(d)[0]+d[0]
print d[1]+f.split(d)[1]

If you can make your file look something like this:
[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]
then you could simply use Python's json module to do this for you. Note that the JSON format requires double quotes rather than single.
import json
f = '[["Cats","like","dogs","as","much","cats."],[1,2,3,4,5,4,3,2,6]]'
a, b = json.loads(f)
print(a)
print(b)
Documentation for the json library can be found here: https://docs.python.org/3/library/json.html

An alternative to Patrick's answer using regular expressions:
import re
data = "f = ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
pattern = 'f = (?P<words>\[.*?\])(?P<numbers>\[.*?\])'
match = re.match(pattern, data)
words = match.group('words')
numbers = match.group('numbers')
print(words)
print(numbers)
Output
['Cats','like','dogs','as','much','cats.']
[1,2,3,4,5,4,3,2,6]

If I understand correctly, you have a text file that contains ['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6] and you just need to split that string at the transition between brackets. You can do this with the string.index() method and string slicing. See my console output below:
>>> f = open('./catsdogs12.txt', 'r')
>>> input = f.read()[:-1] # Read file without trailing newline (\n)
>>> input
"['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
>>> bracket_index = input.index('][') # Get index of transition between brackets
>>> bracket_index
41
>>> words = input[:bracket_index + 1] # Slice from beginning of string
>>> words
"['Cats','like','dogs','as','much','cats.']"
>>> numbers = input[bracket_index + 1:] # Slice from middle of string
>>> numbers
'[1,2,3,4,5,4,3,2,6]'
Note that this will leave you with a python string that looks visually identical to a list (array). If you needed the data represented as python native objects (i.e. so that you can actually use it like a list), you'll need to use some combination of string[1:-1].split(',') on both strings and list.map() on the numbers list to convert the numbers from strings to numbers.
Hope this helps!

Another thing you can do is first replace ][ with ]-[ and then do a split or partition using - but i will suggest you to do split as we don't want that delimiter.
SPLIT
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b = f.split('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(b)
[1,2,3,4,5,4,3,2,6]
PARTITION
f = "['Cats','like','dogs','as','much','cats.'][1,2,3,4,5,4,3,2,6]"
f = f.replace('][',']-[')
a,b,c = f.partition('-')
Output
>>> print(a)
['Cats','like','dogs','as','much','cats.']
>>> print(c)
[1,2,3,4,5,4,3,2,6]

Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?

i've searched pretty hard and cant find a question that exactly pertains to what i want to..
I have a file called "words" that has about 1000 lines of random A-Z sorted words...
10th
1st
2nd
3rd
4th
5th
6th
7th
8th
9th
a
AAA
AAAS
Aarhus
Aaron
AAU
ABA
Ababa
aback
abacus
abalone
abandon
abase
abash
abate
abater
abbas
abbe
abbey
abbot
Abbott
abbreviate
abc
abdicate
abdomen
abdominal
abduct
Abe
abed
Abel
Abelian
I am trying to load this file into a dictionary, where using the word are the key values and the keys are actually auto-gen/auto-incremented for each word
e.g {0:10th, 1:1st, 2:2nd} ...etc..etc...
below is the code i've hobbled together so far, it seems to sort of works but its only showing me the last entry in the file as the only dict pair element
f3data = open('words')
mydict = {}
for line in f3data:
print line.strip()
cmyline = line.split()
key = +1
mydict [key] = cmyline
print mydict

key = +1
+1 is the same thing as 1. I assume you meant key += 1. I also can't see a reason why you'd split each line when there's only one item per line.
However, there's really no reason to do the looping yourself.
with open('words') as f3data:
mydict = dict(enumerate(line.strip() for line in f3data))

dict(enumerate(x.rstrip() for x in f3data))
But your error is key += 1.

f3data = open('words')
print f3data.readlines()

The use of zero-based numeric keys in a dict is very suspicious. Consider whether a simple list would suffice.
Here is an example using a list comprehension:
>>> mylist = [word.strip() for word in open('/usr/share/dict/words')]
>>> mylist[1]
'A'
>>> mylist[10]
"Aaron's"
>>> mylist[100]
"Addie's"
>>> mylist[1000]
"Armand's"
>>> mylist[10000]
"Loyd's"
I use str.strip() to remove whitespace and newlines, which are present in /usr/share/dict/words. This may not be necessary with your data.
However, if you really need a dictionary, Python's enumerate() built-in function is your friend here, and you can pass the output directly into the dict() function to create it:
>>> mydict = dict(enumerate(word.strip() for word in open('/usr/share/dict/words')))
>>> mydict[1]
'A'
>>> mydict[10]
"Aaron's"
>>> mydict[100]
"Addie's"
>>> mydict[1000]
"Armand's"
>>> mydict[10000]
"Loyd's"

With keys that dense, you don't want a dict, you want a list.
with open('words') as fp:
data = map(str.strip, fp.readlines())
But if you really can't live without a dict:
with open('words') as fp:
data = dict(enumerate(X.strip() for X in fp))

{index: x.strip() for index, x in enumerate(open('filename.txt'))}
This code uses a dictionary comprehension and the enumerate built-in, which takes an input sequence (in this case, the file object, which yields each line when iterated through) and returns an index along with the item. Then, a dictionary is built up with the index and text.
One question: why not just use a list if all of your keys are integers?
Finally, your original code should be
f3data = open('words')
mydict = {}
for index, line in enumerate(f3data):
cmyline = line.strip()
mydict[index] = cmyline
print mydict

Putting the words in a dict makes no sense. If you're using numbers as keys you should be using a list.
from __future__ import with_statement
with open('words.txt', 'r') as f:
lines = f.readlines()
words = {}
for n, line in enumerate(lines):
words[n] = line.strip()
print words

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

appending regex matches to a dictionary - python

Actually, regular expressions are unnecessary, provided that your data is well formatted and contains no surprises. Assuming that data is a variable containing the string that you listed above: dict(item.split(":") for item in data.split())

Related

Get list from string with exec in python

Extract list of words from filenames

Matching variable number of occurrences of token using regex in python

How do you split a string at a specific point?

Python: create dict from list and auto-gen/increment the keys (list is the actual key values)?

Categories

Resources