I have a string with the following format:
author="PersonsName" date="1183050420" format="1.1" version="1.2"
I want to turn it in to a Python dict, a la:
{'author': 'PersonsName', 'date': '1183050420', 'format': '1.1', 'version': '1.2'}
I have tried to do so using re.split on the string as so:
attribs = (re.split('(=?" ?)', twikiattribs))
thinking I would get a list back like:
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2']
that then I could turn into a dict, but instead I'm getting:
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
So, before I follow the re.split line further, is there generally a better way to achieve what I'm trying to do, and/or if the solution involves re.split, how can I write a regex that will split on any of the strings =", "_ (where "_" is a space char) or just " to just yield a list with the keys in the odd indices, and values in the even?
Use re.findall():
dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
re.findall(), when presented with a pattern with multiple capturing groups, returns a list of tuples, each nested tuple containing the captured groups. dict() happily takes that output and interprets each nested tuple as a key-value pair.
Demo:
>>> import re
>>> twikiattribs = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> re.findall(r'(\w+)="([^"]+)"', twikiattribs)
[('author', 'PersonsName'), ('date', '1183050420'), ('format', '1.1'), ('version', '1.2')]
>>> dict(re.findall(r'(\w+)="([^"]+)"', twikiattribs))
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
re.split() also behaves differently based on capturing groups; the text on which you split is included in the output if grouped. Compare the output with and without the capturing group:
>>> re.split('(=?" ?)', twikiattribs)
['author', '="', 'PersonsName', '" ', 'date', '="', '1183050420', '" ', 'format', '="', '1.1', '" ', 'version', '="', '1.2', '"', '']
>>> re.split('=?" ?', twikiattribs)
['author', 'PersonsName', 'date', '1183050420', 'format', '1.1', 'version', '1.2', '']
The re.findall() output is far easier to convert to a dictionary however.
you can also do it without re in one line:
>>> data = '''author="PersonsName" date="1183050420" format="1.1" version="1.2"'''
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split(" ")]}
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
if whitespaces are allowed inside the values you can use this line:
>>> {k:v.strip('"') for k,v in [i.split("=",1) for i in data.split('" ')]}
The way I'd personally parse it:
import shlex
s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
dict(x.split('=') for x in shlex.split(s))
Out[12]:
{'author': 'PersonsName',
'date': '1183050420',
'format': '1.1',
'version': '1.2'}
A non-regex list comprehension one liner:
>>> s = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> print dict([tuple(x.split('=')) for x in s.split()])
{'date': '"1183050420"', 'format': '"1.1"', 'version': '"1.2"', 'author': '"PersonsName"'}
The problem is that you included parenthesis in your regex, which turns it into a captured group and includes it in the split. Assign attribs like this
attribs = (re.split('=?" ?', twikiattribs))
and it will work as expected. This does return a blank string (due to the final " in your input string), so you'll want to use attribs[:-1] when creating the dictionary.
Try
>>> str = 'author="PersonsName" date="1183050420" format="1.1" version="1.2"'
>>> eval ('dict(' + str.replace(" ",",") + ')')
{'date': '1183050420', 'format': '1.1', 'version': '1.2', 'author': 'PersonsName'}
assuming as earlier the values have no space in them.
Beware of using eval() though. Bad things may happen for funny input. Don't use it on user input.
This might help some other people that re.findall() doesn't.
# grabbing input
input1 = dict,list,ect
# creating a phantom variable
Phantom = 'variable_name = ' + input1
# executing the phantom
phenomenon = exec(Phantom)
# storing the phantom variable in a live one
output = variable_name
# printing the stored phantom variable
print(output)
What it essentially does is adds a variable name to your input and creates that variable.
For example, if your list returns as "[[1,2][list][3,4]]" this executes as variable_name = [[1,2][list][3,4]]
In which activates it's original function.
It does create a PEP 8 error since the variable doesn't exist until it runs.
Related
I have a complex code which reads some values into nested defaultdict.
Then there is a cycle going through the keys in the dictionary and working with them - basically assigning them to another nested defaultdict.
Problem is, when I want to use the values from the dictionary and access them and pass them as values to a function.... I get either empty {} or something like this: defaultdict(<function tree at 0x2aff774309d8>
I have tried to write the dict so I can see if it is really empty. Part of my code;
if (not families_data[family]['cell_db']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))
The really fun part is, when this "if" is true, then I get the following output:
Output for adfull is empty.
{'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''}
if I change the second line in the if to
print(families_data[family]['celldb'])
I get the following output:
defaultdict(<function tree at 0x2b45844059d8>, {'name': 'adfullx05_b', 'family': 'adfull', 'drive_strength': 0.5, 'template': 'adfull', 'category': '', 'pinmap': '', 'output': 'CO S', 'inout': '', 'input': 'A B CI', 'rail_supply': 'VDD VSS', 'well_supply': '', 'description': ''})
Why is the "if" even true, when there is a value 'CO S' in the output key?
Why am I getting {} when trying to access any value like families_data[family]['cell_db']['input'] and passing it to function as a parameter?
What the heck am I doing wrong?
The "cell_db" key in the if statement has an underscore while it does not in the print statement.
This should fix it:
if (not families_data[family]['celldb']['output']):
print(rf"Output for {family} is empty.")
print(dict(families_data[family]['celldb']))
I am trying to further split an already split string to further clean it up and remove unnecessary bits of info. This is a URL split by '/'
['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
I would like to be able to make it:
['https:', '', 'expressjs','com', 'en', 'starter', 'hello-world','html']
Any thoughts?
re.split can split a string on every match for your regex
>>> re.split('[/\.]', 'https://expressjs.com/en/starter/hello-world.html')
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']
[/\.] matches any forward-slash or period character
Try this:
L = ['https:', '', 'expressjs.com', 'en', 'starter', 'hello-world.html']
L = [subitem for item in L for subitem in item.split('.')]
print(L)
Output:
['https:', '', 'expressjs', 'com', 'en', 'starter', 'hello-world', 'html']
Here is my issue. Given below list:
a = ['COP' , '\t\t\t', 'Basis', 'Notl', 'dv01', '6m', '9m', '1y',
'18m', '2y', '3y', "15.6", 'mm', '4.6', '4y', '5y', '10', 'mm',
'4.6', '6y', '7y', '8y', '9y', '10y', '20y', 'TOTAL', '\t\t9.2' ]
I'm trying to get some outputs like this one. The most important note is the rows
After the first number ended on "y" or "m" will come a number only if it is there in the list
Example : ('3y', '15.6', '')
SAMPLE OUTPUT ( forget about the structure that is a tuple, jsut want teh values)
('6m', '', '')
('9m', '', '')
('1y', '', '')
('18m', '', '')
('2y', '', '')
('3y', '15.6', '')
('4y', '', '')
('5y', '10', '')
('6y', '', '')
('7y', '', '')
('8y', '', '')
('9y', '', '')
('10y', '', '')
('20y', '', '')
I used the following regex that should have returned :
all numbers followed by "y" or "m" => (\b\d+[ym]\b)
and then any number (integer or not) if it appears (meaning zero or more times)=>
(\b[0-9]+.[0-9]\b)
Here is what I did, using Python3 regex and re.findall(), but still got no result
rule2 = re.compile(r"(\b\d+[ym]\b)(\b[0-9]+.*[0-9]*\b)+")
a_str = " ".join(a)
OUT2 = re.findall(rule2, a_str)
print(OUT2)
# OUT2 >>[]
Why I'm not getting the correct result?
You cannot use word boundary twice. Since data is separated by non-letter/digits use \W+ instead.
Then, escape the dot, and make it optional, or you're not going to match 10. Don't use .* as it will match too much (regex greediness)
that yields more or less what you're looking for (note that matching strict numbers, integers or floats, is trickier than that, so this isn't perfect):
rule2 = re.compile(r"\b(\d+[ym])\W+([0-9]+\.?[0-9]*)\b")
a_str = " ".join(a)
OUT2 = re.findall(rule2, a_str)
print(OUT2)
[('3y', '15.6'), ('5y', '10')]
Now, I have a folder path will contain a database table name and a ID, looks like:
path = '/something/else/TableName/000/123/456/789'
Of course I can match TableName/000/123/456/789 then split them by python script.
import re
matched = re.findall(r'.*?/(\w+(?:/\d+){4})', path)[0] # TableName/000/123/456/789
split_text = matched.split('/') # ['TableName', '000', '123', '456', '789']
table_name = split_text[0] # 'TableName'
id = int(''.join(split_text[1:])) # 123456789
.*?/(\w+(?:/\d+){4})
But I want to know, if there any function provided by regex can finish it in one step? I've tried these ways:
re.match(r'.*?/(?P<table_name>\w+)(?:/(?P<id>\d+)){4}', path).groupdict() # {'table_name': 'TableName', 'id': '789'}
re.split(r'.*?/(\w+)(?:/(\d+)){4}', path) # ['', 'TableName', '789', '']
re.sub(r'(.*?/)\w+(?:(/)\d+){4}', '', path) # '', full string has been replaced
.*?/(?P\w+)(?:/(?P\d+)){4}
.*?/(\w+)(?:/(\d+)){4}
Is there anyway else? Or I must use the python script above? I hope the result is {'table_name': 'TableName', 'id': '000123456789'} or ('TableName', '000123456789'), at least ('TableName', '000', '123', '456', '789').
Simplest way is to avoid using quantifier:
re.findall('(\w+)\/(\d+)\/(\d+)\/(\d+)\/(\d+)', path)
[('TableName', '000', '123', '456', '789')]
Easiest way would be to expand the grouping.
>>> match=re.search(r'.*?/(\w+)(?:/(\d+))(?:/(\d+))(?:/(\d+))(?:/(\d+))',a)
>>> match.groups()
('TableName', '000', '123', '456', '789')
So I have some lines of text that are stored in a list as follows:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
I want to create:
[{'1.9': 'comment'},
{'1.11*': ''},
{'1.5': 'another comment'},
{'1.23': ''},
{'3.10.3*': 'commennnnnt'},
{'1.2': ''} ]
In other words, I want to take the list apart and pair each decimal number with either the comment (starting with '#'; we can assume that no numbers occur in it) that appears right after it on the same line, or with an empty string if there is no comment (e.g., the next thing after it is another number).
Specifically, a 'decimal number' can be a single digit, followed by a dot and then either one or two digits, optionally followed by a dot and one or two more digits. A '*' may appear at the very end. So like this(?): r'\d\.\d{1,2}(\.\d{1,2})?\*?')
I've tried a few things with re.split() to get started. For example, splitting the first list item on either the crazy decimal regex or #, before worrying about the dict pairings:
>>> crazy=r'\d\.\d{1,2}(\.\d{1,2})?\*?'
>>> re.split(r'({0})|#'.format(crazy), results[0])
Result:
[u'',
u'1.9',
None,
u' ',
None,
None,
u'comment ',
u'1.11',
None,
u' ',
u'1.5',
None,
u' ',
None,
None,
u' test comment']
This looks like something I can filter and work with, but is there a better way? (also, wow...it seems the parentheses in my crazy regex allow me to keep the decimal number delimiters as desired!)
The following seems to work:
lines = ['1.9 #comment 1.11* 1.5 # another comment',
'1.23',
'3.10.3* #commennnnnt 1.2 ']
entries = re.findall(r'([0-9.]+\*?)\s+((?:[\# ])?[a-zA-Z ]*)', " ".join(lines))
ldict = [{k: v.strip(" #")} for k,v in entries]
print ldict
This displays:
[{'1.9': 'comment'}, {'1.11*': ''}, {'1.5': 'another comment'}, {'1.23': ''}, {'3.10.3*': 'commennnnnt'}, {'1.2': ''}]