Use Python format string in reverse for parsing

Use Python format string in reverse for parsing - python

I've been using the following python code to format an integer part ID as a formatted part number string:
pn = 'PN-{:0>9}'.format(id)
I would like to know if there is a way to use that same format string ('PN-{:0>9}') in reverse to extract the integer ID from the formatted part number. If that can't be done, is there a way to use a single format string (or regex?) to create and parse?

The parse module "is the opposite of format()".
Example usage:
>>> import parse
>>> format_string = 'PN-{:0>9}'
>>> id = 123
>>> pn = format_string.format(id)
>>> pn
'PN-000000123'
>>> parsed = parse.parse(format_string, pn)
>>> parsed
<Result ('123',) {}>
>>> parsed[0]
'123'

You might find simulating scanf interresting.

Here's a solution in case you don't want to use the parse module. It converts format strings into regular expressions with named groups. It makes a few assumptions (described in the docstring) that were okay in my case, but may not be okay in yours.
def match_format_string(format_str, s):
"""Match s against the given format string, return dict of matches.
We assume all of the arguments in format string are named keyword arguments (i.e. no {} or
{:0.2f}). We also assume that all chars are allowed in each keyword argument, so separators
need to be present which aren't present in the keyword arguments (i.e. '{one}{two}' won't work
reliably as a format string but '{one}-{two}' will if the hyphen isn't used in {one} or {two}).
We raise if the format string does not match s.
Example:
fs = '{test}-{flight}-{go}'
s = fs.format('first', 'second', 'third')
match_format_string(fs, s) -> {'test': 'first', 'flight': 'second', 'go': 'third'}
"""
# First split on any keyword arguments, note that the names of keyword arguments will be in the
# 1st, 3rd, ... positions in this list
tokens = re.split(r'\{(.*?)\}', format_str)
keywords = tokens[1::2]
# Now replace keyword arguments with named groups matching them. We also escape between keyword
# arguments so we support meta-characters there. Re-join tokens to form our regexp pattern
tokens[1::2] = map(u'(?P<{}>.*)'.format, keywords)
tokens[0::2] = map(re.escape, tokens[0::2])
pattern = ''.join(tokens)
# Use our pattern to match the given string, raise if it doesn't match
matches = re.match(pattern, s)
if not matches:
raise Exception("Format string did not match")
# Return a dict with all of our keywords and their values
return {x: matches.group(x) for x in keywords}

How about:
id = int(pn.split('-')[1])
This splits the part number at the dash, takes the second component and converts it to integer.
P.S. I've kept id as the variable name so that the connection to your question is clear. It is a good idea to rename that variable that it doesn't shadow the built-in function.

Use lucidity
import lucidty
template = lucidity.Template('model', '/jobs/{job}/assets/{asset_name}/model/{lod}/{asset_name}_{lod}_v{version}.{filetype}')
path = '/jobs/monty/assets/circus/model/high/circus_high_v001.abc'
data = template.parse(path)
print(data)
# Output
# {'job': 'monty',
# 'asset_name': 'circus',
# 'lod': 'high',
# 'version': '001',
# 'filetype': 'abc'}

Related

Replace placeholders in string with replacements sequence

I have a location string with placeholders, used as '#'. Another string which are replacements for the placeholders. I want to replace them sequentially, (like format specifiers). What is the way to do it in Python?
location = '/tmp/#/dir1/#/some_dirx/dir/var/2/#/dir3'
replacements = 'xyz'
result = '/tmp/x/dir1/y/some_dirx/dir/var/2/z/dir3'

You should use the replace method of a string as follows:
for replacement in replacements:
location = location.replace('#', replacement, 1)
It is important you use the third argument, count, in order to replace that placeholder just once. Otherwise, it will replace every time you find your placeholder.

If your location string does not contains format specifiers ({}) you could do:
location = '/tmp/#/dir1/#/some_dirx/dir/var/2/#/dir3'
replacements='xyz'
print(location.replace("#", "{}").format(*replacements))
Output
/tmp/x/dir1/y/some_dirx/dir/var/2/z/dir3
As an alternative you could use the fact that repl in re.sub can be a function:
import re
from itertools import count
location = '/tmp/#/dir1/#/some_dirx/dir/var/2/#/dir3'
def repl(match, replacements='xyz', index=count()):
return replacements[next(index)]
print(re.sub('#', repl, location))
Output
/tmp/x/dir1/y/some_dirx/dir/var/2/z/dir3

How to copy changing substring in string?

How can I copy data from changing string?
I tried to slice, but length of slice is changing.
For example in one case I should copy number 128 from string '"edge_liked_by":{"count":128}', in another I should copy 15332 from "edge_liked_by":{"count":15332}

You could use a regular expression:
import re
string = '"edge_liked_by":{"count":15332}'
number = re.search(r'{"count":(\d*)}', string).group(1)

Really depends on the situation, however I find regular expressions to be useful.
To grab the numbers from the string without caring about their location, you would do as follows:
import re
def get_string(string):
return re.search(r'\d+', string).group(0)
>>> get_string('"edge_liked_by":{"count":128}')
'128'
To only get numbers from the *end of the string, you can use an anchor to ensure the result is pulled from the far end. The following example will grab any sequence of unbroken numbers that is both preceeded by a colon and ends within 5 characters of the end of the string:
import re
def get_string(string):
rval = None
string_match = re.search(r':(\d+).{0,5}$', string)
if string_match:
rval = string_match.group(1)
return rval
>>> get_string('"edge_liked_by":{"count":128}')
'128'
>>> get_string('"edge_liked_by":{"1321":1}')
'1'
In the above example, adding the colon will ensure that we only pick values and don't match keys such as the "1321" that I added in as a test.
If you just want anything after the last colon, but excluding the bracket, try combining split with slicing:
>>> '"edge_liked_by":{"count":128}'.split(':')[-1][0:-1]
'128'
Finally, considering this looks like a JSON object, you can add curly brackets to the string and treat it as such. Then it becomes a nested dict you can query:
>>> import json
>>> string = '"edge_liked_by":{"count":128}'
>>> string = '{' + string + '}'
>>> string = json.loads(string)
>>> string.get('edge_liked_by').get('count')
128
The first two will return a string and the final one returns a number due to being treated as a JSON object.

It looks like the type of string you are working with is read from JSON, maybe you are getting it as the output of some API you are working with?
If it is JSON, you've probably gone one step too far in atomizing it to a string like this. I'd work with the original output, if possible, if I were you.
If not, to make it more JSON like, I'd convert it to JSON by wrapping it in {}, and then working with the json.loads module.
import json
string = '"edge_liked_by":{"count":15332}'
string = "{"+string+"}"
json_obj = json.loads(string)
count = json_obj['edge_liked_by']['count']
count will have the desired output. I prefer this option to using regular expressions because you can rely on the structure of the data and reuse the code in case you wish to parse out other attributes, in a very intuitive way. With regular expressions, the code you use will change if the data are decimal, or negative, or contain non-numeric characters.

Does this help ?
a='"edge_liked_by":{"count":128}'
import re
b=re.findall(r'\d+', a)[0]
b
Out[16]: '128'

Check if a variable substring is in a string

I receive an input string having values expressed in two possible formats. E.g.:
#short format
data = '"interval":19'
>>> "interval":19
#extended format
data = '"interval":{"t0":19,"tf":19}'
>>> "interval":{"t0":19,"tf":19}
I would like to check whether a short format is used and, in case, make it extended.
Considering that the string could be composed of multiple values, i.e.
data = '"interval":19,"interval2":{"t0":10,"tf":15}'
>>> "interval":19,"interval2":{"t0":10,"tf":15}
I cannot just say:
if ":{" not in data:
#then short format is used
I would like to code something like:
if ":$(a general int/float/double number)" in data:
#extract the number
#replace ":{number}" with the extended format
I know how to code the replacing part.
I need help for implementing if condition: in my mind, I model it like a variable substring, in which the variable part is the number inside it, while the rigid format is the $(value name) + ":" part.
"some_value":19
^ ^
rigid format variable part
EDIT - WHY NOT PARSE IT?
I know the string is "JSON-friendly" and I can convert it into a dictionary, easily accessing then the values.
Indeed, I already have this solution in my code. But I don't like it since the input string could be multilevel and I need to iterate on the leaf values of the resulting dictionary, independently from the dictionary levels. The latter is not a simple thing to do.
So I was wondering whether a way to act directly on the string exists.

If you replace all keys, except t0, tf, followed by numbers, it should work.
I show you an example on a multilevel string, probably to be put in a better shape:
import re
s = '"interval": 19,"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval":23}}'
gex = '("(?!(t0|tf)")\w+":)\s*(\d+)'
new_s = re.sub(gex, r'\1 {"t0": \3, "tf": \3}', s)
print(new_s)
>>> print(new_s)
"interval": {"t0": 19, "tf": 19},"t0interval2":{"t0":10,"tf":15},{"deeper": {"other_interval": {"t0": 23, "tf": 23}}}

You could use a regular expression. ("interval":)(\d+) will look for the string '"interval":' followed by any number of digits.
Let's test this
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'xxx', data)
print(result)
# -> xxx,"interval2":{"t0":10,"tf":15},xxx
We see that we found the correct places. Now we're going to create your target format. Here the matched groups come in handy. In the regular expression ("interval":) is group 1, (\d+) is group 2.
Now we use the content of those groups to create your wanted result.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"interval":25'
result = re.sub(r'("interval":)(\d+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"interval":{"t0":25,"tf":25}
If there are floating point values involved you'll have to change (\d+) to ([.\d]+).
If you want any Unicode standard word characters and not only interval you can use the special sequence \w and because it could be multiple characters the expression will be \w+.
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'("\w+":)([.\d]+)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":{"t0":10,"tf":10},"tf":{"t0":15,"tf":15}},"Monty":{"t0":25.4,"tf":25.4}
Dang! Yes, we found "Monty" but now the values from the second part are found too. We'll have to fix this somehow. Let's see. We don't want ("\w+") if it's preceded by { so were going to use a negative lookbehind assertion: (?<!{)("\w+"). And after the number part (\d+) we don't want a } or an other digit so we're using a negative lookahead assertion here: ([.\d]+)(?!})(?!\d).
data = '"interval":19,"interval2":{"t0":10,"tf":15},"Monty":25.4'
result = re.sub(r'(?<!{)("\w+":)([.\d]+)(?!})(?!\d)', r'\1{"t0":\2,"tf":\2}', data)
print(result)
# -> "interval":{"t0":19,"tf":19},"interval2":{"t0":10,"tf":15},"Monty":{"t0":25.4,"tf":25.4}
Hooray, it works!
Regular expressions are powerful and fun, but if you start to add more constraints this might become unmanageable.

Python: re.findall() does not work for overlapping substrings

I want to match a string with a list of values. These can be overlapping, so for example string = "test1 test2" and values = ["test1", "test1 test2"].
EDIT: Below is my entire code for a simple example
import regex
string = "This is a test string"
values = ["test", "word", "string", "test string"]
pattern = r'\b({})\b'.format('|'.join(map(regex.escape, values)))
matches = set(map(str.lower, regex.findall(pattern, string, regex.IGNORECASE)))
output = ([x.upper() for x in values if x.lower() in matches])
print(output) # ['TEST', 'STRING']
# Expected output: ['TEST', 'STRING', 'TEST STRING']

As Wiktor commented, if you want to find all matches, you can not
use alternatives, because regex processor tries consecutive alternatives
and returns only the first alternative found.
So your program has to use a separate pattern for each value to test,
but for performance reason you can compile all of them in advance.
Another difference I spotted, between your Python instalation and mine
is import regex. Apparently you use some older Python version, as
I use import re (version 3.7). I checked even Python version 2.7.15, it
also uses import re.
The script can look like below:
import re
def mtch(pat, str):
s = pat.search(str)
return s.group().upper() if s else None
# Strings to look for
values = ["test", "word", "string", "test string"]
# Compile patterns
patterns = [ re.compile(r'\b({})\b'.format(re.escape(v)),
re.IGNORECASE) for v in values ]
# The string to check
string = "This is a test string"
# What has been found
list(filter(None, [ mtch(pat, string) for pat in patterns ]))
mtch function returns the text found by pat (the compiled pattern)
in str (source string) or None in the match failed.
patterns contains a list of compiled patterns.
Then there is [ mtch(pat, string) for pat in patterns ] a list
comprehension, generating match result list (with None values
if the match attempt failed).
To filter out None values I used filter function.
And finally list gathers all filtered strings and prints:
['TEST', 'STRING', 'TEST STRING']
If you want to perform this search for multiple source strings,
run only the last statement for each source string, probably adding
the result (and some indication of what string has been searched)
to some result list.
If your source list is very long, you should not attempt to read them all.
Instead, you should read them one by one in a loop and run the check
only for the current input string.
Edit concerning comment as of 2019-02-18 10:00Z
As I read from your comment, the code reading strings is as follows:
with open("Test_data.csv") as f:
for entry in f:
entry = entry.split(',')
string = entry[2] + " " + entry[3] + " " + entry[6]
Note that you overwrite string in every loop, so after the loop completed,
you have there the result from the last row (only).
Or maybe just after reading you run the search for patterns for the current
string?
Another hints to change the code:
Avoid such combinations that e.g. entry variable initially holds
the whole string and then a list - product of splitting.
Maybe a more readable variant is:
for row in f:
entry = row.split(',')
After you read a row and before doing anything else, check whether the row
just read is not empty. If the row is empty, omit it.
A quick way to test it is just to use the string in if (an empty string
evaluates to False).
for row in f:
if row:
entry = row.split(',')
...
Before string = entry[2] + " " + entry[3] + " " + entry[6] check
whether entry list has at least 7 items (numeration is from 0).
Maybe some of your input rows contain smaller number of fragments
and hence your program attempts to read from a non-existing element of
this list?
To be sure, what strings you are checking, write a short program
which only splits the input and prints resulting strings. Then look at them, maybe you find something wrong.

If you determine that foobar is in the text, you don't need to search the text separately for foo and bar: you know the answer already.
First group your searches:
searches = ['test', 'word', 'string', 'test string', 'wo', 'wordy']
unique = set(searches)
ordered = sorted(unique, key = len)
grouped = {}
while unique:
s1 = ordered.pop()
if s1 in unique:
unique.remove(s1)
grouped[s1] = [s1]
redundant = [s2 for s2 in unique if s2 in s1]
for s2 in redundant:
unique.remove(s2)
grouped[s1].append(s2)
for s, dups in grouped.items():
print(s, dups)
# Output:
# test string ['test string', 'string', 'test']
# wordy ['wordy', 'word', 'wo']
Once you have things grouped, you can confine the searching to just the top-level searches (the keys of grouped).
Also, if scale and performance are concerns, do you really need regular expressions? Your current examples could be handled with ordinary in tests, which are faster. If you do indeed need regular expressions, the idea of grouping the searches is harder -- but perhaps not impossible under some conditions.

Replace named captured groups with arbitrary values in Python

I need to replace the value inside a capture group of a regular expression with some arbitrary value; I've had a look at the re.sub, but it seems to be working in a different way.
I have a string like this one :
s = 'monthday=1, month=5, year=2018'
and I have a regex matching it with captured groups like the following :
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
now I want to replace the group named d with aaa, the group named m with bbb and group named Y with ccc, like in the following example :
'monthday=aaa, month=bbb, year=ccc'
basically I want to keep all the non matching string and substitute the matching group with some arbitrary value.
Is there a way to achieve the desired result ?
Note
This is just an example, I could have other input regexs with different structure, but same name capturing groups ...
Update
Since it seems like most of the people are focusing on the sample data, I add another sample, let's say that I have this other input data and regex :
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
as you can see I still have the same number of capturing groups(3) and they are named the same way, but the structure is totally different... What I need though is as before replacing the capturing group with some arbitrary text :
'ccc-bbb-aaa'
replace capture group named Y with ccc, the capture group named m with bbb and the capture group named d with aaa.
In the case, regexes are not the best tool for the job, I'm open to some other proposal that achieve my goal.

This is a completely backwards use of regex. The point of capture groups is to hold text you want to keep, not text you want to replace.
Since you've written your regex the wrong way, you have to do most of the substitution operation manually:
"""
Replaces the text captured by named groups.
"""
def replace_groups(pattern, string, replacements):
pattern = re.compile(pattern)
# create a dict of {group_index: group_name} for use later
groupnames = {index: name for name, index in pattern.groupindex.items()}
def repl(match):
# we have to split the matched text into chunks we want to keep and
# chunks we want to replace
# captured text will be replaced. uncaptured text will be kept.
text = match.group()
chunks = []
lastindex = 0
for i in range(1, pattern.groups+1):
groupname = groupnames.get(i)
if groupname not in replacements:
continue
# keep the text between this match and the last
chunks.append(text[lastindex:match.start(i)])
# then instead of the captured text, insert the replacement text for this group
chunks.append(replacements[groupname])
lastindex = match.end(i)
chunks.append(text[lastindex:])
# join all the junks to obtain the final string with replacements
return ''.join(chunks)
# for each occurence call our custom replacement function
return re.sub(pattern, repl, string)
>>> replace_groups(pattern, s, {'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
'monthday=aaa, month=bbb, year=ccc'

You can use string formatting with a regex substitution:
import re
s = 'monthday=1, month=5, year=2018'
s = re.sub('(?<=\=)\d+', '{}', s).format(*['aaa', 'bbb', 'ccc'])
Output:
'monthday=aaa, month=bbb, year=ccc'
Edit: given an arbitrary input string and regex, you can use formatting like so:
input = '2018-12-12'
regex = '((?P<Y>20\d{2})-(?P<m>[0-1]?\d)-(?P<d>\d{2}))'
new_s = re.sub(regex, '{}', input).format(*["aaa", "bbb", "ccc"])

Extended Python 3.x solution on extended example (re.sub() with replacement function):
import re
d = {'d':'aaa', 'm':'bbb', 'Y':'ccc'} # predefined dict of replace words
pat = re.compile('(monthday=)(?P<d>\d{1,2})|(month=)(?P<m>\d{1,2})|(year=)(?P<Y>20\d{2})')
def repl(m):
pair = next(t for t in m.groupdict().items() if t[1])
k = next(filter(None, m.groups())) # preceding `key` for currently replaced sequence (i.e. 'monthday=' or 'month=' or 'year=')
return k + d.get(pair[0], '')
s = 'Data: year=2018, monthday=1, month=5, some other text'
result = pat.sub(repl, s)
print(result)
The output:
Data: year=ccc, monthday=aaa, month=bbb, some other text
For Python 2.7 :
change the line k = next(filter(None, m.groups())) to:
k = filter(None, m.groups())[0]

I suggest you use a loop
import re
regex = re.compile('monthday=(?P<d>\d{1,2}), month=(?P<m>\d{1,2}), year=(?P<Y>20\d{2})')
s = 'monthday=1, month=1, year=2017 \n'
s+= 'monthday=2, month=2, year=2019'
regex_as_str = 'monthday={d}, month={m}, year={Y}'
matches = [match.groupdict() for match in regex.finditer(s)]
for match in matches:
s = s.replace(
regex_as_str.format(**match),
regex_as_str.format(**{'d': 'aaa', 'm': 'bbb', 'Y': 'ccc'})
)
You can do this multile times wiht your different regex patterns
Or you can join ("or") both patterns together

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Use Python format string in reverse for parsing - python

The parse module "is the opposite of format()". Example usage: >>> import parse >>> format_string = 'PN-{:0>9}' >>> id = 123 >>> pn = format_string.format(id) >>> pn 'PN-000000123' >>> parsed = parse.parse(format_string, pn) >>> parsed <Result ('123',) {}> >>> parsed[0] '123'

You might find simulating scanf interresting.

Related

Replace placeholders in string with replacements sequence

How to copy changing substring in string?

Check if a variable substring is in a string

Python: re.findall() does not work for overlapping substrings

Replace named captured groups with arbitrary values in Python

Categories

Resources