Can difflib's charjunk be used to ignore whitespace? - python

I'd like to compare differences between two lists of strings. For my purposes, whitespace is noise and these differences do not need to be shown. Reading into difflib's documentation, "the default [for charjunk] is module-level function IS_CHARACTER_JUNK(), which filters out whitespace characters". Perfect, except I don't see it working, or making much difference (<- pun!).
import difflib
A = ['3 4\n']
B = ['3 4\n']
print ''.join(difflib.ndiff(A, B)) # default: charjunk=difflib.IS_CHARACTER_JUNK
outputs:
- 3 4
? -
+ 3 4
I've tried a few other linejunk options, but none that actually ignore the differences as a result of whitespace. Do I have the wrong interpretation for what charjunk is for?
As a side note, I can side-step this limitation by pre-processing my strings to substitute multiple whitespace characters to single space characters using re.sub(r'\W+', ' ', 'foo\t bar').

Related

Add spaces at the beginning of the print output in python

I'm wondering how Am I suppose to add 4 spaces at the beginnings of the print outputs with f-string or format in python?
This is what I use to print now:
print('{:<10}'.format('Constant'),'{:{width}.{prec}f}'.format(122, width=10, prec=3))
and my output look like this:
Constant 122.000
but what I want is to have 4 spaces before the constant in the output like:
( 4 spaces here )Constant 122.000
Any ideas? Thanks a lot!
You can use ' ' + string (as suggested), but a more robust approach could be:
string="Test String leading space to be added"
spaces_to_add = 4
string_length=len(string) + spaces_to_add # will be adding 4 extra spaces
string_revised=string.rjust(string_length)
result:
' Test String leading space to be added'
There's a couple ways you could do it:
If Constant is really an unchanging constant, why not just
print(f" {Constant}", ...)
before your other string?
With your current implementation, you are left-aligning to a width of 10 characters. If you swap that to right-align, like '{:>12}'.format('Constant') ("Constant" is 8 characters, 12 - 8 = 4 spaces) It will put 4 characters in front of the string.
Here's a Python f-string syntax cheat sheet I've used before:
https://myshell.co.uk/blog/2018/11/python-f-string-formatting-cheatsheet/
And the official docs: PEP 3101

Python: Use Regex to Match Phone Number And Print Tuple (w/Formatting Constraints)

I want to write code that can parse American phone numbers (ie. "(664)298-4397") . Below are the constraints:
allow leading and trailing white spaces
allow white spaces that appear between area code and local numbers
no white spaces in area code or the seven digit number XXX-XXXX
Ultimately I want to print a tuple of strings (area_code, first_three_digits_local, last_four_digits_local)
I have two sets of questions.
Question 1:
Below are inputs my code should accept and print the tuple for:
'(664) 298-4397', '(664)298-4397', ' (664) 298-4397'
Below is the code I tried:
regex_parse1 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664) 298-4397')
print (f' groups are: {regex_parse1.groups()} \n')
regex_parse2 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(664)298-4397')
print (f' groups are: {regex_parse2.groups()} \n')
regex_parse3 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' (664) 298-4397')
print (f' groups are: {regex_parse3.groups()}')
The string input for all three are valid and should return the tuple:
('664', '298', '4397')
But instead I'm getting the output below for all three:
groups are: ('', '', '4397')
What am I doing wrong?
Question 2:
The following two chunks of code should output an 'NoneType' object has no attribute 'group' error because the input phone number string violates the constraints. But instead, I get outputs for all three.
regex_parse4 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', '(404)555 -1212')
print (f' groups are: {regex_parse4.groups()}')
regex_parse5 = re.match(r'^([\s]*[(]*[0-9]*[)]*[\s]*)+([\s]*[0-9]*)-([0-9]*[\s]*)$', ' ( 404)121-2121')
print (f' groups are: {regex_parse5.groups()}')
Expected output: should be an error but I get this instead for all three:
groups are: ('', '', '2121')
What is wrong with my regex code?
In general, your regex overuse the asterisk *. Details as follows:
You have 3 capturing groups:
([\s]*[(]*[0-9]*[)]*[\s]*)
([\s]*[0-9]*)
([0-9]*[\s]*)
You use asterisk on every single item, including the open and close parenthesis. Actually, almost everything in your regex is quoted with asterisk. Thus, the capturing groups match also null strings. That's why your first and second capturing groups return the null strings. The only item you don't use asterisk is the hyphen sign - just before the third capturing group. This is also the reason why your regex can capture the third capturing group as in the 4397 and 2121
To solve your problem, you have to use asterisk only when needed.
In fact, your regex still has plenty of rooms for improvement. For example, it now matches numeric digits of any length (instead of 3 or 4 digits chunks). It also allows the area code not enclosed in parenthesis (because of your use of asterisk around parenthesis symbols.
For this kind of common regex, I suggest you don't need to reinvent the wheel. You can refer to some already made regex easily found from the Internet. For example, you can refer to this post Although the post is using javascript instead of Python, the regex is just similar.
Try:
regex_parse4 = re.match(r'([(]*[0-9]{3}[)])\s*([0-9]{3}).([0-9]{4})', number)
Assumes 3 digit area code in parentheses, proceeded by XXX-XXXX.
Python returns 'NoneType' when there are no matches.
If above does not work, here is a helpful regex tool:
https://regex101.com
Edit:
Another suggestion is to clean data prior to applying a new regex. This helps with instances of abnormal spacing, gets rid of parentheses, and '-'.
clean_number = re.sub("[^0-9]", "", original_number)
regex_parse = re.match(r'([0-9]{3})([0-9]{3})([0-9]{4})', clean_number)
print(f'groups are: {regex_parse}.groups()}')
>>> ('xxx', 'xxx', 'xxxx')

Python - Removing \n when using default split()?

I'm working on strings where I'm taking input from the command line. For example, with this input:
format driveName "datahere"
when I go string.split(), it comes out as:
>>> input.split()
['format, 'driveName', '"datahere"']
which is what I want.
However, when I specify it to be string.split(" ", 2), I get:
>>> input.split(' ', 2)
['format\n, 'driveName\n', '"datahere"']
Does anyone know why and how I can resolve this? I thought it could be because I'm creating it on Windows and running on Unix, but the same problem occurs when I use nano in unix.
The third argument (data) could contain newlines, so I'm cautious not to use a sweeping newline remover.
Default separator in split() is all whitespace which includes newlines \n and spaces.
Here is what the docs on split say:
str.split([sep[, maxsplit]])
If sep is not specified or is None, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single
separator, and the result will contain no empty strings at the start
or end if the string has leading or trailing whitespace.
When you define a new sep it only uses that separator to split the strings.
Use None to get the default whitespace splitting behaviour with a limit:
input.split(None, 2)
This leaves the whitespace at the end of input() untouched.
Or you could strip the values afterwards; this removes whitespace from the start and end, not the middle, of each resulting string, just like input.split() would:
[v.strip() for v in input.split(' ', 2)]
The default str.split targets a number of "whitespace characters", including also tabs and others. If you do str.split(' '), you tell it to split only on ' ' (a space). You can get the default behavior by specifying None, as in str.split(None, 2).
There may be a better way of doing this, depending on what your actual use-case is (your example does not replicate the problem...). As your example output implies newlines as separators, you should consider splitting on them explicitly.
inp = """
format
driveName
datahere
datathere
"""
inp.strip().split('\n', 2)
# ['format', 'driveName', 'datahere\ndatathere']
This allows you to have spaces (and tabs etc) in the first and second item as well.

Regex include the negative lookbehind

I'm trying to filter a string before passing it through eval in python. I want to limit it to math functions, but I'm not sure how to strip it with regex. Consider the following:
s = 'math.pi * 8'
I want that to basically translate to 'math.pi*8', stripped of spaces. I also want to strip any letters [A-Za-z] that are not followed by math\..
So if s = 'while(1): print "hello"', I want any executable part of it to be stripped:
s would ideally equal something like ():"" in that scenario (all letters gone, because they were not followed by math\..
Here's the regex I've tried:
(?<!math\.)[A-Za-z\s]+
and the python:
re.sub(r'(?<!math\.)[A-Za-z\s]+', r'', 'math.pi * 8')
But the result is '.p*8', because math. is not followed by math., and i is not followed by math..
How can I strip letters that are not in math and are not followed by math.?
What I ended up doing
I followed #Thomas's answer, but also stripped square brackets, spaces, and underscores from the string, in hopes that no python function can be executed other than through the math module:
s = re.sub(r'(\[.*?\]|\s+|_)', '', s)
s = eval(s, {
'__builtins__' : None,
'math' : math
})
As #Carl says in a comment, look at what lybniz does for something better. But even this is not enough!
The technique described at the link is the following:
print eval(raw_input(), {"__builtins__":None}, {'pi':math.pi})
But this doesn't prevent something like
([x for x in 1.0.__class__.__base__.__subclasses__()
if x.__name__ == 'catch_warnings'][0]()
)._module.__builtins__['__import__']('os').system('echo hi!')
Source: Several of Ned Batchelder's posts on sandboxing, see http://nedbatchelder.com/blog/201302/looking_for_python_3_builtins.html
edit: pointed out that we don't get square brackets or spaces, so:
1.0.__class__.__base__.__subclasses__().__getitem__(i)()._module.__builtins__.get('__import__')('os').system('echo hi')
where you just try a lot of values for i.

python regex in pyparsing

How do you make the below regex be used in pyparsing? It should return a list of tokens given the regex.
Any help would be greatly appreciated! Thank you!
python regex example in the shell:
>>> re.split("(\w+)(lab)(\d+)", "abclab1", 3)
>>> ['', 'abc', 'lab', '1', '']
I tried this in pyparsing, but I can't seem to figure out how to get it right because the first match is being greedy, i.e the first token will be 'abclab' instead of two tokens 'abc' and 'lab'.
pyparsing example (high level, i.e non working code):
name = 'abclab1'
location = Word(alphas).setResultsName('location')
lab = CaselessLiteral('lab').setResultsName('environment')
identifier = Word(nums).setResultsName('identifier')
expr = location + lab + identifier
match, start, end = expr.scanString(name).next()
print match.asDict()
Pyparsing's classes are pretty much left-to-right, with lookahead implemented using explicit expressions like FollowedBy (for positive lookahead) and NotAny or the '~' operator (for negative lookahead). This allows you to detect a terminator which would normally match an item that is being repeated. For instance, OneOrMore(Word(alphas)) + Literal('end') will never find a match in strings like "start blah blah end", because the terminating 'end' will get swallowed up in the repetition expression in OneOrMore. The fix is to add negative lookahead in the expression being repeated: OneOrMore(~Literal('end') + Word(alphas)) + Literal('end') - that is, before reading another word composed of alphas, first make sure it is not the word 'end'.
This breaks down when the repetition is within a pyparsing class, like Word. Word(alphas) will continue to read alpha characters as long as there is no whitespace to stop the word. You would have to break into this repetition using something very expensive, like Combine(OneOrMore(~Literal('lab') + Word(alphas, exact=1))) - I say expensive because composition of simple tokens using complex Combine expressions will make for a slow parser.
You might be able to compromise by using a regex wrapped in a pyparsing Regex object:
>>> labword = Regex(r'(\w+)(lab)(\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
This does the right kind of grouping and detection, but does not expose the groups themselves. To do that, add names to each group - pyparsing will treat these like results names, and give you access to the individual fields, just as if you had called setResultsName:
>>> labword = Regex(r'(?P<locn>\w+)(?P<env>lab)(?P<identifier>\d+)')
>>> print labword.parseString("abclab1").dump()
['abclab1']
- env: lab
- identifier: 1
- locn: abc
>>> print labword.parseString("abclab1").asDict()
{'identifier': '1', 'locn': 'abc', 'env': 'lab'}
The only other non-regex approach I can think of would be to define an expression to read the whole string, and then break up the parts in a parse action.
If you strip the subgroup sign(the parenthesis), you'll get the right answer:)
>>> re.split("\w+lab\d+", "abclab1")
['', '']

Categories

Resources