ignore certain word on if condition - python

How to create an if condition on a string that fails if there is anything other than "Ignore keyword" followed by "foo- "
Eg: following two strings should pass and fail respectively:
success = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore
keyword\n random stuff;\nfoo- Ignore keyword'
fail = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore
keyword\n random stuff;\nfoo- Ignore keyword\nfoo- this
should fail'
I was trying along this line and wasn't able to make it work:
In [80]: if 'foo- ' in fail or re.search('.*Ignore.*keyword', fail):
print 'fail'
In [81]: if 'foo- ' in success or re.search('.*Ignore.*keyword', success):
print 'fail'

Now it does exactly what were you asking for:
success = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore keyword\n random stuff;\nfoo- Ignore keyword'
fail = '\nrandom stuff,foo- Ignore keyword\nfoo+ Ignore keyword\n random stuff;\nfoo- Ignore keyword\nfoo- this should fail'
# to be successful only "Ignore keyword" after "foo- "
elem_to_search = "foo- "
keyword = "Ignore keyword"
def check_sentence(elem_to_search, keyword, string_to_check):
index = 0
keyword_first = len(elem_to_search)
keyword_last = len(elem_to_search) + len(keyword)
while index < len(string_to_check):
index = string_to_check.find(elem_to_search, index)
if index == -1:
elif string_to_check[index + keyword_first : index + keyword_last] == keyword:
index += len(elem_to_search)
print "fail"
check_sentence("foo- ", "Ignore keyword", success) will pass
check_sentence("foo- ", "Ignore keyword", fail) will print fail.


What is the difference between "t = V1+' ~ '+V2" and "t = V1+' ~ '+V2"? I am getting error "invalid non-printable character U+00A0" with one of it

Trying to work on an Statistical test and below is the code with error.
if (data[V1].dtypes == 'float64') or (data[V1].dtypes == 'int64'):
if (data[V2].dtypes == 'float64') or (data[V2].dtypes == 'int64'):
print ('Correlation between', V1, 'and',V2,'is',round(corre,2))
if (data[V2].dtypes == 'object'):
#issue with V2, V1
t = V1+' ~ '+V2
model = ols(t , data=data).fit()
anovres = sm.stats.anova_lm(model, typ=2)
print('invalid type')
if (data[V1].dtypes == 'object'):
if (data[V2].dtypes == 'float64') or (data[V2].dtypes == 'int64'):
#issue with V2, V1
t = V1+' ~ '+V2
model = ols(t , data=data).fit()
anovres = sm.stats.anova_lm(model, typ=2)
if (data[V2].dtypes == 'object'):
Observed_Values = data_table.values
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
print('significance level:',alpha)
print('degree of freedom:',ddof)
if p_value<=alpha:
print ('reject H0,There is a relationship between',V1,'and',V2)
print ('reject H0,There is no relationship between', V1, 'and',V2)
print('invalid type')
print('invalid type')
The Above code is having error in line 8 if I replace it with another data. I am getting the right output
Error in One data
Output coming as expected in Other
"invalid non-printable character U+00A0" says it all. That's the Unicode Non-breaking space character. You can use it to control how spaces are displayed. A regular space character is U+0020. Python doesn't recognize that character as valid white space, so parsing breaks.
The solution is to delete that character and use a regular space instead.

Python Regex: US phone number parsing

I am a complete newbie in Regex.
I need to parse US phone numbers in a different format into 3 strings: area code (no '()'), next 3 digits, last 4 digits. No '-'.
I also need to reject (message Error):
916-111-1111 ('-' after the area code)
(916)111 -1111 (white space before '-')
( 916)111-1111 (any space inside of area code) - (916 ) - must be
rejected too
(a56)111-1111 (any non-digits inside of area code)
lack of '()' for the area code
it should OK: ' (916) 111-1111 ' (spaces anywhere except as above)
here is my regex:
This took me 2 days.
It did not fail 916-111-1111 (availability of '-' after area code). I am sure there are some other deficiencies.
I would appreciate your help very much. Even hints.
'(916) 111-1111'
'(916)111-1111 '
' (916) 111-1111'
'916-111-1111' - no () or '-' after area code
'(916)111 -1111' - no space before '-'
'( 916)111-1111' - no space inside ()
'(abc) 111-11i1' because of non-digits
You can do this:
import re
r = r'\((\d{3})\)\s*?(\d{3})\-(\d{4,5})'
l = ['(916) 111-11111', '(916)111-1111 ', ' (916) 111-1111', '916-111-1111', '(916)111 -1111', '( 916)111-1111', '(abc) 111-11i1']
print([re.findall(r, x) for x in l])
# [[('916', '111', '11111')], [('916', '111', '1111')], [('916', '111', '1111')], [], [], [], []]
You can simplify the regex if you consider (1) providing easy user interface rather than asking users to modify inputs or (2) taking the numbers to a backend storage as follows:
Since you want to print the error message, the found regex groups should be rechecked according to the error messages as follows:
import re
def get_failed_reason(s):
space_regex = r"(\s+)"
area_code_regex = r"\s*(\D*)(\d{1,3})(\D*)(\d{3})(\D*)-(\d{4})"
results = re.findall(area_code_regex, s)
if 0 == len(results):
area_code_alpha_regex = r"\((\D+)\)"
results = re.findall(area_code_alpha_regex, s)
if len(results) > 0:
return "because of non-digits"
return "no matches"
results = results[0]
space_results = re.findall(space_regex, results[0])
if 0 == len(space_results):
space_results = re.findall(space_regex, results[2])
if 0 != len(space_results):
return "no space inside ()"
alpha_code_regex = r"(\D+)"
alpha_results = re.findall(alpha_code_regex, results[0])
if 0 == len(alpha_results):
alpha_results = re.findall(alpha_code_regex, results[2])
if 0 != len(alpha_results):
if "(" not in results[0] or ")" not in results[2]:
return "no () or '-' after area code"
if 0 != len(results[-2]):
return "no space before '-'"
return "because of non-digits in area code"
return "unspecified"
if __name__ == '__main__':
phone_numbers = ["916-111-1111", "(916)111-1111", "(916)111 -1111 ", " (916) 111-1111", "(916 )111-1111", "( 916)111-1111", "- (916 )111-1111", "(a56)111-1111", "(56a)111-1111", "(916) 111-1111 ", "(abc) 111-1111"]
valid_regex = r"\s*(\()(\d{1,3})(\))(\D*)(\d{3})([^\d\s]*)-(\d{4})"
for phone_number_str in phone_numbers:
results = re.findall(valid_regex, phone_number_str)
if 0 == len(results):
reason = get_failed_reason(phone_number_str)
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Failed:\t{phone_number_str: <30}- {reason}")
area_code = results[0][1]
first_number = results[0][4]
second_number = results[0][6]
phone_number_str = f"[{phone_number_str}]"
print(f"[main] Valid:\t{phone_number_str: <30}- Area code: {area_code}, First number: {first_number}, Second number: {second_number}")
[main] Failed: [916-111-1111] - no () or '-' after area code
[main] Valid: [(916)111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916)111 -1111 ] - no space before '-'
[main] Valid: [ (916) 111-1111] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(916 )111-1111] - no space inside ()
[main] Failed: [( 916)111-1111] - no space inside ()
[main] Failed: [- (916 )111-1111] - no space inside ()
[main] Failed: [(a56)111-1111] - because of non-digits in area code
[main] Failed: [(56a)111-1111] - because of non-digits in area code
[main] Valid: [(916) 111-1111 ] - Area code: 916, First number: 111, Second number: 1111
[main] Failed: [(abc) 111-1111] - because of non-digits
Note: D represents non-digit characters.

PyParsing: parse if not a keyword

I am trying to parse a file as follows:
title = Test Suite A;
timeout = 10000
exp_delay = 500;
log = TRUE;
type = typeA;
name = "HelloWorld";
output_log = "c:\test\out.log";
name = "GoodbyeAll";
type = typeB;
comm1_req = 0xDEADBEEF;
comm1_resp = (int, 1234366);
The file first contains a section with parameters and then some sects. I can parse a file containing just parameters and I can parse a file just containing sects but I can't parse both.
from pyparsing import *
from pathlib import Path
command_req = Word(alphanums)
command_resp = "(" + delimitedList(Word(alphanums)) + ")"
kW = Word(alphas+'_', alphanums+'_') | command_req | command_resp
keyName = ~Literal("sect") + Word(alphas+'_', alphanums+'_') + FollowedBy("=")
keyValue = dblQuotedString.setParseAction( removeQuotes ) | OneOrMore(kW,stopOn=LineEnd())
param = dictOf(keyName, Suppress("=")+keyValue+Optional(Suppress(";")))
node = Group(Literal("sect") + Literal("{") + OneOrMore(param) + Literal("};"))
final = OneOrMore(node) | OneOrMore(param)
p = Path(__file__).with_name("testp.txt")
with open(p) as f:
x = final.parseFile(f, parseAll=True)
dx = x.asDict()
except ParseException as pe:
The issue I have is that param matches against sect so it expects a =. So I tried putting in ~Literal("sect") in keyName but that just leads to another error:
Exception raised:Found unwanted token, "sect", found '\n' (at char 188), (line:4, col:56)
Expected end of text, found 's' (at char 190), (line:6, col:1)
How do I get it use one parse method for sect and another (param) if not sect?
My final goal would be to have the whole lot in a Dict with the global params and sects included.
Think I've figured it out:
This line...
final = OneOrMore(node) | OneOrMore(param)
...should be:
final = ZeroOrMore(param) + ZeroOrMore(node)
But I wonder if there is a more structured way (as I'd ultimately like a dict)?

printing to console in Python while working with 3rd party modules

I'm using third-party modules and figting with error raised while calling those modules.
Here is what compiler is showing:
C:\Users\Dmitry\AppData\Local\Enthought\Canopy\edm\envs\User\lib\site- packages\backtrader\feeds\csvgeneric.py in _loadline(self, linetokens)
148 # get it from the token
149 csvfield = linetokens[csvidx]
--> 150 print(csvidx)
152 if csvfield == '':
IndexError: list index out of range
I deliberately added print(csvidx) to see the value of csvidx, but it's not showing up on console. What am I doing wrong? Thanks a lot.
Here is the code:
def _loadline(self, linetokens):
# Datetime needs special treatment
dtfield = linetokens[self.p.datetime]
if self._dtstr:
dtformat = self.p.dtformat
if self.p.time >= 0:
# add time value and format if it's in a separate field
dtfield += 'T' + linetokens[self.p.time]
dtformat += 'T' + self.p.tmformat
dt = datetime.strptime(dtfield, dtformat)
dt = self._dtconvert(dtfield)
if self.p.timeframe >= TimeFrame.Days:
# check if the expected end of session is larger than parsed
if self._tzinput:
dtin = self._tzinput.localize(dt) # pytz compatible-ized
dtin = dt
dtnum = date2num(dtin) # utc'ize
dteos = datetime.combine(dt.date(), self.p.sessionend)
dteosnum = self.date2num(dteos) # utc'ize
if dteosnum > dtnum:
self.lines.datetime[0] = dteosnum
# Avoid reconversion if already converted dtin == dt
self.l.datetime[0] = date2num(dt) if self._tzinput else dtnum
self.lines.datetime[0] = date2num(dt)
# The rest of the fields can be done with the same procedure
for linefield in (x for x in self.getlinealiases() if x != 'datetime'):
# Get the index created from the passed params
csvidx = getattr(self.params, linefield)
if csvidx is None or csvidx < 0:
# the field will not be present, assignt the "nullvalue"
csvfield = self.p.nullvalue
# get it from the token
csvfield = linetokens[csvidx]
if csvfield == '':
# if empty ... assign the "nullvalue"
csvfield = self.p.nullvalue
# get the corresponding line reference and set the value
line = getattr(self.lines, linefield)
line[0] = float(float(csvfield))
return True
csvidx = getattr(self.params, linefield)
if csvidx is None or csvidx < 0:
# the field will not be present, assignt the "nullvalue"
csvfield = self.p.nullvalue
# get it from the token
csvfield = linetokens[csvidx]
csvidx is being sought in self.params and it's obviously being found.
And it seems to be neither None nor < 0, so it seems to have a numeric value >= 0
The IndexError: list index out of range is clearly indicating that linetokens doesn't contain as many items as csvidxexpects.
Since the name self.params seems to indicate user input, it would seem that whatever value you have given is greater than the actual number of tokens available in linetokens
The code seems to be executed in one of those canned Python environments
--> 150 print(csvidx)
Because that's for sure not the usual Python console output. If that canned environment (and not the 3rd party packages) really allows you to print to the console, it would seem advisable to actually do it a lot sooner, as in:
for linefield in (x for x in self.getlinealiases() if x != 'datetime'):
# Get the index created from the passed params
csvidx = getattr(self.params, linefield)
print('linefield {} -> csvidx {}'.format(linefield, csvidx)
if csvidx is None or csvidx < 0:
# the field will not be present, assignt the "nullvalue"
csvfield = self.p.nullvalue
# get it from the token
csvfield = linetokens[csvidx]
You should see each relationship linefield -> csvidx before the exception is triggered.
If your environment allows it, run everything with python -u which uses unbuffered output. (Strongly recommended under Windows where flushing on newline is known to either not work or have problems)

pretty print assertEqual() for HTML strings

I want to compare two strings in a python unittest which contain html.
Is there a method which outputs the result in a human friendly (diff like) version?
A simple method is to strip whitespace from the HTML and split it into a list. Python 2.7's unittest (or the backported unittest2) then gives a human-readable diff between the lists.
import re
def split_html(html):
return re.split(r'\s*\n\s*', html.strip())
def test_render_html():
expected = ['<div>', '...', '</div>']
got = split_html(render_html())
self.assertEqual(expected, got)
If I'm writing a test for working code, I usually first set expected = [], insert a self.maxDiff = None before the assert and let the test fail once. The expected list can then be copy-pasted from the test output.
You might need to tweak how whitespace is stripped depending on what your HTML looks like.
I submitted a patch to do this some years back. The patch was rejected but you can still view it on the python bug list.
I doubt you would want to hack your unittest.py to apply the patch (if it even still works after all this time), but here's the function for reducing two strings a manageable size while still keeping at least part of what differs. So long as all you didn't want the complete differences this might be what you want:
def shortdiff(x,y):
Compare strings x and y and display differences.
If the strings are too long, shorten them to fit
in one line, while still keeping at least some difference.
import difflib
def limit(s):
if len(s) > LINELEN:
return s[:LINELEN-3] + '...'
return s
def firstdiff(s, t):
span = 1000
for pos in range(0, max(len(s), len(t)), span):
if s[pos:pos+span] != t[pos:pos+span]:
for index in range(pos, pos+span):
if s[index:index+1] != t[index:index+1]:
return index
left = LINELEN/4
index = firstdiff(x, y)
if index > left + 7:
x = x[:left] + '...' + x[index-4:index+LINELEN]
y = y[:left] + '...' + y[index-4:index+LINELEN]
x, y = x[:LINELEN+1], y[:LINELEN+1]
left = 0
cruncher = difflib.SequenceMatcher(None)
xtags = ytags = ""
cruncher.set_seqs(x, y)
editchars = { 'replace': ('^', '^'),
'delete': ('-', ''),
'insert': ('', '+'),
'equal': (' ',' ') }
for tag, xi1, xi2, yj1, yj2 in cruncher.get_opcodes():
lx, ly = xi2 - xi1, yj2 - yj1
edits = editchars[tag]
xtags += edits[0] * lx
ytags += edits[1] * ly
# Include ellipsis in edits line.
if left:
xtags = xtags[:left] + '...' + xtags[left+3:]
ytags = ytags[:left] + '...' + ytags[left+3:]
diffs = [ x, xtags, y, ytags ]
if max([len(s) for s in diffs]) < LINELEN:
return '\n'.join(diffs)
diffs = [ limit(s) for s in diffs ]
return '\n'.join(diffs)
Maybe this is a quite 'verbose' solution. You could add a new 'equality function' for your user defined type (e.g: HTMLString) which you have to define first:
class HTMLString(str):
Now you have to define a type equality function:
def assertHTMLStringEqual(first, second):
if first != second:
message = ... # TODO here: format your message, e.g a diff
raise AssertionError(message)
All you have to do is format your message as you like. You can also use a class method in your specific TestCase as a type equality function. This gives you more functionality to format your message, since unittest.TestCase does this a lot.
Now you have to register this equality function in your unittest.TestCase:
def __init__(self):
self.addTypeEqualityFunc(HTMLString, assertHTMLStringEqual)
The same for a class method:
def __init__(self):
self.addTypeEqualityFunc(HTMLString, 'assertHTMLStringEqual')
And now you can use it in your tests:
def test_something(self):
htmlstring1 = HTMLString(...)
htmlstring2 = HTMLString(...)
self.assertEqual(htmlstring1, htmlstring2)
This should work well with python 2.7.
I (the one asking this question) use BeautfulSoup now:
def assertEqualHTML(string1, string2, file1='', file2=''):
Compare two unicode strings containing HTML.
A human friendly diff goes to logging.error() if there
are not equal, and an exception gets raised.
from BeautifulSoup import BeautifulSoup as bs
import difflib
def short(mystr):
if len(mystr)>max:
return mystr[:max]
return mystr
for mystr, file in [(string1, file1), (string2, file2)]:
if not isinstance(mystr, unicode):
raise Exception(u'string ist not unicode: %r %s' % (short(mystr), file))
if p[0]!=p[1]:
for line in difflib.unified_diff(p[0].splitlines(), p[1].splitlines(), fromfile=file1, tofile=file2):
raise Exception('Not equal %s %s' % (file1, file2))

