Good method to substitute end-lines '\n' into spaces in a string

Good method to substitute end-lines '\n' into spaces in a string - python

I have an error message that spans across multiple (2-3) lines. I want to catch it and embed in a warning. I think that substituting new-lines into spaces is ok.
My question is, which method is the best practice. I know this is not the best kind of question, but I want to code it properly. I also might be missing something. So far I have came up with 3 methods:
string.replace()
regular expression
string.translate()
I was leaning towards string.translate(), however after reading how it works, I think it's an overkill to covert every character into itself except '\n'. Regexp also seems like an overkill for such a simple task.
Is there any other method designated to it, or should I pick up one of the aforementioned? I care about portability and robustness more than speed but it is still somewhat relevant.

Just use the replace method:
>>> "\na".replace("\n", " ")
' a'
>>>
It is the simplest solution. Using Regex is overkill and also means you have to import. translate is a little better, but still doesn't give anything that replace doesn't (except more typing of course).
replace should run faster too.

If you want to leave all these implementation details up to the python implementation you could do:
s = "This\nis\r\na\rtest"
print " ".join(s.splitlines())
# prints: This is a test
Note:
This method uses the universal newlines approach to splitting lines.
Which means:
universal newlines A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use.
A benefit of splitting lines over replacing linefeeds is that you can filter out lines you don't need, i.e. to avoid clutter in your log. For example, if you have this output of traceback.format_exc():
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
ZeroDivisionError: integer division or modulo by zero
And you need to add only the last line(s) to your log:
import traceback
try:
1/0
except: # of course you wouldn't catch exceptions like this in real code
print traceback.format_exc().splitlines()[-1]
# prints: ZeroDivisionError: integer division or modulo by zero
For reference:
http://docs.python.org/2/library/stdtypes.html#str.splitlines
http://docs.python.org/2/library/stdtypes.html#str.join
http://docs.python.org/2/glossary.html#term-universal-newlines
http://www.python.org/dev/peps/pep-0278/
http://docs.python.org/2/library/traceback.html

This is another fast/portable option. It is more or less the same as replace but less readable
errMsg = """Something went wrong
This message is long"""
" ".join(errMsg.splitlines())
With timing results although I guarantee this will be different based on message length
>>> s = """\
' '.join('''Something went wrong
This message is long'''.splitlines())"""
>>> timeit.timeit(stmt=s, number=100000)
0.06071170746817329
>>> q = """'''\
Something went wrong
This message is long'''.replace("\\n",' ')"""
>>> timeit.timeit(stmt=q, number=100000)
0.049164684830429906

This should work on both Windows and Linux.
string.replace('\r\n', ' ').replace('\n', ' ')

Related

Does pep8 allow the usage of \t in a print

Is the usage of escaped characters such as \t allowed by PEP8 in something like print statements?
Is there a more idiomatic way to left indent some of the printout without importing non standard libraries?

Yeah that's fine, it is a fundamental ASCII character - PEP would not deny its use as it may be fundamental to your end result (say an API needed tabs or something) - PEP is all about styling your source code, I wouldn't consider a character in a string to be something that can be decreed by a style guide (PEP8).

Though there is nothing wrong with using \t, you might want to use the textwrap module to allow your indented text to be displayed more naturally in your source code. As an alternative to msg = '\teggs\tmilk\tbread', you can write
import textwrap
def show_list():
msg = """\
eggs
milk
bread"""
print(textwrap.indent(textwrap.dedent(msg), "\t"))
Then show_list() produces the output
eggs
milk
bread
When you indent the definition of msg, the whitespace is part of the literal. dedent removes the common leading whitespace from each line of the string. The indent method then indents each line with, specifically, a tab character.

There is nothing wrong using the tabulator character in a string, at all. See e.g. the Wikipedia link for some common usages. You may be confused by this PEP-8 info:
Use 4 spaces per indentation level.
This is similar to Joe Iddon's answer. It has to be clear that writing a text (not code, of course) is something different than writing code. Texts and their usages are very inhomogeneous. So setting rules how to format your text does not make any sense (if text is not code).
But you also asked "Is there a more idiomatic way to left indent some of the printout without importing non standard libraries?"
Since Python3.6 You can use formatted string literals to get additional spaces (indentation) in your strings you want to print. (If you're using Python3.5 and lower, you can use the str.format instead, for example.)
The usage is like this:
>>> text = "Hello World"
>>> print(f"\t{text}")
Hello World
This is just a toy example, of course. F-Strings become more useful with more complex strings. If you don't have such complex strings, you can consider also using the arguments of print() statement like this, for example:
>>> print("Foo", "Bar", "Foo", "Bar", sep="\t\t") # doubled "\t" only for better displaying
Foo Bar Foo Bar
But often it is simply quite enough to include the tab character in your string, e.g.: "Hello World!\tHow are you doing?\tThat's it.". As already said, don't do that with code (PEP-8), but in texts it is fine.
If you want to use a module for that (it is a built-in module), I recommend using textwrap. See chepner's answer for more information how to use that.

What Python syntax rule allows a space between a period and method call? [duplicate]

Does anyone know why python allows you to put an unlimited amount of spaces between an object and the name of the method being called the "." ?
Here are some examples:
>>> x = []
>>> x. insert(0, 'hi')
>>> print x
['hi']
Another example:
>>> d = {}
>>> d ['hi'] = 'there'
>>> print d
{'hi': 'there'}
It is the same for classes as well.
>>> myClass = type('hi', (), {'there': 'hello'})
>>> myClass. there
'hello'
I am using python 2.7
I tried doing some google searches and looking at the python source code, but I cannot find any reason why this is allowed.

The . acts like an operator. You can do obj . attr the same way you can do this + that or this * that or the like. The language reference says:
Except at the beginning of a logical line or in string literals, the whitespace characters space, tab and formfeed can be used interchangeably to separate tokens.
Because this rule is so general, I would assume the code that does it is very early in the parsing process. It's nothing specific to .. It just ignores all whitespace everywhere except at the beginning of the line or inside a string.

An explanation of how/why it works this way had been given elsewhere, but no mention is made regarding any benefits for doing so.
An interesting benefit of this might occur when methods return an instance of the class. For example, many of the methods on a string return an instance of a string. Therefore, you can string multiple method calls together. Like this:
escaped_html = text.replace('&', '&').replace('<', '<').replace('>'. '>')
However, sometimes, the arguments passed in might be rather long and it would be nice to wrap the calls on multiple lines. Perhaps like this:
fooinstance \
.bar('a really long argument is passed in here') \
.baz('and another long argument is passed in here')
Of course, the newline escapes \ are needed for that to work, which is not ideal. Nevertheless, that is a potentially useful reason for the feature. In fact, in some other languages (where all/most whitespace is insignificant), is is quite common to see code formatted that way.
For comparison, in Python we would generally see this instead:
fooinstance = fooinstance.bar('a really long argument is passed in here')
fooinstance = fooinstance.baz('and another long argument is passed in here')
Each has their place.

Because it would be obnoxious to disallow it. The initial stage of an interpreter or compiler is a tokenizer (aka "lexer"), which chunks a program's flat text into meaningful units. In modern programming languages (C and beyond, for discussion's sake), in order to be nice to programmers, whitespace between tokens is generally ignored. In Python, of course, whitespace at the beginning of lines is very significant; but elsewhere, within lines and multiline expressions, it isn't. [Yes, these are very broad statements, but modulo corner-case counterexamples, they're true.]
Besides, sometimes it's desirable -- e.g.:
obj.deeply.\
nested.\
chain.of.\
attributes
Backslash, the continuation character, wipes out newlines, but the whitespace preceding e.g. nested remains, so it immediately follows the . after deeply.
In expressions with deeper nesting, a little extra whitespace can yield a big gain in readability:
Compare:
x = your_map[my_func(some_big_expr[17])]
vs
x = your_map[ my_func( some_big_expr[17]) ]
Caveats: If your employer, client, team, or professor has style rules or guidelines, you should adhere to them. The second example above doesn't comply with Python's style guide, PEP8, which most Python shops adopt or adapt. But that document is a collection of guidelines, not religious or civil edicts.

Splitting long line printed to screen the right way in python

This might be a silly question but I'd like to know how other people handle this or if there's a standard/recommended way of going about it.
Below are two approaches to splitting a long text line when printing it to screen in python. Which one should be used?
Option 1
if some_condition: # Senseless indenting.
if another condition: # Senseless indenting.
print 'This is a very long line of text that goes beyond the 80\n\
character limit.'
Option 2
if some_condition: # Senseless indenting.
if another condition: # Senseless indenting.
print 'This is a very long line of text that goes beyond the 80'
print 'character limit.'
I personally find Option 1 ugly but Option 2 seems like it would go against the pythonic way of keeping things simple by using a second print call.

One way to do it can be with parenthesis:
print ('This is a very long line of text that goes beyond the 80\n'
'character limit.')
Of course, there are several ways of doing it. Another way (as suggested in comments) is the triple quote:
print '''This is a very long line of text that goes beyond the 80
character limit.'''
Personally I don't like that one much because it seems like breaking the indentation, but that's just me.

If you have a long string and want to insert line breaks at appropriate points, the textwrap module provides functionality to do just that. Ex:
import textwrap
def format_long_string(long_string):
wrapper = textwrap.TextWrapper()
wrapper.width = 80
return wrapper.fill(long_string)
long_string = ('This is a really long string that is raw and unformatted '
'that may need to be broken up into little bits')
print format_long_string(long_string)
This results in the following being printed:
This is a really long string that is raw and unformatted that may need to be
broken up into little bits

Verify CSV against given format

I am expecting users to upload a CSV file of max size 1MB to a web form that should fit a given format similar to:
"<String>","<String>",<Int>,<Float>
That will be processed later. I would like to verify the file fits a specified format so that the program that shall later use the file doesnt receive unexpected input and that there are no security concerns (say some injection attack against the parsing script that does some calculations and db insert).
(1) What would be the best way to go about doing this that would be fast and thorough? From what I've researched I could go the path of regex or something more like this. I've looked at the python csv module but that doesnt appear to have any built in verification.
(2) Assuming I go for a regex, can anyone direct me to towards the best way to do this? Do I match for illegal characters and reject on that? (eg. no '/' '\' '<' '>' '{' '}' etc.) or match on all legal eg. [a-zA-Z0-9]{1,10} for the string component? I'm not too familiar with regular expressions so pointers or examples would be appreciated.
EDIT:
Strings should contain no commas or quotes it would just contain a name (ie. first name, last name). And yes I forgot to add they would be double quoted.
EDIT #2:
Thanks for all the answers. Cutplace is quite interesting but is a standalone. Decided to go with pyparsing in the end because it gives more flexibility should I add more formats.

Pyparsing will process this data, and will be tolerant of unexpected things like spaces before and after commas, commas within quotes, etc. (csv module is too, but regex solutions force you to add "\s*" bits all over the place).
from pyparsing import *
integer = Regex(r"-?\d+").setName("integer")
integer.setParseAction(lambda tokens: int(tokens[0]))
floatnum = Regex(r"-?\d+\.\d*").setName("float")
floatnum.setParseAction(lambda tokens: float(tokens[0]))
dblQuotedString.setParseAction(removeQuotes)
COMMA = Suppress(',')
validLine = dblQuotedString + COMMA + dblQuotedString + COMMA + \
integer + COMMA + floatnum + LineEnd()
tests = """\
"good data","good2",100,3.14
"good data" , "good2", 100, 3.14
bad, "good","good2",100,3.14
"bad","good2",100,3
"bad","good2",100.5,3
""".splitlines()
for t in tests:
print t
try:
print validLine.parseString(t).asList()
except ParseException, pe:
print pe.markInputline('?')
print pe.msg
print
Prints
"good data","good2",100,3.14
['good data', 'good2', 100, 3.1400000000000001]
"good data" , "good2", 100, 3.14
['good data', 'good2', 100, 3.1400000000000001]
bad, "good","good2",100,3.14
?bad, "good","good2",100,3.14
Expected string enclosed in double quotes
"bad","good2",100,3
"bad","good2",100,?3
Expected float
"bad","good2",100.5,3
"bad","good2",100?.5,3
Expected ","
You will probably be stripping those quotation marks off at some future time, pyparsing can do that at parse time by adding:
dblQuotedString.setParseAction(removeQuotes)
If you want to add comment support to your input file, say a '#' followed by the rest of the line, you can do this:
comment = '#' + restOfline
validLine.ignore(comment)
You can also add names to these fields, so that you can access them by name instead of index position (which I find gives more robust code in light of changes down the road):
validLine = dblQuotedString("key") + COMMA + dblQuotedString("title") + COMMA + \
integer("qty") + COMMA + floatnum("price") + LineEnd()
And your post-processing code can then do this:
data = validLine.parseString(t)
print "%(key)s: %(title)s, %(qty)d in stock at $%(price).2f" % data
print data.qty*data.price

I'd vote for parsing the file, checking you've got 4 components per record, that the first two components are strings, the third is an int (checking for NaN conditions), and the fourth is a float (also checking for NaN conditions).
Python would be an excellent tool for the job.
I'm not aware of any libraries in Python to deal with validation of CSV files against a spec, but it really shouldn't be too hard to write.
import csv
import math
dataChecker = csv.reader(open('data.csv'))
for row in dataChecker:
if len(row) != 4:
print 'Invalid row length.'
return
my_int = int(row[2])
my_float = float(row[3])
if math.isnan(my_int):
print 'Bad int found'
return
if math.isnan(my_float):
print 'Bad float found'
return
print 'All good!'

Here's a small snippet I made:
import csv
f = csv.reader(open("test.csv"))
for value in f:
value[0] = str(value[0])
value[1] = str(value[1])
value[2] = int(value[2])
value[3] = float(value[3])
If you run that with a file that doesn't have the format your specified, you'll get an exception:
$ python valid.py
Traceback (most recent call last):
File "valid.py", line 8, in <module>
i[2] = int(i[2])
ValueError: invalid literal for int() with base 10: 'a3'
You can then make a try-except ValueError to catch it and let the users know what they did wrong.

There can be a lot of corner-cases for parsing CSV, so you probably don't want to try doing it "by hand". At least start with a package/library built-in to the language that you're using, even if it doesn't do all the "verification" you can think of.
Once you get there, then examine the fields for your list of "illegal" chars, or examine the values in each field to determine they're valid (if you can do so). You also don't even need a regex for this task necessarily, but it may be more concise to do it that way.
You might also disallow embedded \r or \n, \0 or \t. Just loop through the fields and check them after you've loaded the data with your csv lib.

Try Cutplace. It verifies that tabluar data conforms to an interface control document.

Ideally, you want your filtering to be as restrictive as possible - the fewer things you allow, the fewer potential avenues of attack. For instance, a float or int field has a very small number of characters (and very few configurations of those characters) which should actually be allowed. String filtering should ideally be restricted to only what characters people would have a reason to input - without knowing the larger context it's hard to tell you exactly which you should allow, but at a bare minimum the string match regex should require quoting of strings and disallow anything that would terminate the string early.
Keep in mind, however, that some names may contain things like single quotes ("O'Neil", for instance) or dashes, so you couldn't necessarily rule those out.
Something like...
/"[a-zA-Z' -]+"/
...would probably be ideal for double-quoted strings which are supposed to contain names. You could replace the + with a {x,y} length min/max if you wanted to enforce certain lengths as well.

Sensible python source line wrapping for printout

I am working on a latex document that will require typesetting significant amounts of python source code. I'm using pygments (the python module, not the online demo) to encapsulate this python in latex, which works well except in the case of long individual lines - which simply continue off the page. I could manually wrap these lines except that this just doesn't seem that elegant a solution to me, and I prefer spending time puzzling about crazy automated solutions than on repetitive tasks.
What I would like is some way of processing the python source code to wrap the lines to a certain maximum character length, while preserving functionality. I've had a play around with some python and the closest I've come is inserting \\\n in the last whitespace before the maximum line length - but of course, if this ends up in strings and comments, things go wrong. Quite frankly, I'm not sure how to approach this problem.
So, is anyone aware of a module or tool that can process source code so that no lines exceed a certain length - or at least a good way to start to go about coding something like that?

You might want to extend your current approach a bit, but using the tokenize module from the standard library to determine where to put your line breaks. That way you can see the actual tokens (COMMENT, STRING, etc.) of your source code rather than just the whitespace-separated words.
Here is a short example of what tokenize can do:
>>> from cStringIO import StringIO
>>> from tokenize import tokenize
>>>
>>> python_code = '''
... def foo(): # This is a comment
... print 'foo'
... '''
>>>
>>> fp = StringIO(python_code)
>>>
>>> tokenize(fp.readline)
1,0-1,1: NL '\n'
2,0-2,3: NAME 'def'
2,4-2,7: NAME 'foo'
2,7-2,8: OP '('
2,8-2,9: OP ')'
2,9-2,10: OP ':'
2,11-2,30: COMMENT '# This is a comment'
2,30-2,31: NEWLINE '\n'
3,0-3,4: INDENT ' '
3,4-3,9: NAME 'print'
3,10-3,15: STRING "'foo'"
3,15-3,16: NEWLINE '\n'
4,0-4,0: DEDENT ''
4,0-4,0: ENDMARKER ''

I use the listings package in LaTeX to insert source code; it does syntax highlight, linebreaks et al.
Put the following in your preamble:
\usepackage{listings}
%\lstloadlanguages{Python} # Load only these languages
\newcommand{\MyHookSign}{\hbox{\ensuremath\hookleftarrow}}
\lstset{
% Language
language=Python,
% Basic setup
%basicstyle=\footnotesize,
basicstyle=\scriptsize,
keywordstyle=\bfseries,
commentstyle=,
% Looks
frame=single,
% Linebreaks
breaklines,
prebreak={\space\MyHookSign},
% Line numbering
tabsize=4,
stepnumber=5,
numbers=left,
firstnumber=1,
%numberstyle=\scriptsize,
numberstyle=\tiny,
% Above and beyond ASCII!
extendedchars=true
}
The package has hook for inline code, including entire files, showing it as figures, ...

I'd check a reformat tool in an editor like NetBeans.
When you reformat java it properly fixes the lengths of lines both inside and outside of comments, if the same algorithm were applied to Python, it would work.
For Java it allows you to set any wrapping width and a bunch of other parameters. I'd be pretty surprised if that didn't exist either native or as a plugin.
Can't tell for sure just from the description, but it's worth a try:
http://www.netbeans.org/features/python/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Good method to substitute end-lines '\n' into spaces in a string - python

Just use the replace method: >>> "\na".replace("\n", " ") ' a' >>> It is the simplest solution. Using Regex is overkill and also means you have to import. translate is a little better, but still doesn't give anything that replace doesn't (except more typing of course). replace should run faster too.

This should work on both Windows and Linux. string.replace('\r\n', ' ').replace('\n', ' ')

Related

Does pep8 allow the usage of \t in a print

What Python syntax rule allows a space between a period and method call? [duplicate]

Splitting long line printed to screen the right way in python

Verify CSV against given format

Sensible python source line wrapping for printout

Categories

Resources