When using Python's textwrap library, how can I turn this:
short line,
long line xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
into this:
short line,
long line xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxx
I tried:
w = textwrap.TextWrapper(width=90,break_long_words=False)
body = '\n'.join(w.wrap(body))
But I get:
short line, long line xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(spacing not exact in my examples)
try
w = textwrap.TextWrapper(width=90,break_long_words=False,replace_whitespace=False)
that seemed to fix the problem for me
I worked that out from what I read here (I've never used textwrap before)
body = '\n'.join(['\n'.join(textwrap.wrap(line, 90,
break_long_words=False, replace_whitespace=False))
for line in body.splitlines() if line.strip() != ''])
How about wrap only lines longer then 90 characters?
new_body = ""
lines = body.split("\n")
for line in lines:
if len(line) > 90:
w = textwrap.TextWrapper(width=90, break_long_words=False)
line = '\n'.join(w.wrap(line))
new_body += line + "\n"
TextWrapper is not designed to handle text that already has newlines in it.
There are a two things you may want to do when your document already has newlines:
1) Keep old newlines, and only wrap lines that are longer than the limit.
You can subclass TextWrapper as follows:
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
Then use it the same way as textwrap:
d = DocumentWrapper(width=90)
wrapped_str = d.fill(original_str)
Gives you:
short line,
long line xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxx
2) Remove the old newlines and wrap everything.
original_str.replace('\n', '')
wrapped_str = textwrap.fill(original_str, width=90)
Gives you
short line, long line xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(TextWrapper doesn't do either of these - it just ignores the existing newlines, which leads to a weirdly formatted result)
It looks like it doesn't support that. This code will extend it to do what I need though:
http://code.activestate.com/recipes/358228/
lines = text.split("\n")
lists = (textwrap.TextWrapper(width=90,break_long_words=False).wrap(line) for line in lines)
body = "\n".join("\n".join(list) for list in lists)
Here is a little module that can wrap text, break lines, handle extra indents (eg.a bulleted list), and replace characters/words with markdown!
class TextWrap_Test:
def __init__(self):
self.Replace={'Sphagnum':'$Sphagnum$','Equisetum':'$Equisetum$','Carex':'$Carex$',
'Salix':'$Salix$','Eriophorum':'$Eriophorum$'}
def Wrap(self,Text_to_fromat,Width):
Text = []
for line in Text_to_fromat.splitlines():
if line[0]=='-':
wrapped_line = textwrap.fill(line,Width,subsequent_indent=' ')
if line[0]=='*':
wrapped_line = textwrap.fill(line,Width,initial_indent=' ',subsequent_indent=' ')
Text.append(wrapped_line)
Text = '\n\n'.join(text for text in Text)
for rep in self.Replace:
Text = Text.replace(rep,self.Replace[rep])
return(Text)
Par1 = "- Fish Island is a low center polygonal peatland on the transition"+\
" between the Mackenzie River Delta and the Tuktoyaktuk Coastal Plain.\n* It"+\
" is underlain by continuous permafrost, peat deposits exceede the annual"+\
" thaw depth.\n* Sphagnum dominates the polygon centers with a caonpy of Equisetum and sparse"+\
" Carex. Dwarf Salix grows allong the polygon rims. Eriophorum and carex fill collapsed ice wedges."
TW=TextWrap_Test()
print(TW.Wrap(Par1,Text_W))
Will output:
Fish Island is a low center polygonal peatland on the
transition between the Mackenzie River Delta and the
Tuktoyaktuk Coastal Plain.
It is underlain by continuous permafrost, peat
deposits exceede the annual thaw depth.
$Sphagnum$ dominates the polygon centers with a
caonpy of $Equisetum$ and sparse $Carex$. Dwarf $Salix$
grows allong the polygon rims. $Eriophorum$ and
carex fill collapsed ice wedges.
Characters between the $$ would be in italics if you were working in matplotlib for instance, but the $$ won't count towards the line spacing since they are added after!
So if you did:
fig,ax = plt.subplots(1,1,figsize = (10,7))
ax.text(.05,.9,TW.Wrap(Par1,Text_W),fontsize = 18,verticalalignment='top')
ax.get_xaxis().set_visible(False)
ax.get_yaxis().set_visible(False)
You'd get:
I had to a similar problem formatting dynamically generated docstrings. I wanted to preserve the newlines put in place by hand and split any lines over a certain length. Reworking the answer by #far a bit, this solution worked for me. I only include it here for posterity:
import textwrap
wrapArgs = {'width': 90, 'break_long_words': True, 'replace_whitespace': False}
fold = lambda line, wrapArgs: textwrap.fill(line, **wrapArgs)
body = '\n'.join([fold(line, wrapArgs) for line in body.splitlines()])
Split, wrap+join, and rejoin:
def wrap(s):
return "\n".join("\n".join(textwrap.wrap(x)) for x in s.splitlines())
Related
I am writing a Python program that reads a file and then writes its contents to another one, with added margins. The margins are user-input and the line length must be at most 80 characters.
I wrote a recursive function to handle this. For the most part, it is working. However, the 2 lines before any new paragraph display the indentation that was input for the right side, instead of keeping the left indentation.
Any clues on why this happen?
Here's the code:
left_Margin = 4
right_Margin = 5
# create variable to hold the number of characters to withhold from line_Size
avoid = right_Margin
num_chars = left_Margin
def insertNewlines(i, line_Size):
string_length = len(i) + avoid + right_Margin
if len(i) <= 80 + avoid + left_Margin:
return i.rjust(string_length)
else:
i = i.rjust(len(i)+left_Margin)
return i[:line_Size] + '\n' + ' ' * left_Margin + insertNewlines(i[line_Size:], line_Size)
with open("inputfile.txt", "r") as inputfile:
with open("outputfile.txt", "w") as outputfile:
for line in inputfile:
num_chars += len(line)
string_length = len(line) + left_Margin
line = line.rjust(string_length)
words = line.split()
# check if num of characters is enough
outputfile.write(insertNewlines(line, 80 - avoid - left_Margin))
For input of left_Margin=4 and right_Margin = 5, I expect this:
____Poetry is a form of literature that uses aesthetic and rhythmic
____qualities of language—such as phonaesthetics, sound symbolism, and
____metre—to evoke meanings in addition to, or in place of, the prosai
____c ostensible meaning.
____Poetry has a very long history, dating back to prehistorical ti
____mes with the creation of hunting poetry in Africa, and panegyric an
____d elegiac court poetry was developed extensively throughout the his
____tory of the empires of the Nile, Niger and Volta river valleys.
But The result is:
____Poetry is a form of literature that uses aesthetic and rhythmic
______qualities of language—such as phonaesthetics, sound symbolism, and
______metre—to evoke meanings in addition to, or in place of, the prosai
________c ostensible meaning.
_____Poetry has a very long history, dating back to prehistorical ti
_____mes with the creation of hunting poetry in Africa, and panegyric an
_____d elegiac court poetry was developed extensively throughout the his
_____tory of the empires of the Nile, Niger and Volta river valleys.
This isn't really a good fit for a recursive solution in Python. Below is an imperative/iterative solution of the formatting part of your question (I'm assuming you can take this and write it to a file instead). The code assumes that paragraphs are indicated by two consecutive newlines ('\n\n').
txt = """
Poetry is a form of literature that uses aesthetic and rhythmic qualities of language—such as phonaesthetics, sound symbolism, and metre—to evoke meanings in addition to, or in place of, the prosaic ostensible meaning.
Poetry has a very long history, dating back to prehistorical times with the creation of hunting poetry in Africa, and panegyric and elegiac court poetry was developed extensively throughout the history of the empires of the Nile, Niger and Volta river valleys.
"""
def format_paragraph(paragraph, length, left, right):
"""Format paragraph ``p`` so the line length is at most ``length``
with ``left`` as the number of characters for the left margin,
and similiarly for ``right``.
"""
words = paragraph.split()
lines = []
curline = ' ' * (left - 1) # we add a space before the first word
while words:
word = words.pop(0) # process the next word
# +1 in the next line is for the space.
if len(curline) + 1 + len(word) > length - right:
# line would have been too long, start a new line
lines.append(curline)
curline = ' ' * (left - 1)
curline += " " + word
lines.append(curline)
return '\n'.join(lines)
# we need to work on one paragraph at a time
paragraphs = txt.split('\n\n')
print('0123456789' * 8) # print a ruler..
for paragraph in paragraphs:
print(format_paragraph(paragraph, 80, left=4, right=5))
print() # next paragraph
the output of the above is:
01234567890123456789012345678901234567890123456789012345678901234567890123456789
Poetry is a form of literature that uses aesthetic and rhythmic
qualities of language such as phonaesthetics, sound symbolism, and
metre to evoke meanings in addition to, or in place of, the prosaic
ostensible meaning.
Poetry has a very long history, dating back to prehistorical times with
the creation of hunting poetry in Africa, and panegyric and elegiac
court poetry was developed extensively throughout the history of the
empires of the Nile, Niger and Volta river valleys.
I am surprising, I am using python to slice a long DNA Sequence (4699673 character)to a specific length supstring, it's working properly with a problem in result, after 71 good result \n start apear in result for few slices then correct slices again and so on for whole long file
the code:
import sys
filename = open("out_filePU.txt",'w')
sys.stdout = filename
my_file = open("GCF_000005845.2_ASM584v2_genomic_edited.fna")
st = my_file.read()
length = len(st)
print ( 'Sequence Length is, :' ,length)
for i in range(0,len(st[:-9])):
print(st[i:i+9], i)
figure shows the error from the result file
please i need advice on that.
Your sequence file contains multiple lines, and at the end of each line there is a line break \n. You can remove them with st = my_file.read().replace("\n", "").
Try st = re.sub('\\s', '', my_file.read()) to replace any newlines or other whitespace (you'll need to add import re at the top of your script).
Then for i in range(0,len(st[:-9]),9): to step through your data in increments of nine characters. Otherwise you're only advancing by one character each time: that's why you can see the diagonal patterns in your output.
I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!
I'm a python learner. If I have a lines of text in a file that looks like this
"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Can I split the lines around the inverted commas? The only constant would be their position in the file relative to the data lines themselves. The data lines could range from 10 to 100+ characters (they'll be nested network folders). I cannot see how I can use any other way to do those markers to split on, but my lack of python knowledge is making this difficult.
I've tried
optfile=line.split("")
and other variations but keep getting valueerror: empty seperator. I can see why it's saying that, I just don't know how to change it. Any help is, as always very appreciated.
Many thanks
You must escape the ":
input.split("\"")
results in
['\n',
'Y:\\DATA\x0001\\SERVER\\DATA.TXT',
' ',
'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT',
'\n']
To drop the resulting empty lines:
[line for line in [line.strip() for line in input.split("\"")] if line]
results in
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
I'll just add that if you were dealing with lines that look like they could be command line parameters, then you could possibly take advantage of the shlex module:
import shlex
with open('somefile') as fin:
for line in fin:
print shlex.split(line)
Would give:
['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
No regex, no split, just use csv.reader
import csv
sample_line = '10.0.0.1 foo "24/Sep/2015:01:08:16 +0800" www.google.com "GET /" -'
def main():
for l in csv.reader([sample_line], delimiter=' ', quotechar='"'):
print l
The output is
['10.0.0.1', 'foo', '24/Sep/2015:01:08:16 +0800', 'www.google.com', 'GET /', '-']
shlex module can help you.
import shlex
my_string = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
shlex.split(my_string)
This will spit
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
Reference: https://docs.python.org/2/library/shlex.html
Finding all regular expression matches will do it:
input=r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
re.findall('".+?"', # or '"[^"]+"', input)
This will return the list of file names:
["Y:\DATA\00001\SERVER\DATA.TXT", "V:\DATA2\00002\SERVER2\DATA2.TXT"]
To get the file name without quotes use:
[f[1:-1] for f in re.findall('".+?"', input)]
or use re.finditer:
[f.group(1) for f in re.finditer('"(.+?)"', input)]
The following code splits the line at each occurrence of the inverted comma character (") and removes empty strings and those consisting only of whitespace.
[s for s in line.split('"') if s.strip() != '']
There is no need to use regular expressions, an escape character, some module or assume a certain number of whitespace characters between the paths.
Test:
line = r'"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
output = [s for s in line.split('"') if s.strip() != '']
print(output)
>>> ['Y:\\DATA\\00001\\SERVER\\DATA.TXT', 'V:\\DATA2\\00002\\SERVER2\\DATA2.TXT']
I think what you want is to extract the filepaths, which are separated by spaces. That is you want to split the line about items contained within quotations. I.e with a line
"FILE PATH" "FILE PATH 2"
You want
["FILE PATH","FILE PATH 2"]
In which case:
import re
with open('file.txt') as f:
for line in f:
print(re.split(r'(?<=")\s(?=")',line))
With file.txt:
"Y:\DATA\00001\SERVER\DATA MINER.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"
Outputs:
>>>
['"Y:\\DATA\\00001\\SERVER\\DATA MINER.TXT"', '"V:\\DATA2\\00002\\SERVER2\\DATA2.TXT"']
This was my solution. It parses most sane input exactly the same as if it was passed into the command line directly.
import re
def simpleParse(input_):
def reduce_(quotes):
return '' if quotes.group(0) == '"' else '"'
rex = r'("[^"]*"(?:\s|$)|[^\s]+)'
return [re.sub(r'"{1,2}',reduce_,z.strip()) for z in re.findall(rex,input_)]
Use case: Collecting a bunch of single shot scripts into a utility launcher without having to redo command input much.
Edit:
Got OCD about the stupid way that the command line handles crappy quoting and wrote the below:
import re
tokens = list()
reading = False
qc = 0
lq = 0
begin = 0
for z in range(len(trial)):
char = trial[z]
if re.match(r'[^\s]', char):
if not reading:
reading = True
begin = z
if re.match(r'"', char):
begin = z
qc = 1
else:
begin = z - 1
qc = 0
lc = begin
else:
if re.match(r'"', char):
qc = qc + 1
lq = z
elif reading and qc % 2 == 0:
reading = False
if lq == z - 1:
tokens.append(trial[begin + 1: z - 1])
else:
tokens.append(trial[begin + 1: z])
if reading:
tokens.append(trial[begin + 1: len(trial) ])
tokens = [re.sub(r'"{1,2}',lambda y:'' if y.group(0) == '"' else '"', z) for z in tokens]
I know this got answered a million year ago, but this works too:
input = '"Y:\DATA\00001\SERVER\DATA.TXT" "V:\DATA2\00002\SERVER2\DATA2.TXT"'
input = input.replace('" "','"').split('"')[1:-1]
Should output it as a list containing:
['Y:\\DATA\x0001\\SERVER\\DATA.TXT', 'V:\\DATA2\x0002\\SERVER2\\DATA2.TXT']
My question Python - Error Caused by Space in argv Arument was marked as a duplicate of this one. We have a number of Python books doing back to Python 2.3. The oldest referred to using a list for argv, but with no example, so I changed things to:-
repoCmd = ['Purchaser.py', 'task', repoTask, LastDataPath]
SWCore.main(repoCmd)
and in SWCore to:-
sys.argv = args
The shlex module worked but I prefer this.
I have a python script that is writing text to images using the PIL. Everything this is working fine except for when I encounter strings with carriage returns in them. I need to preserve the carriage returns in the text. Instead of writing the carriage return to the image, I get a little box character where the return should be. Here is the code that is writing the text:
<code>
draw = ImageDraw.Draw(blankTemplate)
draw.text((35 + attSpacing, 570),str(attText),fill=0,font=attFont)
</code>
attText is the variable that I am having trouble with. I'm casting it to a string before writing it because in some cases it is a number.
Thanks for you help.
Let's think for a moment. What does a "return" signify? It means go to the left some distance, and down some distance and resume displaying characters.
You've got to do something like the following.
y, x = 35, 570
for line in attText.splitlines():
draw.text( (x,y), line, ... )
y = y + attSpacing
You could try the the following code which works perfectly good for my needs:
# Place Text on background
lineCnt = 0
for line in str(attText):
draw = ImageDraw.Draw(blankTemplate)
draw.text((35 + attSpacing,570 + 80 * lineCnt), line, font=attFont)
lineCnt = lineCnt +1
The expression "y+80*lineCnt" moves the text down y position depending on the line no. (the factor "80" for the shift must be adapted according the font).