Get rid of all white spaces in between lines - python

So say i had some text like this:
line num 1
line num 2
line num 3
line num 4
I am trying to get rid of all the new lines in between line 2 and 3 and line 3 and 4 while having all of the line num on separate new lines. How would i accomplish this? I have already tried puth=ing them into a list then looping throught them and taking out all of the lone '\n'
ex:
obj=['line num 1','line num 2','\n','line num 3','\n','\n','line num4']
a=-1
for i in obj:
a+=1
if i=='\n':
print 'yes'
del obj[a]
print obj
output:
['line num 1', 'line num 2', 'line num 3', '\n', 'line num4']
It catches some but not all.

In short: don't erase elements while iterating over a list.
Here you will find lot of ways to do this: Remove items from a list while iterating
Note: this is probably the shortest and most pythonic:
filter(lambda x: x!='\n', obj)

I'd simply use regex on the whole file content:
>>> s = """line num 1
line num 2
line num 3
line num 4"""
>>> import re
>>> print re.sub('\n+', '\n', s)
line num 1
line num 2
line num 3
line num 4
P.S. You should newer change list while iterating it.

Maybe if not item.isspace() gives you something more readable:
>>> obj = ['line num 1', 'line num 2', '\n', 'line num 3', '\n', '\n', 'line num4']
>>> [item for item in obj if not item.isspace()]
['line num 1', 'line num 2', 'line num 3', 'line num4']
>>>

def remove_new_line(obj):
if "\n" in obj:
obj.remove("\n")
remove_new_line(obj)
return obj
obj = ['line num 1', 'line num 2', '\n', 'line num 3', '\n', '\n', 'line num4']
print remove_new_line(obj)

You can also try this:
f = open("your file.txt",'r')
values = f.read()
val = re.split(r"\n+",values)
print val
output = ['line num 1', 'line num 2', 'line num 3', 'line num 4']

Related

How to get correct output from regex.split()?

import re
number_with_both_parantheses = "(\(*([\d+\.]+)\))"
def process_numerals(text):
k = re.split(number_with_both_parantheses, text)
k = list(filter(None, k))
for elem in k:
print(elem)
INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
expected_output = ['Statement 1', '(1)' , 'Statement 2', '(1.1)', 'Statement 3']
current_output = ['Statement 1', '(1)' , '1', 'Statement 2', '(1.1)', '1.1' , 'Statement 3']
My input is the INPUT. I am getting the current_output when call the method 'process_numerals' with input text. How do I shift to expected output ?
Your regex seems off. You realize that \(* checks for zero or more left parentheses?
>>> import re
>>> INPUT = 'Statement 1 (1) Statement 2 (1.1) Statement 3'
>>> re.split('\((\d+(?:\.\d+)?)\)', INPUT)
['Statement 1 ', '1', ' Statement 2 ', '1.1', ' Statement 3']
If you really want the literal parentheses to be included, put them inside the capturing parentheses.
The non-capturing parentheses (?:...) allow you to group without capturing. I guess that's what you are mainly looking for.

How do I split a multi-line string into multiple lines in python?

I have a multi-line string:
inputString = "Line 1\nLine 2\nLine 3"
I want to have an array, each element will have maximum 2 lines it it as below:
outputStringList = ["Line 1\nLine2", "Line3"]
Can i convert inputString to outputStringList in python. Any help will be appreciated.
you could try to find 2 lines (with lookahead inside it to avoid capturing the linefeed) or only one (to process the last, odd line). I expanded your example to show that it works for more than 3 lines (with a little "cheat": adding a newline in the end to handle all cases:
import re
s = "Line 1\nLine 2\nLine 3\nline4\nline5"
result = re.findall(r'(.+?\n.+?(?=\n)|.+)', s+"\n")
print(result)
result:
['Line 1\nLine 2', 'Line 3\nline4', 'line5']
the "add newline cheat" allows to process that properly:
s = "Line 1\nLine 2\nLine 3\nline4\nline5\nline6"
result:
['Line 1\nLine 2', 'Line 3\nline4', 'line5\nline6']
Here is an alternative using the grouper itertools recipe to group any number of lines together.
Note: you can implement this recipe by hand, or you can optionally install a third-party library that implements this recipe for you, i.e. pip install more_itertools.
Code
from more_itertools import grouper
def group_lines(iterable, n=2):
return ["\n".join((line for line in lines if line))
for lines in grouper(n, iterable.split("\n"), fillvalue="")]
Demo
s1 = "Line 1\nLine 2\nLine 3"
s2 = "Line 1\nLine 2\nLine 3\nLine4\nLine5"
group_lines(s1)
# ['Line 1\nLine 2', 'Line 3']
group_lines(s2)
# ['Line 1\nLine 2', 'Line 3\nLine4', 'Line5']
group_lines(s2, n=3)
# ['Line 1\nLine 2\nLine 3', 'Line4\nLine5']
Details
group_lines() splits the string into lines and then groups the lines by n via grouper.
list(grouper(2, s1.split("\n"), fillvalue=""))
[('Line 1', 'Line 2'), ('Line 3', '')]
Finally, for each group of lines, only non-emptry strings are rejoined with a newline character.
See more_itertools docs for more details on grouper.
I'm hoping I get your logic right - If you want a list of string, each with at most one newline delimiter, then the following code snippet will work:
# Newline-delimited string
a = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5\nLine 6\nLine 7"
# Resulting list
b = []
# First split the string into "1-line-long" pieces
a = a.split("\n")
for i in range(1, len(a), 2):
# Then join the pieces by 2's and append to the resulting list
b.append(a[i - 1] + "\n" + a[i])
# Account for the possibility of an odd-sized list
if i == len(a) - 2:
b.append(a[i + 1])
print(b)
>>> ['Line 1\nLine 2', 'Line 3\nLine 4', 'Line 5\nLine 6', 'Line 7']
Although this solution isn't the fastest nor the best, it's easy to understand and it does not involve extra libraries.
I wanted to post the grouper recipe from the itertools docs as well, but PyToolz' partition_all is actually a bit nicer.
from toolz import partition_all
s = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5"
result = ['\n'.join(tup) for tup in partition_all(2, s.splitlines())]
# ['Line 1\nLine 2', 'Line 3\nLine 4', 'Line 5']
Here's the grouper solution for the sake of completeness:
from itertools import zip_longest
# Recipe from the itertools docs.
def grouper(iterable, n, fillvalue=None):
"Collect data into fixed-length chunks or blocks"
# grouper('ABCDEFG', 3, 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
result = ['\n'.join((a, b)) if b else a for a, b in grouper(s, 2)]
Use str.splitlines() to split the full input into lines:
>>> inputString = "Line 1\nLine 2\nLine 3"
>>> outputStringList = inputString.splitlines()
>>> print(outputStringList)
['Line 1', 'Line 2', 'Line 3']
Then, join the first lines to obtain the desired result:
>>> result = ['\n'.join(outputStringList[:-1])] + outputStringList[-1:]
>>> print(result)
['Line 1\nLine 2', 'Line 3']
Bonus: write a function that do the same, for any number of desired lines:
def split_to_max_lines(inputStr, n):
lines = inputStr.splitlines()
# This define which element in the list become the 2nd in the
# final result. For n = 2, index = -1, for n = 4, index = -3, etc.
split_index = -(n - 1)
result = ['\n'.join(lines[:split_index])]
result += lines[split_index:]
return result
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 2))
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 4))
print(split_to_max_lines("Line 1\nLine 2\nLine 3\nline 4\nLine 5\nLine 6", 5))
Returns:
['Line 1\nLine 2\nLine 3\nline 4\nLine 5', 'Line 6']
['Line 1\nLine 2\nLine 3', 'line 4', 'Line 5', 'Line 6']
['Line 1\nLine 2', 'Line 3', 'line 4', 'Line 5', 'Line 6']
I'm not sure what you mean by "a maximum of 2 lines" and how you'd hope to achieve that. However, splitting on newlines is fairly simple.
'Line 1\nLine 2\nLine 3'.split('\n')
This will result in:
['line 1', 'line 2', 'line 3']
To get the weird allowance for "some" line splitting, you'll have to write your own logic for that.
b = "a\nb\nc\nd".split("\n", 3)
c = ["\n".join(b[:-1]), b[-1]]
print c
gives
['a\nb\nc', 'd']

sorting using python for complex strings [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I have an array that contains numbers and characters, e.g. ['A 3', 'C 1', 'B 2'], and I want to sort it using the numbers in each element.
I tried the below code but it did not work
def getKey(item):
item.split(' ')
return item[1]
x = ['A 3', 'C 1', 'B 2']
print sorted(x, key=getKey(x))
To be safe, I'd recommend you to strip everything but the digits.
>>> import re
>>> x = ['A 3', 'C 1', 'B 2', 'E']
>>> print sorted(x, key=lambda n: int(re.sub(r'\D', '', n) or 0))
['E', 'C 1', 'B 2', 'A 3']
With your method;
def getKey(item):
return int(re.sub(r'\D', '', item) or 0)
>>> print sorted(x, key=getKey)
['E', 'C 1', 'B 2', 'A 3']
What you have, plus comments to what's not working :P
def getKey(item):
item.split(' ') #without assigning to anything? This doesn't change item.
#Also, split() splits by whitespace naturally.
return item[1] #returns a string, which will not sort correctly
x = ['A 3', 'C 1', 'B 2']
print sorted(x, key=getKey(x)) #you are assign key to the result of getKey(x), which is nonsensical.
What it should be
print sorted(x, key=lambda i: int(i.split()[1]))
This is one way to do it:
>>> x = ['A 3', 'C 1', 'B 2']
>>> y = [i[::-1] for i in sorted(x)]
>>> y.sort()
>>> y = [i[::-1] for i in y]
>>> y
['C 1', 'B 2', 'A 3']
>>>

Parse line data until keyword with pyparsing

I'm trying to parse line data and then group them in list.
Here is my script:
from pyparsing import *
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
EOL = LineEnd().suppress()
start = Keyword('START').suppress() + EOL
end = Keyword('END').suppress() + EOL
line = SkipTo(LineEnd()) + EOL
lines = start + OneOrMore(start | end | Group(line))
start.setDebug()
end.setDebug()
line.setDebug()
result = lines.parseString(data)
results_list = result.asList()
print(results_list)
This code was inspired by another stackoverflow question:
Matching nonempty lines with pyparsing
What I need is to parse everything from START to END line by line and save it to a list per group (everything from START to matching END is one group). However this script put every line in new group.
This is the result:
[['line 2'], ['line 3'], ['line 4'], ['line a'], ['line b'], ['line c'], ['']]
And I want it to be:
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]
Also it parse an empty string at the end.
I'm a pyparsing beginner so I ask you for your help.
Thanks
You could use a nestedExpr to find the text delimited by START and END.
If you use
In [322]: pp.nestedExpr('START', 'END').searchString(data).asList()
Out[322]:
[[['line', '2', 'line', '3', 'line', '4']],
[['line', 'a', 'line', 'b', 'line', 'c']]]
then the text is split on whitespace. (Notice we have 'line', '2' above where we want 'line 2' instead). We'd rather it just split only on '\n'. So to fix this we can use the pp.nestedExpr function's content parameter which allows us to control what is considered an item inside the nested list.
The source code for nestedExpr defines
content = (Combine(OneOrMore(~ignoreExpr +
~Literal(opener) + ~Literal(closer) +
CharsNotIn(ParserElement.DEFAULT_WHITE_CHARS,exact=1))
).setParseAction(lambda t:t[0].strip()))
by default, where pp.ParserElement.DEFAULT_WHITE_CHARS is
In [324]: pp.ParserElement.DEFAULT_WHITE_CHARS
Out[324]: ' \n\t\r'
This is what causes nextExpr to split on all whitespace.
So if we reduce that to simply '\n', then nestedExpr splits the content by
lines instead of by all whitespace.
import pyparsing as pp
data = """START
line 2
line 3
line 4
END
START
line a
line b
line c
END
"""
opener = 'START'
closer = 'END'
content = pp.Combine(pp.OneOrMore(~pp.Literal(opener)
+ ~pp.Literal(closer)
+ pp.CharsNotIn('\n',exact=1)))
expr = pp.nestedExpr(opener, closer, content=content)
result = [item[0] for item in expr.searchString(data).asList()]
print(result)
yields
[['line 2', 'line 3', 'line 4'], ['line a', 'line b', 'line c']]

Python - Faster way of making a custom list of text in a file

I'm trying to make a list of text in a text file like it was being typed.. Kinda like this:
T
Te
Tex
Text
I dunno how to explain it well, so here's an example:
Text file contents:
Line 1
Line 2
Line 3
The list of the first line will be like: ['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 1', 'Line 1\n'].
And the complete list will be: [['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 1', 'Line 1\n'], ['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 2', 'Line 2\n'], ['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 3']]
This is my current code:
lines=open('foo.txt', 'r').readlines()
letters=[]
cnt=0
for line in lines:
letters.append([])
for letter in line:
if len(letters[cnt]) > 0:
letters[cnt].append(letters[cnt][len(letters[cnt])-1]+letter)
else:
letters[cnt].append(letter)
cnt+=1
print letters
The output is exactly like the complete list above.
The problem is this code is kinda slow on bigger files.. Is there any faster way to achieve the same output?
result = []
for line in open('foo.txt'):
result.append([line[:i+1] for i in xrange(len(line))])
print result
Using a list comprehension:
In [66]: with open("data.txt") as f:
print [[line[0:i+1] for i in range(len(line))] for line in f]
....:
[['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 1', 'Line 1\n'],
['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 2', 'Line 2\n'],
['L', 'Li', 'Lin', 'Line', 'Line ', 'Line 3', 'Line 3\n']]
The reason why this gets slow is because you collect huge lists with only redundant information. Do you really need these lists or would something like that do the trick, too?
for line in lines:
for i in range (0,len(line)-1):
for j,letter in enumerate(line):
print letter,
if j>=i:
print ''
break
This outputs
T
T h
T h i
T h i s
T h i s
T h i s i
T h i s i s
T h i s i s
T h i s i s t
T h i s i s t h
T h i s i s t h e
T h i s i s t h e
T h i s i s t h e f
T h i s i s t h e f i
T h i s i s t h e f i r
T h i s i s t h e f i r s
T h i s i s t h e f i r s t
T h i s i s t h e f i r s t
T h i s i s t h e f i r s t l
T h i s i s t h e f i r s t l i
T h i s i s t h e f i r s t l i n
T h i s i s t h e f i r s t l i n e
and I assume this is what you want (except for the whitespaces between the letters but I assume we can get rid of them somehow, too).
This seems like a particularly good case for Python's memoryviews: when using them you do not creates substrings of the original string, merely views of the original string. The performance gain on a large file with lines longer than few characters should be substantial.
results = []
with open("data.txt") as f:
for line in f:
letters = tuple(buffer(line, 0, i+1) for i in xrange(len(line)))
results.append(letters)
If list of all prefixes do not need to be all expanded at the same time, using generators can be considered.
Note:
If timing without printing, the following should be hard to beat ;-)
with open("data.txt") as f:
results = (buffer(line, 0, i+1) for line in f for i in xrange(len(line)))

Categories

Resources