I am reading a file, line-by-line and doing some text processing in order to get output in a certain format
My string processing code goes as follows:
file1=open('/myfolder/testfile.txt')
scanlines=file1.readlines()
string = ''
for line in scanlines:
if line.startswith('>from'):
continue
if line.startswith('*'):
continue
string.join(line.rstrip('\n'))
The output of this code is as follows:
abc
def
ghi
Is there a way to join these physical lines into one logical line, e.g:
abcdefghi
Basically, how can I concatenate multiple strings into one large string?
If I was reading from a file with very long strings is there the risk of an overflow by concatenating multiple physical lines into one logical line?
there are several ways to do this. for example just using + should do the trick.
"abc" + "def" # produces "abcdef"
If you try to concatenate multiple strings you can do this with the join method:
', '.join(('abc', 'def', 'ghi')) # produces 'abc, def, ghi'
If you want no delimiter, use the empty string ''.join() method.
Cleaning things up a bit, it would be easiest to append to array and then return the result
def joinfile(filename) :
sarray = []
with open(filename) as fd :
for line in fd :
if line.startswith('>from') or line.startswith('*'):
continue
sarray.append(line.rstrip('\n'))
return ''.join(sarray)
If you wanted to get really cute you could also do the following:
fd = open(filename)
str = ''.join([line.rstrip('\n') for line in fd if not (line.startswith('>from') or line.startswith('*'))])
Yes of course you could read a file big enough to overflow memory.
Use string addition
>>> s = 'a'
>>> s += 'b'
>>> s
'ab'
I would prefer:
oneLine = reduce(lambda x,y: x+y, \
[line[:-1] for line in open('/myfolder/testfile.txt')
if not line.startswith('>from') and \
not line.startswith('*')])
line[:-1] in order to remove all the \n
the second argument of reduce is a list comprehension which extracts all the lines you are interested in and removes the \n from the lines.
the reduce (just if you actually need that) to make one string from the list of strings.
Related
I'm pretty new to python and are looking for a way to get the following result from a long string
reading in lines of a textfile where each line looks like this
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;
after dataprocessing the data shall be stored in another textfile with this data
short example
2:55:12;66,81;66,75;35,38;
the real string is much longer but always with the same pattern
; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38; Puff2OG;30,25; Puff1OG;29,25; PuffFB;23,50; ....
So this means remove leading semicolon
keep second element
remove third element
keep fourth element
remove fith element
keep sixth element
and so on
the number of elements can vary so I guess as a first step I have to parse the string to get the number of elements and then do some looping through the string and assign each part that shall be kept to a variable
I have tried some variations of the command .split() but with no success.
Would it be easier to store all elements in a list and then for-loop through the list keeping and dropping elements?
If Yes how would this look like so at the end I have stored a file with
lines like this
2:55:12 ; 66,81 ; 66,75 ; 35,38 ;
2:56:12 ; 67,15 ; 74;16 ; 39,15 ;
etc. ....
best regards Stefan
This solution works independently of the content between the semicolons
One line, though it's a bit messier:
result = ' ; '.join(string.split(';')[1::2])
Getting rid of lead semicolon:
Just slice it off!
string = string[2:]
Splitting by semicolon & every second element:
Given a string, we can split by semicolon:
arr = string.split(';')[1::2]
The [::2] means to slice out every second element, starting with index 1. This keeps all "even" elements (second, fourth, etcetera).
Resulting string
To produce the string result you want, simply .join:
result = ' ; '.join(arr)
A regex based solution, which operates on the original input:
inp = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
output = re.sub(r'\s*[A-Z][^;]*?;', '', inp)[2:]
print(output)
This prints:
2:55:12;66,81;66,75;35,38;
This shows how to do it for one line of input if the same pattern repeats itself every time
input_str = "; 2:55:12;PuffDG;66,81; Puff4OG;66,75; Puff3OG;35,38;"
f = open('output.txt', 'w') # open text to write to
output_list = input_str.split(';')[1::2] # create list with numbers of interest
# write to file
for out in output_list:
f.write(f"{out.strip()} ; ")
# end line
f.write("\n")
thank you very much for the quick response. You are awesome.
Your solutions are very comact.
In the meantime I found another solution but this solution needs more lines of code
best regards Stefan
I'm not familiar with how to insert code as a code-section properly
So I add it as plain text
fobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_2min.log")
wobj = open(r"C:\Users\Stefan\AppData\Local\Programs\Python\Python38-32\Heizung_number_2min.log","w")
for line in fobj:
TextLine = fobj.readline()
print(TextLine)
myList = TextLine.split(';')
TextLine = ""
for index, item in enumerate(myList):
if index % 2 == 1:
TextLine += item
TextLine += ";"
TextLine += '\n'
print(TextLine)
wobj.write(TextLine)
fobj.close()
wobj.close()`
i have some problem to strip '[' at my string (read from file).
code
data = open(Koorpath1,'r')
for x in data:
print(x)
print(x.strip('['))
result
[["0.9986130595207214","26.41608428955078"],["39.44521713256836","250.2412109375"],["112.84327697753906","120.34269714355469"],["260.63800048828125","15.424667358398438"],["273.6199645996094","249.74160766601562"]]
"0.9986130595207214","26.41608428955078"],["39.44521713256836","250.2412109375"],["112.84327697753906","120.34269714355469"],["260.63800048828125","15.424667358398438"],["273.6199645996094","249.74160766601562"]]
Desired output :
"0.9986130595207214","26.41608428955078","39.44521713256836","250.2412109375","112.84327697753906","120.34269714355469","260.63800048828125","15.424667358398438","273.6199645996094","249.74160766601562"
Thanks
It strips the first two '[', it seems you have one long string, you have to split it first.
datalist = data.split[',']
for x in datalist:
# code here
If you don't want to split it and have it all in one string you need replace not strip (strip only works at the end and the beginning.
data = data.replace('[','')
If the data is JSON, then parse it into a Python list and treat it from there:
from itertools import chain
import json
nums = json.loads(x)
print(','.join('"%s"' % num for num in chain.from_iterable(nums)))
chain.from_iterable helps you "flatten" the list of lists and join concatenates everything into one long output.
I'm still learning Python, and I have a question I haven't been able to solve. I have a very long string (millions of lines long) which I would like to be split into a smaller string length based on a specified number of occurrences of a delimeter.
For instance:
ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//
In this case I would want to split based on "//" and return a string of all lines before the nth occurrence of the delimeter.
So an input of splitting the string by // by 1 would return:
ABCDEF
an input of splitting the string by // by 2 would return:
ABCDEF
//
GHIJKLMN
an input of splitting the string by // by 3 would return:
ABCDEF
//
GHIJKLMN
//
OPQ
And so on... However, The length of the original 2 million line string appeared to be a problem when I simply tried to split the entire string and by "//" and just work with the individual indexes. (I was getting a memory error) Perhaps Python can't handle so many lines in one split? So I can't do that.
I'm looking for a way that I don't need to split the entire string into a hundred-thousand indexes when I may only need 100, but instead just start from the beginning until a certain point, stop and return everything before it, which I assume may also be faster? I hope my question is as clear as possible.
Is there a simple or elegant way to achieve this? Thanks!
If you want to work with files instead of strings in memory, here is another answer.
This version is written as a function that reads lines and immediately prints them out until the specified number of delimiters have been found (no extra memory needed to store the entire string).
def file_split(file_name, delimiter, n=1):
with open(file_name) as fh:
for line in fh:
line = line.rstrip() # use .rstrip("\n") to only strip newlines
if line == delimiter:
n -= 1
if n <= 0:
return
print line
file_split('data.txt', '//', 3)
You can use this to write the output to a new file like this:
python split.py > newfile.txt
With a little extra work, you can use argparse to pass parameters to the program.
As a more efficient way you can read the firs N lines separated by your delimiter so if you are sure that all of your lines are splitted by delimiter you can use itertools.islice to do the job:
from itertools import islice
with open('filename') as f :
lines = islice(f,0,2*N-1)
The method that comes to my mind when I read your question uses a for loop
where you cut up the string into several (for example the 100 you called) and iterate through the substring.
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
log = log+steps
# and go again from the start only with this offset
now you can go through all the elements go through the whole 2 million(!) line string.
best thing to do here is actually make a recursive function from this(if that is what you want):
thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
def iterateThroughHugeString(beginning):
substring = thestring[:beginning+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
if(element you want):
#do your thing with the line
else:
iterateThroughHugeString(beginning+steps)
# and go again from the start only with this offset
For instance:
i = 0
s = ""
fd = open("...")
for l in fd:
if l[:-1] == delimiter: # skip last '\n'
i += 1
if i >= max_split:
break
s += l
fd.close()
Since you are learning Python it would be a challenge to model a complete dynamic solution. Here's a notion of how you can model one.
Note: The following code snippet only works for file(s) which is/are in the given format (see the 'For Instance' in the question). Hence, it is a static solution.
num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
print ([next(myfile) for x in range(num-1)])
Now that have the idea, you can use pattern matching and so on.
How does one remove a header from a long string of text?
I have a program that displays a FASTA file as
...TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG...
The string is large and contains multiple headers like this
So the headers that need to be trimmed start with a > and end with a $
There's multiple headers, ranging from IonTorrenttrimmedcontig1 to IonTorrenttrimmedcontig25
How can I cut on the > and the $, remove everything inbetween, and seperate the code before and after into seperate list elements?
The file is read from a standard FASTA file, so I´d be very happy to hear possible solutions on the input step as well.
As it is part of fasta file, so you are going to slice it like this:
>>> import re
>>> a = "TCGATCATCGATCG>IonTorrenttrimmedcontig1$CCGTAGGTGAACCTGCGGAAG"
>>> re.split(">[^$]*\$", a)
['TCGATCATCGATCG', 'CCGTAGGTGAACCTGCGGAAG']
Also, some people are answering with slicing with '>ion1'. That's totally wrong!
I believe your problem is solved! I am also editing a tag with bioinformatics for this question!
I would use the re module for that:
>>> s = "blablabla>ion1$foobar>ion2$etc>ion3$..."
>>> import re
>>> re.split(">[^$]*\$",s)
['blablabla', 'foobar', 'etc', '...']
And if you have 1 string on each line:
>>> with open("foo.txt", "r") as f:
... for line in f:
... re.split(">[^$]*\$",line[:-1])
...
['blablabla', 'foobar', 'etc', '...']
['fofofofofo', 'barbarbar', 'blablabla']
If you are reading over every line there a few ways to do this. You could use partition (partition returns a list containing 3 elements: [the text before the specified string, the specified string, and the text after]):
for line in file:
stripped_header = line.partition(">")[2].partition("$")[0]
You could use split:
for line in file:
stripped_header = line.spilt(">")[1].split("$")[0]
You could loop over all the elements in the string and only append after you pass ">" but before "$" (however this will not be nearly as efficient):
for line in file:
bool = False
stripped_header = ""
for char in line:
if char == ">":
bool = True
elif bool:
if char != "$":
stripped_header += char
else:
bool = False
Or alternatively use a regular expression, but it seems like my peers have already beat me to it!
Hi
I'm new to python and I have a 3.2 python!
I have a file which has some sort of format like this:
Number of segment pairs = 108570; number of pairwise comparisons = 54234
'+' means given segment; '-' means reverse complement
Overlaps Containments No. of Constraints Supporting Overlap
******************* Contig 1 ********************
E_180+
E_97-
******************* Contig 2 ********************
E_254+
E_264+ is in E_254+
E_276+
******************* Contig 3 ********************
E_256-
E_179-
I want to count the number of non-empty lines between the *****contig#****
and I want to get a result like this
contig1=2
contig2=3
contig3=2**
Probably, it's best to use regular expressions here. You can try the following:
import re
str = open(file).read()
pairs = re.findall(r'\*+ (Contig \d+) \*+\n([^*]*)',str)
pairs is a list of tuples, where the tuples have the form ('Contig x', '...')
The second component of each tuple contains the text after the mark
Afterwards, you could count the number of '\n' in those texts; most easily this can be done via a list comprehension:
[(contig, txt.count('\n')) for (contig,txt) in pairs]
(edit: if you don't want to count empty lines you can try:
[(contig, txt.count('\n')-txt.count('\n\n')) for (contig,txt) in pairs]
)
def give(filename):
with open(filename) as f:
for line in f:
if 'Contig' in line:
category = line.strip('* \r\n')
break
cnt = 0
aim = []
for line in f:
if 'Contig' in line:
yield (category+'='+str(cnt),aim)
category = line.strip('* \r\n')
cnt = 0
aim= []
elif line.strip():
cnt+=1
if 'is in' in line:
aim.append(line.strip())
yield (category+'='+str(cnt),aim)
for a,b in give('input.txt'):
print a
if b: print b
result
Contig 1=2
Contig 2=3
['E_264+ is in E_254+']
Contig 3=2
The function give() isn't a normal function, it is a generator function. See the doc, and if you have question, I will answer.
strip() is a function that eliminates characters at the beginning and at the end of a string
When used without argument, strip() removes the whitespaces (that is to say \f \n \r \t \v and blank space). When there is a string as argument, all the characters present in the string argument that are found in the treated string are removed from the treated string. The order of characters in the string argument doesn't matter: such an argument doesn't designates a string but a set of characters to be removed.
line.strip() is a means to know if there are characters that aren't whitespaces in a line
The fact that elif line.strip(): is situated after the line if 'Contig' in line: , and that it is written elif and not if, is important: if it was the contrary, line.strip() would be True for line being for exemple
******** Contig 2 *********\n
I suppose that you will be interested to know the content of the lines like this one:
E_264+ is in E_254+
because it is this kind of line that make a difference in the countings
So I edited my code in order that the function give() produce also the information of these kind of lines