How to join second line to end of first line in python? - python

I tried to read lines like below:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1
xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1
xT
Where each Even lines is a continuation of ODD lines but which is split by "\n\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s" so I want to replace those '\n\s(n)' to '' and join back to end of ODD lines .
FOR EXAMPLE:
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
TO
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
CODE:
import os
import sys
import re
lines=["A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1"," xQ,1xT","A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x"," H,1xY","A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1"," xT"]
for i in lines:
print i.replace(" ","")
Here,I just replaced spaces by empty space but i didnt get how to join those replaced EVEN lines to end of ODD lines.
So could some one help me to do the same.
Thanking you in advance.
Hi guys , First of all Many more thanks for your kind replies. I tried all the ways but the followed one works correct:
WILD= open("INPUT.txt", 'r')
merged = []
for line in WILD:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line.replace("\n",""))
OUTPUT:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1xT

Instead of that replace statement, you can just use str.strip to strip away whitespace at the beginning or the end of the string. Also, you can use zip to iterate pairs of lines.
for x, y in zip(l[::2],l[1::2]):
print "".join([x, y.strip()])
Or use next to get the next line if this is an iterator, like a file.
for x in iterator:
y = next(iterator)
print "".join([x, y.strip()])
Both ways, all the even lines (0, 2, ...) go to x and all the odd ones (1, 3, ...) to y.
Of course, this is assuming that all the entries in the list/file are spanning exactly two lines.
If they can span an arbitrary number of lines (just one, or two, or maybe five), then this will get more complicated. In this case, you might try something like this:
merged = []
for line in lines:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line)
Note: If thoses are indeed lines from a file, you might have to apply strip to all the lines, i.e. also x.strip() and merged.append(line.strip()), as each line will be terminated by \n which you might want to get rid of.

Read the entire file as a single string, then replace the entire whitespace with a single tab:
filepointer = open("INPUT.txt")
text = filepointer.read()
text = re.sub(r"\n\s{20,}", "\t", text)
This matches and removes sequences of a newline followed by 20 or more spaces, replacing them with a tab. (That way I don't have to count the precise number of spaces, and the program still works if some lines are slightly different).
If you don't want a tab between the joined lines, just use a space (" ") instead of "\t".
And if you must have the result as a list of lines, split text afterwards:
merged = text.splitlines()

Related

Python DNA sequence slice gives \N as wrong content in slice result

I am surprising, I am using python to slice a long DNA Sequence (4699673 character)to a specific length supstring, it's working properly with a problem in result, after 71 good result \n start apear in result for few slices then correct slices again and so on for whole long file
the code:
import sys
filename = open("out_filePU.txt",'w')
sys.stdout = filename
my_file = open("GCF_000005845.2_ASM584v2_genomic_edited.fna")
st = my_file.read()
length = len(st)
print ( 'Sequence Length is, :' ,length)
for i in range(0,len(st[:-9])):
print(st[i:i+9], i)
figure shows the error from the result file
please i need advice on that.
Your sequence file contains multiple lines, and at the end of each line there is a line break \n. You can remove them with st = my_file.read().replace("\n", "").
Try st = re.sub('\\s', '', my_file.read()) to replace any newlines or other whitespace (you'll need to add import re at the top of your script).
Then for i in range(0,len(st[:-9]),9): to step through your data in increments of nine characters. Otherwise you're only advancing by one character each time: that's why you can see the diagonal patterns in your output.

Parsing numbers in strings from a file

I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!

Regex remove certain characters from a file

I'd like to write a python script that reads a text file containing this:
FRAME
1 J=1,8 SEC=CL1 NSEG=2 ANG=0
2 J=8,15 SEC=CL2 NSEG=2 ANG=0
3 J=15,22 SEC=CL3 NSEG=2 ANG=0
And output a text file that looks like this:
1 1 8
2 8 15
3 15 22
I essentially don't need the commas or the SEC, NSEG and ANG data. Could someone help me use regex to do this?
So far I have this:
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
with open('RawDataFile_445.txt') as a:
# open all 4 files with a meaningful name
file=[open(outputfile.txt","w")
for line in a:
Without regex:
for line in file:
keep = []
line = line.strip()
if line.startswith('FRAME'):
continue
first, second, *_ = line.split()
keep.append(first)
first, second = second.split('=')
keep.extend(second.split(','))
print(' '.join(keep))
My advice? Since I don't write many regex's I avoid writing big ones all at once. Since you've already done that I would try to verify it a small chunk at a time, as illustrated in this code.
import re
r = re.compile(r"\s*(\d+)\s+J=(\S+)\s+SEC=(\S+)\s+NSEG=(\S+)+ANG=(\S+)\s")
r = re.compile(r"\s*(\d+)")
r = re.compile(r"\s*(\d+)\s+J=(\d+)")
with open('RawDataFile_445.txt') as a:
a.readline()
for line in a.readlines():
result = r.match(line)
if result:
print (result.groups())
The first regex is your entire brute of an expression. The next line is the first chunk I verified. The next line is the second, bigger chunk that worked. Notice the slight change.
At this point I would go back, make the correction to the original, whole regex and then copy a bigger chunk to try. And re-run.
Let's focus on an example string we want to parse:
1 J=1,8
We have space(s), digit(s), more space(s), some characters, then digit(s), a comma, and more digit(s). If we replace them with regex characters, we get (\d+)\s+J=(\d+),(\d+), where + means we want 1 or more of that type. Note that we surround the digits with parentheses so we can capture them later with .groups() or .group(#), where # is the nth group.

How to write specific line lengths of a file?

I have this sequences (over 9000) like this:
>TsM_000224500
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL
The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa.
Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).
I've tried writing the following script:
def read_seq(file_name):
with open(file_name) as file:
return file.read().split('\n')[0:]
ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")
tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')
for x in range(len(ts)):
if ([x][0:1] != '>'):
if (len([x]) > 40 or len([x]) < 4000):
tsf.write('%s\n'%(x))
tsf.close()
print "OK!"
I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.
In your for loop, x is an iterating integer due to using range() (i.e, 0,1,2,3,4...). Try this instead:
for x in ts:
This will give you each element in ts as x
Also, you don't need the brackets around x; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x: [x][1], Python will try to get the second element in the list that you put x in, and will run into problems.
EDIT: To include IDs, try this:
NOTE: I also changed if (len(x) > 40 or len(x) < 4000) to if (len(x) > 40 and len(x) < 4000) -- using and instead of or will give you the result you're looking for.
for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
if (x[0] != '>'):
if (len(x) > 40 and len(x) < 4000):
tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
tsf.write('%s\n'%(x))
Try this, simple and easy to understand. It does not load the entire file into memory, instead iterates over the file line by line.
tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
for line in ts: # iterate over each line of input file
line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
if line[0]=='>': # if line is an ID
continue # move to the next line
else: # otherwise
if (len(line)>40) or (len(line)<4000): # if line is in required length
tsf.write('%s\n'%line) # write to output file
tsf.close() # done
print "OK!"
FYI, you could also use awk for a one line solution if working in unix environment:
cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

Extracting Data from Multiple TXT Files and Creating a Summary CSV File in Python

I have a folder with about 50 .txt files containing data in the following format.
=== Predictions on test data ===
inst# actual predicted error distribution (OFTd1_OF_Latency)
1 1:S 2:R + 0.125,*0.875 (73.84)
I need to write a program that combines the following: my index number (i), the letter of the true class (R or S), the letter of the predicted class, and each of the distribution predictions (the decimals less than 1.0).
I would like it to look like the following when finished, but preferably as a .csv file.
ID True Pred S R
1 S R 0.125 0.875
2 R R 0.105 0.895
3 S S 0.945 0.055
. . . . .
. . . . .
. . . . .
n S S 0.900 0.100
I'm a beginner and a bit fuzzy on how to get all of that parsed and then concatenated and appended. Here's what I was thinking, but feel free to suggest another direction if that would be easier.
for i in range(1, n):
s = str(i)
readin = open('mydata/output/output'+s+'out','r')
#The files are all named the same but with different numbers associated
output = open("mydata/summary.csv", "a")
storage = []
for line in readin:
#data extraction/concatenation here
if line.startswith('1'):
id = i
true = # split at the ':' and take the letter after it
pred = # split at the second ':' and take the letter after it
#some have error '+'s and some don't so I'm not exactly sure what to do to get the distributions
ds = # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = #skip the character after the comma but take the have characters after
else:
#take the five characters after the comma
lineholder = id+' , '+true+' , '+pred+' , '+ds+' , '+dr
else: continue
output.write(lineholder)
I think using the indexes would be another option, but it might complicate things if the spacing is off in any of the files and I haven't checked this for sure.
Thank you for your help!
Well first of all, if you want to use CSV, you should use CSV module that comes with python. More about this module here: https://docs.python.org/2.7/library/csv.html I won't demonstrate how to use it, because it's pretty simple.
As for reading the input data, here's my suggestion how to break down every line of the data itself. I assume that lines of data in the input file have their values separated by spaces, and each value cannot contain a space:
def process_line(id_, line):
pieces = line.split() # Now we have an array of values
true = pieces[1].split(':')[1] # split at the ':' and take the letter after it
pred = pieces[2].split(':')[1] # split at the second ':' and take the letter after it
if len(pieces) == 6: # There was an error, the + is there
p4 = pieces[4]
else: # There was no '+' only spaces
p4 = pieces[3]
ds = p4.split(',')[0] # split at the ',' and take the string of 5 digits before it
if pred == 'R':
dr = p4.split(',')[0][1:] #skip the character after the comma but take the have??? characters after
else:
dr = p4.split(',')[0]
return id_+' , '+true+' , '+pred+' , '+ds+' , '+dr
What I mainly used here was split function of strings: https://docs.python.org/2/library/stdtypes.html#str.split and in one place this simple syntax of str[1:] to skip the first character of the string (strings are arrays after all, we can use this slicing syntax).
Keep in mind that my function won't handle any errors or lines formated differently than the one you posted as an example. If the values in every line are separated by tabs and not spaces you should replace this line: pieces = line.split() with pieces = line.split('\t').
i think u can separte floats and then combine it with the strings with the help of re module as follows:
import re
file = open('sample.txt','r')
strings=[[num for num in re.findall(r'\d+\.+\d+',i) for i in file.readlines()]]
print (strings)
file.close()
file = open('sample.txt','r')
num=[[num for num in re.findall(r'\w+\:+\w+',i) for i in file.readlines()]]
print (num)
s= num+strings
print s #[['1:S','2:R'],['0.125','0.875','73.84']] output of the code
this prog is written for one line u can use it for multiple line as well but u need to use a loop for that
contents of sample.txt:
1 1:S 2:R + 0.125,*0.875 (73.84)
2 1:S 2:R + 0.15,*0.85 (69.4)
when you run the prog the result will be:
[['1:S,'2:R'],['1:S','2:R'],['0.125','0.875','73.84'],['0.15,'0.85,'69.4']]
simply concatenate them
This uses regular expressions and the CSV module.
import re
import csv
matcher = re.compile(r'[[:blank:]]*1.*:(.).*:(.).* ([^ ]*),[^0-9]?(.*) ')
filenametemplate = 'mydata/output/output%iout'
output = csv.writer(open('mydata/summary.csv', 'w'))
for i in range(1, n):
for line in open(filenametemplate % i):
m = matcher.match(line)
if m:
output.write([i] + list(m.groups()))

Categories

Resources