Add part of line found after #solution to file - python

I have a script that puts the line that starts with #Solution 1 in a new file together with the name of the input file. But I want to add the piece belonging to Major from the input file. Can someone please help me to figure out how to get the piece of text?
The script now:
#!/usr/bin/env python3
import os
dr = "/home/nwalraven/Result_pgx/Runfolder/Runres_Aldy" outdr = "/home/nwalraven/Result_pgx/Runfolder/Aldy_res_txt" tag = ".aldy"
for f in os.listdir(dr):
if f.endswith(tag):
print(f)
new_file_name = f.split('_')[0]+'.txt' # get the name of the file before the '_' and add '.txt' to it
with open(dr+"/"+f) as file:
for line in file.readlines():
f
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
new_file.write(line + "\n")
if line.startswith("#Solution 2"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(line + "\n")
print("Meerdere oplossingen gevonden! Check Aldy bestand" )
The input:
file = EMQN3-S3_COMT.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *Met, *ValB
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19950234 C>T 530 H62= rs4633
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19951270 G>A 651 V158M rs4680
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 1 ValB
file = EMQN3-S3_CYP2B6.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *1.001, *1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 0 1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 1 1.001
The result it gives right now:
EMQN3-S3_COMT.aldy
#Solution 1: *Met, *ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1.001, *1.001
The result I need:
EMQN3-S3_COMT.aldy
#Solution 1: *Met/*ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1/*1

If you print out the line, you could use regular expression to replace text before printing the line.
On the other hand, if you know it always starts with a fixed number of chars, then it's easier and faster to edit the line manually.
With regex:
# Importing regular expressions
import re
# Setting up regex replacement to replace ", " with "/"
regex = "\, "
replacement = "/"
...
# Format the line before printing it
line_formatted = re.sub(regex, replacement, line)
new_file.write(line.replace(regex, replacement) + "\n") # edited
...

Try to replace this part of your script:
...
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
solution = "/".join([x.strip().split(".")[0] for x in line.split(",")])
new_file.write(solution + "\n")
...
It will do the following:
split the string into two tokens, based on the comma
strip them
remove the decimal part (if any) from the token
rejoin the string using the slash.
Hope it helps.

Related

python regex: Parsing file name

I have a text file (filenames.txt) that contains the file name with its file extension.
filename.txt
[AW] One Piece - 629 [1080P][Dub].mkv
EP.585.1080p.mp4
EP609.m4v
EP 610.m4v
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One_Piece_0745_Sons'_Cups!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One Piece - 621 1080P.mkv
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
these are the example filename and its extension. I need to rename filename with the episode number (without changing its extension).
Example:
Input:
``````
EP609.m4v
EP 610.m4v
EP.585.1080p.mp4
One Piece - 621 1080P.mkv
[AW] One Piece - 629 [1080P][Dub].mkv
One_Piece_0745_Sons'_Cups!.mp4
One Piece 0696 A Tearful Reunion! Rebecca and Kyros!.mp4
One Piece - 591 (1080P Funi Web-Dl -Ks-)-1.m4v
One_Piece_S10E577_Zs_Ambition_A_Great_and_Desperate_Escape_Plan.mp4
Expected Output:
````````````````
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4 (or) 0745.mp4
696.mp4 (or) 0696.mp4
591.m4v
577.mp4
Hope someone will help me parse and rename these filenames. Thanks in advance!!!
As you tagged python, I guess you are willing to use python.
(Edit: I've realized a loop in my original code is unnecessary.)
import re
with open('filename.txt', 'r') as f:
files = f.read().splitlines() # read filenames
# assume: an episode comprises of 3 digits possibly preceded by 0
p = re.compile(r'0?(\d{3})')
for file in files:
if m := p.search(file):
print(m.group(1) + '.' + file.split('.')[-1])
else:
print(file)
This will output
609.m4v
610.m4v
585.mp4
621.mkv
629.mkv
745.mp4
696.mp4
591.m4v
577.mp4
Basically, it searches for the first 3-digit number, possibly preceded by 0.
I strongly advise you to check the output; in particular, you would want to run sort OUTPUTFILENAME | uniq -d to see whether there are duplicate target names.
(Original answer:)
p = re.compile(r'\d{3,4}')
for file in files:
for m in p.finditer(file):
ep = m.group(0)
if int(ep) < 1000:
print(ep.lstrip('0') + '.' + file.split('.')[-1])
break # go to next file if ep found (avoid the else clause)
else: # if ep not found, just print the filename as is
print(file)
Program to parse episode number and renaming it.
Modules used:
re - To parse File Name
os - To rename File Name
full/path/to/folder - is the path to the folder where your file lives
import re
import os
for file in os.listdir(path="full/path/to/folder/"):
# searches for the first 3 or 4 digit number less than 1000 for each line.
for match_obj in re.finditer(r'\d{3,4}', file):
episode = match_obj.group(0)
if int(episode) < 1000:
new_filename = episode.lstrip('0') + '.' + file.split('.')[-1]
old_name = "full/path/to/folder/" + file
new_name = "full/path/to/folder/" + new_filename
os.rename(old_name, new_name)
# go to next file if ep found (avoid the else clause)
break
else:
# if episode not found, just leave the filename as it is
pass

Deleting specific columns of each row

i'm new to python and right now i'm out of ideas.
What i'm trying to: i got a file
example:
254 578 name1 *--21->28--* secname1
854 548 name2 *--21->28--* secname2
944 785 name3 *--21->28--* secname3
1025 654 name4 *--21->28--* secname4
between those files are a lot of spaces and i wan't to remove specific spaces between "name*" and "secname*" for each row. I don't know what to do to as seen in the example remove the character/spaces 21 -> 28
What i got so far:
fobj_in = open("85488_66325_R85V54.txt")
fobj_out = open("85488_66325_R85V54.txt","w")
for line in fobj_in:
fobj_in.close()
fobj_out.close()
At the end it should look like:
254 578 name1 secname1
854 548 name2 secname2
944 785 name3 secname3
1025 654 name4 secname4
To remove characters by specific index positions you have to use slicing
for line in open('85488_66325_R85V54.txt'):
newline = line[:21] + line[29:]
print(newline)
removes the characters in column 21:28 (which are all whitespaces in your example)
Just split the line and pop the element you don't need.
fobj_in = open('85488_66325_R85V54','r')
fobj_out = open('85488_66325_R85V54.txt', 'a')
for line in fobj_in:
items = line.split()
items.pop(3)
fobj_out.write(' '.join(items)+'\n')
fobj_in.close()
fobj_out.close()
You can just use the string object's split method, like so:
f = open('my_file.txt', 'r')
data = f.readlines()
final_data = []
for line in data:
bits = line.split()
final_data.append([bits[0], bits[1], bits[2], bits[4]])
Basically I'm just illustrating how to use that split method to break each line into individual chunks, at which point you can do whatever you wish, like print all of those bits and selectively discard one of the columns.
I can suggest a robust method to correct the input line.
#!/usr/bin/env ipython
# -----------------------------------
line='254 578 name1 *--21->28--* secname1';
# -----------------------------------
def correctline(line,marker='*'):
status=0;
lineout='';
for val in line:
if val=='*':
status=abs(status-1);continue
if status==0:
lineout=lineout+val;
elif status == 1:
lineout=lineout
# -----------------------------------
while lineout.__contains__(' '):
lineout=lineout.replace(' ',' ');
return lineout
# ------------------------------------
print correctline(line)
Basically, it loops through the elements of the input file. When it finds some marker from which onward to skip the text, it skips it and finally just replaces too many spaces with one space.
If the names are of varying lengths and you dont want to just remove a set number of spaces between them you can search for blank characters to find where sname begins and name ends:
# open file in "read" mode
fobj_in = open("85488_66325_R85V54.txt", "r")
# use readlines to create a list, each member containing a line of 85488_66325_R85V54.txt
lines = fobj_in.readlines()
# For each line search from the end backwards for the first " " char
# when this char is found create first_name which is a list containing the
# elements of line from here onwards and a second list which is the elements up to
# this point. Then search for a non " " char and remove the blank spaces.
# remaining_line and first_name can then be concatenated back together using
# + with the desired number of spaces between then (in this case 12).
for line_number, line in enumerate(lines):
first_name_found = False
new_line_created = False
for i in range(len(line)):
if(line[-i] is " " and first_name_found is False):
first_name = line[-i+1:]
remaining_line = line[:-i+1]
first_name_found = True
for j in range(len(remaining_line)):
if(remaining_line[-j-1] is not " " and new_line_created == False):
new_line = remaining_line[0:-j]+ " "*12 + first_name
new_line_created = True
lines[line_number] = new_line
then just write lines to 85488_66325_R85V54.txt.
You could try to do it as follows:
for line in fobj_in:
setstring = line
print(setstring.replace(" ", "")

Python get previous few elements if match checked element

I have some structured data in a text file:
Parse.txt
name1
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
gggggggg
some of the detail4s do not have data and would be replaced by "-":
name2
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
-
How do i parse the data to get the elements below detail1, detail2 and detail3 of only the data with empty detail4s?
So far i have a partially working code but the problem is that it gets each item 40 times. Please help.
Code:
data = []
with open("parse.txt","r",encoding="utf-8") as text_file:
for line in text_file:
data.append(line)
det4li = []
finali= []
for elem,det4 in zip(data,data[1:]):
if "detail4" in elem:
det4li .append(det4)
if "-" in det4:
for elem1,det1,det2,det3 in zip(data,data[1:],data[3:],data[5:]):
if "detail1:" in elem1:
finali.append(det1.strip() + "," + det2.strip() + "," + det3)
Current Output: 40 records of dddddddd,eeeeeeee,ffffffff
Desired Output: dddddddd,eeeeeeee,ffffffff
Don't try to look ahead. Look behind, by storing preceding data:
final = []
with open("parse.txt","r",encoding="utf-8") as text_file:
section = {}
last_header = None
for line in text_file:
line = line.strip()
if line.startswith('detail'):
# detail line, record for later use
last_header = line.rstrip(':')
elif not last_header:
# name line, store as such
section['name'] = line
else:
section[last_header] = line
if last_header == 'detail4':
# section complete, process
if line == '-':
# A section we want to keep
final.append(section)
# reset section data
section, last_header = {}, None
This has the added advantage that you now don't need to read the whole file into memory. If you turn this into a generator (by putting it into a function and replacing the final.append(section) line with yield section), you can even process those matching sections as you read the file without sacrificing readability.

Python Regex or Filename Function

Question about rename file name in folder. My file name looks like this:
EPG CRO 24 Kitchen 09.2013.xsl
With name space between, and I used code like this:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# Remove whitespace from files where EPG named with space " " replace with "_"
for filename in os.listdir("."):
if filename.find("2013|09 ") > 0:
newfilename = filename.replace(" ","_")
os.rename(filename, newfilename)
With this code I removed white space, but how can I remove date, from file name so it can look like this: EPG_CRO_24_Kitche.xsl. Can you give me some solution about this.
Regex
As utdemir was eluding to, regular expressions can really help in situations like these. If you have never been exposed to them, it can be confusing at first. Checkout https://www.debuggex.com/r/4RR6ZVrLC_nKYs8g for a useful tool that helps you construct regular expressions.
Solution
An updated solution would be:
import re
def rename_file(filename):
if filename.startswith('EPG') and ' ' in filename:
# \s+ means 1 or more whitespace characters
# [0-9]{2} means exactly 2 characters of 0 through 9
# \. means find a '.' character
# [0-9]{4} means exactly 4 characters of 0 through 9
newfilename = re.sub("\s+[0-9]{2}\.[0-9]{4}", '', filename)
newfilename = newfilename.replace(" ","_")
os.rename(filename, newfilename)
Side Note
# Remove whitespace from files where EPG named with space " " replace with "_"
for filename in os.listdir("."):
if filename.find("2013|09 ") > 0:
newfilename = filename.replace(" ","_")
os.rename(filename, newfilename)
Unless I'm mistaken, the from the comment you made above, filename.find("2013|09 ") > 0 won't work.
Given the following:
In [76]: filename = "EPG CRO 24 Kitchen 09.2013.xsl"
In [77]: filename.find("2013|09 ")
Out[77]: -1
And your described comment, you might want something more like:
In [80]: if filename.startswith('EPG') and ' ' in filename:
....: print('process this')
....:
process this
If all file names have the same format: NAME_20XX_XX.xsl, then you can use python's list slicing instead of regex:
name.replace(' ','_')[:-12] + '.xsl'
If dates are always formatted same;
>>> s = "EPG CRO 24 Kitchen 09.2013.xsl"
>>> re.sub("\s+\d{2}\.\d{4}\..{3}$", "", s)
'EPG CRO 24 Kitchen'
How about little slicing:
newfilename = input1[:input1.rfind(" ")].replace(" ","_")+input1[input1.rfind("."):]

patterns searching in text

I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?
Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.
This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)
You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).
Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)

Categories

Resources