Find the first pattern before the specific substring in python

Find the first pattern before the specific substring in python - python

In Python 3.6.5, say I have a string, read from file, like this:
# comments
newmtl material_0_2_8
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
map_Kd ../images/texture0.png
newmtl material_1_24
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
newmtl material_20_1_8
Kd 1 1 1
Ka 0 0 0
Ks 0.4 0.4 0.4
Ke 0 0 0
Ns 10
illum 2
d 1.0
map_Kd ../images/texture0.jpg
... and so on ...
I'm looping for each texture and I need to get the corresponding material code.
I want to retrieve the substring material_* corresponding to a certain texture*, which I know the name.
So for example, if I have texture0.jpg, I want to return material_20_1_8; if I have texture0.png then I want to have material_0_2_8.
How can I do it in this way?
f=open('path/to/file', "r")
if f.mode == 'r':
contents =f.read() # contains the string shown above
for texture in textures: # textures is the list of the texture names
material_code = ?
Or any other way, if you think you know a better one.

Try this:
mapping = {}
with open('input.txt', 'r') as fin:
for line in fin:
if line.startswith('newmtl'):
material = line[len('newmtl '):-1]
elif line.startswith('map_Kd'):
file = line.split('/')[-1][:-1]
mapping[file] = material
Then mapping is a dict with the relations you want:
{'texture0.jpg': 'material_20_1_8', 'texture0.png': 'material_0_2_8'}

Iteratively:
import re
textures = ('texture0.jpg', 'texture0.png')
with open('input.txt') as f:
pat = re.compile(r'\bmaterial_\S+')
for line in f:
line = line.strip()
m = pat.search(line)
if m:
material = m.group()
elif line.endswith(textures):
print(line.split('/')[-1], material)
The output:
texture0.png material_0_2_8
texture0.jpg material_20_1_8

Who likes regular expressions may like this approach for its readability and efficiency.
re.findall() returns a sequence of matched groups (the parts of the regexp enclosed in brackets) for all matches of the regular expression in the input data. The regular expression thus finds all occurences of the "newmtl" line with the nearest following "map_Kd" line and extracts the value parts from those lines using regex groups. The values are then just reversed to create the needed dictionary through dictionary comprehension.
I like this solution because it is compact and efficient. Notice, I added just one (well, multiline) expression into the original example (and one import, to be exact). If you can read regular expressions, it is also well readable.
import re
f = open('path/to/file', "r")
if f.mode == 'r':
contents = f.read() # contains the string shown above
materials = {
filename: material for material, filename in
re.findall(r'^newmtl (material_\S+)$.*?^map_Kd \.\./images/(.+?)$',
contents, re.MULTILINE | re.DOTALL)
}
for texture in textures: # textures is the list of the texture names
material_code = materials[texture]
The regular expression in this example works with given data. If you need to be more strict or more permissive regarding whitespace or other kind of variability in the source data, it may need to be further tuned.

Related

Separate lines in Python

I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]

With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.

Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.

It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)

Break line after specific expression and add to running list

I have very long text files with running measurements. These measurements are divided by some information that has almost the same style within my text files. Here an original extract:
10:10 10 244.576 0 0
10:20 10 244.612 0 0
10:30 10 244.563 0 0
HBCHa 9990 Seite 4
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 2. Januar 2000 10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
I would like a running list only with the measurements. As you can see, one measurement stands within an information line, in the line that starts with Sonntag. My problem is that I want to break the line after 2000 and add the second part of the broken line, 10:40 10 244.555 0 0, as a separate line.
My target is this:
10:20 10 244.612 0 0
10:30 10 244.563 0 0
10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
Until now I managed to choose the lines only that start with the time:
if i.startswith("0") or i.startswith("1") or i.startswith("2"):
and add it to new list.
And I can select the lines that contain the expression "tag":
f = open(source_file, "r")
data = f.readlines()
for lines in data:
if re.match("(.*)tag(.*)", lines):
print lines
There are no other lines that match with "tag"!

There's no need to worry about the invalid information if you can precisely match the valid information. So we'll use a regular expression to match only the data we want.
import re
MEASUREMENT_RE = re.compile(r"\b\d{2}:\d{2} \d{2} \d{3}.\d{3} \d \d\b")
with open(source_file, mode="r") as f:
print "\n".join(MEASUREMENT_RE.findall(f.read()))
Changes:
context manager (with block) used to open the file so the file closes automatically
read used instead of readlines since there's no point in applying a regular expression to each line instead of to all lines
measurements found with a regular expression that checks for exactly the digits you're looking for (if you need to match more digits in any section, it should be altered)
word boundaries (\b) used in regular expression to enforce whitespace or beginning/end of string is found around the match

This one matches digit sequences of variable length separated by colon, space and full stop:
import re
p = re.compile(r'\d+:\d+ \d+ \d+.\d+ \d+ \d+')
with open(source_file, "r") as f:
for line in f:
line_clean = p.findall(line)
if any(line_clean):
print "".join(line_clean)

Python 'for loop' to parse results

I am a beginning python user (trying to learn for bioinformatics) and I am having difficulties in getting my final 'for loop' correct. I have used a web-based bioinformatic program to assess the subcellular localization of certain proteins (protein names and sequences contained within ORFs) and I am trying to parse the results (contained within targetp). The web-based program that I've used truncates the names of the proteins (and does not include sequences), and I would like to parse my results file such that I have the complete name and sequence of each protein in FASTA format (this entails having a '>' + the protein name on one line, and the protein sequence on the subsequent line). I think that everything is going well until the last block of code; I end up with the proper protein names, but they are all appended to the same sequence. I know that there must be something simple that I am doing wrong, but I just can't figure it out. Any ideas?
Thanks!
The ORFs file looks like this (it's FASTA, but the " shouldn't be there, only >):
">HsaNP_000700 branched chain keto acid dehydrogenase E1, alpha polypeptide
MAVAIAAARVWRLNRGLSQAALLLLRQPGARGLARSHPPRQQQQFSSLDDKPQFPGASAEFIDKLEFIQPNVISGIPIYRVMDRQGQIINPSEDPHLPKEKVLKLYKSMTLLNTMDRILYESQRQGRISFYMTNYGEEGTHVGSAAALDNTDLVFGQYREAGVLMYRDYPLELFMAQCYGNISDLGKGRQMPVHYGCKERHFVTISSPLATQIPQAVGAAYAAKRANANRVVICYFGEGAASEGDAHAGFNFAATLECPIIFFCRNNGYAISTPTSEQYRGDGIAARGPGYGIMSIRVDGNDVFAVYNATKEARRRAVAENQPFLIEAMTYRIGHHSTSDDSSAYRSVDEVNYWDKQDHPISRLRHYLLSQGWWDEEQEKAWRKQSRRKVMEAFEQAERKPKPNPNLLFSDVYQEMPAQLRKQQESLARHLQTYGEHYPLDHFDK
">HsaNP_060914 pyruvate dehydrogenase phosphatase precursor
MPAPTQLFFPLIRNCELSRIYGTACYCHHKHLCCSSSYIPQSRLRYTPHPAYATFCRPKENWWQYTQGRRYASTPQKFYLTPPQVNSILKANEYSFKVPEFDGKNVSSILGFDSNQLPANAPIEDRRSAATCLQTRGMLLGVFDGHAGCACSQAVSERLFYYIAVSLLPHETLLEIENAVESGRALLPILQWHKHPNDYFSKEASKLYFNSLRTYWQELIDLNTGESTDIDVKEALINAFKRLDNDISLEAQVGDPNSFLNYLVLRVAFSGATACVAHVDGVDLHVANTGDSRAMLGVQEEDGSWSAVTLSNDHNAQNERELERLKLEHPKSEAKSVVKQDRLLGLLMPFRAFGDVKFKWSIDLQKRVIESGPDQLNDNEYTKFIPPNYHTPPYLTAEPEVTYHRLRPQDKFLVLATDGLWETMHRQDVVRIVGEYLTGMHHQQPIAVGGYKVTLGQMHGLLTERRTKMSSVFEDQNAATHLIRHAVGNNEFGTVDHERLSKMLSLPEELARMYRDDITIIVVQFNSHVVGAYQNQE
The targetp file looks like this (the M is in position 57, but the formatting here throws this off):
HsaNP_000700 445 0.939 0.020 0.089 M 1
HsaNP_060914 537 0.309 0.073 0.629 _ 4
The leftmost column in targetp is the identifier (part of the header line in each protein sequence above), and I want to return only entries with an 'M' (i.e., not '_') in position 57, along with the protein name from ORFs (header line).
My script is:
#!/usr/bin/python
ORFs = open('Human.MitoCarta.fasta', 'U')
targetp = open('MitoCarta_TargetP_combined.out', 'U')
report = targetp.readlines()
protfile = open('mitocarta_no_mTP.fasta','w')
protid = []
seqdict = {}
for seq in ORFs:
seq = seq.rstrip()
if seq[0] == '':
continue
if seq[0] == '>':
name = seq[1:]
seqdict[name] = ''
continue
seqdict[name] += seq
for entry in report:
if entry.startswith('HsaNP'):
if entry[57] != 'M':
protid.append(entry[0:20])
protid = [x.strip(' ') for x in protid]
nameslist = seqdict.keys()
c = 0
for i in protid:
if i in nameslist[c]:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[name]))
c += 1
protfile.close()

Yes, you are writing nameslist[c] and seqdict[name] but you never change 'name'. So you need to change 'name' if you want to get the different sequences. You should write:
protfile.write('>%s\n%s\n\n' % (nameslist[c], seqdict[nameslist[c]]))
That way you should get it right.

Return the average mark for all student in that Section

I know it was asked already but the answers the super unclear
The first requirement is to open a file (sadly I have no idea how to do that)
The second requirement is a section of code that does the following:
Each line represents a single student and consists of a student number, a name, a section code and a midterm grade, all separated by whitespace
So I don't think i can target that element due to it being separate by whitespace?
Here is an excerpt of the file, showing line structure
987654322 Xu Carolyn L0101 19.5
233432555 Jones Billy Andrew L5101 16.0
555432345 Patel Amrit L0101 13.5
888332441 Fletcher Bobby L0201 18
777998713 Van Ryan Sarah Jane L5101 20
877633234 Zhang Peter L0102 9.5
543444555 Martin Joseph L0101 15
876543222 Abdolhosseini Mohammad Mazen L0102 18.5
I was provided the following hints:
Notice that the number of names per student varies.
Use rstrip() to get rid of extraneous whitespace at the end of the lines.
I don't understand the second hint.
This is what I have so far:
counter = 0
elements = -1
for sets in the_file
elements = elements + 1
if elements = 3
I know it has something to do with readlines() and the targeting the section code.

marks = [float(line.strip().split()[-1]) for line in open('path/to/input/file')]
average = sum(marks)/len(marks)
Hope this helps

Open and writing to files
strip method
Something like this?
data = {}
with open(filename) as f:#open a file
for line in f.readlines():#proceed through file lines
#next row is to split data using spaces and them skip empty using strip
stData = [x.strip() for x in line.split() if x.strip()]
#assign to variables
studentN, studentName, sectionCode, midtermGrade = stData
if sectionCode not in data:
data[sectionCode] = []
#building dict, key is a section code, value is a tuple with student info
data[sectionCode].append([studentN, studentName, float(midtermGrade)]
#make calculations
for k,v in data.iteritems():#iteritems returns you (key, value) pair on each iteration
print 'Section:' + k + ' Grade:' + str(sum(x[2] for x in v['grade']))

more or less:
infile = open('grade_file.txt', 'r')
score = 0
n = 0
for line in infile.readlines():
score += float(line.rstrip().split()[-1])
n += 1
avg = score / n

patterns searching in text

I have text file as follows seq.txt
>S1
AACAAGAAGAAAGCCCGCCCGGAAGCAGCTCAATCAGGAGGCTGGGCTGGAATGACAGCG
CAGCGGGGCCTGAAACTATTTATATCCCAAAGCTCCTCTCAGATAAACACAAATGACTGC
GTTCTGCCTGCACTCGGGCTATTGCGAGGACAGAGAGCTGGTGCTCCATTGGCGTGAAGT
CTCCAGGGCCAGAAGGGGCCTTTGTCGCTTCCTCACAAGGCACAAGTTCCCCTTCTGCTT
CCCCGAGAAAGGTTTGGTAGGGGTGGTGGTTTAGTGCCTATAGAACAAGGCATTTCGCTT
CCTAGACGGTGAAATGAAAGGGAAAAAAAGGACACCTAATCTCCTACAAATGGTCTTTAG
TAAAGGAACCGTGTCTAAGCGCTAAGAACTGCGCAAAGTATAAATTATCAGCCGGAACGA
GCAAACAGACGGAGTTTTAAAAGATAAATACGCATTTTTTTCCGCCGTAGCTCCCAGGCC
AGCATTCCTGTGGGAAGCAAGTGGAAACCCTATAGCGCTCTCGCAGTTAGGAAGGAGGGG
TGGGGCTGTCCCTGGATTTCTTCTCGGTCTCTGCAGAGACAATCCAGAGGGAGACAGTGG
ATTCACTGCCCCCAATGCTTCTAAAACGGGGAGACAAAACAAAAAAAAACAAACTTCGGG
TTACCATCGGGGAACAGGACCGACGCCCAGGGCCACCAGCCCAGATCAAACAGCCCGCGT
CTCGGCGCTGCGGCTCAGCCCGACACACTCCCGCGCAAGCGCAGCCGCCCCCCCGCCCCG
GGGGCCCGCTGACTACCCCACACAGCCTCCGCCGCGCCCTCGGCGGGCTCAGGTGGCTGC
GACGCGCTCCGGCCCAGGTGGCGGCCGGCCGCCCAGCCTCCCCGCCTGCTGGCGGGAGAA
ACCATCTCCTCTGGCGGGGGTAGGGGCGGAGCTGGCGTCCGCCCACACCGGAAGAGGAAG
TCTAAGCGCCGGAAGTGGTGGGCATTCTGGGTAACGAGCTATTTACTTCCTGCGGGTGCA
CAGGCTGTGGTCGTCTATCTCCCTGTTGTTC
>S2
ACACGCATTCACTAAACATATTTACTATGTGCCAGGCACTGTTCTCAGTGCTGGGGATAT
AGCAGTGAAGAAACAGAAACCCTTGCACTCACTGAGCTCATATCTTAGGGTGAGAAACAG
TTATTAAGCAAGATCAGGATGGAAAACAGATGGTACGGTAGTGTGAAATGCTAAAGAGAA
AAATAACTACGGAAAAGGGATAGGAAGTGTGTGTATCGCAGTTGACTTATTTGTTCGCGT
TGTTTACCTGCGTTCTGTCTGCATCTCCCACTAAACTGTAAGCTCTACATCTCCCATCTG
TCTTATTTACCAATGCCAACCGGGGCTCAGCGCAGCGCCTGACACACAGCAGGCAGCTGA
CAGACAGGTGTTGAGCAAGGAGCAAAGGCGCATCTTCATTGCTCTGTCCTTGCTTCTAGG
AGGCGAATTGGGAAATCCAGAGGGAAAGGAAAAGCGAGGAAAGTGGCTCGCTTTTGGCGC
TGGGGAAGAGGTGTACAGTGAGCAGTCACGCTCAGAGCTGGCTTGGGGGACACTCTCACG
CTCAGGAGAGGGACAGAGCGACAGAGGCGCTCGCAGCAGCGCGCTGTACAGGTGCAACAG
CTTAGGCATTTCTATCCCTATTTTTACAGCGAGGGACACTGGGCCTCAGAAAGGGAAGTG
CCTTCCCAAGCTCCAACTGCTCATAAGCAGTCAACCTTGTCTAAGTCCAGGTCTGAAGTC
CTGGAGCGATTCTCCACCCACCACGACCACTCACCTACTCGCCTGCGCTTCACCTCACGT
GAGGATTTTCCAGGTTCCTCCCAGTCTCTGGGTAGGCGGGGAGCGCTTAGCAGGTATCAC
CTATAAGAAAATGAGAATGGGTTGGGGGCCGGTGCAAGACAAGAATATCCTGACTGTGAT
TGGTTGAATTGGCTGCCATTCCCAAAACGAGCTTTGGCGCCCGGTCTCATTCGTTCCCAG
CAGGCCCTGCGCGCGGCAACATGGCGGGGTCCAGGTGGAGGTCTTGAGGCTATCAGATCG
GTATGGCATTGGCGTCCGGGCCCGCAAGGCG
.
.
.
.
I have to count patterns in these sequences to achieve python script
import re
infile = open("seq.txt", 'r')
out = open("pat.txt", 'w')
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
for line in infile:
line = line.strip("\n")
if line.startswith('>'):
name = line
else:
s = re.findall(pattern,line)
print '%s:%s' %(name,s)
out.write('%s:\t%s\n' %(name,len(s)))
But it is giving the wrong result. The script is reading line by line.
S1 : 0
S1 : 0
S1 : 0
S1 : 0
S2 : 0
S2 : 1
S2 : 0
S2 : 1
But I want output as follows:
S1 : 0
S2 : 2
Can anybody help?

Use a hit counter, zero it if line.startswith('>'). Increment by len(s) otherwise.

This code might be helpful for you:
import re
pattern = re.compile("GAAAT", flags=re.IGNORECASE)
with open('seq.txt') as f:
sections = f.read().split('\n\n')
for section in sections:
lines = section.split()
name = lines[0].lstrip('>')
data = ''.join(lines[1:])
print '{0}: {1}'.format(name, len(pattern.findall(data)))
Example output:
S1: 1
S2: 2
Notes:
It's assumed that two newline characters are used to separate every section as in the example.
It's assumed that every section name is preceded by a greater than (>) character as in the example.
If you already have a pattern, use pattern.findall(data) instead of re.findall(pattern, data)

You should gather input until you enter the next pattern. This would also solve the corner case of where your pattern crosses a line boundary (not sure if that "can" happen with your data, but it looks like it).

Use a counter. Also, have your print function inside the for loop, so it's going to iterate as many times as the else condition. Note that it's also not a good idea to use the variable line as both the iterator variable in the for loop and as another variable. It makes the code more confusing.
counter_dict = {}
for line in infile:
if line[0] == '>':
name = line[1:len(line) - 2]
counter_dict[name] = 0
else:
counter_dict[name] += len(re.findall(pattern,line))
for (key, val) in counter_dict.items():
print '%s:%s' %(key, val)
out.write('%s:\t%s\n' %(key, val)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find the first pattern before the specific substring in python - python

Related

Separate lines in Python

Break line after specific expression and add to running list

Python 'for loop' to parse results

Return the average mark for all student in that Section

patterns searching in text

Categories

Resources