How to split array elements with Python 3 - python

I am writing a script to gather results from an output file of a programme. The file contains headers, captions, and data in scientific format. I only want the data and I need a script that can do this repeatedly for different output files with the same results format.
This is the data:
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
This is my code at the moment. I want it to open the file, search for the keyword 'INPUT:BETA' which indicates the start of the results I want to extract. It then takes the information between this input keyword and the end identifier that signals the end of the data I want. I don't think this section needs changing but I have included it just in case.
I have then tried to use regex to specify the lines that start with VELOCITY (m/s) as these contain the data I need. This works and extracts each line, whitespace and all, into an array. However, I want each numerical value to be a single element, so the next line is supposed to strip the whitespace out and split the lines into individual array elements.
with open(file_name) as f:
t=f.read()
t=t[t.find('INPUT:BETA'):]
t=t[t.find(start_identifier):t.find(end_identifier)]
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, t)
res = [s.split() for s in res]
print(res)
print(len(res))
This isn't working, here is the output:
[['33.2405E+06', '30.8868E+06', '27.9475E+06', '25.2880E+06', '22.8815E+06', '21.1951E+06', '20.1614E+06', '18.7338E+06'], ['16.9510E+06', '15.7017E+06', '14.9359E+06', '14.2075E+06', '13.5146E+06', '12.8555E+06', '11.6805E+06', '10.5252E+06']]
2
It's taking out the whitespace but not putting the values into separate elements, which I need for the next stage of the processing.
My question is therefore:
How can I extract each value into a separate array element, leaving the rest of the data behind, in a way that will work with different output files with different data?

Here is how you can flatten your list, which is your point 1.
import re
text = """
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
"""
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, text)
res = [s.split() for s in res]
res = [value for lst in res for value in lst]
print(res)
print(len(res))
Your regex isn't skipping your first line though. There must be an error in the rest of your code.

Related

Reading numbers after a keyword, and using only certain numbers from the list

I would like to have my Python code read numbers following a keyword such as 'OUTPUT', all the way to the end of the .dat file that I'm working with. However, all the numbers wouldn't have to be read by the program.. just numbers that are attributed to a position on the .dat file. For example, this is somewhat what my file looks like:
VARIABLES= "a", "b", "c", "d" , "e", "f", "g"
OUTPUT=
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
...
and I would like to only read and work with the numbers attributed to a, d, and g(all the way down to the end of the document).
For instance, the expected output for these three lines would be 1, 4, 7. & 8,
11, 14 & 15, 18, 21.
I've already tried a few different approaches to this problem, but none seem to work out too well for me. The closest I've gotten to solving this problem is with the code down below:
with open('C:/Users/filename', "r") as input:
for line in input:
if line.startswith('OUTPUT'):
print(next(input), end='')
continue
break
The problem with this code is that I'm only able to read the first line of code following "OUTPUT", and nothing else. I'm also not able to figure out how to read only the numbers related to the three specific letters.
Please let me know how I might solve this problem, or give me some code that works better than what I have now!
thank you!!
If you don't need to know the variables/letters and already know that you want data from a,d,g (or 0,3,6th element):
with open('C:/Users/filename', "r") as input:
start_print=False
output=[]
for line in input:
if start_print:
output.extend([line.split()[i] for i in [0,3,6]])
if line.startswith('OUTPUT'):
start_print=True
print(','.join(output))
OUTPUT:
1,4,7,8,11,14,15,18,21
This solution uses itemgetter to get the indexes from the wanted columns (which letters you need to supply to the program).
EDIT: I guess the wanted columns may contain more than one letter, (a,b,c,...), so the solution would need modified to account for that - I'll try to post a modified solution)
EDIT: Modified the solution to account for multiple letters
import re
from operator import itemgetter
file = """\
VARIABLES= "a", "b", "c", "d" , "e", "f", "g"
OUTPUT=
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
""".splitlines()
idx = []
found = False
wanted = ('a', 'd', 'g')
for line in file:
if line.startswith('VARIABLES'):
#var = re.findall('[a-z]+', line)
_, *var = re.split(r'\W+', line)
for letter in wanted:
idx.append(var.index(letter))
get_indexes = itemgetter(*idx)
elif line.startswith('OUTPUT'):
found = True
elif found:
data = line.split()
print(' '.join(get_indexes(data)), end=' ')
Prints:
1 4 7 8 11 14 15 18 21

Writing Specific lines and characters from an input file into an output file

I have been working with a project that requires I count the number of 1's and 0's
which describes the effect an amino acid would have in the stability of the peptide. There are about 300 different peptide sequences in the file. I want my code to recognize the start of a peptide sequence from my text file, count its length then count the number of 1's and 0's each amino acid records. so far I have been working to get my code recognize the start of a sequence using its index numbering, here's what I have
input_file01=open (r'C:/Users/12345/Documents/Dr Blan Research/MHC I 17 NOV2016.txt')
Output_file01= open ('MHC I 17 NOV2016OUT.txt','w')
for line in input_file01:
templist=line.split()
a=line[0]
for i in range(0,len(a)):
if a[i]==1:
b=line[0+1]
index=i
count=+1
Output_file01.write(a)
Output_file01.write(b)
else:
break
Here is an example of the content in the file. I want my code to count the peptide sequence, count the number of 1's and 0's and find their ratios within each peptide seq.
# 1 - Amino acid number
# 2 - One letter code
# 3 - ANCHOR probability value
# 4 - ANCHOR output
#
1 A 0.3129 0
2 P 0.4044 0
3 K 0.5258 1
4 R 0.6358 1
5 P 0.7277 1
6 P 0.7895 1
7 S 0.8710 1
8 A 0.9358 1
9 F 0.9680 1

python lines concatenate itself when writing to a file

I'm using python to generate training and testing data for 10-fold cross-validations, and to write the datasets to 2x10 separated files (each fold writes a training file and a testing file). And the weird thing is that when writing data to a file, there always is a line "missing". Actually, it might not even be "missing", since I discovered later that some line (only one line) in the middle of the file gets to concatenate itself to its previous line. So an output file should be something like the following (there should be 39150 lines in total):
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 4
50 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
However, I keep getting 39149 lines, and somewhere in the middle of the file seems to mess up like this:
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 450 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
My code:
def k_fold(myfile, myseed=1, k=10):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed = myseed
random.shuffle(data)
# Compute partition size given input k
len_total = len(data)
len_part = int(math.floor(len_total / float(k)))
# Create one partition per fold
train = {}
test = {}
for i in range(k):
test[i] = data[i * len_part:(i + 1) * len_part]
train[i] = data[0:i * len_part] + data[(i + 1) * len_part:len_total]
return train, test
if __name__ == "__main__":
path = '....' #some path and input
input = '...'
# Generate data
[train, test] = k_fold(input)
# Write data to files
for i in range(10):
train_old = path + 'tmp_train_' + str(i)
test_old = path + 'tmp_test_' + str(i)
trainF = open(train_old, 'a')
testF = open(test_old, 'a')
print(len(train[i]))
The strange thing is that I'm doing the same thing for the training and the testing dataset. The testing dataset outputs the correct file (4350 lines), but the training dataset has the above problem. I'm sure that the function returns the 39150 lines of training data, so I think the problem should be in the file writing part. Any body has any ideas how I could possibly done wrong? Thanks in advance!
I assume that the first half of the double length line is the last line of the original file.
The lines returned by readlines (or by iterating over the file) will all still end with the LF character '\n' except the last line if the file doesn't end with an empty line. In that case, the shuffling that you do will put that '\n'-less line somewhere in the middle of 'data'.
Either append an empty line to your original file or strip all lines before processing and add the newline to each line when writing back to a file.

Break line after specific expression and add to running list

I have very long text files with running measurements. These measurements are divided by some information that has almost the same style within my text files. Here an original extract:
10:10 10 244.576 0 0
10:20 10 244.612 0 0
10:30 10 244.563 0 0
HBCHa 9990 Seite 4
16.02.16
Hafenpegel
Pegel müM Var 0 Pegelstand
Datum Zeit Intervall müM Q Art
Sonntag, 2. Januar 2000 10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
I would like a running list only with the measurements. As you can see, one measurement stands within an information line, in the line that starts with Sonntag. My problem is that I want to break the line after 2000 and add the second part of the broken line, 10:40 10 244.555 0 0, as a separate line.
My target is this:
10:20 10 244.612 0 0
10:30 10 244.563 0 0
10:40 10 244.555 0 0
10:50 10 244.592 0 0
11:00 10 244.595 0 0
11:10 10 244.593 0 0
...
Until now I managed to choose the lines only that start with the time:
if i.startswith("0") or i.startswith("1") or i.startswith("2"):
and add it to new list.
And I can select the lines that contain the expression "tag":
f = open(source_file, "r")
data = f.readlines()
for lines in data:
if re.match("(.*)tag(.*)", lines):
print lines
There are no other lines that match with "tag"!
There's no need to worry about the invalid information if you can precisely match the valid information. So we'll use a regular expression to match only the data we want.
import re
MEASUREMENT_RE = re.compile(r"\b\d{2}:\d{2} \d{2} \d{3}.\d{3} \d \d\b")
with open(source_file, mode="r") as f:
print "\n".join(MEASUREMENT_RE.findall(f.read()))
Changes:
context manager (with block) used to open the file so the file closes automatically
read used instead of readlines since there's no point in applying a regular expression to each line instead of to all lines
measurements found with a regular expression that checks for exactly the digits you're looking for (if you need to match more digits in any section, it should be altered)
word boundaries (\b) used in regular expression to enforce whitespace or beginning/end of string is found around the match
This one matches digit sequences of variable length separated by colon, space and full stop:
import re
p = re.compile(r'\d+:\d+ \d+ \d+.\d+ \d+ \d+')
with open(source_file, "r") as f:
for line in f:
line_clean = p.findall(line)
if any(line_clean):
print "".join(line_clean)

Adding in-between columns, skipping and keeping some rows/columns

I am new to programming but I have started looking into both Python and Perl.
I am looking for data in two input files that are partly CSV, selecting some of them and putting into a new output file.
Maybe Python CSV or Pandas can help here, but I'm a bit stuck when it comes to skipping/keeping rows and columns.
Also, I don't have any headers for my columns.
Input file 1:
-- Some comments
KW1
'Z1' 'F' 30 26 'S'
KW2
'Z1' 30 26 1 1 5 7 /
'Z1' 30 26 2 2 6 8 /
'Z1' 29 27 4 4 12 13 /
Input file 2:
-- Some comments
-- Some more comments
KW1
'Z2' 'F' 40 45 'S'
KW2
'Z2' 40 45 1 1 10 10 /
'Z2' 41 45 2 2 14 15 /
'Z2' 41 46 4 4 16 17 /
Desired output file:
KW_NEW
'Z_NEW' 1000 30 26 1 /
'Z_NEW' 1000 30 26 2 /
'Z_NEW' 1000 29 27 4 /
'Z_NEW' 1000 40 45 1 /
'Z_NEW' 1000 41 45 2 /
'Z_NEW' 1000 41 46 4 /
So what I want to do is:
Do not include anything in either of my two input files before I reach KW2
Replace KW2 with KW_NEW
Replace either Z1' orZ2withZ_NEW` in the first column
Add a new second column with a constant value e.g. 1000
Copy the next three columns as they are
Leave out any remaining columns before printing the slash / at the end
Could anyone give me at least some general hints/tips how to approach this?
Your files are not "partly csv" (there is not a comma in sight); they are (partly) space delimited. You can read the files line-by-line, use Python's .split() method to convert the relevant strings into lists of substrings, and then re-arrange the pieces as you please. The splitting and re-assembly might look something like this:
input_line = "'Z1' 30 26 1 1 5 7 /" # test data
input_items = input_line.split()
output_items = ["'Z_NEW'", '1000']
output_items.append(input_items[1])
output_items.append(input_items[2])
output_items.append(input_items[3])
output_items.append('/')
output_line = ' '.join(output_items)
print(output_line)
The final print() statement shows that the resulting string is
'Z_NEW' 1000 30 26 1 /
Is your file format static? (this is not actually csv by the way :P) You might want to investigate a standardized file format like JSON or strict CSV to store your data, so that you can use already-existing tools to parse your input files. python has great JSON and CSV libraries that can do all the hard stuff for you.
If you're stuck with this file format, I would try something along these lines.
path = '<input_path>'
kws = ['KW1', 'KW2']
desired_kw = kws[1]
def parse_columns(line):
array = line.split()
if array[-1] is '/':
# get rid of trailing slash
array = array[:-1]
def is_kw(cols):
if len(cols) > 0 and cols[0] in kws:
return cols[0]
# to parse the section denoted by desired keyword
with open(path, 'r') as input_fp:
matrix = []
reading_file = False
for line in input_fp.readlines:
cols = parse_columns(line)
line_is_kw = is_kw(line)
if line_is_kw:
if not reading_file:
if line_is_kw is desired_kw:
reading_file = True
else:
continue
else:
break
if reading_file:
matrix = cols
print matrix
From there you can use stuff like slice notation and basic list manipulation to get your desired array. Good luck!
Here is a way to do it with Perl:
#!/usr/bin/perl
use strict;
use warnings;
# initialize output array
my #output = ('KW_NEW');
# proceed first file
open my $fh1, '<', 'in1.txt' or die "unable to open file1: $!";
while(<$fh1>) {
# consider only lines after KW2
if (/KW2/ .. eof) {
# Don't treat KW2 line
next if /KW2/;
# split the current line on space and keep only the fifth first element
my #l = (split ' ', $_)[0..4];
# change the first element
$l[0] = 'Z_NEW';
# insert 1000 at second position
splice #l,1,0,1000;
# push into output array
push #output, "#l";
}
}
# proceed second file
open my $fh2, '<', 'in2.txt' or die "unable to open file2: $!";
while(<$fh2>) {
if (/KW2/ .. eof) {
next if /KW2/;
my #l = (split ' ', $_)[0..4];
$l[0] = 'Z_NEW';
splice #l,1,0,1000;
push #output, "#l";
}
}
# write array to output file
open my $fh3, '>', 'out.txt' or die "unable to open file3: $!";
print $fh3 $_,"\n" for #output;

Categories

Resources