python lines concatenate itself when writing to a file - python

I'm using python to generate training and testing data for 10-fold cross-validations, and to write the datasets to 2x10 separated files (each fold writes a training file and a testing file). And the weird thing is that when writing data to a file, there always is a line "missing". Actually, it might not even be "missing", since I discovered later that some line (only one line) in the middle of the file gets to concatenate itself to its previous line. So an output file should be something like the following (there should be 39150 lines in total):
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 4
50 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
However, I keep getting 39149 lines, and somewhere in the middle of the file seems to mess up like this:
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 450 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
My code:
def k_fold(myfile, myseed=1, k=10):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed = myseed
random.shuffle(data)
# Compute partition size given input k
len_total = len(data)
len_part = int(math.floor(len_total / float(k)))
# Create one partition per fold
train = {}
test = {}
for i in range(k):
test[i] = data[i * len_part:(i + 1) * len_part]
train[i] = data[0:i * len_part] + data[(i + 1) * len_part:len_total]
return train, test
if __name__ == "__main__":
path = '....' #some path and input
input = '...'
# Generate data
[train, test] = k_fold(input)
# Write data to files
for i in range(10):
train_old = path + 'tmp_train_' + str(i)
test_old = path + 'tmp_test_' + str(i)
trainF = open(train_old, 'a')
testF = open(test_old, 'a')
print(len(train[i]))
The strange thing is that I'm doing the same thing for the training and the testing dataset. The testing dataset outputs the correct file (4350 lines), but the training dataset has the above problem. I'm sure that the function returns the 39150 lines of training data, so I think the problem should be in the file writing part. Any body has any ideas how I could possibly done wrong? Thanks in advance!

I assume that the first half of the double length line is the last line of the original file.
The lines returned by readlines (or by iterating over the file) will all still end with the LF character '\n' except the last line if the file doesn't end with an empty line. In that case, the shuffling that you do will put that '\n'-less line somewhere in the middle of 'data'.
Either append an empty line to your original file or strip all lines before processing and add the newline to each line when writing back to a file.

Related

How to split array elements with Python 3

I am writing a script to gather results from an output file of a programme. The file contains headers, captions, and data in scientific format. I only want the data and I need a script that can do this repeatedly for different output files with the same results format.
This is the data:
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
This is my code at the moment. I want it to open the file, search for the keyword 'INPUT:BETA' which indicates the start of the results I want to extract. It then takes the information between this input keyword and the end identifier that signals the end of the data I want. I don't think this section needs changing but I have included it just in case.
I have then tried to use regex to specify the lines that start with VELOCITY (m/s) as these contain the data I need. This works and extracts each line, whitespace and all, into an array. However, I want each numerical value to be a single element, so the next line is supposed to strip the whitespace out and split the lines into individual array elements.
with open(file_name) as f:
t=f.read()
t=t[t.find('INPUT:BETA'):]
t=t[t.find(start_identifier):t.find(end_identifier)]
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, t)
res = [s.split() for s in res]
print(res)
print(len(res))
This isn't working, here is the output:
[['33.2405E+06', '30.8868E+06', '27.9475E+06', '25.2880E+06', '22.8815E+06', '21.1951E+06', '20.1614E+06', '18.7338E+06'], ['16.9510E+06', '15.7017E+06', '14.9359E+06', '14.2075E+06', '13.5146E+06', '12.8555E+06', '11.6805E+06', '10.5252E+06']]
2
It's taking out the whitespace but not putting the values into separate elements, which I need for the next stage of the processing.
My question is therefore:
How can I extract each value into a separate array element, leaving the rest of the data behind, in a way that will work with different output files with different data?
Here is how you can flatten your list, which is your point 1.
import re
text = """
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
"""
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, text)
res = [s.split() for s in res]
res = [value for lst in res for value in lst]
print(res)
print(len(res))
Your regex isn't skipping your first line though. There must be an error in the rest of your code.

How do you split a list by space in python?

How do you split a list by space? With the code below, it reads a file with 4 lines of 7 numbers separated by spaces. When it takes the file and then splits it, it splits it by number so if i print item[0], 5 will print instead of 50. here is the code
def main():
filename = input("Enter the name of the file: ")
infile = open(filename, "r")
for i in range(4):
data = infile.readline()
print(data)
item = data.split()
print(data[0])
main()
the file looks like this
50 60 15 100 60 15 40 /n
100 145 20 150 145 20 45 /n
50 245 25 120 245 25 50 /n
100 360 30 180 360 30 55 /n
Split takes as argument the character you want to split your string with.
I invite you to read the documentation of methods you are using. :)
EDIT : By the way, readline returns a string, not a **list **.
However, split does return a list.
import nltk
tokens = nltk.word_tokenize(TextInTheFile)
Try this once you have opened that file.
TextInTheFile is a variable
There's not a lot wrong with what you are doing, except that you are printing the wrong thing.
Instead of
print(data[0])
use
print(item[0])
data[0] is the first character of the string you read from file. You split this string into a variable called item so that's what you should print.

Reading specific column from file when last few rows are not equivalent in python

I have a problem during the reading of a text file in python. Basically what I need is to get the 4th column in a list.
With this small function I achieve it without any great issues:
def load_file(filename):
f = open(filename, 'r')
# skip the first useless row
line = list(f.readlines()[1:])
total_sp = []
for i in line:
t = i.strip().split()
total_sp.append(int(t[4]))
return total_sp
but now I have to manage files, that in the last row(s) have any random number that don't respect the text format. An example of the not working text file is:
#generated file
well10_1 3 18 6 1 2 -0.01158 0.01842 142
well5_1 1 14 6 1 2 0.009474 0.01842 141
well4_1 1 13 4 1 2 -0.01842 -0.03737 125
well7_1 3 10 1 1 2 -0.002632 0.009005 101
well3_1 1 10 9 1 2 -0.03579 -0.06368 157
well8_1 3 10 10 1 2 -0.06895 -0.1021 158
well9_1 3 10 18 1 2 0.03053 0.02158 176
well2_1 1 4 4 1 2 -0.03737 -0.03737 128
well6_1 3 4 5 1 2 -0.07053 -0.1421 127
well1_1 -2 3 1 1 2 0.006663 -0.02415 128
1 0.9259
2 0.07407
where 1 0.9259 and 2 0.07407 have to be dumped.
In fact, using the function of above with this text file, I get the following error because of the 2 additional last rows:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "/tmp/tmpqi8Ktw.py", line 21, in load_obh
total_sp.append(int(t[4]))
IndexError: list index out of range
How can I get rid of the last lines in the line variable?
Thanks to all
There are many ways to handle this, one such way can be to handle the indexError by surrounding the erroneous code by try and except, something like this :
try :
total_sp.append(int(t[4]))
except IndexError :
pass
This will only append to the total_sp when index exits otherwise not. Also, this will handle whenever you do not have the data present corresponding to that particular index.
Alternatively, if you are interested in removing just the last two rows (elements), you can use the slice operator such as by replacing line = list(f.readlines()[1:]) with line = f.readlines()[1:-2].
f.readlines already returns a list. Just as you provide a start index to slice from, you can specify "2 before the end" using negative indexing as below:
line = f.readlines()[1:-2]
Should do the trick.
EDIT: To handle an arbitrary number of lines at the end:
def load_file(filename):
f = open(filename, 'r')
# skip the first useless row
line = f.readlines()[1:]
total_sp = []
for i in line:
t = i.strip().split()
# check if enough columns were found
if len(t) >= 5:
total_sp.append(int(t[4]))
return total_sp
Also there is a "your case-specific" solution:
for i in line:
if not i.startswith(' '):
t = i.strip().split()
total_sp.append(int(t[4]))

python script slow read and write gz files

I have a xxx.wig.gz file, that have 3,000,000,000 lines in such format:
fixedStep chrom=chr1 start=1 step=1
0
0
0
0
0
1
2
3
4
5
6
7
8
9
10
...
fixedStep chrom=chr2 start=1 step=1
0
0
0
0
0
11
12
13
14
15
16
17
18
19
20
...
and i want to
break it down by "chrom". So every time I read a line starts with "fixedstep", I create a new file and close old one.
I want 0/1 output by comparing each value to a "threshold", pass=1 otherwise 0
below is my python script which runs super slow (I am projecting it to finish ~10hours, so far 2 chromosomes done after ~1 hour)
can someone help me improve it?
#!/bin/env python
import gzip
import re
import os
import sys
fn = sys.argv[1]
f = gzip.open(fn)
fo_base = os.path.basename(fn).rstrip('.wig').rstrip('.wig.gz')
fo_ext = '.bt.gz'
thres = 100
fo = None
for l in f:
if l.startswith("fixedStep"):
if fo is not None:
fo.flush()
fo.close()
fon = re.search(r'chrom=(\w*)', l).group(0).split('=')[-1]
fo = gzip.open(fo_base + "_" + fon + fo_ext,'wb')
else:
if int(l.strip())>= thres:
fo.write("1\n")
else:
fo.write("0\n")
if fo is not None:
fo.flush()
fo.close()
f.close()
PS. I assume awk can do it much faster but I am not great with awk
Thanks Summer for editing the text.
I added buffered read/write to the script and now it is several times faster (still relatively slow though):
import io
f = io.BufferedReader( gzip.open(fn) )
fo = io.BufferedWriter( gzip.open(fo_base + "." + fon + fo_ext,'wb') )

Adding in-between columns, skipping and keeping some rows/columns

I am new to programming but I have started looking into both Python and Perl.
I am looking for data in two input files that are partly CSV, selecting some of them and putting into a new output file.
Maybe Python CSV or Pandas can help here, but I'm a bit stuck when it comes to skipping/keeping rows and columns.
Also, I don't have any headers for my columns.
Input file 1:
-- Some comments
KW1
'Z1' 'F' 30 26 'S'
KW2
'Z1' 30 26 1 1 5 7 /
'Z1' 30 26 2 2 6 8 /
'Z1' 29 27 4 4 12 13 /
Input file 2:
-- Some comments
-- Some more comments
KW1
'Z2' 'F' 40 45 'S'
KW2
'Z2' 40 45 1 1 10 10 /
'Z2' 41 45 2 2 14 15 /
'Z2' 41 46 4 4 16 17 /
Desired output file:
KW_NEW
'Z_NEW' 1000 30 26 1 /
'Z_NEW' 1000 30 26 2 /
'Z_NEW' 1000 29 27 4 /
'Z_NEW' 1000 40 45 1 /
'Z_NEW' 1000 41 45 2 /
'Z_NEW' 1000 41 46 4 /
So what I want to do is:
Do not include anything in either of my two input files before I reach KW2
Replace KW2 with KW_NEW
Replace either Z1' orZ2withZ_NEW` in the first column
Add a new second column with a constant value e.g. 1000
Copy the next three columns as they are
Leave out any remaining columns before printing the slash / at the end
Could anyone give me at least some general hints/tips how to approach this?
Your files are not "partly csv" (there is not a comma in sight); they are (partly) space delimited. You can read the files line-by-line, use Python's .split() method to convert the relevant strings into lists of substrings, and then re-arrange the pieces as you please. The splitting and re-assembly might look something like this:
input_line = "'Z1' 30 26 1 1 5 7 /" # test data
input_items = input_line.split()
output_items = ["'Z_NEW'", '1000']
output_items.append(input_items[1])
output_items.append(input_items[2])
output_items.append(input_items[3])
output_items.append('/')
output_line = ' '.join(output_items)
print(output_line)
The final print() statement shows that the resulting string is
'Z_NEW' 1000 30 26 1 /
Is your file format static? (this is not actually csv by the way :P) You might want to investigate a standardized file format like JSON or strict CSV to store your data, so that you can use already-existing tools to parse your input files. python has great JSON and CSV libraries that can do all the hard stuff for you.
If you're stuck with this file format, I would try something along these lines.
path = '<input_path>'
kws = ['KW1', 'KW2']
desired_kw = kws[1]
def parse_columns(line):
array = line.split()
if array[-1] is '/':
# get rid of trailing slash
array = array[:-1]
def is_kw(cols):
if len(cols) > 0 and cols[0] in kws:
return cols[0]
# to parse the section denoted by desired keyword
with open(path, 'r') as input_fp:
matrix = []
reading_file = False
for line in input_fp.readlines:
cols = parse_columns(line)
line_is_kw = is_kw(line)
if line_is_kw:
if not reading_file:
if line_is_kw is desired_kw:
reading_file = True
else:
continue
else:
break
if reading_file:
matrix = cols
print matrix
From there you can use stuff like slice notation and basic list manipulation to get your desired array. Good luck!
Here is a way to do it with Perl:
#!/usr/bin/perl
use strict;
use warnings;
# initialize output array
my #output = ('KW_NEW');
# proceed first file
open my $fh1, '<', 'in1.txt' or die "unable to open file1: $!";
while(<$fh1>) {
# consider only lines after KW2
if (/KW2/ .. eof) {
# Don't treat KW2 line
next if /KW2/;
# split the current line on space and keep only the fifth first element
my #l = (split ' ', $_)[0..4];
# change the first element
$l[0] = 'Z_NEW';
# insert 1000 at second position
splice #l,1,0,1000;
# push into output array
push #output, "#l";
}
}
# proceed second file
open my $fh2, '<', 'in2.txt' or die "unable to open file2: $!";
while(<$fh2>) {
if (/KW2/ .. eof) {
next if /KW2/;
my #l = (split ' ', $_)[0..4];
$l[0] = 'Z_NEW';
splice #l,1,0,1000;
push #output, "#l";
}
}
# write array to output file
open my $fh3, '>', 'out.txt' or die "unable to open file3: $!";
print $fh3 $_,"\n" for #output;

Categories

Resources