I have a very large (10GB) data file of the form:
A B C D
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
1 2 3 4
2 2 3 4
3 2 3 4
4 2 3 4
5 2 3 4
I would like to read just the B column of the file and rearrange it in the form
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
it takes very long time to read the data and rearrange them, could some give me a very efficient method to do this in python
This is the code that I used for my MATLAB for processing the data
fid = fopen('hpts.out', 'r'); % Open text file
InputText = textscan(fid, '%s', 1, 'delimiter', '\n'); % Read header lines
HeaderLines = InputText{1}
A = textscan(fid,'%n %n %n %n %n', 'HeaderLines', 1);
t = A{1};
vz = A{4};
L = 1;
for j = 1:1:5000
for i=1:1:14999
V1(j,i) = vz(L);
L = L +1 ;
end
end
imagesc(V1);
You can us Python for this, but I think this is exactly the sort of job where a shell script is better, since it's a lot shorter & easier:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
tail removes the first (header) line;
awk gets the second column;
tr puts it on a single line;
and fmt makes lines a maximum of 10 characters.
Since this is a streaming operation, it should not take a lot of memory, and most performance for this is limited to just disk I/O (although shell pipes also introduce some overhead).
Example:
$ tail -n+2 input_file | awk '{print $2}' | tr '\n' ' ' | fmt -w 10
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
2 2 2 2 2
This streaming approach should perform well:
from itertools import izip_longest
with open('yourfile', 'r') as fin, open('newfile', 'w') as fout:
# discard header row
next(fin)
# make generator for second column
col2values = (line.split()[1] for line in fin)
# zip into groups of five.
# fillvalue used to make a partial last row look good.
for row in izip_longest(*[col2values ]*5, fillvalue=''):
fout.write(' '.join(row) + '\n')
Dont't read the whole file at one time! Read the file line by line:
def read_data():
with open("filename.txt", 'r') as f:
for line in f:
yield line.split()[1]
with open('file_to_save.txt', 'w') as f:
for i, data in enumerate(read_data()):
f.write(data)
if i % 5 == 0:
f.write('\n')
Related
I have a (big) file containing values as such :
1 2 3 4 5 6 7
8 9 10 ... N
I want to be able to transpose the data values inside this file in on line to get the final result :
1 2 3 4 ... N
tr '\n' ' ' < inputfile
Thanks to John Gordon.
I have a file kind of like this:
===
1 2 3 4
===
2 3 4 5
===
3 4 5 6
and I am trying to make a program to turn the file into this
p
===
1 2 3 4
p
===
2 3 4 5
p
===
3 4 5 6
Is there any way I could do this in python?
you can use:
with open('my_file.txt') as fp:
lines = fp.readlines()
for i, l in enumerate(lines):
if l == '===\n':
lines[i] = 'p\n===\n'
with open('my_file.txt', 'w') as fp:
fp.write(''.join(lines))
This method inputs a file and the directory of the file. It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row. The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
For example a file would look like:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem. I don't know how to ignore the unimportant information at the end of the files. I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution.
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
The expression for this would look like (?<=[0-9]\s[A-Z]\s)[0-9\-\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps
I have a tab-separated file, see below:
The first column contains the position number, 1-end
and the rest have some frequency numbers.
Position A B C D
1 117 0 1 0
2 4 0 0 16
3 0 5 11 0
4 0 0 0 5
5 0 15 0 0
6 100 0 108 0
7 0 0 147 0
I would like to reformat this file to have two columns, the first is the position column kept as is, and the second contains the highest frequency for each position.
ideal output:
Position HighFreq
1 117
2 16
3 11
4 5
5 15
6 108
7 147
what I have so far is a function that selects the highest number and prints:
awk '{max=$1; for(i=2;i<=NF; i++) {if($i>max){max=$i;}};printf"%f\n",max}' file.tsv
I'm trying to write a bash solution for this problem, but Perl/Python is most welcome!
$ perl -MList::Util=max -F/\t/ -lane 'print join "\t", $. == 1 ? qw(Position HighFreq) : ( $F [0], max(#F[1..$#F]) )'
Explanation
-MList::Util=max
Load List::Util::max
-F/\t/ -a
Activate auto-split and set delimiter to /\t/
-lne
Automatically append appropriate line ending, apply one-liner, process ARGV input line-by-line
print join "\t", ...
print tab-separated
$. == 1 ? ... : ...
Handle column headings
max( #F[1..$#F] )
returns max of all-but-first elements of #F
awk 'BEGIN{print"Position\tHighFreq"}{if(NR==1)next; max=0;for(i=2;i<=NF; i++) {if($i>max){max=$i;}} printf"%d\t%d\n",$1,max;}' file.tsv
output:
Position HighFreq
1 117
2 16
3 11
4 5
5 15
6 108
7 147
As you've chosen a Python tag, this could be done in Python as follows:
import sys
import csv
with open(sys.argv[1], 'rb') as f_input:
tsv = csv.reader(f_input, delimiter='\t')
next(tsv)
data = []
for row in tsv:
row = map(int, row)
data.append([row[0]] + [max(row[1:])])
with open(sys.argv[1], 'wb') as f_output:
tsv = csv.writer(f_output, delimiter='\t')
tsv.writerow(['Position', 'HighFreq'])
tsv.writerows(data)
In Perl
use strict;
use warnings 'all';
use feature 'say';
use autodie;
use List::Util 'max';
open my $fh, '<', 'freq.txt';
<$fh>;
say join "\t", qw/ Position HighFreq /;
while ( <$fh> ) {
my ($n, #fields) = split;
say join "\t", $n, max(#fields);
}
output
Position HighFreq
1 117
2 16
3 11
4 5
5 15
6 108
7 147
This seems like a simple question, but I can't find an answer.
Input:
a 3 4
b 1 4
c 8 3
d 3 8
Wanted output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
Note: the file .txt input has many rows in the first column.
You didn't ask for it, but would you want awk? You could do:
awk '{$1=$1 OFS $1}1' Input
or the more obvious but less flexible:
awk '{print $1 $1 $2 $3}' Input
Assuming you've read your results in an array, you want:
values = ["a",1,2,3]
values.insert(0,values[0])
This inserts the value of index 0 (in this case "a") at position 0, moving all the other contents of values to the right.
This will also work on strings, so if your results are read as a string you can do the following - please note that I am including the spaces after each digit and am doing it a bit differently:
values="a 1 2 3"
values = values[:2] + values
In this example we take the first two array members (values[:2] or values[0:2]) and adding the existing array values to the end.
Hope this helps!
Try this:
fin = open("text.txt")
content = fin.readlines()
fin.close()
for elem in content:
print(elem[0],elem[0]+elem[1:-1])
Output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8
with open("sample.csv") as inputs:
for line in inputs:
trimed_line = line.strip()
parts = trimed_line.split()
print("{0} {1}".format(parts[0], trimed_line))
output:
a a 3 4
b b 1 4
c c 8 3
d d 3 8