I am new to programming but I have started looking into both Python and Perl.
I am looking for data in two input files that are partly CSV, selecting some of them and putting into a new output file.
Maybe Python CSV or Pandas can help here, but I'm a bit stuck when it comes to skipping/keeping rows and columns.
Also, I don't have any headers for my columns.
Input file 1:
-- Some comments
KW1
'Z1' 'F' 30 26 'S'
KW2
'Z1' 30 26 1 1 5 7 /
'Z1' 30 26 2 2 6 8 /
'Z1' 29 27 4 4 12 13 /
Input file 2:
-- Some comments
-- Some more comments
KW1
'Z2' 'F' 40 45 'S'
KW2
'Z2' 40 45 1 1 10 10 /
'Z2' 41 45 2 2 14 15 /
'Z2' 41 46 4 4 16 17 /
Desired output file:
KW_NEW
'Z_NEW' 1000 30 26 1 /
'Z_NEW' 1000 30 26 2 /
'Z_NEW' 1000 29 27 4 /
'Z_NEW' 1000 40 45 1 /
'Z_NEW' 1000 41 45 2 /
'Z_NEW' 1000 41 46 4 /
So what I want to do is:
Do not include anything in either of my two input files before I reach KW2
Replace KW2 with KW_NEW
Replace either Z1' orZ2withZ_NEW` in the first column
Add a new second column with a constant value e.g. 1000
Copy the next three columns as they are
Leave out any remaining columns before printing the slash / at the end
Could anyone give me at least some general hints/tips how to approach this?
Your files are not "partly csv" (there is not a comma in sight); they are (partly) space delimited. You can read the files line-by-line, use Python's .split() method to convert the relevant strings into lists of substrings, and then re-arrange the pieces as you please. The splitting and re-assembly might look something like this:
input_line = "'Z1' 30 26 1 1 5 7 /" # test data
input_items = input_line.split()
output_items = ["'Z_NEW'", '1000']
output_items.append(input_items[1])
output_items.append(input_items[2])
output_items.append(input_items[3])
output_items.append('/')
output_line = ' '.join(output_items)
print(output_line)
The final print() statement shows that the resulting string is
'Z_NEW' 1000 30 26 1 /
Is your file format static? (this is not actually csv by the way :P) You might want to investigate a standardized file format like JSON or strict CSV to store your data, so that you can use already-existing tools to parse your input files. python has great JSON and CSV libraries that can do all the hard stuff for you.
If you're stuck with this file format, I would try something along these lines.
path = '<input_path>'
kws = ['KW1', 'KW2']
desired_kw = kws[1]
def parse_columns(line):
array = line.split()
if array[-1] is '/':
# get rid of trailing slash
array = array[:-1]
def is_kw(cols):
if len(cols) > 0 and cols[0] in kws:
return cols[0]
# to parse the section denoted by desired keyword
with open(path, 'r') as input_fp:
matrix = []
reading_file = False
for line in input_fp.readlines:
cols = parse_columns(line)
line_is_kw = is_kw(line)
if line_is_kw:
if not reading_file:
if line_is_kw is desired_kw:
reading_file = True
else:
continue
else:
break
if reading_file:
matrix = cols
print matrix
From there you can use stuff like slice notation and basic list manipulation to get your desired array. Good luck!
Here is a way to do it with Perl:
#!/usr/bin/perl
use strict;
use warnings;
# initialize output array
my #output = ('KW_NEW');
# proceed first file
open my $fh1, '<', 'in1.txt' or die "unable to open file1: $!";
while(<$fh1>) {
# consider only lines after KW2
if (/KW2/ .. eof) {
# Don't treat KW2 line
next if /KW2/;
# split the current line on space and keep only the fifth first element
my #l = (split ' ', $_)[0..4];
# change the first element
$l[0] = 'Z_NEW';
# insert 1000 at second position
splice #l,1,0,1000;
# push into output array
push #output, "#l";
}
}
# proceed second file
open my $fh2, '<', 'in2.txt' or die "unable to open file2: $!";
while(<$fh2>) {
if (/KW2/ .. eof) {
next if /KW2/;
my #l = (split ' ', $_)[0..4];
$l[0] = 'Z_NEW';
splice #l,1,0,1000;
push #output, "#l";
}
}
# write array to output file
open my $fh3, '>', 'out.txt' or die "unable to open file3: $!";
print $fh3 $_,"\n" for #output;
Related
I am writing a script to gather results from an output file of a programme. The file contains headers, captions, and data in scientific format. I only want the data and I need a script that can do this repeatedly for different output files with the same results format.
This is the data:
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
This is my code at the moment. I want it to open the file, search for the keyword 'INPUT:BETA' which indicates the start of the results I want to extract. It then takes the information between this input keyword and the end identifier that signals the end of the data I want. I don't think this section needs changing but I have included it just in case.
I have then tried to use regex to specify the lines that start with VELOCITY (m/s) as these contain the data I need. This works and extracts each line, whitespace and all, into an array. However, I want each numerical value to be a single element, so the next line is supposed to strip the whitespace out and split the lines into individual array elements.
with open(file_name) as f:
t=f.read()
t=t[t.find('INPUT:BETA'):]
t=t[t.find(start_identifier):t.find(end_identifier)]
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, t)
res = [s.split() for s in res]
print(res)
print(len(res))
This isn't working, here is the output:
[['33.2405E+06', '30.8868E+06', '27.9475E+06', '25.2880E+06', '22.8815E+06', '21.1951E+06', '20.1614E+06', '18.7338E+06'], ['16.9510E+06', '15.7017E+06', '14.9359E+06', '14.2075E+06', '13.5146E+06', '12.8555E+06', '11.6805E+06', '10.5252E+06']]
2
It's taking out the whitespace but not putting the values into separate elements, which I need for the next stage of the processing.
My question is therefore:
How can I extract each value into a separate array element, leaving the rest of the data behind, in a way that will work with different output files with different data?
Here is how you can flatten your list, which is your point 1.
import re
text = """
GROUP 1 2 3 4 5 6 7 8
VELOCITY (m/s) 59.4604E+06 55.5297E+06 52.4463E+06 49.3329E+06 45.4639E+06 41.6928E+06 37.7252E+06 34.9447E+06
GROUP 9 10 11 12 13 14 15 16
VELOCITY (m/s) 33.2405E+06 30.8868E+06 27.9475E+06 25.2880E+06 22.8815E+06 21.1951E+06 20.1614E+06 18.7338E+06
GROUP 17 18 19 20 21 22 23 24
VELOCITY (m/s) 16.9510E+06 15.7017E+06 14.9359E+06 14.2075E+06 13.5146E+06 12.8555E+06 11.6805E+06 10.5252E+06
"""
regex = r"VELOCITY \(m\/s\)\s(.*)"
res = re.findall(regex, text)
res = [s.split() for s in res]
res = [value for lst in res for value in lst]
print(res)
print(len(res))
Your regex isn't skipping your first line though. There must be an error in the rest of your code.
I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()
How do you split a list by space? With the code below, it reads a file with 4 lines of 7 numbers separated by spaces. When it takes the file and then splits it, it splits it by number so if i print item[0], 5 will print instead of 50. here is the code
def main():
filename = input("Enter the name of the file: ")
infile = open(filename, "r")
for i in range(4):
data = infile.readline()
print(data)
item = data.split()
print(data[0])
main()
the file looks like this
50 60 15 100 60 15 40 /n
100 145 20 150 145 20 45 /n
50 245 25 120 245 25 50 /n
100 360 30 180 360 30 55 /n
Split takes as argument the character you want to split your string with.
I invite you to read the documentation of methods you are using. :)
EDIT : By the way, readline returns a string, not a **list **.
However, split does return a list.
import nltk
tokens = nltk.word_tokenize(TextInTheFile)
Try this once you have opened that file.
TextInTheFile is a variable
There's not a lot wrong with what you are doing, except that you are printing the wrong thing.
Instead of
print(data[0])
use
print(item[0])
data[0] is the first character of the string you read from file. You split this string into a variable called item so that's what you should print.
I'm using python to generate training and testing data for 10-fold cross-validations, and to write the datasets to 2x10 separated files (each fold writes a training file and a testing file). And the weird thing is that when writing data to a file, there always is a line "missing". Actually, it might not even be "missing", since I discovered later that some line (only one line) in the middle of the file gets to concatenate itself to its previous line. So an output file should be something like the following (there should be 39150 lines in total):
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 4
50 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
However, I keep getting 39149 lines, and somewhere in the middle of the file seems to mess up like this:
44 1 90 0 44 0 45 46 0 1
55 -3 95 0 44 22 40 51 12 450 -3 81 0 50 0 31 32 0 1
44 -4 76 0 42 -30 32 34 2 1
My code:
def k_fold(myfile, myseed=1, k=10):
# Load data
data = open(myfile).readlines()
# Shuffle input
random.seed = myseed
random.shuffle(data)
# Compute partition size given input k
len_total = len(data)
len_part = int(math.floor(len_total / float(k)))
# Create one partition per fold
train = {}
test = {}
for i in range(k):
test[i] = data[i * len_part:(i + 1) * len_part]
train[i] = data[0:i * len_part] + data[(i + 1) * len_part:len_total]
return train, test
if __name__ == "__main__":
path = '....' #some path and input
input = '...'
# Generate data
[train, test] = k_fold(input)
# Write data to files
for i in range(10):
train_old = path + 'tmp_train_' + str(i)
test_old = path + 'tmp_test_' + str(i)
trainF = open(train_old, 'a')
testF = open(test_old, 'a')
print(len(train[i]))
The strange thing is that I'm doing the same thing for the training and the testing dataset. The testing dataset outputs the correct file (4350 lines), but the training dataset has the above problem. I'm sure that the function returns the 39150 lines of training data, so I think the problem should be in the file writing part. Any body has any ideas how I could possibly done wrong? Thanks in advance!
I assume that the first half of the double length line is the last line of the original file.
The lines returned by readlines (or by iterating over the file) will all still end with the LF character '\n' except the last line if the file doesn't end with an empty line. In that case, the shuffling that you do will put that '\n'-less line somewhere in the middle of 'data'.
Either append an empty line to your original file or strip all lines before processing and add the newline to each line when writing back to a file.
I have two files. One has two columns, ref.txt. The other has three columns, file.txt.
In ref.txt,
1 2
2 3
3 5
In file.txt,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
I would like to compare two columns for each file, then only print the lines in file.txt matching the ref.txt.
So, the output should be,
1 2 4
2 3 10
3 5 7
I thought two dictionary comparison like,
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
How can I compare two files appropriately with dictionary or others?
I found some perl and python code to solve the same number of columns in both file.
In my case, one file has two columns and the other has three columns.
How to compare two files and only print matching values?
Here's another option:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push #ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
Usage: perl script.pl ref.txt file.txt [>outFile]
The last, optional parameter directs output to a file.
Output on your datasets:
1 2 4
2 3 10
3 5 7
Hope this helps!
grep -Ff ref.txt file.txt
is enough if the amount of whitespace between the characters is the same in both files. If it is not, you can do
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
combining three of my favorite utilities: awk, grep, and xargs... This latter method also ensures that the match only occurs at the start of the line (comparing column 1 with column 1, and column 2 with column 2).
Here's a revised and commented version that should work on your larger data set:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
Output:
1 2 4
2 3 10
3 5 7