I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()
Related
I have a text file that looks something like this :
original-- expected output--
0 1 2 3 4 5 SET : {0,1,2,3,4,5}
1 3 RELATION:{(1,3),(3,1),(5,4),(4,5)}
3 1
5 4 REFLEXIVE : NO
4 5 SYMMETRIC : YES
and part of the code is having it print out the first line in curly braces, and the rest within one giant curly braces and each binary set in parentheses. I am still a beginner but I wanted to know if there is some way in python to make one loop that treats the first line differently than the rest?
try this with filename is your file ..
with open("filename.txt", "r") as file:
set_firstline = []
first_string = file.readline()
list_of_first_string = list(first_string)
for i in range(len(list_of_first_string)):
if str(i) in first_string:
set_firstline.append(i)
print(set_firstline)
OUTPUT : [0,1,2,3,4,5]
im new as well. so hope I can help you
I have a data.dat file that has 3 columns: The 3rd column is just the numbers 1 to 6 repeated again and again:
( In reality, column 3 has numbers from 1 to 1917, but for a minimal working example, let's stick to 1 to 6 )
# Title
127.26 134.85 1
127.26 135.76 2
127.26 135.76 3
127.26 160.97 4
127.26 160.97 5
127.26 201.49 6
125.88 132.67 1
125.88 140.07 2
125.88 140.07 3
125.88 165.05 4
125.88 165.05 5
125.88 203.06 6
137.20 140.97 1
137.20 140.97 2
137.20 148.21 3
137.20 155.37 4
137.20 155.37 5
137.20 184.07 6
I would like to:
1) extract the lines that contain 1 in the 3rd column and save them to a file called mode_1.dat.
2) extract the lines that contain 2 in the 3rd column and save them to a file called mode_2.dat.
3) extract the lines that contain 3 in the 3rd column and save them to a file called mode_3.dat.
.
.
.
6) extract the lines that contain 6 in the 3rd column and save them to a file called mode_6.dat.
In order to accomplish this, I have:
a) defined a variable factor = 6
a) created a one_to_factor list that has numbers 1 to 6
b) The re.search statement is in charge of extracting the lines for each value of one_to_factor. %s are the i inside the one_to_factor list
c) append these results to an empty LINES list.
However, this does not work. I cannot manage to extract the lines that contain i in the 3rd column and save them to a file called mode_i.dat
I would appreciate if you could help me.
factor = 6
one_to_factor = range(1,factor+1)
LINES = []
f_2 = open('data.dat', 'r')
for line in f_2:
for i in one_to_factor:
if re.search(r' \b%s$' %i , line):
print 'line = ', line
LINES.append(line)
print 'LINES =' , LINES
I would do it like this:
no regexes, just use str.split() to split according to whitespace
use last item (the digit) of the current line to generate the filename
use a dictionary to open the file the first time, and reuse the handle for subsequent matches (write title line at file open)
close all handles in the end
code:
title_line="# Vol \t Freq \t Mod \n"
handles = dict()
next(f_2) # skip title
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
# create files first time id encountered
if filename in handles:
pass
else:
handles[filename] = open(filename,"w")
handles[filename].write(title_line) # write title
handles[filename].write(line)
# close all files
for v in handles.values():
v.close()
EDIT: that's the fastest way but the problem is if you have too many suffixes (like in your real example), you'll get "too many open files" exception. So for this case, there's a slightly less efficient method but which works too:
import glob,os
# pre-processing: cleanup old files if any
for f in glob.glob("mode_*.dat"):
os.remove(f)
next(f_2) # skip title
s = set()
title_line="# Vol \t Freq \t Mod \n"
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
with open(filename,"a") as f:
if filename in s:
pass
else:
s.add(filename)
f.write(title_line)
f.write(line)
It basically opens as append mode, writes the lines, and closes the file.
(the set is used to detect first write in this file, so title can be written before the data)
There's a directory cleanup first to ensure that no data is left from a previous computation (append mode expects that no file exists, and if input data set changes, there's a possibility that there's an indentifier not present in the new dataset, so there would be an "orphan" file remaining from previous run)
First, instead of looping on you one_to_factor, you can get the index in one step :
index = line[-1] # Last character on the line
Then, you can check if index is in your one_to_factor list.
You should created a dictionary of lists to store your lines.
Something like :
{ "1" : [line1, line7, ...],
"2" : ....
}
And then you can use the key of the dictionnary to create the file and populate it with lines.
What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?
Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.
I have previously found a way to count the prefixes, as shown below, so is there a way similar to this which is so obvious I'm missing it completely?
for i in range (0, len(hardprefix)):
if len(word) > len(hardprefix[i]):
if word.startswith(hardprefix[i]):
hardprefixcount += 1
break
I need this code to use the first column of the file and count the number of a set array of suffixes found within these words
This is what i have so far
for i in range (0, len(easysuffix)):
if len (word) > len(easysuffix[i]):
if word.endswith(easysuffix[i]):
easysuffixcount += 1
break
below is a sample of my data from the csv file, with the arrays using the suffixes below that
on 1
only 4
our 1
own 1
part 7
piece 4
pieces 4
place 1
pressed 1
riot 1
september 1
shape 3
hardsuffix = ['ism']
easysuffix = ['ity', 'esome', 'ece']
Your input file is tab delimited CSV so you can use the csv module to process it.
import csv
suffixes = ['ity', 'esome', 'ece']
with open('input.csv') as words:
suffix_count = 0
reader = csv.reader(words, delimiter='\t')
for word, _ in reader:
if any(word.endswith(suffix) for suffix in suffixes):
suffix_count += 1
print "Found {} suffix(es)".format(suffix_count)
I have two files. One has two columns, ref.txt. The other has three columns, file.txt.
In ref.txt,
1 2
2 3
3 5
In file.txt,
1 2 4 <---here matching
3 4 5
6 9 4
2 3 10 <---here matching
4 7 9
3 5 7 <---here matching
I would like to compare two columns for each file, then only print the lines in file.txt matching the ref.txt.
So, the output should be,
1 2 4
2 3 10
3 5 7
I thought two dictionary comparison like,
mydict = {}
mydict1 = {}
with open('ref.txt') as f1:
for line in f1:
key, key1 = line.split()
sp1 = mydict[key, key1]
with open('file.txt') as f2:
for lines in f2:
item1, item2, value = lines.split()
sp2 = mydict1[item1, item2]
if sp1 == sp2:
print value
How can I compare two files appropriately with dictionary or others?
I found some perl and python code to solve the same number of columns in both file.
In my case, one file has two columns and the other has three columns.
How to compare two files and only print matching values?
Here's another option:
use strict;
use warnings;
my $file = pop;
my %hash = map { chomp; $_ => 1 } <>;
push #ARGV, $file;
while (<>) {
print if /^(\d+\s+\d+)/ and $hash{$1};
}
Usage: perl script.pl ref.txt file.txt [>outFile]
The last, optional parameter directs output to a file.
Output on your datasets:
1 2 4
2 3 10
3 5 7
Hope this helps!
grep -Ff ref.txt file.txt
is enough if the amount of whitespace between the characters is the same in both files. If it is not, you can do
awk '{print "^" $1 "[[:space:]]+" $2}' | xargs -I {} grep -E {} file.txt
combining three of my favorite utilities: awk, grep, and xargs... This latter method also ensures that the match only occurs at the start of the line (comparing column 1 with column 1, and column 2 with column 2).
Here's a revised and commented version that should work on your larger data set:
#read in your reference and the file
reference = open("ref.txt").read()
filetext = open("file.txt").read()
#split the reference file into a list of strings, splitting each time you encounter a new line
splitReference = reference.split("\n")
#do the same for the file
splitFile = filetext.split("\n")
#then, for each line in the reference,
for referenceLine in splitReference:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
referenceCells = referenceLine.split()
#then, for each line in your 'file',
for fileLine in splitFile:
#split that line into a list of strings, splitting each time you encouter a stretch of whitespace
lineCells = fileLine.split()
#now, for each line in 'reference' check to see if the first value is equal to the first value of the current line in 'file'
if referenceCells[0] == lineCells[0]:
#if those are equal, then check to see if the current rows of the reference and the file both have a length of more than one
if len(referenceCells) > 1:
if len(lineCells) > 1:
#if both have a length of more than one, compare the values in their second columns. If they are equal, print the file line
if referenceCells[1] == lineCells[1]:
print fileLine
Output:
1 2 4
2 3 10
3 5 7