Problems counting rows and columns with no spaces in a matrix - python

I'm trying to find the number of rows and columns in a matrix file. The matrix doesn't have spaces between the characters but does have separate lines. The sample down below should return 3 rows and 5 columns but that's not happening.
Also when I print the matrix each line has \n in it. I want to remove that. I tried .split('\n') but that didn't help. I ran this script earlier with a different data set separated with commas I had the line.split(',') in the code and that worked it would return the correct number of rows and columns as well as print the matrix with no \n, I'm not sure what changed by removing the comma from the line.split().
import sys
import numpy
with open(sys.argv[1], "r") as f:
m = [[char for char in line.split(' ')] for line in f if line.strip('\n') ]
m_size = numpy.shape(m)
print(m)
print("%s, %s" % m_size)
Sample data:
aaaaa
bbbbb
ccccc
Output:
[['aaaaa\n'], ['bbbbb\n'], ['ccccc']]
3, 1,

IIUC:
with open(sys.argv[1]) as f:
m = np.array([[char for char in line.strip()] for line in f])
>>> m
array([['a', 'a', 'a', 'a', 'a'],
['b', 'b', 'b', 'b', 'b'],
['c', 'c', 'c', 'c', 'c']], dtype='<U1')
>>> m.shape
(3, 5)

Related

read line and split line of csv file but has comma in a cell

I have a string line and tried split with the delimiter ','
tp = 'A,B,C,"6G,1A",1,2\r\n'
tp.split(',')
and get the results as with the length of 7
['A', 'B', 'C', '"6G', '1A"', '1', '2\r\n']
but I want to get the result as to get the length 6
['A', 'B', 'C', '"6G,1A"', '1', '2\r\n']
how can I do that?
https://docs.python.org/3/library/csv.html#csv.reader
>>> import csv
>>> tp = 'A,B,C,"6G,1A",1,2\r\n'
>>> rows = list(csv.reader([tp]))
>>> rows
[['A', 'B', 'C', '6G,1A', '1', '2']]
You can use regex, in fact, once you know how to use regex.
It is much faster and easier for you to extract text.
tp = 'A,B,C,"6G,1A",1,2\r\n'
import re
ab = [s for s in re.split("(,|\\\".*?\\\"|'.*?')", tp)if s.strip(",")]
print(len(ab))
print(ab)
6
['A', 'B', 'C', '"6G,1A"', '1', '2\r\n']

Program without module

I have the following program:
from collections import Counter
counter=0
lst=list()
fhandle=open('DNAInput.txt','r')
for line in fhandle:
if line.startswith('>'):
continue
else:
lst.append(line)
while counter != len(lst[0]):
lst2=list()
for word in lst:
lst2.append(word[counter])
for letter in lst2:
mc=Counter(lst).most_common(5)
counter=counter +1
print(mc)
which takes the following inout file:
>1
GATCA
>2
AATC
>3
AATA
>4
ACTA
And prints out the letter that repeats the most in each Collin.
How can I make the exact same file without the "from collections import Counter"
If I understand what you are trying to do; find the most common character in each column(?) here is how you can do it:
def most_common(col, exclude_char='N'):
col = list(filter((exclude_char).__ne__, col))
return max(set(col), key=col.count)
sequences = []
with open('DNAinput.txt', 'r') as file:
for line in file:
if line[0] == '>':
continue
else:
sequences.append(line.strip())
m = max([len(v) for v in sequences])
matrix = [list(v) for v in sequences]
for seq in matrix:
seq.extend(list('N' * (m - len(seq))))
transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
for column in transposed_matrix:
print(most_common(column))
This works by:
Opening your file and reading it into a list like this:
# This is the `sequences` list
['GATCA', 'AATC', 'AATA', 'ACTA']
Get the length of the longest DNA sequence:
# m = max([len(v) for v in sequences])
5
Create a matrix (list of lists) from these sequences:
# matrix = [list(v) for v in sequences]
[['G', 'A', 'T', 'C', 'A'],
['A', 'A', 'T', 'C'],
['A', 'A', 'T', 'A'],
['A', 'C', 'T', 'A']]
Pad the matrix so all the sequences are the same length:
# for seq in matrix:
# seq.extend(list('N' * (m - len(seq))))
[['G', 'A', 'T', 'C', 'A'],
['A', 'A', 'T', 'C', 'N'],
['A', 'A', 'T', 'A', 'N'],
['A', 'C', 'T', 'A', 'N']]
Transpose the matrix so columns go top -> bottom (not left -> right). This places all the characters from the same position into a list together.
# [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
[['G', 'A', 'A', 'A'],
['A', 'A', 'A', 'C'],
['T', 'T', 'T', 'T'],
['C', 'C', 'A', 'A'],
['A', 'N', 'N', 'N']]
Finally, iterate over each list in the transposed matrix and call most_common with the sub-list as input:
# for column in transposed_matrix:
# print(most_common(column))
A
A
T
C
A
There are caveats to this approach; firstly, the most_common function I have included will return the first value in the event that there are the same number of nucleotides in a single postion (see position four, this could have been either A or C). Furthermore, the most_common function could take exponentially more time than using Counter from collections.
For these reasons, I would strongly recommend using the following script instead as collections is included with python on installation.
from collections import Counter
sequences = []
with open('DNAinput.txt', 'r') as file:
for line in file:
if line[0] == '>':
continue
else:
sequences.append(line.strip())
m = max([len(v) for v in sequences])
matrix = [list(v) for v in sequences]
for seq in matrix:
seq.extend(list('N' * (m - len(seq))))
transposed_matrix = [[matrix[j][i] for j in range(len(matrix))] for i in range(m)]
for column in transposed_matrix:
print(Counter(column).most_common(5))
You would have to go to the Collections module, in my case is located here:
C:\Python27\Lib\collections.py
And grab the parts you need and copy them into your script, in your case you need the Counter class.
This could get complicated if the Counter class is sourcing other things from that's script or other imported modules. You could go to those imported modules and also copy the code into your script but they as well could be referencing more modules.
What is the reason why you don't want to import a module in your script? Maybe there is a better solution to your problem than not importing anything.

Remove rows from a csv file where some column matches a specific regex

I have the following csv file:
ID,PDBID,FirstResidue,FirstChain,SecondResidue,SecondChain,ThirdResidue,ThirdChain,FourthResidue,FourthChain,Pattern
RZ_AUTO_505,1hmh,A22L,C,A22L,A,G21L,A,A23L,A,AA/GA Naked ribose
RZ_AUTO_506,1hmh,A22L,C,A22L,A,G114,A,A23L,A,AA/GA Naked ribose
RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_511,1hmh,G113,C,C112,C,G21L,A,A23L,A,GC/GA Single ribose
RZ_AUTO_512,1hmh,G113,C,C112,C,G114,A,A23L,A,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose
I need to remove the csv rows if the value of FirstResidue or SecondResidue or ThirdResidue or FourthResidue matches the regex: '[A-Za-z]$'.
The output should look something like below.
RZ_AUTO_507,1hmh,A130,E,A90,A,G80,A,A130,A,AA/GA Naked ribose
RZ_AUTO_508,1hmh,A140,E,A90,E,G120,A,A90,A,AA/GA Naked ribose
RZ_AUTO_509,1hmh,G102,A,C103,A,G102,E,A90,E,GC/GA Single ribose
RZ_AUTO_510,1hmh,G102,A,C103,A,G120,E,A90,E,GC/GA Single ribose
RZ_AUTO_513,1hnw,C1496,A,G1497,A,A1518,A,A1519,A,CG/AA Canonical ribose
RZ_AUTO_514,1hnw,C1496,A,G1497,A,A1519,A,A1518,A,CG/AA Canonical ribose
RZ_AUTO_515,1hnw,C221,A,U222,A,A195,A,A196,A,CU/AA Canonical ribose
RZ_AUTO_516,1hnw,C221,A,U222,A,A196,A,A195,A,CU/AA Canonical ribose
So far I've saved each column as a list but I'm not sure how to proceed next. Here is my code:
import csv
import re
rzid = []
pdbid = []
first_residue = []
first_chain = []
second_residue = []
second_chain = []
third_residue = []
third_chain = []
fourth_residue = []
fourth_chain = []
rz_pattern = []
#open csv file rz45.csv
f = open( 'rz45.csv', 'rU' ) #open the file in read universal mode
for line in f:
cells = line.split( "," )
rzid.append( (cells[0]) )
pdbid.append( (cells[1]) )
first_residue.append( (cells[2]) )
first_chain.append( (cells[3]) )
second_residue.append( (cells[4]) )
second_chain.append( (cells[5]) )
third_residue.append( (cells[6]) )
third_chain.append( (cells[7]) )
fourth_residue.append( (cells[8]) )
fourth_chain.append( (cells[9]) )
rz_pattern.append( (cells[10]) )
f.close()
Can someone please help? Thanks
UPDATE 1
import re
import csv
output = []
regex = '[AUGC]\d{1,4}'
#open csv file test_regex.csv
f = open( 'test_regex.csv', 'rU' ) #open the file in read universal mode
for line in f:
cells = line.split( "," )
output.append( [ cells[ 2 ], cells[ 4 ], cells[ 6 ], cells[ 8 ] ] )
match = re.search(regex, str(output))
if match:
print line
f.close()
I've made some changes to my code but I'm still not sure how to check that all values in cells [2,4,6,8] fulfill the given regex. Can someone advise on how to proceed next?
Something like this works (at least on your example):
import csv
import re
tgt=['FirstResidue', 'SecondResidue', 'ThirdResidue']
with open(file) as f:
reader=csv.reader(f)
header=next(reader)
for row in reader:
di={k:v for k, v in zip(header, row)}
if any(re.search(r'[A-Za-z]$', s) for s in [di[x] for x in tgt]):
continue
print row
Prints:
['RZ_AUTO_507', '1hmh', 'A130', 'E', 'A90', 'A', 'G80', 'A', 'A130', 'A', 'AA/GA Naked ribose']
['RZ_AUTO_508', '1hmh', 'A140', 'E', 'A90', 'E', 'G120', 'A', 'A90', 'A', 'AA/GA Naked ribose']
['RZ_AUTO_509', '1hmh', 'G102', 'A', 'C103', 'A', 'G102', 'E', 'A90', 'E', 'GC/GA Single ribose']
['RZ_AUTO_510', '1hmh', 'G102', 'A', 'C103', 'A', 'G120', 'E', 'A90', 'E', 'GC/GA Single ribose']
['RZ_AUTO_512', '1hmh', 'G113', 'C', 'C112', 'C', 'G114', 'A', 'A23L', 'A', 'GC/GA Single ribose']
['RZ_AUTO_513', '1hnw', 'C1496', 'A', 'G1497', 'A', 'A1518', 'A', 'A1519', 'A', 'CG/AA Canonical ribose']
['RZ_AUTO_514', '1hnw', 'C1496', 'A', 'G1497', 'A', 'A1519', 'A', 'A1518', 'A', 'CG/AA Canonical ribose']
['RZ_AUTO_515', '1hnw', 'C221', 'A', 'U222', 'A', 'A195', 'A', 'A196', 'A', 'CU/AA Canonical ribose']
['RZ_AUTO_516', '1hnw', 'C221', 'A', 'U222', 'A', 'A196', 'A', 'A195', 'A', 'CU/AA Canonical ribose']
Once you have filtered the data based on your regex, you have row that is as you want it. Either write that to a new csv or whatever you wish.

Python: How do I make a for loop append data to a list when the format is non-standard?

I'm looking to read some marker data into data structures using Python. So far, I have successfully read every Marker name into a single list (there are 2,000 of those).
The data I have was originally in Excel, but I converted it into a .txt file.
The header data in the file was removed and assigned to variables using readline().
Every line with a marker name begins with a double quotation mark (") so I was able to easily gain that information and store it as a list.
Each line with the data for that marker is indented 2 spaces and there are lines that begin with either "a" , "b" , or "h". I want to get these into a data structure. I've tried both lists and strings, but both are returned as empty. The data under each marker name is a block with the three letters "a", "b", and "h" with each letter representing an individual in a population (there are 250). The tricky thing is that there are 5 letters separated by a single space, but then those 5-letter blocks are separated from other 5-letter blocks by two spaces.
Example:
"BK_12 (a,h,b) ; 1"
b a a a b a b a a a b a b a a a a a a a a a a b b a a b a h b
a a a a a a a a a a a a a a a a b a a a a h a a a a a a a a h
a a b a a a h a a a a h a h a a a a a a a a b a a a a a a h a
a a a b a a a a a a a a b a a b b a b a h a b a a a b a a a h
a a a a
That part I don't really need help with, but just included for reference of how the file looks. My ultimate goal is to use phenotype data to find markers associated with a specific phenotype.
I used a for loop to accomplish this so far. My code is below. EDIT: I tried indexing from position 2, rather an searching from position 0 for an empty space. I thought this would work. The else: statement was meant to tell me whether or not it was recognizing the elif statements. Nothing was returned, so I'm assuming it is working in that regard, but it isn't appending.
Markers = []
Genotype_Data = []
for line in infile:
line=line.rstrip()
if (line[0] == '"'):
line=line.rstrip()
Markers.append(line)
elif (line[2] == 'a'):
line=line.rstrip()
Genotype_Data.append(line)
elif (line[2] == 'b'):
line=line.rstrip()
Genotype_Data.append(line)
elif (line[2] == 'h'):
line=line.rstrip()
Genotype_Data.append(line)
else:
print("Something isn't right!")
I do not understand what your goal is.
Maybe this helps you achieve it:
print(line.split()) # just a and b, ...
['b', 'a', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'a', 'b', 'a', 'b', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'b', 'a', 'h', 'b']
>>> print(line.split(' ')) # a b, ... and '' where a new block starts
['', '', 'b', 'a', 'a', 'a', 'b', '', 'a', 'b', 'a', 'a', 'a', '', 'b', 'a', 'b', 'a', 'a', '', 'a', 'a', 'a', 'a', 'a', '', 'a', 'a', 'a', 'b', 'b', '', 'a', 'a', 'b', 'a', 'h', '', 'b', '', '', '']
>>> ' x x '.strip()
'x x'
I'm still not clear about what format you want the data to end up in theGenotype_Datalist, but you should be able to tweak that portion of the following as required:
Markers = []
Genotype_Data = []
INDIVIDUALS = set('abh')
with open('genotype_data.txt', mode='rt') as infile:
line = infile.next().rstrip() # read first line of file
if line[0] == '"':
Markers.append(line)
else:
raise ValueError('marker line expected')
geno_accumulator = []
for line in infile: # read remainder of file
line = line.rstrip()
if line[0] == '"':
Genotype_Data.append(geno_accumulator)
geno_accumulator = []
Markers.append(line)
elif line[2] in INDIVIDUALS:
geno_accumulator.append(line)
else:
raise ValueError('unrecognized line of input data encountered')
if geno_accumulator: # append the final bit of genotype data
Genotype_Data.append(geno_accumulator)
print 'Markers:', Markers
print 'Genotype_Data:', Genotype_Data

Python - File manipulation using headings

How would I get python to run through a .txt document, find a specific heading and then put information from each line in to a list for printing? And then once finished, look for another heading and do the same with the information there...
If you had a csv file as follows:
h1,h2,h3
a,b,c
d,e,f
g,h,i
Then the following would do as you request (if I understood you correctly)
def getColumn(title,file):
result = []
with open(file) as f:
headers = f.readline().split(',')
index = headers.index(title)
for l in f.readlines():
result.append(l.rstrip().split(',')[index])
return result
For example:
print(getColumn("h1",'cf.csv') )
>>> ['a', 'd', 'g']
File test.txt
a
b
c
heading1
d
e
f
heading2
g
h
heading3
>>> from itertools import takewhile, imap
>>> with open('test.txt') as f:
for heading in ('heading1', 'heading2', 'heading3'):
items = list(takewhile(heading.__ne__, imap(str.rstrip, f)))
print items
['a', 'b', 'c']
['d', 'e', 'f']
['g', 'h']

Categories

Resources