parsing information out of sequencing data - python

I have a txt file that is a converted fasta file that just has a particular region that I'm interested in analyzing. It looks like this
CTGGCCGCGCTGACTCCTCTCGCT
CTCGCAGCACTGACTCCTCTTGCG
CTAGCCGCTCTGACTCCGCTAGCG
CTCGCTGCCCTCACACCTCTTGCA
CTCGCAGCACTGACTCCTCTTGCG
CTCGCAGCACTAACACCCCTAGCT
CTCGCTGCTCTGACTCCTCTCGCC
CTGGCCGCGCTGACTCCTCTCGCT
I am currently using excel to perform some calculations on the nucleotide diversity at each position. Some of the files have like 200,000 reads so this makes the excel files unwieldy. I figure there must be an easier way to do this using python or R.
Basically I want to take the .txt file with the list of sequences and measure nucleotide diversity at each position using this equation –p(log2(p)). Does anyone know how this might be done in a way besides excel?
Thanks so much in advance for any help.

If you can work from the fasta file, that might be better, as there are
packages specifically designed to work with that format.
Here, I give a solution in R, using the packages seqinr and also
dplyr (part of tidyverse) for manipulating data.
If this were your fasta file (based on your sequences):
>seq1
CTGGCCGCGCTGACTCCTCTCGCT
>seq2
CTCGCAGCACTGACTCCTCTTGCG
>seq3
CTAGCCGCTCTGACTCCGCTAGCG
>seq4
CTCGCTGCCCTCACACCTCTTGCA
>seq5
CTCGCAGCACTGACTCCTCTTGCG
>seq6
CTCGCAGCACTAACACCCCTAGCT
>seq7
CTCGCTGCTCTGACTCCTCTCGCC
>seq8
CTGGCCGCGCTGACTCCTCTCGCT
You can read it into R using the seqinr package:
# Load the packages
library(tidyverse) # I use this package for manipulating data.frames later on
library(seqinr)
# Read the fasta file - use the path relevant for you
seqs <- read.fasta("~/path/to/your/file/example_fasta.fa")
This returns a list object, which contains as many elements as there are
sequences in your file.
For your particular question - calculating diversity metrics for each position -
we can use two useful functions from the seqinr package:
getFrag() to subset the sequences
count() to calculate the frequency of each nucleotide
For example, if we wanted the nucleotide frequencies for the first position of
our sequences, we could do:
# Get position 1
pos1 <- getFrag(seqs, begin = 1, end = 1)
# Calculate frequency of each nucleotide
count(pos1, wordsize = 1, freq = TRUE)
a c g t
0 1 0 0
Showing us that the first position only contains a "C".
Below is a way to programatically "loop" through all positions and to do the
calculations we might be interested in:
# Obtain fragment lenghts - assuming all sequences are the same length!
l <- length(seqs[[1]])
# Use the `lapply` function to estimate frequency for each position
p <- lapply(1:l, function(i, seqs){
# Obtain the nucleotide for the current position
pos_seq <- getFrag(seqs, i, i)
# Get the frequency of each nucleotide
pos_freq <- count(pos_seq, 1, freq = TRUE)
# Convert to data.frame, rename variables more sensibly
## and add information about the nucleotide position
pos_freq <- pos_freq %>%
as.data.frame() %>%
rename(nuc = Var1, freq = Freq) %>%
mutate(pos = i)
}, seqs = seqs)
# The output of the above is a list.
## We now bind all tables to a single data.frame
## Remove nucleotides with zero frequency
## And estimate entropy and expected heterozygosity for each position
diversity <- p %>%
bind_rows() %>%
filter(freq > 0) %>%
group_by(pos) %>%
summarise(shannon_entropy = -sum(freq * log2(freq)),
het = 1 - sum(freq^2),
n_nuc = n())
The output of these calculations now looks like this:
head(diversity)
# A tibble: 6 x 4
pos shannon_entropy het n_nuc
<int> <dbl> <dbl> <int>
1 1 0.000000 0.00000 1
2 2 0.000000 0.00000 1
3 3 1.298795 0.53125 3
4 4 0.000000 0.00000 1
5 5 0.000000 0.00000 1
6 6 1.561278 0.65625 3
And here is a more visual view of it (using ggplot2, also part of tidyverse package):
ggplot(diversity, aes(pos, shannon_entropy)) +
geom_line() +
geom_point(aes(colour = factor(n_nuc))) +
labs(x = "Position (bp)", y = "Shannon Entropy",
colour = "Number of\nnucleotides")
Update:
To apply this to several fasta files, here's one possibility
(I did not test this code, but something like this should work):
# Find all the fasta files of interest
## use a pattern that matches the file extension of your files
fasta_files <- list.files("~/path/to/your/fasta/directory",
pattern = ".fa", full.names = TRUE)
# Use lapply to apply the code above to each file
my_diversities <- lapply(fasta_files, function(f){
# Read the fasta file
seqs <- read.fasta(f)
# Obtain fragment lenghts - assuming all sequences are the same length!
l <- length(seqs[[1]])
# .... ETC - Copy the code above until ....
diversity <- p %>%
bind_rows() %>%
filter(freq > 0) %>%
group_by(pos) %>%
summarise(shannon_entropy = -sum(freq * log2(freq)),
het = 1 - sum(freq^2),
n_nuc = n())
})
# The output is a list of tables.
## You can then bind them together,
## ensuring the name of the file is added as a new column "file_name"
names(my_diversities) <- basename(fasta_files) # name the list elements
my_diversities <- bind_rows(my_diversities, .id = "file_name") # bind tables
This will give you a table of diversities for each file. You can then use ggplot2 to visualise it, similarly to what I did above, but perhaps using facets to separate the diversity from each file into different panels.

you can open and read your file :
plist=[]
with open('test.txt', 'r') as infile:
for i in infile:
# make calculation of 'p' for each line here
plist.append(p)
And then use you plist to make calculation of your entropy

Related

Combining and tabulating several blocks of text

The Problem:
I need a generic approach for the following problem. For one of many files, I have been able to grab a large block of text which takes the form:
Index
1 2 3 4 5 6
eigenvalues: -15.439 -1.127 -0.616 -0.616 -0.397 0.272
1 H 1 s 0.00077 -0.03644 0.03644 0.08129 -0.00540 0.00971
2 H 1 s 0.00894 -0.06056 0.06056 0.06085 0.04012 0.03791
3 N s 0.98804 -0.11806 0.11806 -0.11806 0.15166 0.03098
4 N s 0.09555 0.16636 -0.16636 0.16636 -0.30582 -0.67869
5 N px 0.00318 -0.21790 -0.50442 0.02287 0.27385 0.37400
7 8 9 10 11 12
eigenvalues: 0.373 0.373 1.168 1.168 1.321 1.415
1 H 1 s -0.77268 0.00312 -0.00312 -0.06776 0.06776 0.69619
2 H 1 s -0.52651 -0.03358 0.03358 0.02777 -0.02777 0.78110
3 N s -0.06684 0.06684 -0.06684 -0.01918 0.01918 0.01918
4 N s 0.23960 -0.23960 0.23961 -0.87672 0.87672 0.87672
5 N px 0.01104 -0.52127 -0.24407 -0.67837 -0.35571 -0.01102
13 14 15
eigenvalues: 1.592 1.592 2.588
1 H 1 s 0.01433 0.01433 -0.94568
2 H 1 s -0.18881 -0.18881 1.84419
3 N s 0.00813 0.00813 0.00813
4 N s 0.23298 0.23298 0.23299
5 N px -0.08906 0.12679 -0.01711
The problem is that I need extract only the coefficients, and I need to be able to reformat the table so that the coefficients can be read in rows not columns. The resulting array would have the form:
[[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.21790]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[-0.00540, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.37400]
[-0.77268, -0.52651, -0.06684, 0.23960, 0.01104]
[0.00312, -0.03358, 0.06684, -0.23960, -0.52127
...
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]]
This would be manageable for me if it wasn't for the fact that the number of columns changes with different files.
What I have tried:
I had earlier managed to get the eigenvalues by:
eigenvalues = []
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
if 'eigenvalues' in line:
eigenvalues.append(line.split()[1:])
flatten = [item for sublist in eigenvalues for item in sublist]
$ ['-15.439', '-1.127', '-0.616', '-0.616', '-0.397', '0.272', '0.373', '0.373', '1.168', '1.168', '1.321', '1.415', '1.592', '1.592', '2.588']
So attempting several variants of this, and in the most recent approach I tried:
dir = {}
with open('text', 'r+') as f:
for n, line in enumerate(f):
if (n >= start_section) and (n <= end_section):
for i in range(1, number_of_coefficients+1):
if str(i) in line.split()[0]:
if line.split()[1].isdigit() == False:
if line.split()[3] in ['s', 'px', 'py', 'pz']:
dir[str(i)].append(line.split()[4:])
else:
dir[str(i)].append(line.split()[3:])
Which seemed to get me close, however, I got a strange duplication of numbers in random orders.
The idea was that I would then be able to convert the dictionary into the array.
Please HELP!!
EDIT:
The letters in the 3rd and sometimes 4th column are also variable (changing from, s, px, py, pz).
Here's one way to do it. This approach has a few noteworthy aspects.
First -- and this is key -- it processes the data section-by-section rather than line by line. To do that, you have to write some code to read the input lines and then yield them to the rest of the program in meaningful sections. Quite often, this preliminary step will radically simplify a parsing problem.
Second, once we have a section's worth of "rows" of coefficients, the other challenge is to reorient the data -- specifically to transpose it. I figured that someone smarter than I had already figured out a slick way to do this in Python, and StackOverflow did not disappoint.
Third, there are various ways to grab the coefficients from a section of input lines, but this type of fixed-width, report-style data output has a useful characteristic that can help with parsing: everything is vertically aligned. So rather than thinking of a clever way to grab the coefficients, we just grab the columns of interest -- line[20:].
import sys
def get_section(fh):
# Takes an open file handle.
# Yields each section of lines having coefficients.
lines = []
start = False
for line in fh:
if 'eigenvalues' in line:
start = True
if lines:
yield lines
lines = []
elif start:
lines.append(line)
if 'px' in line:
start = False
if lines:
yield lines
def main():
coeffs = []
with open(sys.argv[1]) as fh:
for sect in get_section(fh):
# Grab the rows from a section.
rows = [
[float(c) for c in line[20:].split()]
for line in sect
]
# Transpose them. See https://stackoverflow.com/questions/6473679
transposed = list(map(list, zip(*rows)))
# Add to the list-of-lists of coefficients.
coeffs.extend(transposed)
# Check.
for cs in coeffs:
print(cs)
main()
Output:
[0.00077, 0.00894, 0.98804, 0.09555, 0.00318]
[-0.03644, -0.06056, -0.11806, 0.16636, -0.2179]
[0.03644, 0.06056, 0.11806, -0.16636, -0.50442]
[0.08129, 0.06085, -0.11806, 0.16636, 0.02287]
[-0.0054, 0.04012, 0.15166, -0.30582, 0.27385]
[0.00971, 0.03791, 0.03098, -0.67869, 0.374]
[-0.77268, -0.52651, -0.06684, 0.2396, 0.01104]
[0.00312, -0.03358, 0.06684, -0.2396, -0.52127]
[-0.00312, 0.03358, -0.06684, 0.23961, -0.24407]
[-0.06776, 0.02777, -0.01918, -0.87672, -0.67837]
[0.06776, -0.02777, 0.01918, 0.87672, -0.35571]
[0.69619, 0.7811, 0.01918, 0.87672, -0.01102]
[0.01433, -0.18881, 0.00813, 0.23298, -0.08906]
[0.01433, -0.18881, 0.00813, 0.23298, 0.12679]
[-0.94568, 1.84419, 0.00813, 0.23299, -0.01711]

Python: How do I read a .txt file of numbers to find the median of said numbers without using median() call?

I am working on an assignment that requires us to read in a .txt file of numbers, then figure out a way to use the length of the list to determine the middle index of both odd/even lengths of number lists to calculate the median without using the median() call. I do not understand how I would go about this, anything helps! (I am also still faily new to Python)
debug = print
# assign name of file to be read
file = "numbers_even.txt"
# open file to be read
def get_median(med):
with open(file, mode='r') as my_file:
# assign middle index values for list
m2 = len(file) // 2
debug("index2", m2)
m1 = m2 -1
value1 = file[m1]
debug(m1, value1)
value2 = file[m2]
middle = (value1 + value2) / 2
debug("val1:", value1, "val2:", value2, "mid", middle)
# end with
# end function
get_median(file)
I would recommend pulling all the numbers into a list. Then you can sort the list and choose the middle index.
Assuming your text file (numbers_even.txt) looks something like this:
1
2
3
4
5
6
7
8
9
10
11
You can do this:
with open('numbers_even.txt','r') as f:
median = f.readlines()
if len(median) % 2 == 0:
print(median[int(len(median)/2-1)])
else:
print((int(median[int(len(median)//2)])+int(median[int(len(median)//2-12)]))/2)
Output:
5.5

Reading a binary file using np.fromfile()

I have a binary file that has numerous sections. Each section has its own pattern (i.e. the placement of integers, floats, and strings).
The pattern of each section is known. However, the number of times that pattern occurs within the section is unknown. Each record is in between two same integers. These integers indicate the size of the record. The section name is in between two integer record length variables: 8 and 8. Also within each section, there are multiple records (which are known).
Header
---------------------
Known header pattern
---------------------
8 Section One 8
---------------------
Section One pattern repeating i times
---------------------
8 Section Two 8
---------------------
Section Two pattern repeating j times
---------------------
8 Section Three 8
---------------------
Section Three pattern repeating k times
---------------------
Here was my approach:
Loop through and read each record using f.read(record_length), if the record is 8 bytes, convert to string, this will be the section name.
Then i call: np.fromfile(file,dtype=section_pattern,count=n)
I am calling np.fromfile for each section.
The issue I am having is two fold:
How do I determine n for each section without doing a first pass read?
Reading each record to find a section name seems rather inefficient. Is there a more efficient way to accomplish this?
The section names are always between two integer record variables: 8 and 8.
Here is a sample code, note that in this case i do not have to specify count since the OES section is the last section:
with open('m13.op2', "rb") as f:
filesize = os.fstat(f.fileno()).st_size
f.seek(108,1) # skip header
while True:
rec_len_1 = unpack_int(f.read(4))
record_bytes = f.read(rec_len_1)
rec_len_2 = unpack_int(f.read(4))
record_num = record_num + 1
if rec_len_1==8:
tablename = unpack_string(record_bytes).strip()
if tablename == 'OES':
OES = [
# Top keys
('1','i4',1),('op2key7','i4',1),('2','i4',1),
('3','i4',1),('op2key8','i4',1),('4','i4',1),
('5','i4',1),('op2key9','i4',1),('6','i4',1),
# Record 2 -- IDENT
('7','i4',1),('IDENT','i4',1),('8','i4',1),
('9','i4',1),
('acode','i4',1),
('tcode','i4',1),
('element_type','i4',1),
('subcase','i4',1),
('LSDVMN','i4',1), # Load set number
('UNDEF(2)','i4',2), # Undefined
('LOADSET','i4',1), # Load set number or zero or random code identification number
('FCODE','i4',1), # Format code
('NUMWDE(C)','i4',1), # Number of words per entry in DATA record
('SCODE(C)','i4',1), # Stress/strain code
('UNDEF(11)','i4',11), # Undefined
('THERMAL(C)','i4',1), # =1 for heat transfer and 0 otherwise
('UNDEF(27)','i4',27), # Undefined
('TITLE(32)','S1',32*4), # Title
('SUBTITL(32)','S1',32*4), # Subtitle
('LABEL(32)','S1',32*4), # Label
('10','i4',1),
# Record 3 -- Data
('11','i4',1),('KEY1','i4',1),('12','i4',1),
('13','i4',1),('KEY2','i4',1),('14','i4',1),
('15','i4',1),('KEY3','i4',1),('16','i4',1),
('17','i4',1),('KEY4','i4',1),('18','i4',1),
('19','i4',1),
('EKEY','i4',1), #Element key = 10*EID+Device Code. EID = (Element key)//10
('FD1','f4',1),
('EX1','f4',1),
('EY1','f4',1),
('EXY1','f4',1),
('EA1','f4',1),
('EMJRP1','f4',1),
('EMNRP1','f4',1),
('EMAX1','f4',1),
('FD2','f4',1),
('EX2','f4',1),
('EY2','f4',1),
('EXY2','f4',1),
('EA2','f4',1),
('EMJRP2','f4',1),
('EMNRP2','f4',1),
('EMAX2','f4',1),
('20','i4',1)]
nparr = np.fromfile(f,dtype=OES)
if f.tell() == filesize:
break

Python: slow for loop performance on reading, extracting and writing from a list of thousands of files

I am extracting 150 different cell values from 350,000 (20kb) ascii raster files. My current code is fine for processing the 150 cell values from 100's of the ascii files, however it is very slow when running on the full data set.
I am still learning python so are there any obvious inefficiencies? or suggestions to improve the below code.
I have tried closing the 'dat' file in the 2nd function; no improvement.
dat = None
First: I have a function which returns the row and column locations from a cartesian grid.
def world2Pixel(gt, x, y):
ulX = gt[0]
ulY = gt[3]
xDist = gt[1]
yDist = gt[5]
rtnX = gt[2]
rtnY = gt[4]
pixel = int((x - ulX) / xDist)
line = int((ulY - y) / xDist)
return (pixel, line)
Second: A function to which I pass lists of 150 'id','x' and 'y' values in a for loop. The first function is called within and used to extract the cell value which is appended to a new list. I also have a list of files 'asc_list' and corresponding times in 'date_list'. Please ignore count / enumerate as I use this later; unless it is impeding efficiency.
def asc2series(id, x, y):
#count = 1
ls_id = []
ls_p = []
ls_d = []
for n, (asc,date) in enumerate(zip(asc, date_list)):
dat = gdal.Open(asc_list)
gt = dat.GetGeoTransform()
pixel, line = world2Pixel(gt, east, nort)
band = dat.GetRasterBand(1)
#dat = None
value = band.ReadAsArray(pixel, line, 1, 1)[0, 0]
ls_id.append(id)
ls_p.append(value)
ls_d.append(date)
Many thanks
In world2pixel you are setting rtnX and rtnY which you don't use.
You probably meant gdal.Open(asc) -- not asc_list.
You could move gt = dat.GetGeoTransform() out of the loop. (Rereading made me realize you can't really.)
You could cache calls to world2Pixel.
You're opening dat file for each pixel -- you should probably turn the logic around to only open files once and lookup all the pixels mapped to this file.
Benchmark, check the links in this podcast to see how: http://talkpython.fm/episodes/show/28/making-python-fast-profiling-python-code

Computing averages of records from multiple files with python

Dear all,
I am beginner in Python. I am looking for the best way to do the following in Python: let's assume I have three text files, each one with m rows and n columns of numbers, name file A, B, and C. For the following, the contents can be indexed as A[i][j], or B[k][l] and so on. I need to compute the average of A[0][0], B[0][0], C[0][0], and writes it to file D at D[0][0]. And the same for the remaining records. For instance, let's assume that :
A:
1 2 3
4 5 6
B:
0 1 3
2 4 5
C:
2 5 6
1 1 1
Therefore, file D should be
D:
1 2.67 4
2.33 3.33 4
My actual files are of course larger than the present ones, of the order of some Mb. I am unsure about the best solution, if reading all the file contents in a nested structure indexed by filename, or trying to read, for each file, each line and computing the mean. After reading the manual, the fileinput module is not useful in this case because it does not read the lines "in parallel", as I need here, but it reads the lines "serially". Any guidance or advice is highly appreciated.
Have a look at numpy. It can read the three files into three arrays (using fromfile), calculate the average and export it to a text file (using tofile).
import numpy as np
a = np.fromfile('A.csv', dtype=np.int)
b = np.fromfile('B.csv', dtype=np.int)
c = np.fromfile('C.csv', dtype=np.int)
d = (a + b + c) / 3.0
d.tofile('D.csv')
Size of "some MB" should not be a problem.
In case of text files, try this:
def readdat(data,sep=','):
step1 = data.split('\n')
step2 = []
for index in step1:
step2.append(float(index.split(sep)))
return step2
def formatdat(data,sep=','):
step1 = []
for index in data:
step1.append(sep.join(str(data)))
return '\n'.join(step1)
and then use these functions to format the text into lists.
Just for reference, here's how you'd do the same sort of thing without numpy (less elegant, but more flexible):
files = zip(open("A.dat"), open("B.dat"), open("C.dat"))
outfile = open("D.dat","w")
for rowgrp in files: # e.g.("1 2 3\n", "0 1 3\n", "2 5 6\n")
intsbyfile = [[int(a) for a in row.strip().split()] for row in rowgrp]
# [[1,2,3], [0,1,3], [2,5,6]]
intgrps = zip(*intsbyfile) # [(1,0,2), (2,1,5), (3,3,6)]
# use float() to ensure we get true division in Python 2.
averages = [float(sum(intgrp))/len(intgrp) for intgrp in intgrps]
outfile.write(" ".join(str(a) for a in averages) + "\n")
In Python 3, zip will only read the files as they are needed. In Python 2, if they're too big to load into memory, use itertools.izip instead.

Categories

Resources