I am trying to replace data at a set frequency only on select columns. With the help of AKX here, I was able to find a solution to randomly replace values across an entire list. I feel bad for asking the question because I already asked a similar question, however, I am can't seem to really find a solution to do this regardless. So what I am trying to do exactly is, if I have a list that contains 4 values, I want to be able to select which values are randomly being replaced based on their indices. For example, if I select indices 2 and 4, I only want to replace values in those indices but indices 1 and 3 remain unaltered.
vals = ["*"]
def replace_random(lst, min_n, max_n, replacements):
n = random.randint(min_n, max_n)
if n == 0:
return lst
indexes = set(random.sample(range(len(lst)), n))
return [
random.choice(replacements)
if index in indexes
else value
for index, value
in enumerate(lst)
]
example of applying
with open("test2.txt", "w") as out, open("test.txt", "rt") as f:
for line in f:
li = line.strip()
tabs = li.split("\t")
geno = tabs[1:]
new_geno = replace_random_indexes(geno, 0, 5, vals)
print(new_geno)
example me what I have been trying to do to achieve the goal:
M = [1,3]
with open("test2.txt", "w") as out, open("test.txt", "rt") as f:
for line in f:
li = line.strip()
tabs = li.split("\t")
geno = tabs[1:]
new_geno = replace_random_indexes(geno[M], 0, 1, vals)
print(new_geno)
However, I get the following error when I try this:
TypeError: list indices must be integers or slices, not list
Example data:
Input:
123 1 2 1 4
234 - 2 0 4
345 - 2 - 4
456 0 2 1 4
567 1 2 1 4
678 0 2 0 4
789 - 2 1 4
890 0 2 1 4
Output:
123 1 * 1 4
234 - 2 0 4
345 - 2 - *
456 0 2 1 *
567 1 2 1 4
678 0 2 0 4
789 - 2 1 4
890 0 * 1 4
Edit:
One thing I forgot to mention, I thought about just removing the indices that I did not want to edit, then performing the replacement on the indices that I do want to replace, however, I wasn't sure how to join the indices back together in the same order. Here is what I tried
with open("test2.txt", "w") as out, open("start.test.txt", "rt") as f:
for line in f:
li = line.strip()
tabs = li.split("\t")
geno = tabs[1:]
geno_alt = [i for j, i in enumerate(geno) if j not in M]
geno_alt = replace_random(geno_alt,0,1,vals)
print(geno_alt)
If all you're trying to do is replace values at specific indices on each line of a file (taking the example data you provided), making n replacements (n randomly selected from some range), with a randomly selected replacement from some values, this would work:
from random import sample, choice
def make_replacements(fn_in, fn_out, indices, values, frequency):
with open(fn_out, "w") as out, open(fn_in, "r") as f:
for line in f:
indices_sample = sample(indices, choice(frequency))
line = '\t'.join(
choice(values)
if n in indices_sample
else v
for n, v in enumerate(line.strip().split())
) + '\n'
out.write(line)
make_replacements("start.test.txt", "out.txt", [2, 4], ['*'], [0, 1])
An example output:
123 1 2 1 *
234 - 2 0 4
345 - 2 - 4
456 0 2 1 4
567 1 2 1 *
678 0 * 0 4
789 - 2 1 *
890 0 2 1 *
I've updated the code and example output according to your changes in the question and the comments and believe this is what you were after.
Based on the answer given by Grismar and the answer given by AKX in my last post here, I was able to come up with this answer to solve my problem.
import random
def select_replacements(lst, indices, min_n, max_n, values):
n = random.randint(min_n, max_n)
if n == 0:
return lst
indices_sample = random.sample(indices, n)
return [
random.choice(values)
if n in indices_sample
else v
for n, v in enumerate(last)
]
program:
with open("test2.txt", "w") as out, open("start.test2.txt", "rt") as f:
for line in f:
li = line.strip()
tabs = li.split("\t")
geno = tabs[1:]
geno = select_replacements(geno, [1, 3], ['*'], 0,2)
geno = '\t'.join(geno)
merge = (f"{tabs[0]}\t{geno}\n")
out.write(merge)
input:
123 1 2 1 4
234 - 2 0 4
345 - 2 - 4
456 0 2 1 4
567 1 2 1 4
678 0 2 0 4
789 - 2 1 4
890 0 2 1 4
output:
123 1 2 1 4
234 - * 0 4
345 - * - *
456 0 * 1 4
567 1 * 1 *
678 0 * 0 *
789 - 2 1 4
890 0 * 1 *
Related
I am trying to edit a 5 * 5 square matrix in Python.And I initialize every element in this 5 * 5 matrix with the value 0. I initialize the matrix by using lists using this code:
h = []
for i in range(5):
h.append([0,0,0,0,0])
And now I want to change the matrix to something like this.
4 5 0 0 0
0 4 5 0 0
0 0 4 5 0
0 0 0 4 5
5 0 0 0 4
Here is the piece of code -
i = 0
a = 0
while i < 5:
h[i][a] = 4
h[i][a+1] = 5
a += 1
i += 1
where h[i][j] is the 2 D matrix. But the output is always is showing something like this -
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
4 4 4 4 4
Can you guys tell me what is wrong with it?
Do the update as follows using the modulo operator %:
for i in range(5):
h[i][i % 5] = 4
h[i][(i+1) % 5] = 5
The % 5 in the first line isn't strictly necessary but underlines the general principle for matrices of various dimensions. Or more generally, for random dimensions:
for i, row in enumerate(h):
n = len(row)
row[i % n] = 4
row[(i+1) % n] = 5
Question answered here: 2D list has weird behavor when trying to modify a single value
This should work:
#m = [[0]*5]*5 # Don't do this.
m = []
for i in range(5):
m.append([0]*5)
i = a = 0
while i < 5:
m[i][a] = 4
if a < 4:
m[i][a+1] = 5
a += 1
i += 1
matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
matrix[index].append(value + ':')
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
print("Unity: " + unity)
for row in matrix:
print(' '.join(map(str, row)))
OUTPUT:
Unity: GGCTACGC
A: 1 2 0 2 3 2 0 0
C: 0 1 4 2 1 3 2 4
G: 3 3 2 0 1 2 4 1
T: 3 1 1 3 2 0 1 2
With this code I get this matrix but I want to form the matrix like this:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2
But I don't know how. I hope someone can help me. Thanks already for the answers.
The sequences are:
AGCTACGT
TAGCTAGC
TAGCTACG
GCTAGCGC
TGCTAGCC
GGCTACGT
GTCACGTC
You're needing to do a transpose of your matrix. I've added comments in the code below to explain what has been changed to make the table.
matrix = []
for index, value in enumerate(['A','C','G','T']):
matrix.append([])
# Don't put colons in column headers
matrix[index].append(value)
for i in range(len(lines[0])):
total = 0
for sequence in lines:
if sequence[i] == value:
total += 1
matrix[index].append(total)
unity = ''
for i in range(len(lines[0])):
column = []
for row in matrix:
column.append(row[1:][i])
maximum = column.index(max(column))
unity += ['A', 'C', 'G', 'T'][maximum]
# Tranpose matrix
matrix = list(map(list, zip(*matrix)))
# Print header with tabs to make it look pretty
print( '\t'+'\t'.join(matrix[0]))
# Print rows in matrix
for row,unit in zip(matrix[1:],unity):
print(unit + ':\t'+'\t'.join(map(str, row)))
The following will be printed:
A C G T
G: 1 0 3 3
G: 2 1 3 1
C: 0 4 2 1
T: 2 2 0 3
A: 3 1 1 2
C: 2 3 2 0
G: 0 2 4 1
C: 0 4 1 2
I think that the best way is to convert your matrix to pandas dataframe and to then use transpose function.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.transpose.html
I would like to ask you how to change file looks like this:
123 111 1
146 204 2
178 398 1
...
...
First column is x, second is y and the third mean the number in each square.
My matrix is 400x400 dimension. I would like to change it to the simple file
M file doesn't posses every square (for example 0 0 doesn't exist which mean that in output file i would like to have 0 in first row in first place.
My output file should look like this
0 0 1 0 0 0 1 0 7 9 3 0 2 0 ...
8 0 0 1 0 0 0 0 0 0 0 0 0 0 ...
7 8 9 0 7 5 0 0 3 2 4 5 5 7 ...
...
...
How can I change my file?
From first file i would like to reah second file. Like text file with 400lines each 400 characters splited by " " (blankspace).
Just initialize your matrix as as a list of list of zeros, and then iterate the lines in the file and set the values in the matrix accordingly. Cells that are not in the file will remain unchanged.
matrix = [[0 for i in range(400)] for k in range(400)]
with open("filename") as data:
for row in data:
(x, y, n) = map(int, row.split())
matrix[x][y] = n
Finally, write that matrix to another file:
with open("outfile", "w") as outfile:
for row in matrix:
outfile.write(" ".join(map(str, row)) + "\n")
You could also use numpy:
matrix = numpy.zeros((4,4), dtype=numpy.int8)
Two input files. Regions input file:
Start: 1 id123 1234
Stop: 1 id234 3456
Start: 2 id456 34523
Stop: 2 id231 35234
Positions input file:
1 123
1 1234
1 1256
1 1390
1 1490
1 3456
1 3560
1 5000
2 345
2 456
2 34523
2 34589
2 35234
2 40000
I want to add a third field to the positions file, where the positions fall inside my regions. This is what I wrote, Option 1:
regions = open(fileone, 'r')
positions = open(filetwo, 'r').readlines()
for start in regions:
stop = regions.next()
a = start.split()
b = stop.split()
if 'Start' in a[0] and 'Stop' in b[0]:
for line in positions:
pos = line.split()
if pos[0] == a[1] and pos[1] >= a[3] and pos[1] <= b[3]:
pos.append("1")
else:
pos.append("0")
print("\t".join(pos))
Alternative, option 2:
regions = open(fileone, 'r')
positions = open(filetwo, 'r')
d = {}
for start in regions:
stop = regions.next()
a = start.split()
b = stop.split()
if 'Start' in a[0] and 'Stop' in b[0]:
d[a[1]] = [a[3],b[3]]
for line in positions:
pos = line.split()
chr = d.keys()
beg = d.values()[0][0]
end = d.values()[0][1]
if pos[0] == chr and pos[1] >= beg and pos[1] <= end:
pos.append("1")
else:
pos.append("0")
print("\t".join(pos))
Option 1 returns the file twice, with only one region annotated in each repetition:
1 123 0
1 1234 1
1 1256 1
1 1390 1
1 1490 1
1 3456 1
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 0
2 34589 0
2 35234 0
2 40000 0
1 123 0
1 1234 0
1 1256 0
1 1390 0
1 1490 0
1 3456 0
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 1
2 34589 1
2 35234 1
2 40000 0
Option 2 just returns all 0 in column 3.
What I would like is a combination of the two, where the second region is also annotated the first go around. I know I could run it once for each region and then combine, but that would get messy with the volume of my real data so I'd rather avoid combining them after the fact.
Thanks in advance :)
Desired output:
1 123 0
1 1234 1
1 1256 1
1 1390 1
1 1490 1
1 3456 1
1 3560 0
1 5000 0
2 345 0
2 456 0
2 34523 1
2 34589 1
2 35234 1
2 40000 0
I propose:
def one(regions):
with open(regions,'r') as f:
for line in f:
a = line.split()
b = f.next().split()
assert(a[0]=='Start:' and b[0]=='Stop:')
assert(a[1]==b[1])
yield (a[1], (int(a[3]),int(b[3])) )
def two(positions,regions):
d = dict(one(regions))
with open(positions,'r') as g:
for line in g:
ls = tuple(line.split())
yield(ls + (1 if d[ls[0]][0]<= int(ls[1]) <=d[ls[0]][1] else 0,))
print list(two('filetwo.txt','fileone.txt'))
print '================================'
print '\n'.join('%s\t%s\t%s' % x for x in two('filetwo.txt','fileone.txt'))
.
EDIT
It seems that the following code does what you ask as an improvement:
def two(positions,regions):
d = defaultdict(list)
for k,v in one(regions):
d[k].append(v)
with open(positions,'r') as g:
for line in g:
ls = tuple(line.split())
yield ls + (1 if any(x <= int(ls[1]) <= y for (x,y) in d[ls[0]])
else 0,)
with
from collections import defaultdict
at the beginning
I have the following data:
1 3 4 2 6 7 8 8 93 23 45 2 0 0 0 1
0 3 4 2 6 7 8 8 90 23 45 2 0 0 0 1
0 3 4 2 6 7 8 6 93 23 45 2 0 0 0 1
-1 3 4 2 6 7 8 8 21 23 45 2 0 0 0 1
-1 3 4 2 6 7 8 8 0 23 45 2 0 0 0 1
The above data is in a file. I want to count the number of 1's,0's,-1's but only in 1st column. I am taking the file in standard input but the only way I could think of is to do like this:
cnt = 0
cnt1 = 0
cnt2 = 0
for line in sys.stdin:
(t1, <having 15 different variables as that many columns are in files>) = re.split("\s+", line.strip())
if re.match("+1", t1):
cnt = cnt + 1
if re.match("-1", t1):
cnt1 = cnt1 + 1
if re.match("0", t1):
cnt2 = cnt2 + 1
How can I make it better especially the 15 different variables part as thats the only place where I will be using those variables.
Use collections.Counter:
from collections import Counter
with open('abc.txt') as f:
c = Counter(int(line.split(None, 1)[0]) for line in f)
print c
Output:
Counter({0: 2, -1: 2, 1: 1})
Here str.split(None, 1) splits the line just once:
>>> s = "1 3 4 2 6 7 8 8 93 23 45 2 0 0 0 1"
>>> s.split(None, 1)
['1', '3 4 2 6 7 8 8 93 23 45 2 0 0 0 1']
Numpy makes it even easy:
>>> import numpy as np
>>> from collections import Counter
>>> Counter(np.loadtxt('abc.txt', usecols=(0,), dtype=np.int))
Counter({0: 2, -1: 2, 1: 1})
If you only want the first column, then only split the first column. And use a dictionary to store the counts for each value.
count = dict()
for line in sys.stdin:
(t1, rest) = line.split(' ', 1)
try:
count[t1] += 1
except KeyError:
count[t1] = 1
for item in count:
print '%s occurs %i times' % (item, count[item])
Instead of using tuple unpacking, where you need a number of variables exactly equal to the number of parts returned by split(), you can just use the first element of those parts:
parts = re.split("\s+", line.strip())
t1 = parts[0]
or equivalently, simply
t1 = re.split("\s+", line.strip())[0]
import collections
def countFirstColum(fileName):
res = collections.defaultdict(int)
with open(fileName) as f:
for line in f:
key = line.split(" ")[0]
res[key] += 1;
return res
rows = []
for line in f:
column = line.strip().split(" ")
rows.append(column)
then you get a 2-dimensional array.
1st column:
for row in rows:
print row[0]
output:
1
0
0
-1
-1
This is from a script of mine with an infile, I checked and it works with standard input as infile:
dictionary = {}
for line in someInfile:
line = line.strip('\n') # if infile but you should
f = line.split() # do your standard input thing
dictionary[f[0]]=0
for line in someInfile:
line = line.strip('\n') # if infile but you should
f = line.split() # do your standard input thing
dictionary[f[0]]+=1
print dictionary