Counting Gene segments in python and print them in columns - python

I need to convert a text file into species and counts of gene segments. For this I wanted to create a dictionary, filled with keys i searched with a pattern. Every key should have 3 items (digits) starting with 0. With another patterns, I want to look for the gene segments and if there is one, I want to increase the count for that.
I'm searching for 3 different gene segments, why I only want to increase item1, item2 or item3. Is there a way to do this with python?
That's the code I wrote till now, but I don't know how to continue.
matrix = {}
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r";[A_Z]+V[0-9]?;")
pattern_d = re.compile(r";[A_Z]+D[0-9]?;")
pattern_j = re.compile(r";[A_Z]+J[0-9]?;")
for i in file.readlines():
name = pattern.search(i)
if pattern_v.search:
if name.group() not in matrix:
matrix.update(name.group(), (1,0,0))
else:
matrix[(name.group()[0]] = matrix[(name.group()[0]]+1
...
As you can see, if pattern_v was found, I want to increase the item at position zero.
I know, that the last command doesn't work, I just wrote it to explain, what I want to do.
EDIT ADD: I got the algorithm working, but now i have the problem, that i cant print it like i want.
{'Mus cookii': [0, 0, 0], 'Ovis aries': [0, 7, 9], 'Camelus dromedarius': [2, 0, 0], 'Danio rerio': [1, 1, 5], 'Mus saxicola': [0, 0, 0], 'Homo sapiens': [21, 6, 33], 'Rattus norvegicus': [0, 1, 12], 'Sus scrofa': [0, 5, 13], 'Vicugna pacos': [0, 9, 7], 'Macaca nemestrina': [0, 0, 0], 'Mus spretus': [4, 0, 2], 'Mus musculus': [30, 5, 28], 'Mus minutoides': [0, 0, 0], 'Oncorhynchus mykiss': [0, 11, 16], 'Canis lupus familiaris': [4, 2, 0], 'Bos taurus': [2, 5, 12], 'Cercocebus atys': [0, 0, 0], 'Oryctolagus cuniculus': [0, 0, 10], 'Rattus rattus': [0, 0, 0], 'Ornithorhynchus anatinus': [0, 4, 9], 'Macaca mulatta': [1, 3, 16], 'Papio anubis anubis': [0, 0, 0], 'Macaca fascicularis': [0, 0, 0], 'Mus pahari': [0, 0, 0]}
is the output, but i need to make it more comfortable to read. The idea is to make a output with columns (name,v,d,j). I tried:
def printStatistics(dict):
for i in range(0,len(dict)):
print(" {0:30s}{1:30d}{2:30d}{3:30d}".format(dict[i],dict[i] [0],dict[i][1],dict[i][2]), sep = "")
but i get
"TypeError: non-empty format string passed to object.format"

You can make your algorithm work with collections.defaultdict:
input data
import re
from collections import defaultdict
import numpy as np
data= '''Bos taurus;TRGV8-1;F;Bos taurus T cell receptor gamma variable 8-1;1;4;4q3.1;AY644517;-;
Bos taurus;TRGV8-2;(F) F;Bos taurus T cell receptor gamma variable 8-2;2;4;4q3.1;AY644517;-;
Camelus dromedarius;TRDV1S3;F;Camelus dromedarius T cell receptor delta variable 1S3;1;-;-;FN298223;-;
Camelus dromedarius;TRDV1S4;F;Camelus dromedarius T cell receptor delta variable 1S4;2;-;-;FN298224;-;
Canis lupus familiaris;TRBD2;F;Canis lupus familiaris T cell receptor beta diversity 2;1;16;-;HE653929;-;'''
patterns = [
re.compile(r"TR.V"),
re.compile(r"TR.D"),
re.compile(r"TR.J")
]
result = defaultdict(lambda:np.array([0,0,0]))
script
for line in data.splitlines():
result[line.split(';')[0]]+=np.array([len(pattern.findall(line)) for pattern in patterns])
print(result)
output
defaultdict(<function <lambda> at 0x7f622f81c140>, {'Camelus dromedarius': array([2, 0, 0]), 'Canis lupus familiaris': array([0, 1, 0]), 'Bos taurus': array([2, 0, 0])})
defaultdict works like a dictionary, but every key is initialized with a callable of your choice. lambda: [0,0,0] gives you the ability to immediately increment the group occurences instead of having to do update and increment.
I decided to work with numpy arrays because they support vector like adding operations which makes the algorithm prettier, you could also do it without numpy.

Found a solution now with defaultdictionary:
def find_name(file):
gene_count = defaultdict(lambda:[0,0,0])
pattern = re.compile(r"[A-Za-z ]*")
pattern_v = re.compile(r"\;[A-Z]+V[0-9]?\;")
pattern_d = re.compile(r"\;[A-Z]+D[0-9]?\;")
pattern_j = re.compile(r"\;[A-Z]+J[0-9]?\;")
for i in file.readlines():
name = pattern.search(i)
name = name.group()
if name not in gene_count and name != "Species":
gene_count.update({name:[0,0,0]})
if pattern_v.search(i):
gene_count[name][0] += 1
elif pattern_d.search(i):
gene_count[name][1] += 1
elif pattern_j.search(i):
gene_count[name][2] += 1
return gene_count
PRINTING:
def printStatistics(dict):
print(" {0:<30s}{1:<15s}{2:<15s}{3:<15s}".format("Species", "V Count", "D Count", "J Count"), sep = "")
for item in dict:
print(" {0:<30s}{1:<15d}{2:<15d}{3:<15d}".format(item,dict[item][0],dict[item][1],dict[item][2]), sep = "")
Thx 4 help!

Related

Python Numpy - Slicing assignment not assigning correctly

I have a 2d numpy array called arm_resets that has positive integers. The first column has all positive integers < 360. For all columns other than the first, I need to replace all values over 360 with the value that is in the same row in the 1st column. I thought this would be a relatively easy thing to do, here's what I have:
i = 300
over_360 = arm_resets[:, [i]] >= 360
print(arm_resets[:, [i]][over_360])
print(arm_resets[:, [0]][over_360])
arm_resets[:, [i]][over_360] = arm_resets[:, [0]][over_360]
print(arm_resets[:, [i]][over_360])
And here's what prints:
[3600 3609 3608 ... 3600 3611 3605]
[ 0 9 8 ... 0 11 5]
[3600 3609 3608 ... 3600 3611 3605]
Since all numbers that are being shown in the first print (first 3 and last 3) are above 360, they should be getting replaced by the 2nd print in the 3rd print. Why is this not working?
edit: reproducible example:
df = pd.DataFrame({"start":[1,2,5,6],"freq":[1,5,6,9]})
periods = 6
arm_resets = df[["start"]].values
freq = df[["freq"]].values
arm_resets = np.pad(arm_resets,((0,0),(0,periods-1)))
for i in range(1,periods):
arm_resets[:,[i]] = arm_resets[:,[i-1]] + freq
#over_360 = arm_resets[:,[i]] >= periods
#arm_resets[:,[i]][over_360] = arm_resets[:,[0]][over_360]
arm_resets
Given commented out code here's what prints:
array([[ 1, 2, 3, 4, 5, 6],
[ 2, 7, 12, 17, 22, 27],
[ 3, 9, 15, 21, 27, 33],
[ 4, 13, 22, 31, 40, 49]])
What I would expect:
array([[ 1, 2, 3, 4, 5, 1],
[ 2, 2, 2, 2, 2, 2],
[ 3, 3, 3, 3, 3, 3],
[ 4, 4, 4, 4, 4, 4]])
Now if it helps, the final 2d array I'm actually trying to create is a 1/0 array that indicates which are filled in, so in this example I'd want this:
array([[ 0, 1, 1, 1, 1, 1],
[ 0, 0, 1, 0, 0, 0],
[ 0, 0, 0, 1, 0, 0],
[ 0, 0, 0, 0, 1, 0]])
The code I use to achieve this from the above arm_resets is this:
fin = np.zeros((len(arm_resets),periods),dtype=int)
for i in range(len(arm_resets)):
fin[i,a[i]] = 1
The slice arm_resets[:, [i]] is a fancy index, and therefore makes a copy of the ith column of the data. arm_resets[:, [i]][over_360] = ... therefore calls __setitem__ on a temporary array that is discarded as soon as the statement executes. If you want to assign to the mask, call __setitem__ on the sliced object directly:
arm_resets[over_360, [i]] = ...
You also don't need to make the index into a list. It's generally better to use simple indices, especially when doing assignments, since they create views rather than copies:
arm_resets[over_360, i] = ...
With slicing, even the following should work, since it calls __setitem__ on a view:
arm_resets[:, i][over_360] = ...
This index does not help you process each row of the data, since i is a column. In fact, you can process the entire matrix in one step, without looping, if you use indices rather than a boolean mask. The reason that indices are useful is that you can match the item from the correct row in the first column:
rows, cols = np.nonzero(arm_resets[:, 1:] >= 360)
arm_resets[rows, cols] = arm_resets[rows, 1]
You can use np.where()
first_col = arm_resets[:,0] # first col
first_col = first_col.reshape(first_col.size,1) #Transfor in 2d array
arm_resets = np.where(arm_resets >= 360,first_col,arm_resets)
You can see in detail how np.where work here, but basically it compare arm_resets >= 360, if true it put first_col value in place (there another detail here with broadcasting) if false it put arm_resets value.
Edit: As suggested by Mad Physicist. You can use arm_resets[:,0,None] directly instead of creating first_col variable.
arm_resets = np.where(arm_resets >= 360,arm_resets[:,0,None],arm_resets)

Detecting egde on square wave

I have two lists, one for time and other for amplitude.
time = [0, 1, 2, 3, 6, 7, 10, 11, 13, 15, 16, 17, 18, 20] # (seconds for example) the step isn't fixed
ampli = [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0] # ugly space to facilitate the reading
I want to know when there's a change from '0' to '1' or vice-versa, but I only care if the event happens after verify_time = X. So, if verify_time = 12.5 it would return time[8] = 13 and time[10] = 16.
What I have so far is:
time = [0, 1, 2, 3, 6, 7, 10, 11, 13, 15, 16, 17, 18, 20] # (seconds for example) the step isn't fixed
ampli = [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0] # ugly spacing to facilitate the reading
verify_time = 12.5
start_end = []
for i, (t, a) in enumerate(zip(time, ampli)):
if t >= verify_time: # should check the values from here
if ampli[i-1] and (a != ampli[i-1]): # there's a change from 0 to 1 or vice-versa
start_end.append(i)
print(f"Start: {time[start_end[0]]}s")
print(f"End: {time[start_end[1]]}s")
This will print:
Start: 13s
End: 17s
Question 1) Shouldn't it print End: 16s? I'm kind of lost with this logic because the number of '1's is three (3).
Question 2) Is there another way to have the same results without using this for if if? I find it awkward, in Matlab I would use the diff() function
if you don't mind using numpy, it is easiest, also faster in larger lists, to find edges by calculating differences, unless your waves are taking gigabytes that goes out of memory
import numpy as np
verify_time = 12.5
time = np.array([0, 1, 2, 3, 6, 7, 10, 11, 13, 15, 16, 17, 18, 20])
ampli = np.array([0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
ind = time>verify_time
time = time[ind]
ampli = ampli[ind]
d_ampli = np.diff(ampli)
starts = np.where(d_ampli>0)[0]
ends = np.where(d_ampli<0)[0]-1
UPDATE
I forgot to change the diff properly, it should be d_ampli = np.diff(ampli, prepend=ampli[0]
UPDATE
As you noted, the original answer returns an empty start. The reason is that after filtering the ampli starts with [1, 1, ...] so there is no edge. A philosophical question arises here, does the edge really starts before 12.5 or after it? We don't know, and I'm kinda sure you won't care. What you want here is a backward differencing scheme that numpy does not allow, so we just trick it by shifting everything forward one index as:
import numpy as np
verify_time = 12.5
time = np.array([0, 1, 2, 3, 6, 7, 10, 11, 13, 15, 16, 17, 18, 20])
ampli = np.array([0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
d_ampli = np.r_[[0], np.diff(ampli)]
starts = np.where(d_ampli>0)[0]
ends = np.where(d_ampli<0)[0]-1
start = start[time[start]>verify_time]
ends = ends[time[ends]>verify_time]
start, ends
(array([8], dtype=int64), array([10], dtype=int64))
It prints 17s because you take note of the first value after the change, which is 17 for the first 0 after the end of the square wave.
I've simplified the logic into a list comprehension, so you it should make more sense:
assert len(time) == len(ampli)
start_end = [i for i in range(len(time)) if time[i] >= verify_time and ampli[i-1] is not None and (ampli[i] != ampli[i-1])]
print(f"Start: {time[start_end[0]]}s")
print(f"End: {time[start_end[1]]}s")
Also, you had an issue, where if ampli[i-1] was also False when it was 0. Fixed that too. It would be most accurate, if you took the average of time[start_end[0]] and time[start_end[0]-1], as all you know based on your resolution, that the transition occurred somewhere between the two samples.
I've made the below solution to have a straightforward algorithm. In summary, it goes as follows:
Convert lists to NumPy arrays
Find closest value in time array to verify_time, cut off all indexes that occur beforehand.
NumPys' "diff" method is great for finding rising and falling edges. Once those edges are found, we can use NumPys' "where" method to look up the indexes and then return the time found at the same indexes in the time array.
Coding Environment
Python 3.6 (Minimum Requirement for the print statements)
NumPy 1.15.2 (Older versions are probably fine)
import numpy as np
# inputs
time = [0, 1, 2, 3, 6, 7, 10, 11, 13, 15, 16, 17, 18, 20] # (seconds for example) the step isn't fixed
ampli = [0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0] # ugly spacing to facilitate the reading
verify_time = 12.5
# ------------------------------------------
# Solution
# Step 1) Convert lists to Numpy Arrays
npTime = np.array(time)
npAmplitude = np.array(ampli) # Amplitude
# Step 2) Find closest Value in time array to 'verify_time'.
# Strategy:
# i) Subtact 'verify_time' from each value in array. (Produces an array of Diffs)
# ii) The Diff that is nearest to zero, or better yet is zero is the best match for 'verify_time'
# iii) Get the array index of the Diff selected in step ii
# Step i
npDiffs = np.abs(npTime - float(verify_time))
# Step ii
smallest_value = np.amin(npDiffs)
# Step iii (Use numpy.where to lookup array index)
first_index_we_care_about = (np.where(npDiffs == smallest_value)[0])[0]
first_index_we_care_about = first_index_we_care_about - 1 # Below edge detection requires previous index
# Remove the beginning parts of the arrays that the question doesn't care about
npTime = npTime[first_index_we_care_about:len(npTime)]
npAmplitude = npAmplitude[first_index_we_care_about:len(npAmplitude)]
# Step 3) Edge Detection: Find the rising and falling edges
# Generates a 1 when rising edge is found, -1 for falling edges, 0s for no change
npEdges = np.diff(npAmplitude)
# For Reference
# Here you can see that numpy diff placed a 1 before all rising edges, and a -1 before falling
# ampli [ 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0]
# npEdges [ 0, 1, 0, -1, 0, 0, 0, 1, 0, 0, -1, 0, 0]
# Get array indexes where a 1 is found (I.e. A Rising Edge)
npRising_edge_indexes = np.where(npEdges == 1)[0]
# Get array indexes where a -1 is found (I.e. A Falling Edge)
npFalling_edge_indexes = np.where(npEdges == -1)[0]
# Print times that edges are found after 'verify_time'
# Note: Adjust edge detection index by '+1' to answer question correctly (yes this is consistent)
print(f'Start: {npTime[npRising_edge_indexes[0]+1]}s')
print(f'End: {npTime[npFalling_edge_indexes[0]+1]}s')
Output
Start: 13s
End: 17s

how to append keys and values from nested dictionaries to lists

I build three nested dictionaries to analyze my big data. I try to anylyze values inside them to make a scatter plot, so I am creating a list to append my data to them and then make a scatterplot by matplotlib. My problem is that I get an error while I try to append! TypeError: unhashable type: 'list'. so i confused to change structure of my dictionaries or is there possibility to handle it by this from that i have created.
my dictionaries structure are respectively like:
data_geo1:
'ENSG00000268358': {'Sample_19-leish_023_v2': 0, 'Sample_4-leish_012_v3': 0, 'Sample_25-leish027_v2': 0, 'Sample_6-leish_015_v3': 0, 'Sample_23-leish026_v2': 1, 'Sample_20-leish_023_v3': 0, 'Sample_18-leish_022_v3': 0, 'Sample_10-leish_017_v3': 0, 'Sample_13-leish_019_v2': 0, 'Sample_1-Leish_011_v2': 0, 'Sample_11-leish_018_v2': 0, 'Sample_3-leish_012_v2': 0, 'Sample_2-leish_011_v3': 0, 'Sample_29-leish032_v2': 0, 'Sample_8-leish_016_v3': 0, 'Sample_28-leish028_v3': 0, 'Sample_27-leish028_v2': 1, 'Sample_26-leish027_v3': 0, 'Sample_12-leish_018_v3': 0, 'Sample_5-leish_015_v2': 0, 'Sample_16-leish_021_v3': 0, 'Sample_21-leish_024_v2': 0, 'Sample_9-leish_017_v2': 0, 'Sample_24-leish026_v3': 1, 'Sample_22-leish_024_v3': 0, 'Sample_14-leish_019_v3': 0, 'Sample_30-leish032_v3': 0, 'Sample_7-leish_016_v2': 0, 'Sample_15-leish_021_v2': 0, 'Sample_17-leish_022_v2': 1}
data_ali:
{'ENSG00000268358': {'Sample_19-leish_023_v2': 0, 'Sample_16-leish_021_v3': 2, 'Sample_20': 0, 'Sample_24-leish026_v3': 1, 'Sample_6-leish_015_v3': 0, 'Sample_12-leish_018_v3': 0, 'Sample_22-leish_024_v3': 0, 'Sample_23-leish026_v2': 2, 'Sample_25-leish027_v2': 0, 'Sample_18-leish_022_v3': 1, 'Sample_14': 0, 'Sample_2-leish_011_v3': 0, 'Sample_13-leish_019_v2': 0, 'Sample_1-Leish_011_v2': 0, 'Sample_11-leish_018_v2': 0, 'Sample_20-leish_023_v3': 0, 'Sample_3-leish_012_v2': 0, 'Sample_10-leish_017_v3': 1, 'Sample_7': 0, 'Sample_29-leish032_v2': 1, 'Sample_8-leish_016_v3': 0, 'Sample_6': 0, 'Sample_7-leish_016_v2': 0, 'Sample_9': 0, 'Sample_8': 0, 'Sample_27-leish028_v2': 0, 'Sample_26-leish027_v3': 0, 'Sample_5': 1, 'Sample_4': 0, 'Sample_3': 0, 'Sample_19': 0, 'Sample_1': 0, 'Sample_2': 0, 'Sample_9-leish_017_v2': 0, 'Sample_5-leish_015_v2': 0, 'Sample_4-leish_012_v3': 0, 'Sample_21-leish_024_v2': 0, 'Sample_18': 0, 'Sample_13': 0, 'Sample_12': 0, 'Sample_11': 0, 'Sample_10': 1, 'Sample_17': 0, 'Sample_16': 0, 'Sample_15': 1, 'Sample_14-leish_019_v3': 0, 'Sample_30-leish032_v3': 0, 'Sample_28-leish028_v3': 1, 'Sample_15-leish_021_v2': 0, 'Sample_17-leish_022_v2': 0}
here is all my code structure from beginning, as you see in the end lines i tried to create list and append my values inside a list but i couldn't successful.
import os
import numpy as np
import matplotlib.pyplot as plt
path = "/home/ali/Desktop/data/"
root = "/home/ali/Desktop/SAMPLES/"
data_geo1={}
with open(path+"GSE98212_H_DE_genes_count.txt","rt") as fin: #data for sample 1-30
h = fin.readline()
sample1 = h.split()
sample_names = [s.strip('"') for s in sample1[1:31]]
for l in fin.readlines():
l = l.strip().split()
if l:
gene1= l[0].strip('"')
data_geo1[gene1] = {}
for i, x in enumerate(l[1:31]):
data_geo1[gene1][sample_names[i]] = int(x)
#print(data_geo1)
data_geo2={}
with open (path+"GSE98212_L_DE_genes_count.txt","rt") as fin:
h= fin.readline()
sample2=h.split()
sample_names=sample2[1:21]
for l in fin.readlines():
l = l.strip().split()
if l:
gene2= l[0].strip()
data_geo2[gene2]={}
for i,x in enumerate (l[1:21]):
data_geo2[gene2][sample_names[i]]= int(x)
#print(data_geo2)
data_ali={}
for sample_name in os.listdir(root):
with open(os.path.join(root, sample_name, "counts.txt"), "r") as fin:
for line in fin.readlines():
gene, reads = line.split()
reads = int(reads)
if gene.startswith('ENSG'):
data_ali.setdefault(gene, {})[sample_name] = reads
gene = l[0].strip()
#print(data_ali)
list_samples= data_ali[gene].keys()
#print(list_samples)
for sample in list_samples:
reads_data_ali = []
for gene in data_ali.keys():
reads_data_ali.append(data_ali[gene][sample_name])
i expect the output like :
[[0, 0], [0, 2], [11, 12], [4, 4], [18, 17], [2, 2], [381, 383], [1019, 1020], [198, 194], [66, 65], [2223, 2230], [30, 30], [0, 0], [33, 34], [0, 0], [411, 409], [804, 803], [11829, 7286], [137, 139], [277, 278], [3475, 3482], [5, 5], [2, 1], [70, 70], [48, 48], [234, 232], [121, 120], [928, 925], [220, 159], [165, 165], [702, 700], [1645, 1643], [79, 78], [1064, 1067], [971, 972], [0, 0]]
You can try to avoid the keyerror by checking if the key exist in your dictionary before the .append(...). Try to look at the dictionary .get() method. It's good for prevent this type of error.
As to your description, I suppose your code of making the dictionaries of data_ali and data_geo1got the right outputs, so the problem may be in the last code of making a list.
I find two questions:
1 for gene in data_ali.keys():, in the following loop, reads_data_geo1.append(data_geo1[gene1][sample_names]),here it's [gene1]
2for sample in list_samples:,so maybe you should use reads_data_ali.append(data_ali[gene][sample])
you may revise the name of these variables and see if it works.

Converting Matrix Definition to Zero-Indexed Notation - Numpy

I am trying to construct a numpy array (a 2-dimensional numpy array - i.e. a matrix) from a paper that uses a non-standard indexing to construct the matrix. I.e. the top left element is q1,2. instead of q0,0.
Define the n x (n-2) matrix Q by its elements qi,j for i = i,...,n and j = 2, ... , n-1 given by
qj-1,j=h-1j-1, qj,j = h-1j-1 - h-1j and qj+1,j=hjj-1. (I have posted this in Latex form here: http://www.texpaste.com/n/8vwds4fx)
I have tried to implement in python like this:
# n = u_s.size
# n = 299 for this example
n = 299
Q = np.zeros((n,n-2))
for i in range(0,n+1):
for j in range(2,n):
Q[j-1,j] = 1.0/h[j-1]
Q[j,j] = -1.0/h[j-1] - 1.0/h[j]
Q[j+1,j] = 1.0/h[j]
But I always get the error:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-c07a3b1c81bb> in <module>()
1 for i in range(1,n+1):
2 for j in range(2,n-1):
----> 3 Q[j-1,j] = 1.0/h[j-1]
4 Q[j,j] = -1.0/h[j-1] - 1.0/h[j]
5 Q[j+1,j] = 1.0/h[j]
IndexError: index 297 is out of bounds for axis 1 with size 297
I initially thought I could decrement both i and j in my for loop to keep edge cases safe, as a quick way to move to zero-indexed notation, but this hasn't worked. I also tried incrementing and modifying the range().
Is there a way to convert this definition to one that python can handle? Is this a common issue?
Simplifying the problem to make the assignment pattern obvious:
In [228]: h=np.arange(10,15)
In [229]: Q=np.zeros((5,5),int)
In [230]: for j in range(1,5):
...: Q[j-1:j+2,j] = h[j-1:j+2]
In [231]: Q
Out[231]:
array([[ 0, 10, 0, 0, 0],
[ 0, 11, 11, 0, 0],
[ 0, 12, 12, 12, 0],
[ 0, 0, 13, 13, 13],
[ 0, 0, 0, 14, 14]])
Assignment to the partial first and last columns may need tweaking. Here's the equivalent built from diagonals:
In [232]: np.diag(h,0)+np.diag(h[:-1],1)+np.diag(h[1:],-1)
Out[232]:
array([[10, 10, 0, 0, 0],
[11, 11, 11, 0, 0],
[ 0, 12, 12, 12, 0],
[ 0, 0, 13, 13, 13],
[ 0, 0, 0, 14, 14]])
With the h[j-1], h[j] indexing this diagonal assignment probably needs tweaking, but it should be a useful starting point.
Selecting h values more like what you use (skipping the 1/h for now):
In [238]: Q=np.zeros((5,5),int)
In [239]: for j in range(1,4):
...: Q[j-1:j+2,j] =[h[j-1],h[j-1]+h[j], h[j]]
...:
In [240]: Q
Out[240]:
array([[ 0, 10, 0, 0, 0],
[ 0, 21, 11, 0, 0],
[ 0, 11, 23, 12, 0],
[ 0, 0, 12, 25, 0],
[ 0, 0, 0, 13, 0]])
I'm skipping the two partial end columns for now. The first slicing approach allowed me to be a bit sloppy, since it's ok to slice 'off the end'. The end columns, if set, will require their own expressions.
In [241]: j=0; Q[j:j+2,j] =[h[j], h[j]]
In [242]: j=4; Q[j-1:j+1,j] =[h[j-1],h[j-1]+h[j]]
In [243]: Q
Out[243]:
array([[10, 10, 0, 0, 0],
[10, 21, 11, 0, 0],
[ 0, 11, 23, 12, 0],
[ 0, 0, 12, 25, 13],
[ 0, 0, 0, 13, 27]])
The relevant diagonal pieces are still evident:
In [244]: h[1:]+h[:-1]
Out[244]: array([21, 23, 25, 27])
The equation doesn't contain any value for i. It is referring only to j. The Q should be a matrix of dimension n+2 x n+2. For j = 1, it refers to Q[0,1], Q[1,1] and Q[2,1]. for j =n, it refers to Q[n-1,n], Q[n,n] and Q[n+1,n]. So, Q should have indices from 0 to n+1 which n+2
I don't think, you require the i loop. You can achieve your results only with j loop from 1 to n, but Q should be from 0 to n+1

Why cycle behaves differently in just one iteration?

I have this code:
gs = open("graph.txt", "r")
gp = gs.readline()
gp_splitIndex = gp.find(" ")
gp_nodeCount = int(gp[0:gp_splitIndex])
gp_edgeCount = int(gp[gp_splitIndex+1:-1])
matrix = [] # predecare the array
for i in range(0, gp_nodeCount):
matrix.append([])
for y in range(0, gp_nodeCount):
matrix[i].append(0)
for i in range(0, gp_edgeCount-1):
gp = gs.readline()
gp_splitIndex = gp.find(" ") # get the index of space, dividing the 2 numbers on a row
gp_from = int(gp[0:gp_splitIndex])
gp_to = int(gp[gp_splitIndex+1:-1])
matrix[gp_from][gp_to] = 1
print matrix
The file graph.txt contains this:
5 10
0 1
1 2
2 3
3 4
4 0
0 3
3 1
1 4
4 2
2 0
The first two number are telling me, that GRAPH has 5 nodes and 10 edges. The Following number pairs demonstrate the edges between nodes. For example "1 4" means an edge between node 1 and 4.
Problem is, the output should be this:
[[0, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 0, 0, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 0, 0]]
But instead of that, I get this:
[[0, 1, 0, 1, 0], [0, 0, 1, 0, 1], [0, 0, 0, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 0, 0]]
Only one number is different and I can't understand why is this happening. The edge "3 1" is not present. Can someone explain, where is the problem?
Change for i in range(0, gp_edgeCount-1): to
for i in range(0, gp_edgeCount):
The range() function already does the "-1" operation. range(0,3) "==" [0,1,2]
And it is not the "3 1" edge that is missing, it is the "2 0" edge that is missing, and that is the last edge. The matrices start counting at 0.
Matthias has it; you don't need edgeCount - 1 since the range function doesn't include the end value in the iteration.
There are several other things you can do to clean up your code:
The with operator is preferred for opening files, since it closes them automatically for you
You don't need to call find and manually slice, split already does what you want.
You can convert and assign directly to a pair of numbers using a generator expression and iterable unpacking
You can call range with just an end value, the 0 start is implicit.
The multiplication operator is handy for initializing lists
With all of those changes:
with open('graph.txt', 'r') as graph:
node_count, edge_count = (int(n) for n in graph.readline().split())
matrix = [[0]*node_count for _ in range(node_count)]
for i in range(edge_count):
src, dst = (int(n) for n in graph.readline().split())
matrix[src][dst] = 1
print matrix
# [[0, 1, 0, 1, 0], [0, 0, 1, 0, 1], [1, 0, 0, 1, 0], [0, 1, 0, 0, 1], [1, 0, 1, 0, 0]]
Just to keep your code and style, of course it could be much more readable:
gs = open("graph.txt", "r")
gp = gs.readline()
gp_splitIndex = gp.split(" ")
gp_nodeCount = int(gp_splitIndex[0])
gp_edgeCount = int(gp_splitIndex[1])
matrix = [] # predecare the array
for i in range(0, gp_nodeCount):
matrix.append([])
for y in range(0, gp_nodeCount):
matrix[i].append(0)
for i in range(0, gp_edgeCount):
gp = gs.readline()
gp_Index = gp.split(" ") # get the index of space, dividing the 2 numbers on a row
gp_from = int(gp_Index[0])
gp_to = int(gp_Index[1])
matrix[gp_from][gp_to] = 1
print matrix
Exactly is the last instance not used..the 2 0 from your file. Thus the missed 1. Have a nice day!
The other answers are correct, another version similar to the one of tzaman:
with open('graph.txt', mode='r') as txt_file:
lines = [l.strip() for l in txt_file.readlines()]
number_pairs = [[int(n) for n in line.split(' ')] for line in lines]
header = number_pairs[0]
edge_pairs = number_pairs[1:]
num_nodes, num_edges = header
edges = [[0] * num_nodes for _ in xrange(num_nodes)]
for edge_start, edge_end in edge_pairs:
edges[edge_start][edge_end] = 1
print edges

Categories

Resources