Python Dynamic Data Structures - python

I am going to read the lines of a given text file and select several chunks of data whose format are (int, int\n) . Every time the number of lines are different so I need a dynamic sized data structure in Python. I also would like to store those chunks in 2D data structures. If you are familiar with MATLAB programming, I'd like to have something like a structure A{n} n = number of chunks of data and each chunk includes several lines of the data mentioned above.
Which type of data structure would you recommend? and how to implement with it?
i.e. A{0} = ([1,2],[2,3],[3,4]) A{1} = ([1,1],[2,2],[5,5],[7,4]) and so on.
Thank you

A python list can contain lists as well any different data type.
l = []
l.append(2) # l is now (2)
l.extend([3,2]) # l is now (2,3,2)
l.append([4,5]) # l is now (2,3,2,[4,5])
list.Append appends whatever it is given as argument to the list
while list.extend makes the given the argument the tail of the list.
I guess you required list would appear somehwhat like this:
l = ([[1,2],[2,3],[3,4]],[[1,1],[2,2],[5,5],[7,4]])
PS: Here's a link to get you jump start learning python
https://learnxinyminutes.com/docs/python/

Just keep in mind that if your are reading data from text file , the format is string , you need to use int() to convert your string to int.

The issue was resolved with 2 steps appending the list.
import numpy as np
file = ('data.txt')
f = open(file)
i = 0
str2 = '.PEN_DOWN\n'
str3 = '.PEN_UP\n'
A = []
B = []
for line in f.readlines():
switch_end = False
if (line == str2) or (~switch_end):
if line[0].isdigit():
A.append(line[:-1])
elif line == str3:
switch_end = True
B.append(A)
A = []
B.append(A)
f.close()
print(np.shape(A))
print(np.shape(B))

Related

read data from file to a 2d array [Python]

I have a file .txt like this:
8.3713312149,0.806817531586,0.979428482338,0.20179159543
5.00263547897,2.33208847046,0.55745770379,0.830205341157
0.0087910592556,4.98708152771,0.56425779093,0.825598658777
and I want data to be saved in a 2d array such as
array = [[8.3713312149,0.806817531586,0.979428482338,0.20179159543],[5.00263547897,2.33208847046,0.55745770379,0.830205341157],[0.0087910592556,4.98708152771,0.56425779093,0.825598658777]
I tried with this code
#!/usr/bin/env python
checkpoints_from_file[][]
def read_checkpoints():
global checkpoints_from_file
with open("checkpoints.txt", "r") as f:
lines = f.readlines()
for line in lines:
checkpoints_from_file.append(line.split(","))
print checkpoints_from_file
if __name__ == '__main__':
read_checkpoints()
but it does not work.
Can you guys tell me how to fix this? thank you
You have two errors in your code. The first is that checkpoints_from_file[][] is not a valid way to initialize a multidimensional array in Python. Instead, you should write
checkpoints_from_file = []
This initializes a one-dimensional array, and you then append arrays to it in your loop, which creates a 2D array with your data.
You are also storing the entries in your array as strings, but you likely want them as floats. You can use the float function as well as a list comprehension to accomplish this.
checkpoints_from_file.append([float(x) for x in line.split(",")])
Reading from your file,
def read_checkpoints():
checkpoints_from_file = []
with open("checkpoints.txt", "r") as f:
lines = f.readlines()
for line in lines:
checkpoints_from_file.append(line.split(","))
print(checkpoints_from_file)
if __name__ == '__main__':
read_checkpoints()
Or assuming you can read this data successfully, using a string literal,
lines = """8.3713312149,0.806817531586,0.979428482338,0.20179159543
5.00263547897,2.33208847046,0.55745770379,0.830205341157
0.0087910592556,4.98708152771,0.56425779093,0.825598658777"""
and a list comprehension,
list_ = [[decimal for decimal in line.split(",")] for line in lines.split("\n")]
Expanded,
checkpoints_from_file = []
for line in lines.split("\n"):
list_of_decimals = []
for decimal in line.split(","):
list_of_decimals.append(decimal)
checkpoints_from_file.append(list_of_decimals)
print(checkpoints_from_file)
Your errors:
Unlike in some languages, in Python you don't initialize a list like, checkpoints_from_file[][], instead, you can initialize a one-dimensional list checkpoint_from_file = []. Then, you can insert more lists inside of it with Python's list.append().

Parsing sequences from a FASTA file in python

I have a text file:
>name_1
data_1
>name_2
data_2
>name_3
data_3
>name_4
data_4
>name_5
data_5
I want to store header (name_1, name_2....) in one list and data (data_1, data_2....) in another list in a Python program.
def parse_fasta_file(fasta):
desc=[]
seq=[]
seq_strings = fasta.strip().split('>')
for s in seq_strings:
if len(s):
sects = s.split()
k = sects[0]
v = ''.join(sects[1:])
desc.append(k)
seq.append(v)
for l in sys.stdin:
data = open('D:\python\input.txt').read().strip()
parse_fasta_file(data)
print seq
this is my code which i have tried but i am not able to get the answer.
The most fundamental error is trying to access a variable outside of its scope.
def function (stuff):
seq = whatever
function('data')
print seq ############ error
You cannot access seq outside of function. The usual way to do this is to have function return a value, and capture it in a variable within the caller.
def function (stuff):
seq = whatever
return seq
s = function('data')
print s
(I have deliberately used different variable names inside the function and outside. Inside function you cannot access s or data, and outside, you cannot access stuff or seq. Incidentally, it would be quite okay, but confusing to a beginner, to use a different variable with the same name seq in the mainline code.)
With that out of the way, we can attempt to write a function which returns a list of sequences and a list of descriptions for them.
def parse_fasta (lines):
descs = []
seqs = []
data = ''
for line in lines:
if line.startswith('>'):
if data: # have collected a sequence, push to seqs
seqs.append(data)
data = ''
descs.append(line[1:]) # Trim '>' from beginning
else:
data += line.rstrip('\r\n')
# there will be yet one more to push when we run out
seqs.append(data)
return descs, seqs
This isn't particularly elegant, but should get you started. A better design would be to return a list of (description, data) tuples where the description and its data are closely coupled together.
descriptions, sequences = parse_fasta(open('file', 'r').read().split('\n'))
The sys.stdin loop in your code does not appear to do anything useful.

Custom sort method in Python is not sorting list properly

I'm a student in a Computing class and we have to write a program which contains file handling and a sort. I've got the file handling done and I wrote out my sort (it's a simple sort) but it doesn't sort the list. My code is this:
namelist = []
scorelist = []
hs = open("hst.txt", "r")
namelist = hs.read().splitlines()
hss = open("hstscore.txt","r")
for line in hss:
scorelist.append(int(line))
scorelength = len(scorelist)
for i in range(scorelength):
for j in range(scorelength + 1):
if scorelist[i] > scorelist[j]:
temp = scorelist[i]
scorelist[i] = scorelist[j]
scorelist[j] = temp
return scorelist
I've not been doing Python for very long so I know the code may not be efficient but I really don't want to use a completely different method for sorting it and we're not allowed to use .sort() or .sorted() since we have to write our own sort function. Is there something I'm doing wrong?
def super_simple_sort(my_list):
switched = True
while switched:
switched = False
for i in range(len(my_list)-1):
if my_list[i] > my_list[i+1]:
my_list[i],my_list[i+1] = my_list[i+1],my_list[i]
switched = True
super_simple_sort(some_list)
print some_list
is a very simple sorting implementation ... that is equivelent to yours but takes advantage of some things to speed it up (we only need one for loop, and we only need to repeat as long as the list is out of order, also python doesnt require a temp var for swapping values)
since its changing the actual array values you actually dont even need to return

Python: argument conversion during string format error /w dictionary/list reads

new to these boards and understand there is protocol and any critique is appreciated. I have begun python programming a few days ago and am trying to play catch-up. The basis of the program is to read a file, convert a specific occurrence of a string into a dictionary of positions within the document. Issues abound, I'll take all responses.
Here is my code:
f = open('C:\CodeDoc\Mm9\sampleCpG.txt', 'r')
cpglist = f.read()
def buildcpg(cpg):
return "\t".join(["%d" % (k) for k in cpg.items()])
lookingFor = 'CG'
i = 0
index = 0
cpgdic = {}
try:
while i < len(cpglist):
index = cpglist.index(lookingFor, i)
i = index + 1
for index in range(len(cpglist)):
if index not in cpgdic:
cpgdic[index] = index
print (buildcpg(cpgdic))
except ValueError:
pass
f.close()
The cpgdic is supposed to act as a dictionary of the position reference obtained in the index. Each read of index should be entering cpgdic as a new value, and the print (buildcpg(cpgdic)) is my hunch of where the logic fails. I believe(??) it is passing cpgdic into the buildcpg function, where it should be returned as an output of all the positions of 'CG', however the error "TypeError:not all arguments converted during string formatting" shows up. Your turn!
ps. this destroys my 2GB memory; I need to improve with much more reading
cpg.items is yielding tuples. As such, k is a tuple (length 2) and then you're trying to format that as a single integer.
As a side note, you'll probably be a bit more memory efficient if you leave off the [ and ] in the join line. This will turn your list comprehension to a generator expression which is a bit nicer. If you're on python2.x, you could use cpg.iteritems() instead of cpg.items() as well to save a little memory.
It also makes little sense to store a dictionary where the keys and the values are the same. In this case, a simple list is probably more elegant. I would probably write the code this way:
with open('C:\CodeDoc\Mm9\sampleCpG.txt') as fin:
cpgtxt = fin.read()
indices = [i for i,_ in enumerate(cpgtxt) if cpgtxt[i:i+2] == 'CG']
print '\t'.join(indices)
Here it is in action:
>>> s = "CGFOOCGBARCGBAZ"
>>> indices = [i for i,_ in enumerate(s) if s[i:i+2] == 'CG']
>>> print indices
[0, 5, 10]
Note that
i for i,_ in enumerate(s)
is roughly the same thing as
i for i in range(len(s))
except that I don't like range(len(s)) and the former version will work with any iterable -- Not just sequences.

Smart filter with python

Hi
I need filter out all rows that don't contain symbols from huge "necessary" list, example code:
def any_it(iterable):
for element in iterable:
if element: return True
return False
regexp = re.compile(r'fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ...] # huge list of 10 000 members
f = open("huge_file", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
## File rows like, let's say:
# 1 djhds fruit=REDSOMETHING sdkjld
# 2 sdhfkjk fruit=GREENORANGE lkjfldk
# 3 dskjldsj fruit=YELLOWDOG sldkfjsdl
# 4 gfhfg fruit=REDSOMETHINGELSE fgdgdfg
filtered = (line for line in lines if any_it(regexp.findall(line)[0].startswith(x) for x in necessary))
I have python 2.4, so I can't use built-in any().
I wait a long time for this filtering, but is there some way to optimize it? For example row 1 and 4 contains "RED.." pattern, if we found that "RED.." pattern is ok, can we skip search in 10000-members list for row 4 the same pattern??
Is there some another way to optimize filtering?
Thank you.
...edited...
UPD: See real example data in comments to this post. I'm also interested in sorting by "fruits" the result. Thanks!
...end edited...
If you organized the necessary list as a trie, then you could look in that trie to check if the fruit starts with a valid prefix. That should be faster than comparing the fruit against every prefix.
For example (only mildly tested):
import bisect
import re
class Node(object):
def __init__(self):
self.children = []
self.children_values = []
self.exists = False
# Based on code at http://docs.python.org/library/bisect.html
def _index_of(self, ch):
i = bisect.bisect_left(self.children_values, ch)
if i != len(self.children_values) and self.children_values[i] == ch:
return (i, self.children[i])
return (i, None)
def add(self, value):
if len(value) == 0:
self.exists = True
return
i, child = self._index_of(value[0])
if not child:
child = Node()
self.children.insert(i, child)
self.children_values.insert(i, value[0])
child.add(value[1:])
def contains_prefix_of(self, value):
if self.exists:
return True
i, child = self._index_of(value[0])
if not child:
return False
return child.contains_prefix_of(value[1:])
necessary = ['RED', 'GREEN', 'BLUE', 'ORANGE', 'BLACK',
'LIGHTRED', 'LIGHTGREEN', 'GRAY']
trie = Node()
for value in necessary:
trie.add(value)
# Find lines that match values in the trie
filtered = []
regexp = re.compile(r'fruit=([A-Z]+)')
for line in open('whatever-file'):
fruit = regexp.findall(line)[0]
if trie.contains_prefix_of(fruit):
filtered.append(line)
This changes your algorithm from O(N * k), where N is the number of elements of necessary and k is the length of fruit, to just O(k) (more or less). It does take more memory though, but that might be a worthwhile trade-off for your case.
I'm convinced Zach's answer is on the right track. Out of curiosity, I've implemented another version (incorporating Zach's comments about using a dict instead of bisect) and folded it into a solution that matches your example.
#!/usr/bin/env python
import re
from trieMatch import PrefixMatch # https://gist.github.com/736416
pm = PrefixMatch(['YELLOW', 'GREEN', 'RED', ]) # huge list of 10 000 members
# if list is static, it might be worth picking "pm" to avoid rebuilding each time
f = open("huge_file.txt", "r") ## file with > 100 000 lines
lines = f.readlines()
f.close()
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
filtered = (line for line in lines if pm.match(regexp.match(line).group(1)))
For brevity, implementation of PrefixMatch is published here.
If your list of necessary prefixes is static or changes infrequently, you can speed up subsequent runs by pickling and reusing the PickleMatch object instead of rebuilding it each time.
update (on sorted results)
According to the changelog for Python 2.4:
key should be a single-parameter function that takes a list element and
returns a comparison key for the
element. The list is then sorted using
the comparison keys.
also, in the source code, line 1792:
/* Special wrapper to support stable sorting using the decorate-sort-undecorate
pattern. Holds a key which is used for comparisons and the original record
which is returned during the undecorate phase. By exposing only the key
.... */
This means that your regex pattern is only evaluated once for each entry (not once for each compare), hence it should not be too expensive to do:
sorted_generator = sorted(filtered, key=regexp.match(line).group(1))
I personally like your code as is since you consider "fruit=COLOR" as a pattern which others does not. I think you want to find some solution like memoization which enables you to skip test for already solved problem but this is not the case I guess.
def any_it(iterable):
for element in iterable:
if element: return True
return False
necessary = ['YELLOW', 'GREEN', 'RED', ...]
predicate = lambda line: any_it("fruit=" + color in line for color in necessary)
filtered = ifilter(predicate, open("testest"))
Tested (but unbenchmarked) code:
import re
import fileinput
regexp = re.compile(r'^.*?fruit=([A-Z]+)')
necessary = ['YELLOW', 'GREEN', 'RED', ]
filtered = []
for line in fileinput.input(["test.txt"]):
try:
key = regexp.match(line).group(1)
except AttributeError:
continue # no match
for p in necessary:
if key.startswith(p):
filtered.append(line)
break
# "filtered" now holds your results
print "".join(filtered)
Diff to code in question:
We do not first load the whole file into memory (as is done when you use file.readlines()). Instead, we process each line as the file is read in. I use the fileinput module here for brevity, but one can also use line = file.readline() and a while line: loop.
We stop iterating through the necessary list once a match is found.
We modified the regex pattern and use re.match instead of re.findall. That's assuming that each line would only contain one "fruit=..." entry.
update
If the format of the input file is consistent, you can squeeze out a little more performance by getting rid of regex altogether.
try:
# with line = "2 asdasd fruit=SOMETHING asdasd...."
key = line.split(" ", 3)[2].split("=")[1]
except:
continue # no match
filtered=[]
for line in open('huge_file'):
found=regexp.findall(line)
if found:
fruit=found[0]
for x in necessary:
if fruit.startswith(x):
filtered.append(line)
break
or maybe :
necessary=['fruit=%s'%x for x in necessary]
filtered=[]
for line in open('huge_file'):
for x in necessary:
if x in line:
filtered.append(line)
break
I'd make a simple list of ['fruit=RED','fruit=GREEN'... etc. with ['fruit='+n for n in necessary], then use in rather than a regex to test them. I don't think there's any way to do it really quickly, though.
filtered = (line for line in f if any(a in line for a in necessary_simple))
(The any() function is doing the same thing as your any_it() function)
Oh, and get rid of file.readlines(), just iterate over the file.
Untested code:
filtered = []
for line in lines:
value = line.split('=', 1)[1].split(' ',1)[0]
if value not in necessary:
filtered.append(line)
That should be faster than pattern matching 10 000 patterns onto a line.
Possibly there are even faster ways. :)
It shouldn't take too long to iterate through 100,000 strings, but I see you have a 10,000 strings list, which means you iterate 10,000 * 100,000 = 1,000,000,000 times the strings, so I don't know what did you expect...
As for your question, if you encounter a word from the list and you only need 1 or more (if you want exacly 1 you need to iterate through the whole list) you can skip the rest, it should optimize the search operation.

Categories

Resources