Python How to extract specific string into multiple variable - python

i am trying to extract a specific line as variable in file.
this is content of my test.txt
#first set
Task Identification Number: 210CT1
Task title: Assignment 1
Weight: 25
fullMark: 100
Description: Program and design and complexity running time.
#second set
Task Identification Number: 210CT2
Task title: Assignment 2
Weight: 25
fullMark: 100
Description: Shortest Path Algorithm
#third set
Task Identification Number: 210CT3
Task title: Final Examination
Weight: 50
fullMark: 100
Description: Close Book Examination
this is my code
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for line in mod:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
print(taskNumber)
print(taskTile)
print(weight)
print(fullMark)
print(description)
here is what i'm trying to do:
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time
and loop until the third set
but there's an error occurred in the output
ValueError: not enough values to unpack (expected 5, got 2)
Reponse for SwiftsNamesake
i tried out your code . i am still getting an error.
ValueError: too many values to unpack (expected 5)
this is my attempt by using your code
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
with open(home + '\\Desktop\\PADS Assignment\\210CT.txt', 'r') as mod:
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')

As previously mentioned, you need some sort of chunking. To chunk it usefully we'd also need to ignore the irrelevant lines of the file. I've implemented such a function with some nice Python witchcraft below.
It might also suit you to use a namedtuple to store the values. A namedtuple is a pretty simple type of object, that just stores a number of different values - for example, a point in 2D space might be a namedtuple with an x and a y field. This is the example given in the Python documentation. You should refer to that link for more info on namedtuples and their uses, if you wish. I've taken the liberty of making a Task class with the fields ["number", "title", "weight", "fullMark", "desc"].
As your variables are all properties of a task, using a named tuple might make sense in the interest of brevity and clarity.
Aside from that, I've tried to generally stick to your approach, splitting by the colon. My code produces the output
================================================================================
number is 210CT1
title is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
================================================================================
number is 210CT2
title is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
================================================================================
number is 210CT3
title is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
which seems to be roughly what you're after - I'm not sure how strict your output requirements are. It should be relatively easy to modify to that end, though.
Here is my code, with some explanatory comments:
from collections import namedtuple
#defines a simple class 'Task' which stores the given properties of a task
Task = namedtuple("Task", ["number", "title", "weight", "fullMark", "desc"])
#chunk a file (or any iterable) into groups of n (as an iterable of n-tuples)
def n_lines(n, read_file):
return zip(*[iter(read_file)] * n)
#used to strip out empty lines and lines beginning with #, as those don't appear to contain any information
def line_is_relevant(line):
return line.strip() and line[0] != '#'
with open("input.txt") as in_file:
#filters the file for relevant lines, and then chunks into 5 lines
for task_lines in n_lines(5, filter(line_is_relevant, in_file)):
#for each line of the task, strip it, split it by the colon and take the second element
#(ie the remainder of the string after the colon), and build a Task from this
task = Task(*(line.strip().split(": ")[1] for line in task_lines))
#just to separate each parsed task
print("=" * 80)
#iterate over the field names and values in the task, and print them
for name, value in task._asdict().items():
print("{} is {}".format(name, value))
You can also reference each field of the Task, like this:
print("The number is {}".format(task.number))
If the namedtuple approach is not desired, feel free to replace the content of the main for loop with
taskNumber, taskTitle, weight, fullMark, desc = (line.strip().split(": ")[1] for line in task_lines)
and then your code will be back to normal.
Some notes on other changes I've made:
filter does what it says on the tin, only iterating over lines that meet the predicate (line_is_relevant(line) is True).
The * in the Task instantiation unpacks the iterator, so each parsed line is an argument to the Task constructor.
The expression (line.strip().split(": ")[1] for line in task_lines) is a generator. This is needed because we're doing multiple lines at once with task_lines, so for each line in our 'chunk' we strip it, split it by the colon and take the second element, which is the value.
The n_lines function works by passing a list of n references to the same iterator to the zip function (documentation). zip then tries to yield the next element from each element of this list, but as each of the n elements is an iterator over the file, zip yields n lines of the file. This continues until the iterator is exhausted.
The line_is_relevant function uses the idea of "truthiness". A more verbose way to implement it might be
def line_is_relevant(line):
return len(line.strip()) > 0 and line[0] != '#'
However, in Python, every object can implicitly be used in boolean logic expressions. An empty string ("") in such an expression acts as False, and a non-empty string acts as True, so conveniently, if line.strip() is empty it will act as False and line_is_relevant will therefore be False. The and operator will also short-circuit if the first operand is falsy, which means the second operand won't be evaluated and therefore, conveniently, the reference to line[0] will not cause an IndexError.
Ok, here's my attempt at a more extended explanation of the n_lines function:
Firstly, the zip function lets you iterate over more than one 'iterable' at once. An iterable is something like a list or a file, that you can go over in a for loop, so the zip function can let you do something like this:
>>> for i in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print(i)
...
('foo', 1)
('bar', 4)
('baz', 9)
The zip function returns a 'tuple' of one element from each list at a time. A tuple is basically a list, except it's immutable, so you can't change it, as zip isn't expecting you to change any of the values it gives you, but to do something with them. A tuple can be used pretty much like a normal list apart from that. Now a useful trick here is using 'unpacking' to separate each of the bits of the tuple, like this:
>>> for a, b in zip(["foo", "bar", "baz"], [1, 4, 9]):
... print("a is {} and b is {}".format(a, b))
...
a is foo and b is 1
a is bar and b is 4
a is baz and b is 9
A simpler unpacking example, which you may have seen before (Python also lets you omit the parentheses () here):
>>> a, b = (1, 2)
>>> a
1
>>> b
2
Although the n-lines function doesn't use this. Now zip can also work with more than one argument - you can zip three, four or as many lists (pretty much) as you like.
>>> for i in zip([1, 2, 3], [0.5, -2, 9], ["cat", "dog", "apple"], "ABC"):
... print(i)
...
(1, 0.5, 'cat', 'A')
(2, -2, 'dog', 'B')
(3, 9, 'apple', 'C')
Now the n_lines function passes *[iter(read_file)] * n to zip. There are a couple of things to cover here - I'll start with the second part. Note that the first * has lower precedence than everything after it, so it is equivalent to *([iter(read_file)] * n). Now, what iter(read_file) does, is constructs an iterator object from read_file by calling iter on it. An iterator is kind of like a list, except you can't index it, like it[0]. All you can do is 'iterate over it', like going over it in a for loop. It then builds a list of length 1 with this iterator as its only element. It then 'multiplies' this list by n.
In Python, using the * operator with a list concatenates it to itself n times. If you think about it, this kind of makes sense as + is the concatenation operator. So, for example,
>>> [1, 2, 3] * 3 == [1, 2, 3] + [1, 2, 3] + [1, 2, 3] == [1, 2, 3, 1, 2, 3, 1, 2, 3]
True
By the way, this uses Python's chained comparison operators - a == b == c is equivalent to a == b and b == c, except b only has to be evaluated once,which shouldn't matter 99% of the time.
Anyway, we now know that the * operator copies a list n times. It also has one more property - it doesn't build any new objects. This can be a bit of a gotcha -
>>> l = [object()] * 3
>>> id(l[0])
139954667810976
>>> id(l[1])
139954667810976
>>> id(l[2])
139954667810976
Here l is three objects - but they're all in reality the same object (you might think of this as three 'pointers' to the same object). If you were to build a list of more complex objects, such as lists, and perform an in place operation like sorting them, it would affect all elements of the list.
>>> l = [ [3, 2, 1] ] * 4
>>> l
[[3, 2, 1], [3, 2, 1], [3, 2, 1], [3, 2, 1]]
>>> l[0].sort()
>>> l
[[1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2, 3]]
So [iter(read_file)] * n is equivalent to
it = iter(read_file)
l = [it, it, it, it... n times]
Now the very first *, the one with the low precedence, 'unpacks' this, again, but this time doesn't assign it to a variable, but to the arguments of zip. This means zip receives each element of the list as a separate argument, instead of just one argument that is the list. Here is an example of how unpacking works in a simpler case:
>>> def f(a, b):
... print(a + b)
...
>>> f([1, 2]) #doesn't work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: f() missing 1 required positional argument: 'b'
>>> f(*[1, 2]) #works just like f(1, 2)
3
So in effect, now we have something like
it = iter(read_file)
return zip(it, it, it... n times)
Remember that when you 'iterate' over a file object in a for loop, you iterate over each lines of the file, so when zip tries to 'go over' each of the n objects at once, it draws one line from each object - but because each object is the same iterator, this line is 'consumed' and the next line it draws is the next line from the file. One 'round' of iteration from each of its n arguments yields n lines, which is what we want.

Your line variable gets only Task Identification Number: 210CT1 as its first input. You're trying to extract 5 values from it by splitting it by :, but there are only 2 values there.
What you want is to divide your for loop into 5, read each set as 5 lines, and split each line by :.

The problem here is that you are spliting the lines by : and for each line there is only 1 : so there are 2 values.
In this line:
taskNumber , taskTile , weight, fullMark , desc = line.strip(' ').split(": ")
you are telling it that there are 5 values but it only finds 2 so it gives you an error.
One way to fix this is to run multiple for loops one for each value since you are not allowed to change the format of the file. I would use the first word and sort the data into different
import re
Identification=[]
title=[]
weight=[]
fullmark=[]
Description=[]
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod::
for line in mod:
list_of_line=re.findall(r'\w+', line)
if len(list_of_line)==0:
pass
else:
if list_of_line[0]=='Task':
if list_of_line[1]=='Identification':
Identification.append(line[28:-1])
if list_of_line[1]=='title':
title.append(line[12:-1])
if list_of_line[0]=='Weight':
weight.append(line[8:-1])
if list_of_line[0]=='fullMark':
fullmark.append(line[10:-1])
if list_of_line[0]=='Description':
Description.append(line[13:-1])
print('taskNumber is %s' % Identification[0])
print('taskTitle is %s' % title[0])
print('Weight is %s' % weight[0])
print('fullMark is %s' %fullmark[0])
print('desc is %s' %Description[0])
print('\n')
print('taskNumber is %s' % Identification[1])
print('taskTitle is %s' % title[1])
print('Weight is %s' % weight[1])
print('fullMark is %s' %fullmark[1])
print('desc is %s' %Description[1])
print('\n')
print('taskNumber is %s' % Identification[2])
print('taskTitle is %s' % title[2])
print('Weight is %s' % weight[2])
print('fullMark is %s' %fullmark[2])
print('desc is %s' %Description[2])
print('\n')
of course you can use a loop for the prints but i was too lazy so i copy and pasted :).
IF YOU NEED ANY HELP OR HAVE ANY QUESTIONS PLEASE PLEASE ASK!!!
THIS CODE ASSUMES THAT YOU ARE NOT THAT ADVANCED IN CODING
Good Luck!!!

As another poster (#Cuber) has already stated, you're looping over the lines one-by-one, whereas the data sets are split across five lines. The error message is essentially stating that you're trying to unpack five values when all you have is two. Furthermore, it looks like you're only interested in the value on the right hand side of the colon, so you really only have one value.
There are multiple ways of resolving this issue, but the simplest is probably to group the data into fives (plus the padding, making it seven) and process it in one go.
First we'll define chunks, with which we'll turn this somewhat fiddly process into one elegant loop (from the itertools docs).
from itertools import zip_longest
def chunks(iterable, n, fillvalue=None):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
Now, we'll use it with your data. I've omitted the file boilerplate.
for group in chunks(mod.readlines(), 5+2, fillvalue=''):
# Choose the item after the colon, excluding the extraneous rows
# that don't have one.
# You could probably find a more elegant way of achieving the same thing
l = [item.split(': ')[1].strip() for item in group if ':' in item]
taskNumber , taskTile , weight, fullMark , desc = l
print(taskNumber , taskTile , weight, fullMark , desc, sep='|')
The 2 in 5+2 is for the padding (the comment above and the empty line below).
The implementation of chunks may not make sense to you at the moment. If so, I'd suggest looking into Python generators (and the itertools documentation in particular, which is a marvellous resource). It's also a good idea to get your hands dirty and tinker with snippets inside the Python REPL.

You can still read in lines one by one, but you will have to help the code understand what it's parsing. We can use an OrderedDict to lookup the appropriate variable name.
import os
import collections as ct
def printer(dict_, lookup):
for k, v in lookup.items():
print("{} is {}".format(v, dict_[k]))
print()
names = ct.OrderedDict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
for line in f.readlines():
line = line.strip()
if line.startswith("#"):
header = line
d = {}
continue
elif line:
k, v = line.split(":")
d[k] = v.strip(" ")
else:
printer(d, names)
printer(d, names)
Output
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm

You're trying to get more data than is present on one line; the five pieces of data are on separate lines.
As SwiftsNamesake suggested, you can use itertools to group the lines:
import itertools
def keyfunc(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
with open(home + '\\Desktop\\PADS Assignment\\test.txt', 'r') as mod:
for k, g in itertools.groupby(mod, keyfunc):
if not k: # Does not process lines that are separators.
for line in g:
data = line.strip().partition(": ")
print(f"{data[0] is {data[2]}")
# print(data[0] + " is " + data[2]) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
If you want to use the data in other functions, output it as a dictionary from a generator:
from collections import OrderedDict
import itertools
def isSeparator(line):
# Ignores comments in the data file.
if len(line) > 0 and line[0] == "#":
return True
# The separator is an empty line between the data sets, so it returns
# true when it finds this line.
return line == "\n"
def parseData(data):
for line in data:
k, s, v = line.strip().partition(": ")
yield k, v
def readData(filePath):
with open(filePath, "r") as mod:
for key, g in itertools.groupby(mod, isSeparator):
if not key: # Does not process lines that are separators.
yield OrderedDict((k, v) for k, v in parseData(g))
def printData(data):
for d in data:
for k, v in d.items():
print(f"{k} is {v}")
# print(k + " is " + v) # If python < 3.6
print("") # Prints a newline to separate groups at the end of each group.
data = readData(home + '\\Desktop\\PADS Assignment\\test.txt')
printData(data)

Inspired by itertools-related solutions, here is another using the more_itertools.grouper tool from the more-itertools library. It behaves similarly to #SwiftsNamesake's chunks function.
import collections as ct
import more_itertools as mit
names = dict([
("Task Identification Number", "taskNumber"),
("Task title", "taskTitle"),
("Weight", "weight"),
("fullMark","fullMark"),
("Description", "desc"),
])
filepath = home + '\\Desktop\\PADS Assignment\\test.txt'
with open(filepath, "r") as f:
lines = (line.strip() for line in f.readlines())
for group in mit.grouper(7, lines):
for line in group[1:]:
if not line: continue
k, v = line.split(":")
print("{} is {}".format(names[k], v.strip()))
print()
Output
taskNumber is 210CT1
taskTitle is Assignment 1
weight is 25
fullMark is 100
desc is Program and design and complexity running time.
taskNumber is 210CT2
taskTitle is Assignment 2
weight is 25
fullMark is 100
desc is Shortest Path Algorithm
taskNumber is 210CT3
taskTitle is Final Examination
weight is 50
fullMark is 100
desc is Close Book Examination
Care was taken to print the variable name with the corresponding value.

Related

Wrong list output than what was expected

So I have to iterate through this list and divide the even numbers by 2 and multiply the odds by 3, but when I join the list together to print it gives me [0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]. I printed each value inside the loop in order to check if it was an arithmetic error but it prints out the correct value. Using lambda I have found out that it rewrites data every time it is called, so I'm trying to find other ways to do this while still using the map function. The constraint for the code is that it needs to be done using a map function. Here is a snippet of the code:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
i = 0
while i < len(data):
if (data[i] % 2) == 0:
data_list1 = list(map(lambda a: a / 2, data))
print(data_list1[i])
i += 1
else:
data_list1 = list(map(lambda a: a * 3, data))
print(data_list1[i])
i += 1
print(list(data_list1))1
Edit: Error has been fixed.
The easiest way for me to do this is as follows:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
i = 0
for i in range(len(data)):
if (data[i]%2) == 0:
data_list1=data_list1+[int(data[i]/2)]
elif (data[i]%2) == 1: # alternatively a else: would do, but only if every entry in data is an int()
data_list1=data_list1+[data[i]*3]
print(data_list1)
In your case a for loop makes the code much more easy to read, but a while loop works just as well.
In your original code the issue is your map() function. If you look into the documentation for it, you will see that map() affects every item in the iterable. You do not want this, instead you want to change only the specified entry.
Edit: If you want to use lambda for some reason, here's a (pretty useless) way to do it:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
data_list1 = []
for i in range(len(data)):
if (data[i] % 2) == 0:
x = lambda a: a/2
data_list1.append(x(data[i]))
else:
y = lambda a: a*3
data_list1.append(y(data[i]))
print(data_list1)
If you have additional design constraints, please specify them in your question, so we can help.
Edit2: And once again onto the breach: Since you added your constraints, here's how to do it with a mapping function:
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
def changer(a):
if a%2==0:
return a/2
else:
return a*3
print(list(map(changer,data)))
If you want it to be in a new list, just add data_list1=list(map(changer,data)).
Hope this is what you were looking for!
You can format the string for the output like this:
print(','.join(["%2.1f " % (.5*x if (x%2)==0 else 3*x) for x in data]))
From your latest comment I completely edited the answer below (old version can be found in the edit-history of this post).
From your update, I see that your constraint is to use map. So let's address how this works:
map is a function which exists in many languages and it might be surprising to see at first because it takes a function as an argument. One analogy could be: You give a craftsmen (the "map" function) pieces of metal (the list of values) and a tool (the function passed into "map") and you tell him, to use the tool on each piece of metal and give you back the modified pieces.
A very important thing to understand is that map takes a complete list/iterable and return a new iterable all by itself. map takes care of the looping so you don't have to.
If you hand him a hammer as tool, each piece of metal will have a dent in it.
If you hand him a scriber, each piece of metal will have a scratch in it.
If you hand him a forge as tool, each piece of metal will be returned molten.
The core to underand here is that "map" will take any list (or more precisely an "iterable") and will apply whatever function you give it to each item and return the modified list (again, the return value is not really a list but a new "iterable").
So for example (using strings):
def scribe(piece_of_metal):
"""
This function takes a string and appends "with a scratch" at the end
"""
return "%s with a scratch" % piece_of_metal
def hammer(piece_of_metal):
"""
This function takes a string and appends "with a dent" at the end
"""
return "%s with a dent" % piece_of_metal
def forge(piece_of_metal):
"""
This function takes a string and prepends it with "molten"
"""
return "molten %s" % piece_of_metal
metals = ["iron", "gold", "silver"]
scribed_metals = map(scribe, metals)
dented_metals = map(hammer, metals)
molten_metals = map(forge, metals)
for row in scribed_metals:
print(row)
for row in dented_metals:
print(row)
for row in molten_metals:
print(row)
I have delibrately not responded to the core of your question as it is homework but I hope this post gives you a practical example of using map which helps with the exercise.
Another, more practical example, saving data to disk
The above example is deliberately contrived to keep it simple. But it's not very practical. Here is another example which could actually be useful, storing documents on disk. We assume that we hava a function fetch_documents which returns a list of strings, where the strings are the text-content of the documents. We want to store those into .txt files. As filenames we will use the MD5 hash of the contents. The reason MD5 is chosen is to keep things simple. This way we still only require one argument to the "mapped" function and it is sufficiently unique to avoid overwrites:
from assume_we_have import fetch_documents
from hashlib import md5
def store_document(contents):
"""
Store the contents into a unique filename and return the generated filename.
"""
hash = md5(contents)
filename = '%s.txt' % hash.hexdigest()
with open(filename, 'w') as outfile:
outfile.write(contents)
return filename
documents = fetch_documents()
stored_filenames = map(store_document, documents)
The last line which is using map could be replaced with:
stored_filenames = []
for document in documents:
filename = store_document(document)
stored_filenames.append(filename)

Avoiding off-by-one errors when removing columns based on indices in a python list

I have a target file called TARGFILE of the form:
10001000020002002001100100200000111
10201001020000120210101100110010011
02010010200000011100012021001012021
00102000012001202100101202100111010
My idea here was to leave this as a string, and use slicing in python to remove the indices.
The removal will occur based on a list of integers called INDICES like so:
[1, 115654, 115655, 115656, 2, 4, 134765, 134766, 18, 20, 21, 23, 24, 17659, 92573, 30, 32, 88932, 33, 35, 37, 110463, 38, 18282, 46, 18458, 48, 51, 54]
I want to remove every position of every line in TARGFILE that matches with INDICES. For instance, the first digit in INDICES is 1, so the first column of TARGFILE containing 1,1,0,0 would be removed. However, I am weary of doing this incorrectly due to off-by-one errors and changing index positions if everything is not removed at the same time.
Thus, a solution that removed every column from each row at the same time would likely be both much faster and safer than using a nested loop, but I am unsure of how to code this.
My code so far is here:
#!/usr/bin/env python
import fileinput
SRC_FILES=open('YCP.txt', 'r')
for line in SRC_FILES:
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split(' ')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
TARGFILE=open(EUR, 'r')
It seems to me that a solution using enumerate might be possible, but I have not found it, and that might be suboptimal in the first place...
EDIT: in response to concerns about memory: the longest lines are ~180,000 items, but I should be able to get this into memory without a problem, I have access to a cluster.
I like the simplicity of Peter's answer, even though it's currently off-by-one. My thought is that you can get rid of the index-shifting problem, by sorting INDICES, and doing the process from the back to the front. That led to remove_indices1, which is really inefficient. I think 2 is better, but simplest is 3, which is Peter's answer.
I may do timing in a bit for some large numbers, but my intuition says that my remove_indices2 will be faster than Peter's remove_indices3 if INDICES is very sparse. (Because you don't have to iterate over each character, but only over the indices that are being deleted.)
BTW - If you can sort INDICES once, then you don't need to make the local copy to sort/reverse, but I didn't know if you could do that.
rows = [
'0000000001111111111222222222233333333334444444444555555555566666666667',
'1234567890123456789012345678901234567890123456789012345678901234567890',
]
def remove_nth_character(row,n):
return row[:n-1] + row[n:]
def remove_indices1(row,indices):
local_indices = indices[:]
retval = row
local_indices.sort()
local_indices.reverse()
for i in local_indices:
retval = remove_nth_character(retval,i)
return retval
def remove_indices2(row,indices):
local_indices = indices[:]
local_indices.sort()
local_indices.reverse()
front = row
chunks = []
for i in local_indices:
chunks.insert(0,front[i:])
front = front[:i-1]
chunks.insert(0,front)
return "".join(chunks)
def remove_indices3(row,indices):
return ''.join(c for i,c in enumerate(row) if i+1 not in indices)
indices = [1,11,4,54,33,20,7]
for row in rows:
print remove_indices1(row,indices)
print ""
for row in rows:
print remove_indices2(row,indices)
print ""
for row in rows:
print remove_indices3(row,indices)
EDIT: Adding timing info, plus a new winner!
As I suspected, my algorithm (remove_indices2) wins when there aren't many indices to remove. It turns out that the enumerate-based one, though, gets worse even faster as there are more indices to remove. Here's the timing code (bigrows rows have 210000 characters):
bigrows = []
for row in rows:
bigrows.append(row * 30000)
for indices_len in [10,100,1000,10000,100000]:
print "indices len: %s" % indices_len
indices = range(indices_len)
#for func in [remove_indices1,remove_indices2,remove_indices3,remove_indices4]:
for func in [remove_indices2,remove_indices4]:
start = time.time()
for row in bigrows:
func(row,indices)
print "%s: %s" % (func.__name__,(time.time() - start))
And here are the results:
indices len: 10
remove_indices1: 0.0187089443207
remove_indices2: 0.00184297561646
remove_indices3: 1.40601491928
remove_indices4: 0.692481040955
indices len: 100
remove_indices1: 0.0974130630493
remove_indices2: 0.00125503540039
remove_indices3: 7.92742991447
remove_indices4: 0.679095029831
indices len: 1000
remove_indices1: 0.841033935547
remove_indices2: 0.00370812416077
remove_indices3: 73.0718669891
remove_indices4: 0.680690050125
So, why does 3 do so much worse? Well, it turns out that the in operator isn't efficient on a list. It's got to iterate through all of the list items to check. remove_indices4 is just 3 but converting indices to a set first, so the inner loop can do a fast hash-lookup, instead of iterating through the list:
def remove_indices4(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if i+1 not in indices_set)
And, as I originally expected, this does better than my algorithm for high densities:
indices len: 10
remove_indices2: 0.00230097770691
remove_indices4: 0.686790943146
indices len: 100
remove_indices2: 0.00113391876221
remove_indices4: 0.665997982025
indices len: 1000
remove_indices2: 0.00296902656555
remove_indices4: 0.700706005096
indices len: 10000
remove_indices2: 0.074893951416
remove_indices4: 0.679219007492
indices len: 100000
remove_indices2: 6.65899395943
remove_indices4: 0.701599836349
If you've got fewer than 10000 indices to remove, 2 is fastest (even faster if you do the indices sort/reverse once outside the function). But, if you want something that is pretty stable in time, no matter how many indices, use 4.
The simplest way I can see would be something like:
>>> for line in TARGFILE:
... print ''.join(c for i,c in enumerate(line) if (i+1) not in INDICES)
...
100000200020020100200001
100010200001202010110001
010102000000111021001021
000000120012021012100110
(Substituting print for writing to your output file etc)
This relies on being able to load each line into memory which may or may not be reasonable given your data.
Edit: explaination:
The first line is straightforward:
>>> for line in TARGFILE:
Just iterates through each line in TARGFILE. The second line is a bit more complex:
''.join(...) concatenates a list of strings together with an empty joiner (''). join is often used with a comma like: ','.join(['a', 'b', 'c']) == 'a,b,c', but here we just want to join each item to the next.
enumerate(...) takes an interable and returns pairs of (index, item) for each item in the iterable. For example enumerate('abc') == (0, 'a'), (1, 'b'), (2, 'c')
So the line says,
Join together each character of line whose index are not found in INDICES
However, as John pointed out, Python indexes are zero base, so we add 1 to the value from enumerate.
The script I ended up using is the following:
#!/usr/bin/env python
def remove_indices(row,indices):
indices_set = set(indices)
return ''.join(c for i,c in enumerate(row) if (i+1) in indices_set)
SRC_FILES=open('YCP2.txt', 'r')
CEUDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/CEU/PARSED/'
YRIDIR='/USER/ScriptsAndLists/LAMP/LAMPLDv1.1/IN/aps/4bogdan/omni/YRI/PARSED/'
i=0
for line in SRC_FILES:
i+=1
EUR_YRI_ADM=line.strip('\n')
EUR,YRI,ADM=EUR_YRI_ADM.split('\t')
ADMFO=open(ADM, 'r')
lines=ADMFO.readlines()
INDICES=[int(val) for val in lines[0].split()]
INDEXSORT=sorted(INDICES, key=int)
EURF=open(EUR, 'r')
EURFOUT=open(CEUDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype in EURF:
TRIMLINE=remove_indices(haplotype, INDEXSORT)
EURFOUT.write(TRIMLINE + '\n')
EURFOUT.close()
AFRF=open(YRI, 'r')
AFRFOUT=open(YRIDIR + 'chr' + str(i) + 'anc.hap.txt' , 'a')
for haplotype2 in AFRF:
TRIMLINE=remove_indices(haplotype2, INDEXSORT)
AFRFOUT.write(TRIMLINE + '\n')
AFRFOUT.close()

Unable to parse just sequences from FASTA file

How can I remove ids like '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n' from sequences?
I have this code:
with open('sequence.fasta', 'r') as f :
while True:
line1=f.readline()
line2=f.readline()
line3=f.readline()
if not line3:
break
fct([line1[i:i+100] for i in range(0, len(line1), 100)])
fct([line2[i:i+100] for i in range(0, len(line2), 100)])
fct([line3[i:i+100] for i in range(0, len(line3), 100)])
Output:
['>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n']
['CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n']
['AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG\n']
['CCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAA\n']
['AGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGA\n']
['ATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGAT\n']
['AAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCA\n']
['GGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCC\n']
['AGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGT\n']
['TTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTT\n']
['GTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGAT\n']
['GTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC\n']
['\n']
...
My function is:
def fct(input_string):
code={"a":0,"c":1,"g":2,"t":3}
p=[code[i] for i in input_string]
n=len(input_string)
c=0
for i, n in enumerate(range(n, 0, -1)):
c +=p[i]*(4**(n-1))
return c+1
fct() returns an integer from a string. For example, ACT gives 8
i.e.: my function must take as input string sequences contain just the following bases A,C,G,T
But when I use my function it gives:
KeyError: '>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA\n'
I try to remove ids by stripping lines start with > and writing the rest in text file so, my text file output.txt contains just sequences without ids, but when I use my function fct I found the same error:
KeyError: 'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG\n'
What can I do?
I see two major problems in your code: You're having problems parsing FASTA sequences, and your function is not properly iterating over each sequence.
Parsing FASTA data
Might I suggest using the excellent Biopython package? It has excellent FASTA support (reading and writing) built in (see Sequences in the Tutorial).
To parse sequences from a FASTA file:
for seq_record in SeqIO.parse("seqs.fasta", "fasta"):
print record.description # gi|2765658|emb|Z78533.1...
print record.seq # a Seq object, call str() to get a simple string
>>> print record.id
'gi|2765658|emb|Z78533.1|CIZ78533'
>>> print record.description
'gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA'
>>> print record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC', SingleLetterAlphabet())
>>> print str(record.seq)
'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACC' #(truncated)
Iterating over sequence data
In your code, you have a list of strings being passed to fct() (input_string is not actually a string, but a list of strings). The solution is just to build one input string, and iterate over that.
Other errors in fct:
You need to capitalize the keys to your dictionary: case matters
You should have the return statement after the for loop. Keeping it nested means c is returned immediately.
Why bother constructing p when you can just index into code when iterating over the sequence?
You write over the sequence's length (n) by using it in your for loop as a variable name
Modified code (with proper PEP 8 formatting), and variables renamed to be clearer what they mean (still have no idea what c is supposed to be):
from Bio import SeqIO
def dna_seq_score(dna_seq):
nucleotide_code = {"A": 0, "C": 1, "G": 2, "T": 3}
c = 0
for i, k in enumerate(range(len(dna_seq), 0, -1)):
nucleotide = dna_seq[i]
code_num = nucleotide_code[nucleotide]
c += code_num * (4 ** (k - 1))
return c + 1
for record in SeqIO.parse("test.fasta", "fasta"):
dna_seq_score(record.seq)

Finding a value within a dictionary of ranges - python

I'm comparing 2 files with an initial identifier column, start value, and end value. The second file contains corresponding identifiers and another value column.
Ex.
File 1:
A 200 900
A 1000 1200
B 100 700
B 900 1000
File 2:
A 103
A 200
A 250
B 50
B 100
B 150
I would like to find all values from the second file that are contained within the ranges found in the first file so that my output would look like:
A 200
A 250
B 100
B 150
For now I have created a dictionary from the first file with a list of ranges:
Ex.
if Identifier in Dictionary:
Dictionary[Identifier].extend(range(Start, (End+1)))
else:
Dictionary[Identifier] = range(Start, (End+1))
I then go through the second file and search for the value within the dictionary of ranges:
Ex.
if Identifier in Dictionary:
if Value in Dictionary[Identifier]:
OutFile.write(Line + "\n")
While not optimal this works for relatively small files, however I have several large files and this program is proving terribly inefficient. I need to optimize my program so that it will run much faster.
from collections import defaultdict
ident_ranges = defaultdict(list)
with open('file1.txt', 'r') as f1
for row in f1:
ident, start, end = row.split()
start, end = int(start), int(end)
ident_ranges[ident].append((start, end))
with open('file2.txt', 'r') as f2, open('out.txt', 'w') as output:
for line in f2:
ident, value = line.split()
value = int(value)
if any(start <= value <= end for start, end in ident_ranges[ident]):
output.write(line)
Notes: Using a defaultdict allows you to add ranges to your dictionary without first checking for the existence of a key. Using any allows for short circuiting of the range check. Using chained comparision is a nice Python syntactic shortcut (start <= value <= end).
Do you need to construct range(START, END)? That seems quite wasteful when you can do:
if START <= x <= END:
# process
Checking if the value is in the range is slow because a) you've had to construct the list and b) perform a linear search over the list to find it.
You can try something like this:
In [27]: ranges=defaultdict(list)
In [28]: with open("file1") as f:
for line in f:
name,st,end=line.split()
st,end=int(st),int(end)
ranges[name].append([st,end])
....:
In [30]: ranges
Out[30]: defaultdict(<type 'list'>, {'A': [[200, 900], [1000, 1200]], 'B': [[100, 700], [900, 1000]]})
In [29]: with open("file2") as f:
for line in f:
name,val=line.split()
val=int(val)
if any(y[0]<=val<=y[1] for y in ranges[name]):
print name,val
....:
A 200
A 250
B 100
B 150
Neat trick: Python lets you do in comparisons with xrange objects, which is much faster than doing in with a range, and much more memory efficient.
So, you can do
from collections import defaultdict
rangedict = defaultdict(list)
...
rangedict[ident].append(xrange(start, end+1))
...
for i in rangedict:
for r in rangedict[i]:
if v in r:
print >>outfile, line
Since you've got large ranges and your problem is essentially just a bunch of comparisons, it's almost certainly faster to store a start/end tuple than the whole range (especially since what you have now is going to duplicate most of the numbers in the ranges if two happen to overlap).
# Building the dict
if not ident in d:
d[ident] = (lo, hi)
else:
old_lo, old_hi = d[ident]
d[ident] = (min(lo, old_lo), max(hi, old_hi))
Then your comparisons just look like:
# comparing...
if ident in d:
if d[ident][0] <= val <= d[ident][1]:
outfile.write(line+'\n')
Both parts of this will go faster if you aren't making separate checks for if ident in d. Python dictionaries are nice and fast, so just make the call to it in the first place. You've got the ability to provide defaults to the dictionary, so use it. I haven't benchmarked this or anything to see what the speedup is, but you'd certainly get some, and it certainly works:
# These both make use of the following somewhat silly hack:
# In Python, None is treated as less than everything (even -float('inf))
# and empty containers (e.g. (), [], {}) are treated as greater than everything.
# So we use the tuple ((), None) as if it was (float('inf'), float('-inf))
for line in file1:
ident, lo, hi = line.split()
lo = int(lo)
hi = int(hi)
old_lo, old_hi = d.get(ident, ((), None))
d[ident] = (min(lo, old_lo), max(hi, old_hi))
# comparing:
for line in file2:
ident, val = line.split()
val = int(val)
lo, hi = d.get(ident, ((), None))
if lo <= val <= hi:
outfile.write(line) # unless you stripped it off, this still has a \n
The above code is what I was using to test; it runs on a file2 of a million lines in a couple seconds.

translate my sequence?

I have to write a script to translate this sequence:
dict = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
a=""
for y in range( 0, len ( seq)):
c=(seq[y:y+3])
#print(c)
for k, v in dict.items():
if seq[y:y+3] == k:
alle_amino = v[::3] #alle aminozuren op rijtje, a1.1 -a2.1- a.3.1-a1.2 enzo
print (v)
With this script I get the amino acids from the 3 frames under each other, but how can I sort this and get all the amino acids from frame 1 next to each other, and all the amino acids from frame 2 next to each other, and the same for frame 3?
for example , my results must be :
+3 SerIleLeuAlaStpProLysTrpGluProProTyrValAlaStpProIleTyrIleTyrTle
+2 PheAsnThrSerMetThrLysValGlyThrProLeuArgSerMetThrHisIleTyrIleTyr
+1 PheGlnTyrStpHisAspGlnSerGlyAsnProLeuThrStpHisAspProTyrIleTyrIle
TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA
I use Python 3.
i had one more question : can i make this results by some changes in mine own script ?
You can use (Note this would be ridiculously much more easier using biopython translate method):
dictio = {your dictionary here}
def translate(seq):
x = 0
aaseq = []
while True:
try:
aaseq.append(dicti[seq[x:x+3]])
x += 3
except (IndexError, KeyError):
break
return aaseq
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
for frame in range(3):
print('+%i' %(frame+1), ''.join(item.split('|')[1] for item in translate(seq[frame:])))
Note I changed the name of your dictionary with dicti (not to overwrite dict).
Some comments to help you understand:
translate takes you sequence and returns it in the form of a list in which each item corresponds to the amino acid translation of the triplet coding that position. Like:
aaseq = ["L|Leu","L|Leu","P|Pro", ....]
you could process more this data (get only one or three letters code) inside translate or return it as it is to be processed latter as I have done.
translate is called in
''.join(item.split('|')[1] for item in translate(seq[frame:]))
for each frame. For frame value being 0, 1 or 2 it sends seq[frame:] as a parameter to translate. That is, you are sending the sequences corresponding to the three different reading frames processing them in series. Then, in
''.join(item.split('|')[1]
I split the one and three-letters codes for each amino acid and take the one at index 1 (the second). Then they are joined in a single string
Not too pretty, but does what you want
dct = {"TTT":"F|Phe","TTC":"F|Phe","TTA":"L|Leu","TTG":"L|Leu","TCT":"S|Ser","TCC":"S|Ser",
"TCA":"S|Ser","TCG":"S|Ser", "TAT":"Y|Tyr","TAC":"Y|Tyr","TAA":"*|Stp","TAG":"*|Stp",
"TGT":"C|Cys","TGC":"C|Cys","TGA":"*|Stp","TGG":"W|Trp", "CTT":"L|Leu","CTC":"L|Leu",
"CTA":"L|Leu","CTG":"L|Leu","CCT":"P|Pro","CCC":"P|Pro","CCA":"P|Pro","CCG":"P|Pro",
"CAT":"H|His","CAC":"H|His","CAA":"Q|Gln","CAG":"Q|Gln","CGT":"R|Arg","CGC":"R|Arg",
"CGA":"R|Arg","CGG":"R|Arg", "ATT":"I|Ile","ATC":"I|Ile","ATA":"I|Ile","ATG":"M|Met",
"ACT":"T|Thr","ACC":"T|Thr","ACA":"T|Thr","ACG":"T|Thr", "AAT":"N|Asn","AAC":"N|Asn",
"AAA":"K|Lys","AAG":"K|Lys","AGT":"S|Ser","AGC":"S|Ser","AGA":"R|Arg","AGG":"R|Arg",
"GTT":"V|Val","GTC":"V|Val","GTA":"V|Val","GTG":"V|Val","GCT":"A|Ala","GCC":"A|Ala",
"GCA":"A|Ala","GCG":"A|Ala", "GAT":"D|Asp","GAC":"D|Asp","GAA":"E|Glu",
"GAG":"E|Glu","GGT":"G|Gly","GGC":"G|Gly","GGA":"G|Gly","GGG":"G|Gly"}
seq = "TTTCAATACTAGCATGACCAAAGTGGGAACCCCCTTACGTAGCATGACCCATATATATATATATA"
def get_amino_list(s):
for y in range(3):
yield [s[x:x+3] for x in range(y, len(s) - 2, 3)]
for n, amn in enumerate(get_amino_list(seq), 1):
print ("+%d " % n + "".join(dct[x][2:] for x in amn))
print(seq)
Here's my solution. I've called your "dict" variable "aminos". The function method3 returns a list of the values to the right of the "|". To merge them into a single string, just join them on "".
From looking at your code, I believe that your aminos dict contains all possible three-letter combinations. Therefore, I've removed the checks that verify this. It should run a lot faster as a result.
def overlapping_groups(seq, group_len=3):
"""Returns `N` adjacent items from an iterable in a sliding window style
"""
for i in range(len(seq)-group_len):
yield seq[i:i+group_len]
def method3(seq, aminos):
return [aminos[k][2:] for k in overlapping_groups(seq, 3)]
for i in range(3):
print("%d: %s" % (i, "".join(method3(seq[i:], aminos))))

Categories

Resources