Related
I have some extremely large lists of character strings I need to parse. I need to break them into smaller lists based on a pre-defined character string, and I figured out a way to do it, but I worry that this will not be performant on my real data. Is there a better way to do this?
My goal is to turn this list:
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
Into this list:
[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]
What I tried:
# List that replicates my data. `string_to_split_on` is a fixed character string I want to break my list up on
my_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
# Inspect List
print(my_list)
# Create empty lists to store dat ain
new_list = []
good_letters = []
# Iterate over each string in the list
for i in my_list:
# If the string is the seporator, append data to new_list, reset `good_letters` and move to the next string
if i == 'string_to_split_on':
new_list.append(good_letters)
good_letters = []
continue
# Append letter to the list of good letters
else:
good_letters.append(i)
# I just like printing things thay because its easy to read
for item in new_list:
print(item)
print('-'*100)
### Output
['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
['a', 'b']
----------------------------------------------------------------------------------------------------
['c', 'd', 'e', 'f', 'g']
----------------------------------------------------------------------------------------------------
['h', 'i', 'j', 'k']
----------------------------------------------------------------------------------------------------
You can also use one line of code:
original_list = ['a', 'b', 'string_to_split_on', 'c', 'd', 'e', 'f', 'g', 'string_to_split_on', 'h', 'i', 'j', 'k', 'string_to_split_on']
split_string = 'string_to_split_on'
new_list = [sublist.split() for sublist in ' '.join(original_list).split(split_string) if sublist]
print(new_list)
This approach is more efficient when dealing with large data set:
import itertools
new_list = [list(j) for k, j in itertools.groupby(original_list, lambda x: x != split_string) if k]
print(new_list)
[['a', 'b'], ['c', 'd', 'e', 'f', 'g'], ['h', 'i', 'j', 'k']]
I have a list that contains multiple strings created from a FASTA format file.
The list is like this:
data = ['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
I want to get the characters at the first index of all the strings in the list and transfer them to another list and I do it like this:
list1 = []
z = 0
while x < len(data):
list1.append(((data[x])[z]))
x += 1
Now that I have the first index, how do I do that for every index of all the strings? Assuming they are all the same length.
Assuming they all have the same length, you can zip the string.
The first string in the result contains all the first chars, the second all the seconds, etc
>>> res = ["".join(el) for el in zip(*data)]
>>> res
['AGAATAA', 'TGTATTT', 'CGGGGGG', 'CCGCGCG', 'AAAAACC', 'GATAAAA', 'CCCCCTC', 'TTTCTTT']
If all your strings are of same length, you can use zip() to achieve this:
>>> data = ['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
>>> my_lists = zip(*data)
>>> my_lists[0] # chars from `0`th index of each string
('A', 'G', 'A', 'A', 'T', 'A', 'A')
>>> my_lists[1] # chars from `1`st index of each string
('T', 'G', 'T', 'A', 'T', 'T', 'T')
# ... so on
If you want each of these lists stored in separate variables, then you can also unpack these like:
a, b, c, d, e, f, g, h = zip(*data)
# where:
# a = ('A', 'G', 'A', 'A', 'T', 'A', 'A') ## chars from `0`th index
# b = ('T', 'G', 'T', 'A', 'T', 'T', 'T') ## chars from `1`st index
In case your strings are of different length, you can use itertools.zip_longest() in Python 3 (or itertools.izip_longest() in Python 2) as:
>>> from itertools import zip_longest # In Python 3
# OR, from itertools import izip_longest # In Python 2
>>> my_list = ['abc', 'de', 'fghi', 'j']
>>> list(zip_longest(*my_list, fillvalue=''))
[('a', 'd', 'f', 'j'), ('b', 'e', 'g', ''), ('c', '', 'h', ''), ('', '', 'i', '')]
Skipping fillvalue param in above example will fill the empty elements with None like this:
[('a', 'd', 'f', 'j'), ('b', 'e', 'g', None), ('c', None, 'h', None), (None, None, 'i', None)]
All you really need is a for loop that loops for every item in the list. This is how I would do it:
data = ['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
first_index = []
for i in data:
first_index.append(i[0])
print(first_index)
This outputs a list which looks like this:
['A', 'G', 'A', 'A', 'T', 'A', 'A']
So, I've take your question to mean getting the first letter of each string in the data.
(ATCCAGCT).
The way I would do it is:
data = ['.', '..', '...'] # Ur values
list1 = [] # empty list
for x in data: # for each entry
list1.append(x[0]) # add the first letter
Use a list comprehension. If you want to use a loop, then deconstruct this to the loop form. You went to a lot of indirect effort in your posted code.
first = [seq[0] for seq in data]
data = ['ATCCAGCT', 'GGGCAACT', 'ATGGATCT', 'AAGCAACC', 'TTGGAACT', 'ATGCCATT', 'ATGGCACT']
mydict = {}
for position in range(len(data[0])):
mydict[str(position)] = []
for seq in data:
for position, nucleotide in enumerate(seq):
mydict[str(position)].append(nucleotide)
for position in mydict.keys():
print (position,mydict[position],"".join(mydict[position]))
Output:
0 ['A', 'G', 'A', 'A', 'T', 'A', 'A'] AGAATAA
1 ['T', 'G', 'T', 'A', 'T', 'T', 'T'] TGTATTT
2 ['C', 'G', 'G', 'G', 'G', 'G', 'G'] CGGGGGG
3 ['C', 'C', 'G', 'C', 'G', 'C', 'G'] CCGCGCG
4 ['A', 'A', 'A', 'A', 'A', 'C', 'C'] AAAAACC
5 ['G', 'A', 'T', 'A', 'A', 'A', 'A'] GATAAAA
6 ['C', 'C', 'C', 'C', 'C', 'T', 'C'] CCCCCTC
7 ['T', 'T', 'T', 'C', 'T', 'T', 'T'] TTTCTTT
If i have a list
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
and I want to split into new list without 'k', and turn it into a tuple. So I get
(['a'],['b', 'c'], ['d', 'e', 'g'])
I am thinking about first splitting them into different list by using a for loop.
new_lst = []
for element in lst:
if element != 'k':
new_ist.append(element)
This does remove all the 'k' but they are all together. I do not know how to split them into different list. To turn a list into a tuple I would need to make a list inside a list
a = [['a'],['b', 'c'], ['d', 'e', 'g']]
tuple(a) == (['a'], ['b', 'c'], ['d', 'e', 'g'])
True
So the question would be how to split the list into a list with sublist.
You are close. You can append to another list called sublist and if you find a k append sublist to new_list:
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
new_lst = []
sublist = []
for element in lst:
if element != 'k':
sublist.append(element)
else:
new_lst.append(sublist)
sublist = []
if sublist: # add the last sublist
new_lst.append(sublist)
result = tuple(new_lst)
print(result)
# (['a'], ['b', 'c'], ['d', 'e', 'g'])
If you're feeling adventurous, you can also use groupby. The idea is to group elements as "k" or "non-k" and use groupby on that property:
from itertools import groupby
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
result = tuple(list(gp) for is_k, gp in groupby(lst, "k".__eq__) if not is_k)
print(result)
# (['a'], ['b', 'c'], ['d', 'e', 'g'])
Thanks #YakymPirozhenko for the simpler generator expression
tuple(list(i) for i in ''.join(lst).split('k'))
Output:
(['a'], ['b', 'c'], ['d', 'e', 'g'])
Here's a different approach, using re.split from the re module, and map:
import re
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
tuple(map(list, re.split('k',''.join(lst))))
(['a'], ['b', 'c'], ['d', 'e', 'g'])
smallerlist = [l.split(',') for l in ','.join(lst).split('k')]
print(smallerlist)
Outputs
[['a', ''], ['', 'b', 'c', ''], ['', 'd', 'e', 'g']]
Then you could check if each sub lists contain ''
smallerlist = [' '.join(l).split() for l in smallerlist]
print(smallerlist)
Outputs
[['a'], ['b', 'c'], ['d', 'e', 'g']]
How about slicing, without appending and joining .
def isplit_list(lst, v):
while True:
try:
end = lst.index(v)
except ValueError:
break
yield lst[:end]
lst = lst[end+1:]
if len(lst):
yield lst
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g', 'k']
results = tuple(isplit_list(lst, 'k'))
Try this, works and doesn't need any imports!
>>> l = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
>>> t = []
>>> for s in ''.join(l).split('k'):
... t.append(list(s))
...
>>> t
[['a'], ['b', 'c'], ['d', 'e', 'g']]
>>> t = tuple(t)
>>> t
(['a'], ['b', 'c'], ['d', 'e', 'g'])
Why don't you make a method which will take a list as an argument and return a tuple like so.
>>> def list_to_tuple(l):
... t = []
... for s in l:
... t.append(list(s))
... return tuple(t)
...
>>> l = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
>>> l = ''.join(l).split('k')
>>> l = list_to_tuple(l)
>>> l
(['a'], ['b', 'c'], ['d', 'e', 'g'])
Another approach using itertools
import more_itertools
lst = ['a', 'k', 'b', 'c', 'k', 'd', 'e', 'g']
print(tuple(more_itertools.split_at(lst, lambda x: x == 'k')))
gives
(['a'], ['b', 'c'], ['d', 'e', 'g'])
One of the review questions for my midterm is to understand what this function does and I cannot read it because I don't understand where the parameters come from and how it works. Can somebody with programming experience help?
def enigma(numList, n, pos):
length = len(numList)
if pos == length:
print('Error')
return
newList = []
for i in range(pos):
newList = newList + [numList[i]]
newList = newList + [n]
tailLength = length - pos
counter = tailLength
while counter < length:
newList = newList + [numList[counter]]
counter = counter + 1
return newList
Many times trying some test data reveals the functionality quickly:
>>> enigma('abcdefghijklm', 'X', 0)
['X']
>>> enigma('abcdefghijklm', 'X', 1)
['a', 'X', 'm']
>>> enigma('abcdefghijklm', 'X', 2)
['a', 'b', 'X', 'l', 'm']
>>> enigma('abcdefghijklm', 'X', 3)
['a', 'b', 'c', 'X', 'k', 'l', 'm']
>>> enigma('abcdefghijklm', 'X', 4)
['a', 'b', 'c', 'd', 'X', 'j', 'k', 'l', 'm']
>>> enigma('abcdefghijklm', 'X', 12)
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'X', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm']
>>> enigma('abcdefghijklm', 'X', 13)
Error
The code starts with an empty new_list and builds it up in three sections
The repeating the first pos number of elements
Adding the n element
Repeating the trailing pos number of elements
As pos becomes larger than the midpoint, the beginning and trailing elements sections cross over to repeat some items.
I have a string like this:
G O S J A J E K R A L J
I would like to print it like this:
['G', 'O', 'S', 'J', 'A'....
I tried with:
print s,
print list(s),
Any ideas ?
try
>>> l = "G O S J A J E K R A L J"
>>> l.split()
['G', 'O', 'S', 'J', 'A', 'J', 'E', 'K', 'R', 'A', 'L', 'J']
>>> ''.join(l.split())
'GOSJAJEKRALJ'
It seems that you are trying to split a string given the string and the delimiter that you wish to split on; in this case the space character. Python provides functionality to do this using the split method. A couple examples are as follows:
>>> s = "A B C D E"
>>> t = "A:B:C:D:E"
>>> s.split(" ")
['A', 'B', 'C', 'D', 'E']
>>> t.split(":")
['A', 'B', 'C', 'D', 'E']
I think you are trying to split the string -
>>> s = "G O S J A J E K R A L J"
>>> s.split()
['G', 'O', 'S', 'J', 'A', 'J', 'E', 'K', 'R', 'A', 'L', 'J']
My answer would be the same: use split for that.
But another solution (for the fun) is [x for x in l if x != ' ']
>>> l = "G O S J A J E K R A L J"
>>> [x for x in l if x != ' ']
['G', 'O', 'S', 'J', 'A', 'J', 'E', 'K', 'R', 'A', 'L', 'J']
>>> l.split()
['G', 'O', 'S', 'J', 'A', 'J', 'E', 'K', 'R', 'A', 'L', 'J']