Transform character array into integers with python

Transform character array into integers with python - python

I have a piece of data which is in the form of character array:
cgcgcg
aacacg
cgcaag
cgcacg
agaacg
cacaag
agcgcg
cgcaca
cacaca
agaacg
cgcacg
cgcgaa
Notice that each column consists of only two types characters. I need to transform them into integers 0 or 1, based on their percentage in the column. For instance in the 1st column, there are 8 c's and 4 a's, so c is in majority, then we need to code it as 0 and the other as 1.
Using zip() I can transpose this array in python, and get each column into a list:
In [28]: lines = [l.strip() for l in open(inputfn)]
In [29]: list(zip(*lines))
Out[29]:
[('c', 'a', 'c', 'c', 'a', 'c', 'a', 'c', 'c', 'a', 'c', 'c'),
('g', 'a', 'g', 'g', 'g', 'a', 'g', 'g', 'a', 'g', 'g', 'g'),
('c', 'c', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'a', 'c', 'c'),
('g', 'a', 'a', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'a', 'g'),
('c', 'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'c', 'a'),
('g', 'g', 'g', 'g', 'g', 'g', 'g', 'a', 'a', 'g', 'g', 'a')]
It's not necessary to transform them strictly into integers, i.e. 'c' to '0' or 'c' to int(0) will both be ok, since we are going to write them to a tab delimited file anyway.

Something like this:
lis = [('c', 'a', 'c', 'c', 'a', 'c', 'a', 'c', 'c', 'a', 'c', 'c'),
('g', 'a', 'g', 'g', 'g', 'a', 'g', 'g', 'a', 'g', 'g', 'g'),
('c', 'c', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'a', 'c', 'c'),
('g', 'a', 'a', 'a', 'a', 'a', 'g', 'a', 'a', 'a', 'a', 'g'),
('c', 'c', 'a', 'c', 'c', 'a', 'c', 'c', 'c', 'c', 'c', 'a'),
('g', 'g', 'g', 'g', 'g', 'g', 'g', 'a', 'a', 'g', 'g', 'a')]
def solve(lis):
for row in lis:
item1, item2 = set(row)
c1, c2 = row.count(item1), row.count(item2)
dic = {item1 : int(c1 < c2), item2 : int(c2 < c1)}
yield [dic[x] for x in row]
...
>>> list(solve(lis))
[[0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]
Using collections.Counter:
from collections import Counter
def solve(lis):
for row in lis:
c = Counter(row)
maxx = max(c.values())
yield [int(c[x] < maxx) for x in row]
...
>>> pprint(list(solve(lis)))
[[0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]

Related

Convert a list of string to category integer in Python

Given a list of string,
['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
I would like to convert to an integer-category form
[0, 0, 2, 0, 0, 0, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 1, 1, 1, 3, 1, 1, 1]
This can achieve using numpy unique as below
ipt=['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
_, opt = np.unique(np.array(ipt), return_inverse=True)
But, I curious if there is another alternative without the need to import numpy.

If you are solely interested in finding integer representation of factors, then you can use a dict comprehension along with enumerate to store the mapping, after using set to find unique values:
lst = ['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
d = {x: i for i, x in enumerate(set(lst))}
lst_new = [d[x] for x in lst]
print(lst_new)
# [3, 3, 0, 3, 3, 3, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 1, 1, 1, 2, 1, 1, 1]
This approach can be used for general factors, i.e., the factors do not have to be 'a', 'b' and so on, but can be 'dog', 'bus', etc. One drawback is that it does not care about the order of factors. If you want the representation to preserve order, you can use sorted:
d = {x: i for i, x in enumerate(sorted(set(lst)))}
lst_new = [d[x] for x in lst]
print(lst_new)
# [0, 0, 2, 0, 0, 0, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 1, 1, 1, 3, 1, 1, 1]

You could take a note out of the functional programming book:
ipt=['a', 'a', 'c', 'a', 'a', 'a', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'd', 'd', 'd', 'd', 'c', 'b', 'b', 'b', 'd', 'b', 'b', 'b']
opt = list(map(lambda x: ord(x)-97, ipt))
This code iterates through the input array and passes each element through the lambda function, which takes the ascii value of the character, and subtracts 97 (to convert the characters to 0-25).
If each string isn't a single character, then the lambda function may need to be adapted.

You could write a custom function to do the same thing as you are using numpy.unique() for.
def unique(my_list):
''' Takes a list and returns two lists, a list of each unique entry and the index of
each unique entry in the original list
'''
unique_list = []
int_cat = []
for item in my_list:
if item not in unique_list:
unique_list.append(item)
int_cat.append(unique_list.index(item))
return unique_list, int_cat
Or if you wanted your indexing to be ordered.
def unique_ordered(my_list):
''' Takes a list and returns two lists, an ordered list of each unique entry and the
index of each unique entry in the original list
'''
# Unique list
unique_list = []
for item in my_list:
if item not in unique_list:
unique_list.append(item)
# Sorting unique list alphabetically
unique_list.sort()
# Integer category list
int_cat = []
for item in my_list:
int_cat.append(unique_list.index(item))
return unique_list, int_cat
Comparing the computation time for these two vs numpy.unique() for 100,000 iterations of your example list, we get:
numpy = 2.236004s
unique = 0.460719s
unique_ordered = 0.505591s
Showing that either option would be faster than numpty for simple lists. More complicated strings decrease the speed of unique() and unique_ordered much more than numpy.unique(). Doing 10,000 iterations of a random, 100 element list of 20 character strings, we get times of:
numpy = 0.45465s
unique = 1.56963s
unique_ordered = 1.59445s
So if efficiency was important and your list had more complex/a larger variety of strings, it would likely be better to use numpy.unique()

Plotting strings in a list as datapoints in a linegraph

I have a list of lists as so:
lsofls = [0,0,0,0,0],["a",0,0,0,0],[0,"a",0,0,0],["b","a",0,0,0],["b",0,"a",0,0],["b",0,0,"a",0],[0,"b",0,"a",0],["c","b",0,"a",0],["c",0,"b","a",0],[0,"c","b","a",0],["d","c","b","a",0], ["d","c","b",0,"a"],["d","c","b",0,0],["d","c",0,"b",0]
And I wish to plot this, whereby each string in each list acts as its own datapoint. Each list in the list of lists is a point in time starting at t0 at the zeroth list. Each element in a list within the list of lists is a point in the sequence. I struggle to explain what I mean, but by printing the list of lists with each list as a new line it becomes clearer:
for s in lsofls:
print(s)
This gives:
[0, 0, 0, 0, 0]
['a', 0, 0, 0, 0]
[0, 'a', 0, 0, 0]
['b', 'a', 0, 0, 0]
['b', 0, 'a', 0, 0]
['b', 0, 0, 'a', 0]
[0, 'b', 0, 'a', 0]
['c', 'b', 0, 'a', 0]
['c', 0, 'b', 'a', 0]
[0, 'c', 'b', 'a', 0]
['d', 'c', 'b', 'a', 0]
['d', 'c', 'b', 0, 'a']
['d', 'c', 'b', 0, 0]
['d', 'c', 0, 'b', 0]
I essentially want to rotate this output 90 degrees anticlockwise, as a linegraph.
I am unsure how to do this, as I usually plot using integers.
I hope I am being clear enough, I am unsure how to phrase the question.
EDIT:
The solution provided by #ce.teuf is very close to what I need. However I need the string to be able to rejoin at position 1 in the graph. SO if you look at this list here:
lsofls = [0, 0, 0, 0, 0], ['a', 0, 0, 0, 0], [0, 'a', 0, 0, 0], ['b', 'a', 0, 0, 0], ['b', 0, 'a', 0, 0], ['b', 0, 0, 'a', 0], [0, 'b', 0, 'a', 0], ['c', 'b', 0, 'a', 0], ['c', 0, 'b', 'a', 0], [0, 'c', 'b', 'a', 0], ['d', 'c', 'b', 'a', 0], ['d', 'c', 'b', 0, 'a'], ['d', 'c', 'b', 0, 0], ['d', 'c', 0, 'b', 0], ['d', 'c', 0, 0, 'b'], ['d', 0, 'c', 0, 'b'], [0, 'd', 'c', 0, 'b'], ['a', 'd', 'c', 0, 'b']
for s in lsofls:
print(s)
So I need a way for each string to rejoin in the graph if that makes sense.

Using numpy essentially :
import numpy as np
import matplotlib.pyplot as plt
z = [[0, 0, 0, 0, 0],
['a', 0, 0, 0, 0],
[0, 'a', 0, 0, 0],
['b', 'a', 0, 0, 0],
['b', 0, 'a', 0, 0],
['b', 0, 0, 'a', 0],
[0, 'b', 0, 'a', 0],
['c', 'b', 0, 'a', 0],
['c', 0, 'b', 'a', 0],
[0, 'c', 'b', 'a', 0],
['d', 'c', 'b', 'a', 0],
['d', 'c', 'b', 0, 'a'],
['d', 'c', 'b', 0, 0],
['d', 'c', 0, 'b', 0]]
flat_list = [item for sublist in z for item in sublist]
series = list(set(flat_list))[1:]
y_len = len(z[0])
z2 = np.rot90(z)
for s in series:
z3 = np.argwhere(z2== s)
z3[:, 0] = (z3[:, 0] - y_len) * -1
y, x = z3[:, 0], z3[:, 1]
plt.plot(np.sort(x[::-1]), y[::-1])
plt.show()

One-liner for splicing lists

I am looking for a pythonic way to splice two lists based on the values in one of them. One-liner would be preferred.
Say we have
[0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
and
['a', 'b', 'c', 'd', 'e', 'f']
and the result has to look like this:
[0, 'a', 'b', 0, 0, 'c', 'd', 'e', 0, 'f']

You can use next with iter:
d = [0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
d1 = ['a', 'b', 'c', 'd', 'e', 'f']
new_d = iter(d1)
result = [i if not i else next(new_d) for i in d]
Output:
[0, 'a', 'b', 0, 0, 'c', 'd', 'e', 0, 'f']

One liner:
d = [0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
d1 = ['a', 'b', 'c', 'd', 'e', 'f']
print( [d1.pop(0) if i==1 else i for i in d] )
Prints:
[0, 'a', 'b', 0, 0, 'c', 'd', 'e', 0, 'f']
EDIT (More efficient approach):
d = [0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
d1 = ['a', 'b', 'c', 'd', 'e', 'f'][::-1]
print( [d1.pop() if i==1 else i for i in d[::-1]] )

Similar to #Ajax1234's answer, but on a single line:
d = [0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
d1 = ['a', 'b', 'c', 'd', 'e', 'f']
result = [d[i] if not d[i] else d1[d[:i].count(1)] for i in range(len(d))]
Result:
[0, 'a', 'b', 0, 0, 'c', 'd', 'e', 0, 'f']

iterating on next item in sublist with condition in python

I have a list which is sorted and grouped based on 2 element of sublist like below
[[[2178393, 'a', 'online', 0, 20], [2178394, 'a', 'away', 0, 30], [2178395, 'a', 'away', 0, 40]],[[2178389, 'b', 'online', 0, 10], [2178390, 'b', 'online', 0, 15], [2178392, 'b', 'online', 1, 25], [2178391, 'b', 'online', 1, 30], [2178397, 'b', 'away', 1, 40]], [[2178388, 'c', 'online', 0, 15], [2178396, 'c', 'away', 0, 20], [2178402, 'c', 'online', 0,25], [2178408, 'c', 'online', 1, 50]]]
in above there are 3 sublists that contains the lists, i want to add 5th element(4th index) from next list to present list inside the sublists. In simple adding the 5th element(4th index) of every next sublist to the present sublist.
the output should be
[[[2178393, 'a', 'online', 0, 20,30], [2178394, 'a', 'away', 0, 30,40], [2178395, 'a', 'away', 0, 40]],[[2178389, 'b', 'online', 0, 10,15], [2178390, 'b', 'online', 0, 15,25], [2178392, 'b', 'online', 1, 25,30], [2178391, 'b', 'online', 1, 30,40], [2178397, 'b', 'away', 1, 40]], [[2178388, 'c', 'online', 0, 15,20], [2178396, 'c', 'away', 0, 20,25], [2178402, 'c', 'online', 0,25,50], [2178408, 'c', 'online', 1, 50]]]
Please help me.

Here is the code to achieve that
for outer in range(0,len(list)):
for inner in range(0,len(list[outer])-1):
list[outer][inner].append(list[outer][inner+1][4])
Desired Output

Use a nested list comprehension along with zip_longest. This takes advantage of the fact that each of the innermost lists just needs the last element of the next list to be appended to it, with the last innermost list being unchanged.
from itertools import zip_longest
data = [[[2178393, 'a', 'online', 0, 20], [2178394, 'a', 'away', 0, 30], [2178395, 'a', 'away', 0, 40]],[[2178389, 'b', 'online', 0, 10], [2178390, 'b', 'online', 0, 15], [2178392, 'b', 'online', 1, 25], [2178391, 'b', 'online', 1, 30], [2178397, 'b', 'away', 1, 40]], [[2178388, 'c', 'online', 0, 15], [2178396, 'c', 'away', 0, 20], [2178402, 'c', 'online', 0,25], [2178408, 'c', 'online', 1, 50]]]
expected = [[[2178393, 'a', 'online', 0, 20,30], [2178394, 'a', 'away', 0, 30,40], [2178395, 'a', 'away', 0, 40]],[[2178389, 'b', 'online', 0, 10,15], [2178390, 'b', 'online', 0, 15,25], [2178392, 'b', 'online', 1, 25,30], [2178391, 'b', 'online', 1, 30,40], [2178397, 'b', 'away', 1, 40]], [[2178388, 'c', 'online', 0, 15,20], [2178396, 'c', 'away', 0, 20,25], [2178402, 'c', 'online', 0,25,50], [2178408, 'c', 'online', 1, 50]]]
result = [[bottom_list_first + bottom_list_second[-1:]
for bottom_list_first, bottom_list_second
in zip_longest(middle_list, middle_list[1:], fillvalue=[])]
for middle_list in data]
print(result == expected)
Output:
True

Generate list of all combinations and maintain index position

I'm looking for a method to generate a list of all combinations with its current index kept maintained:
So far i've been using this method:
stuff = ['A', 'B', 'C']
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
a = subset
list.append(a)
which gives:
[(),
('A',),
('B',),
('C',),
('A', 'B'),
('A', 'C'),
('B', 'C'),
('A', 'B', 'C')]
What I'm looking for is a solution that gives or can be convertet to the string below:
[(0, 0, 0),
('A', 0, 0),
('B', 0, 0),
('C', 0, 0),
('A', 'B', 0),
('A', 0, 'C'),
(0, 'B', 'C'),
('A', 'B', 'C')]
Best,
Christian

Instead of directly appending the result, you can generate a new list in the desired format using a list comprehension [i if i in subset else 0 for i in stuff]
Here you go:
import itertools
stuff, result = ['A', 'B', 'C'], []
for L in range(0, len(stuff)+1):
for subset in itertools.combinations(stuff, L):
result.append([i if i in subset else 0 for i in stuff])
And now if you check the result,
>>> print result
[[0, 0, 0], ['A', 0, 0], [0, 'B', 0], [0, 0, 'C'], ['A', 'B', 0], ['A', 0, 'C'], [0, 'B', 'C'], ['A', 'B', 'C']]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Transform character array into integers with python - python

Related

Convert a list of string to category integer in Python

Plotting strings in a list as datapoints in a linegraph

One-liner for splicing lists

iterating on next item in sublist with condition in python

Generate list of all combinations and maintain index position

Categories

Resources