I have a large file that is read into a DataFrame which has a column 'features' which is a string representation of a list. The elements in this "list" are sometimes strings, sometimes numbers, as shown below, but the lists in reality at times may be very long depending on the data source.
df = pd.DataFrame(["['a', 'b', 1, 2, 3, 'c', -5]",
"['a', 'b', 1, 2, 4, 'd', 3]",
"['a', 'b', 1, 2, 3, 'c', -5]"],
columns=['features'])
df
features
0 ['a', 'b', 1, 2, 3, 'c', -5]
1 ['a', 'b', 1, 2, 4, 'd', 3]
2 ['a', 'b', 1, 2, 3, 'c', -5]
# Looking at first two characters in first row for example--
df.features[0][0:2]
"['"
I am trying to use pd.json_normalize() to get the column into a "flat table" so it is easier to perform operations on various elements in the features column, (not all of them, but different sets of them depending on the operation being done). However, I can't seem to figure out how to get this to work.
How can I use json_normalize() properly here?
above you are setting the items as a list of strings. What you should be doing is setting them as a list of arrays.
import pandas as pd
df = pd.DataFrame({'features' : [['a', 'b', 1, 2, 3, 'c', -5],
['a', 'b', 1, 2, 4, 'd', 3],
['a', 'b', 1, 2, 3, 'c', -5]]})
will give you
features
0 [a, b, 1, 2, 3, c, -5]
1 [a, b, 1, 2, 4, d, 3]
2 [a, b, 1, 2, 3, c, -5]
Notice the missing quotes around the characters?
so you want df.features[0][0:2]
you get
['a', 'b']
Now how are you getting the data for your dataframe?
or if you have to get your dataframe like that,
df = pd.DataFrame(["['a', 'b', 1, 2, 3, 'c', -5]",
"['a', 'b', 1, 2, 4, 'd', 3]",
"['a', 'b', 1, 2, 3, 'c', -5]"],
columns=['features'])
df.features = df.features.str.replace(']','').str.replace('[','').str.replace(' ','').str.replace("'",'').str.split(',')
then df.features[0][0:2]
will give you
['a', 'b']
I wonder if the following can be done better:
import numpy as np
def label_items(items):
data = np.array(items)
labels = np.zeros(len(items), dtype='int')
for label, value in enumerate(set(items)):
labels[data==value] = label
return labels
for example:
label_items(['a', 'a', 'c', 'd', 'e', 'b', 'e', 'e', 'd', 'c'])
will return
array([0, 0, 1, 4, 3, 2, 3, 3, 4, 1])
Addendum: the letters are merely an example, it could be a list of anything. That's why I called the function "label_items". The order of the labels doesn't matter.
If order is not important, you can use numpy.unique:
import numpy as np
def label_items(arr):
return np.unique(arr, return_inverse=True)
vals, labels = label_items(['a', 'a', 'c', 'd', 'e', 'b', 'e', 'e', 'd', 'c'])
print(vals)
['a' 'b' 'c' 'd' 'e']
print(labels)
[0 0 2 3 4 1 4 4 3 2]
You can use simple map:
list(map(lambda x: ord(x) - ord('a'), a))
Result:
[0, 0, 2, 3, 4, 1, 4, 4, 3, 2]
So, ord function returns an integer representing the Unicode code point of that character. Than ord(x) - ord('a'), where x - some letter, returns alphabetical order number.
On start I have 2 lists and 1 list that says in what order I should merge those two lists.
For example I have first list equal to [a, b, c] and second list equal to [d, e] and 'merging' list equal to [0, 1, 0, 0, 1].
That means: to make merged list first I need to take element from first list, then second, then first, then first, then second... And I end up with [a, d, b, c, e].
To solve this I just used for loop and two "pointers", but I was wondering if I can do this task more pythonic... I tried to find some functions that could help me, but no real result.
You could create iterators from those lists, loop through the ordering list, and call next on one of the iterators:
i1 = iter(['a', 'b', 'c'])
i2 = iter(['d', 'e'])
# Select the iterator to advance: `i2` if `x` == 1, `i1` otherwise
print([next(i2 if x else i1) for x in [0, 1, 0, 0, 1]]) # ['a', 'd', 'b', 'c', 'e']
It's possible to generalize this solution to any number of lists as shown below
def ordered_merge(lists, selector):
its = [iter(l) for l in lists]
for i in selector:
yield next(its[i])
In [4]: list(ordered_merge([[3, 4], [1, 5], [2, 6]], [1, 2, 0, 0, 1, 2]))
Out[4]: [1, 2, 3, 4, 5, 6]
If the ordering list contains strings, floats, or any other objects that can't be used as list indexes, use a dictionary:
def ordered_merge(mapping, selector):
its = {k: iter(v) for k, v in mapping.items()}
for i in selector:
yield next(its[i])
In [6]: mapping = {'A': [3, 4], 'B': [1, 5], 'C': [2, 6]}
In [7]: list(ordered_merge(mapping, ['B', 'C', 'A', 'A', 'B', 'C']))
Out[7]: [1, 2, 3, 4, 5, 6]
Of course, you can use integers as dictionary keys as well.
Alternatively, you could remove elements from the left side of each of the original lists one by one and add them to the resulting list. Quick example:
In [8]: A = ['a', 'b', 'c']
...: B = ['d', 'e']
...: selector = [0, 1, 0, 0, 1]
...:
In [9]: [B.pop(0) if x else A.pop(0) for x in selector]
Out[9]: ['a', 'd', 'b', 'c', 'e']
I would expect the first approach to be more efficient (list.pop(0) is slow).
How about this,
list1 = ['a', 'b', 'c']
list2 = ['d', 'e']
options = [0,1,0,0,1]
list1_iterator = iter(list1)
list2_iterator = iter(list2)
new_list = [next(list2_iterator) if option else next(list1_iterator) for option in options]
print(new_list)
# Output
['a', 'd', 'b', 'c', 'e']
I'd like to make additions/replacements to the digram list which looks similar to this:
[[a,b][b,c][c,d][d,c][c,b][b,a]]
If the list is flattened, outcome would be: ´´[a,b,c,d,c,b,a]´´ but this is just for describing the structure, not the issue.
Note that there are only two items on a digram and each of the two items on a
digram precedes the next and the previous digram items, except of the first
and the last digram, where terminating item occurs only once. See item
´´a´´.
My question is that how can you replace digrams to the list, so that next example results on the comment part would fulfill:
replace([['d','d']], 1, ['a', 0]) # should return: [['d', 'd']]
replace([['d',1]], 1, ['a', 0]) # should return: [['d', 'a'], ['a', 0]]
replace([[1,'d']], 1, ['a', 0]) # should return: [['a', 0], [0, 'd']]
replace([[1,'d'],['d', 1]], 1, ['a', 0]) # should return: [['a', 0], [0, 'd'], ['d', 'a'], ['a', 0]]
replace([['d',1],[1,'d']], 1, ['a', 0]) # should return: [['d','a'], ['a', 0], [0, 'd']]
replace([[1,1]], 1, ['a', 0]) # should return: [['a', 0], [0, 'a'], ['a', 0]]
replace([[1,1],[1,1]], 1, ['a', 0]) # should return: [['a', 0], [0, 'a'], ['a', 0], [0, 'a'], ['a', 0]]
I have tried the next approach, but it has some issues. Especially the part under ´´j == 1´´ has special cases that doesnt work.
def replace(t, a, b):
"""
1. argument is the target list
2. argument is the index value to be used on replacement
3. argument is the digram to be inserted
"""
# copy possibly not needed, im not sure
t1 = t[:]
for i, x in enumerate(t1):
for j, y in enumerate(x):
# if there is a digram match, lets make replacement / addition
if y == a:
if j == 0:
c = t1[i:]
del t1[i:]
t1 += [b] + c
c[0][0] = b[1]
if j == 1:
c = t1[i:]
del t1[i:]
t1 += c + [b]
c[len(c)-1][1] = b[0]
#c[0][1] = b[0]
#t1 += c
print (t, t1)
Can you suggest some tips to improve the function or have alternative ways to do the task?
Addition
This is my enchanged version of the function, which provides right answers, but still "annoying" part of it or whole approach could be optimized. This question and topic could be changed more to the code optimization area:
def replace(t, a, b):
"""
1. argument is the target list
2. argument is the index value to be used on replacement
3. argument is the digram to be inserted
"""
l = len(t)
i = 0
while i < l:
for j, y in enumerate(t[i]):
# if there is a digram match, lets make replacement / addition
if y == a:
if j == 0:
c = t[i:]
del t[i:]
t += [b] + c
c[0][0] = b[1]
# increase both index and length
# this practically jumps over the inserted digram to the next one
i += 1
l += 1
elif j == 1:
c = t[i:]
del t[i:]
# this is the annoying part of the algorithm...
if len(c) > 1 and c[1][0] == a:
t += c
else:
t += c + [b]
c[-1][1] = b[0]
t[i][1] = b[0]
i += 1
return t
I also provide test function to test inputs and outputs:
def test(ins, outs):
try:
assert ins == outs
return (True, 'was', outs)
except:
return (False, 'was', ins, 'should be', outs)
for i, result in enumerate(
[result for result in [
[replace([['d','d']], 1, ['a', 0]), [['d', 'd']]],
[replace([['d',1]], 1, ['a', 0]), [['d', 'a'], ['a', 0]]],
[replace([[1,'d']], 1, ['a', 0]), [['a', 0], [0, 'd']]],
[replace([[1,'d'],['d', 1]], 1, ['a', 0]), [['a', 0], [0, 'd'], ['d', 'a'], ['a', 0]]],
[replace([['d',1],[1,'d']], 1, ['a', 0]), [['d','a'], ['a', 0], [0, 'd']]],
[replace([[1,1]], 1, ['a', 0]), [['a', 0], [0, 'a'], ['a', 0]]],
[replace([[1,1],[1,1]], 1, ['a', 0]), [['a', 0], [0, 'a'], ['a', 0], [0, 'a'], ['a', 0]]],
[replace([['d',1],[1,1]], 1, ['a', 0]), [['d', 'a'], ['a', 0], [0, 'a'], ['a', 0]]],
[replace([[1,1],[1,'d']], 1, ['a', 0]), [['a', 0], [0, 'a'], ['a', 0], [0, 'd']]]
]]):
print (i+1, test(*result))
This is my approach. Explanation below.
def replace(t, a, b):
# Flatten the list
t = [elem for sub in t for elem in sub]
replaced = []
# Iterate the elements of the flattened list
# Let the elements that do not match a in and replace the ones that
# do not match with the elements of b
for elem in t:
if elem == a: # this element matches, replace with b
replaced.extend(b)
else: # this element does not, add it
replaced.append(elem)
# break up the replaced, flattened list with groups of 2 elements
return [replaced[x:x+2] for x in range(len(replaced)-1)]
You start with some list of lists. So first, we can flatten that.
[[1,'d'],['d', 1]] becomes [1,'d','d', 1]
Now we can loop through the flattened list and anywhere we find a match on a we can extend our replaced list with the contents of b. If the element does not match a we simply append it to replaced. We end up with:
['a', 0, 'd', 'd', 'a', 0]
Now we want to take all of these in groups of 2, moving our index 1 at a time.
[['a', 0] ...]
[['a', 0], [0, 'd'], ...]
[['a', 0], [0, 'd'], ['d', 'd'], ...]
If your data was substantially longer than your examples and was in need of performance improvements, the flattening of the list could be removed and you could flatten the value in t with a nested loop so you would make a single pass through t.
EDIT
def replace(t, a, b):
t = [elem for sub in t for elem in sub]
inner_a_matches_removed = []
for i, elem in enumerate(t):
if not i % 2 or elem != a:
inner_a_matches_removed.append(elem)
continue
if i < len(t) - 1 and t[i+1] == a:
continue
inner_a_matches_removed.append(elem)
replaced = []
for elem in inner_a_matches_removed:
if elem == a:
replaced.extend(b)
else:
replaced.append(elem)
return [replaced[x:x+2] for x in range(len(replaced)-1)]
And here is an addition for testing:
args_groups = [
([['d','d']], 1, ['a', 0]),
([['d',1]], 1, ['a', 0]),
([[1,'d']], 1, ['a', 0]),
([[1,'d'],['d', 1]], 1, ['a', 0]),
([['d',1],[1,'d']], 1, ['a', 0]),
([[1,1]], 1, ['a', 0]),
([[1,1],[1,1]], 1, ['a', 0]),
]
for args in args_groups:
print "replace({}) => {}".format(", ".join(map(str, args)), replace(*args))
Which outputs:
replace([['d', 'd']], 1, ['a', 0]) => [['d', 'd']]
replace([['d', 1]], 1, ['a', 0]) => [['d', 'a'], ['a', 0]]
replace([[1, 'd']], 1, ['a', 0]) => [['a', 0], [0, 'd']]
replace([[1, 'd'], ['d', 1]], 1, ['a', 0]) => [['a', 0], [0, 'd'], ['d', 'd'], ['d', 'a'], ['a', 0]]
replace([['d', 1], [1, 'd']], 1, ['a', 0]) => [['d', 'a'], ['a', 0], [0, 'd']]
replace([[1, 1]], 1, ['a', 0]) => [['a', 0], [0, 'a'], ['a', 0]]
replace([[1, 1], [1, 1]], 1, ['a', 0]) => [['a', 0], [0, 'a'], ['a', 0], [0, 'a'], ['a', 0]]
I guess I still don't understand case #4, but you seem to have solved it yourself which is Great!
Here is your modified code:
def replace(t, a, b):
# Flatten the list
t1 = []
l = len(t)-1
for items in [t[i][0:(1 if i>-1 and i<l else 2)] for i in range(0,l+1)]:
t1.extend(items)
replaced = []
# Iterate the elements of the flattened list
# Let the elements that do not match a in and replace the ones that
# do not match with the elements of b
for elem in t1:
if elem == a: # this element matches, replace with b
replaced.extend(b)
else: # this element does not, add it
replaced.append(elem)
# break up the replaced, flattened list with groups of 2 elements
return [replaced[x:x+2] for x in range(len(replaced)-1)]
Quick Summary:
need_to_reorder = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
I want to set an order for the need_to_reorder[0][x] x values using my sorting array
sorting_array = [1, 3, 0, 2]
Required result: need_to_reorder will equal
[['b', 'd', 'a', 'c'], [2, 4, 1, 3]]
Searching for the answer, I tried using numPy:
import numpy as np
sorting_array = [1, 3, 0, 2]
i = np.array(sorting_array)
print i ## Results: [1 3 0 2] <-- No Commas?
need_to_reorder[:,i]
RESULTS:
TypeError: list indicies must be integers, not tuple
I'm looking for a correction to the code above or an entirely different approach.
You can try a simple nested comprehension
>>> l = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
>>> s = [1, 3, 0, 2]
>>> [[j[i] for i in s] for j in l]
[['b', 'd', 'a', 'c'], [2, 4, 1, 3]]
If you need this as a function you can have a very simple function as in
def reorder(need_to_reorder,sorting_array)
return [[j[i] for i in sorting_array] for j in need_to_reorder]
Do note that this can be solved using map function also. However in this case, a list comp is preferred as the map variant would require a lambda function. The difference between map and a list-comp is discussed in full length in this answer
def order_with_sort_array(arr, sort_arr):
assert len(arr) == len(sort_arr)
return [arr[i] for i in sort_arr]
sorting_array = [1, 3, 0, 2]
need_to_reorder = [['a', 'b', 'c', 'd'], [1, 2, 3, 4]]
after_reordered = map(lambda arr : order_with_sort_array(arr, sorting_array),
need_to_reorder)
This should work
import numpy as np
ntr = np.array([['a', 'b', 'c', 'd'], [1, 2, 3, 4]])
sa = np.array([1, 3, 0, 2])
print np.array( [ntr[0,] , np.array([ntr[1,][sa[i]] for i in range(sa.shape[0])])] )
>> [['a' 'b' 'c' 'd'],['2' '4' '1' '3']]