Weird Behaviour of Enumerate while using pandas dataframe - python

I have a dataframe(df):
df = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5],'f':[6]},index=[0])
I am using enumerate on row.
res = [tuple(x) for x in enumerate(df.values)]
print(res)
>>> [(1, 1, 6, 4, 2, 3, 5)] ### the elements are int type
Now when i change the datatype of one column of my dataframe df:
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
and again use enumerate, i get:
res2 = [tuple(x) for x in enumerate(df2.values)]
print(res2)
>>> [(1, 1.0, 6.0, 4.0, 2.0, 3.0, 5.5)] ### the elements data type has changed
I am not getting why?
Also i am looking for a solution where i have to get it in its own datatype.
For eg.
res = [(1, 1, 6, 4, 2, 3, 5.5)]
How can i Achieve this?

This has nothing to do with enumerate, that's a red herring. The issue is you are looking for mixed type output whereas Pandas prefers storing homogeneous data.
What you are looking for is not recommended with Pandas. Your data type should be int or float, not a combination. This has performance repercussions, since the only straightforward alternative is to use object dtype series, which only permits operations in Python time. Converting to object dtype is inefficient.
So here's what you can do:
res2 = df2.astype(object).values.tolist()[0]
print(res2)
[1, 6, 4, 2, 3, 5.5]
One method which avoids the object conversion:
from itertools import chain
from operator import itemgetter, methodcaller
iter_series = map(itemgetter(1), df2.items())
res2 = list(chain.from_iterable(map(methodcaller('tolist'), iter_series)))
[1, 6, 4, 2, 3, 5.5]
Performance benchmarking
If you want a list of tuples as output, one tuple for each row, then the series-based solution performs better:-
# Python 3.6.0, Pandas 0.19.2
df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
from itertools import chain
from operator import itemgetter, methodcaller
n = 10**5
df2 = pd.concat([df2]*n)
def jpp_series(df2):
iter_series = map(itemgetter(1), df2.items())
return list(zip(*map(methodcaller('tolist'), iter_series)))
def jpp_object1(df2):
return df2.astype(object).values.tolist()
def jpp_object2(df2):
return list(map(tuple, df2.astype(object).values.tolist()))
assert jpp_series(df2) == jpp_object2(df2)
%timeit jpp_series(df2) # 39.7 ms per loop
%timeit jpp_object1(df2) # 43.7 ms per loop
%timeit jpp_object2(df2) # 68.2 ms per loop

The issue is that calling df2.values will cause df2's data to be returned as a numpy array having a single dtype, where all the integers are also coerced to float.
You can prevent this coercion by operating on object arrays.
Use astype(object) to convert the underlying numpy array to object and prevent type coercion:
>>> [(i, *x) for i, x in df2.astype(object).iterrows()]
[(0, 1, 2, 3, 4, 5.5, 6)]
Or,
>>> [(i, *x) for i, x in enumerate(df2.astype(object).values)]
[(0, 1, 2, 3, 4, 5.5, 6)]
Or, on older versions,
>>> [(i,) + tuple(x) for i, x in enumerate(df2.astype(object).values)]
[(0, 1, 2, 3, 4, 5.5, 6)]

Your df2 has mixed dtypes:
In [23]: df2 = pd.DataFrame({'a':[1],'l':[2],'m':[3],'k':[4],'s':[5.5],'f':[6]},index=[0])
...:
In [24]: df2.dtypes
Out[24]:
a int64
f int64
k int64
l int64
m int64
s float64
dtype: object
therefore, using .values will "upcast" to the lowest common denominator. From the doces:
The dtype will be a lower-common-denominator dtype (implicit
upcasting); that is to say if the dtypes (even of numeric types) are
mixed, the one that accommodates all will be chosen. Use this with
care if you are not dealing with the blocks.
It looks like you actually just want .itertuples:
In [25]: list(df2.itertuples())
Out[25]: [Pandas(Index=0, a=1, f=6, k=4, l=2, m=3, s=5.5)]
Note, this conveniently returns a list of namedtuple objects, if you really just want plain tuples, map tuple on to it:
In [26]: list(map(tuple, df2.itertuples()))
Out[26]: [(0, 1, 6, 4, 2, 3, 5.5)]
But there's really no need.

Related

Numpy/Pandas: Merge two numpy arrays based on one array efficiently

I have two numpy arrays comprised of two-set tuples:
a = [(1, "alpha"), (2, 3), ...]
b = [(1, "zylo"), (1, "xen"), (2, "potato", ...]
The first element in the tuple is the identifier and shared between both arrays, so I want to create a new numpy array which looks like this:
[(1, "alpha", "zylo", "xen"), (2, 3, "potato"), etc...]
My current solution works, but it's way too inefficient for me. Looks like this:
aggregate_collection = []
for tuple_set in a:
for tuple_set2 in b:
if tuple_set[0] == tuple_set2[0] and other_condition:
temp_tup = (tuple_set[0], other tuple values)
aggregate_collection.append(temp_tup)
How can I do this efficiently?
I'd concatenate these into a data frame and just groupby+agg
(pd.concat([pd.DataFrame(a), pd.DataFrame(b)])
.groupby(0)
.agg(lambda s: [s.name, *s])[1])
where 0 and 1 are the default column names given by creating a dataframe via pd.DataFrame. Change it to your column names.
In [278]: a = [(1, "alpha"), (2, 3)]
...: b = [(1, "zylo"), (1, "xen"), (2, "potato")]
In [279]: a
Out[279]: [(1, 'alpha'), (2, 3)]
In [280]: b
Out[280]: [(1, 'zylo'), (1, 'xen'), (2, 'potato')]
Note that if I try to make an array from a I get something quite different.
In [281]: np.array(a)
Out[281]:
array([['1', 'alpha'],
['2', '3']], dtype='<U21')
In [282]: _.shape
Out[282]: (2, 2)
defaultdict is a handy tool for collecting like-keyed values
In [283]: from collections import defaultdict
In [284]: dd = defaultdict(list)
In [285]: for tup in a+b:
...: k,v = tup
...: dd[k].append(v)
...:
In [286]: dd
Out[286]: defaultdict(list, {1: ['alpha', 'zylo', 'xen'], 2: [3, 'potato']})
which can be cast as a list of tuples with:
In [288]: [(k,*v) for k,v in dd.items()]
Out[288]: [(1, 'alpha', 'zylo', 'xen'), (2, 3, 'potato')]
I'm using a+b to join the lists, since it apparently doesn't matter where the tuples occur.
Out[288] is even a poor numpy fit, since the tuples differ in size, and items (other than the first) might be strings or numbers.

Python: Calculate difference between all elements in a set of integers

I want to calculate absolute difference between all elements in a set of integers. I am trying to do abs(x-y) where x and y are two elements in the set. I want to do that for all combinations and save the resulting list in a new set.
I want to calculate absolute difference between all elements in a set of integers (...) and save the resulting list in a new set.
You can use itertools.combinations:
s = { 1, 4, 7, 9 }
{ abs(i - j) for i,j in combinations(s, 2) }
=>
set([8, 2, 3, 5, 6])
combinations returns the r-length tuples of all combinations in s without replacement, i.e.:
list(combinations(s, 2))
=>
[(9, 4), (9, 1), (9, 7), (4, 1), (4, 7), (1, 7)]
As sets do not maintain order, you may use something like an ordered-set and iterate till last but one.
For completeness, here's a solution based on Numpy ndarray's and pdist():
In [69]: import numpy as np
In [70]: from scipy.spatial.distance import pdist
In [71]: s = {1, 4, 7, 9}
In [72]: set(pdist(np.array(list(s))[:, None], 'cityblock'))
Out[72]: {2.0, 3.0, 5.0, 6.0, 8.0}
Here is another solution based on numpy:
data = np.array([33,22,21,1,44,54])
minn = np.inf
index = np.array(range(data.shape[0]))
for i in range(data.shape[0]):
to_sub = (index[:i], index[i+1:])
temp = np.abs(data[i] - data[np.hstack(to_sub)])
min_temp = np.min(temp)
if min_temp < minn : minn = min_temp
print('Min difference is',minn)
Output: "Min difference is 1"
Here is another way using combinations:
from itertools import combinations
def find_differences(lst):
" Find all differences, min & max difference "
d = [abs(i - j) for i, j in combinations(set(lst), 2)]
return min(d), max(d), d
Test:
list_of_nums = [1, 9, 7, 13, 56, 5]
min_, max_, diff_ = find_differences(list_of_nums)
print(f'All differences: {diff_}\nMaximum difference: {max_}\nMinimum difference: {min_}')
Result:
All differences: [4, 6, 8, 12, 55, 2, 4, 8, 51, 2, 6, 49, 4, 47, 43]
Maximum difference: 55
Minimum difference: 2

python map/reduce: emit multiple keys values from single map lambda

Is there a canonical way to emit multiple keys from a single item in the input sequence so that they form a continuous sequence and I don't need to use a reduce(...) just to flatten the sequence?
e.g. if I wanted to expand each digit in a series of numbers into individual numbers in a sequence
[1,12,123,1234,12345] => [1,1,2,1,2,3,1,2,3,4,1,2,3,4,5]
then I'd write some python that looked a bit like this:
somedata = [1,12,123,1234,12345]
listified = map(lambda x:[int(c) for c in str(x)], somedata)
flattened = reduce(lambda x,y: x+y,listified,[])
but would prefer not to have to call the flattened = reduce(...) if there was a neater (or maybe more efficient) way to express this.
map(func, *iterables) will always call func as many times as the length of the shortest iterable (assuming no Exception is raised). Functions always return a single object. So list(map(func, *iterables)) will always have the same length as the shortest iterable.
Thus list(map(lambda x:[int(c) for c in str(x)], somedata)) will always have the same length as somedata. There is no way around that.
If the desired result (e.g. [1,1,2,1,2,3,1,2,3,4,1,2,3,4,5]) has more items than the input (e.g. [1,12,123,1234,12345]) then something other than map must be used to produce it.
You could, for example, use itertools.chain.from_iterable to flatten 2 levels of nesting:
In [31]: import itertools as IT
In [32]: somedata = [1,12,123,1234,12345]
In [33]: list(map(int, IT.chain.from_iterable(map(str, somedata))))
Out[33]: [1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
or, to flatten a list of lists, sum(..., []) suffices:
In [44]: sum(map(lambda x:[int(c) for c in str(x)], somedata), [])
Out[44]: [1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
but note that this is much slower than using IT.chain.from_iterable (see below).
Here is a benchmark (using IPython's %timeit) testing the various methods on a list of 10,000 integers from 0 to a million:
In [4]: import random
In [8]: import functools
In [49]: somedata = [random.randint(0, 10**6) for i in range(10**4)]
In [50]: %timeit list(map(int, IT.chain.from_iterable(map(str, somedata))))
100 loops, best of 3: 9.35 ms per loop
In [13]: %timeit [int(i) for i in list(''.join(str(somedata)[1:-1].replace(', ','')))]
100 loops, best of 3: 12.2 ms per loop
In [52]: %timeit [int(j) for i in somedata for j in str(i)]
100 loops, best of 3: 12.3 ms per loop
In [51]: %timeit sum(map(lambda x:[int(c) for c in str(x)], somedata), [])
1 loop, best of 3: 869 ms per loop
In [9]: %timeit listified = map(lambda x:[int(c) for c in str(x)], somedata); functools.reduce(lambda x,y: x+y,listified,[])
1 loop, best of 3: 871 ms per loop
Got two ideas, one with list comprehentions:
print [int(j) for i in somedata for j in list(str(i)) ]
Something new (from comments), string is already iterable, so it would be:
print [int(j) for i in somedata for j in str(i) ]
second with opertations on strings and list comprehentions:
print [int(i) for i in list(''.join(str(somedata)[1:-1].replace(', ','')))]
output for both:
[1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5]
Here's how the transformation goes:
Convert every item (int) to a string: 12 -> '12'
Convert every item (str) to a list of string: '12' -> ['1', '2']
Flatten every item (list of str): ['1', '2'] -> '1', '2'
Convert every item (str) to an int: '1' -> 1
We can use Pyterator for this:
from pyterator import iterate
(
iterate([1, 12, 123, 1234, 12345])
.flat_map(lambda x: list(str(x))) # Steps 1-3
.map(int) # Step 4
.to_list()
)

Python: turn single array of sorted, repeat values into an array of arrays?

I have a sorted array with some repeated values. How can this array be turned into an array of arrays with the subarrays grouped by value (see below)? In actuality, my_first_array has ~8 million entries, so the solution would preferably be as time efficient as possible.
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
itertools.groupby makes this trivial:
import itertools
wanted_array = [list(grp) for _, grp in itertools.groupby(my_first_array)]
With no key function, it just yields groups consisting of runs of identical values, so you list-ify each one in a list comprehension; easy-peasy. You can think of it as basically a within-Python API for doing the work of the GNU toolkit program, uniq, and related operations.
In CPython (the reference interpreter), groupby is implemented in C, and it operates lazily and linearly; the data must already appear in runs matching the key function, so sorting might make it too expensive, but for already sorted data like you have, there is nothing that will be more efficient.
Note: If the inputs might be value identical, but different objects, it may make sense for memory reasons to change list(grp) for _, grp to [k] * len(list(grp)) for k, grp. The former would retain the original (possibly value but not identity duplicate) objects in the final result, the latter would replicate the first object from each group instead, reducing the final cost per group to the cost of N references to a single object, instead of N references to between 1 and N objects.
I am assuming that the input is a NumPy array and you are looking for a list of arrays as output. Now, you can split the input array at indices where those shifts (groups of repeats have boundaries) with np.split. To find such indices, there are two ways - Using np.unique with its optional argument return_index set as True, and another with a combination of np.where and np.diff. Thus, we would have two approaches as listed next.
With np.unique -
import numpy as np
_,idx = np.unique(my_first_array, return_index=True)
out = np.split(my_first_array, idx)[1:]
With np.where and np.diff -
idx = np.where(np.diff(my_first_array)!=0)[0] + 1
out = np.split(my_first_array, idx)
Sample run -
In [28]: my_first_array
Out[28]: array([ 1, 1, 1, 3, 5, 5, 9, 9, 9, 9, 9, 10, 23, 23])
In [29]: _,idx = np.unique(my_first_array, return_index=True)
...: out = np.split(my_first_array, idx)[1:]
...:
In [30]: out
Out[30]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
In [31]: idx = np.where(np.diff(my_first_array)!=0)[0] + 1
...: out = np.split(my_first_array, idx)
...:
In [32]: out
Out[32]:
[array([1, 1, 1]),
array([3]),
array([5, 5]),
array([9, 9, 9, 9, 9]),
array([10]),
array([23, 23])]
Here is a solution, although it might not be very efficient:
my_first_array = [1,1,1,3,5,5,9,9,9,9,9,10,23,23]
wanted_array = [ [1,1,1], [3], [5,5], [9,9,9,9,9], [10], [23,23] ]
new_array = [ [my_first_array[0]] ]
count = 0
for i in range(1,len(my_first_array)):
a = my_first_array[i]
if a == my_first_array[i - 1]:
new_array[count].append(a)
else:
count += 1
new_array.append([])
new_array[count].append(a)
new_array == wanted_array
This is O(n):
a = [1,1,1,3,5,5,9,9,9,9,9,10,23,23,24]
res = []
s = 0
e = 0
length = len(a)
while s < length:
b = []
while e < length and a[s] == a[e]:
b.append(a[s])
e += 1
res.append(b)
s = e
print res

Checking for and indexing non-unique/duplicate values in a numpy array

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.
As an example, if we take the traced_descIDs here, I want the following process to occur:
traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]
I'm currently finding out which objects have more than 1 entry by:
mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)
however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?
Any help greatly appreciated. Cheers.
While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:
a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)
If you are using a version of numpy earlier than 1.9, then that last line would have to be:
unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)
The contents of the three arrays we have created are:
>>> unq
array([ 1, 23, 90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])
To get the repeated items:
cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]
>>> dup_ids
array([ 1, 345])
Getting the indices is a little more involved, but pretty straightforward:
cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])
>>> dup_idx
[array([0, 5]), array([1, 3])]
There is scipy.stats.itemfreq which would give the frequency of each item:
>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[ 1, 2],
[ 23, 1],
[ 90, 1],
[345, 2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]
Your current approach is O(N**2), use a dictionary to do it in O(N)time:
>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
... d[x].append(i)
...
>>> for k, v in d.items():
... if len(v) == 1:
... del d[k]
...
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})
And to get the items and indices:
>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))
Note that if you want to preserver the order of items in dupIds then use collections.OrderedDict and dict.setdefault() method.
td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]
That gives you:
array([ 1, 345])
I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)
A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):
import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))
This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:
g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)
np.unqiue for Ndims
I had a similar problem with an ndArray in which I want to find which rows are duplicated.
x = np.arange(60).reshape(5,4,3)
x[1] = x[0]
0 and 1 should be duplicates in axis 0. I used np.unique and returned all options. Then use Jaime's method to locate the duplicates.
_,i,_,c = np.unique(x,1,1,1,axis=0)
x_dup = x[i[1<c]]
I unnecessarily use return_inverse for clarity. Here are the result:
>>> print(x_dupilates)
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]]]

Categories

Resources