Sorting a scipy.stats.itemfreq result containing strings - python

The Problem
I'm attempting to count the frequency of a list of strings and sort it in descending order. scipy.stats.itemfreq generates the frequency results which are output as a numpy array of string elements. This is where I'm stumped. How do I sort it?
So far I have tried operator.itemgetter which appeared to work for a small list until I realised that it is sorting by the first string character rather than converting the string to an integer so '5' > '11' as it is comparing 5 and 1 not 5 and 11.
I'm using python 2.7, numpy 1.8.1, scipy 0.14.0.
Example Code:
from scipy.stats import itemfreq
import operator as op
items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']
items = itemfreq(items)
items = sorted(items, key=op.itemgetter(1), reverse=True)
print items
print items[0]
Output:
[array(['platypus duck', '5'],
dtype='|S13'), array(['dog', '3'],
dtype='|S13'), array(['', '2'],
dtype='|S13'), array(['bird', '2'],
dtype='|S13'), array(['cat', '11'],
dtype='|S13'), array(['elephant', '1'],
dtype='|S13')]
['platypus duck' '5']
Expected Output:
I'm after the ordering so something like:
[array(['cat', '11'],
dtype='|S13'), array(['platypus duck', '5'],
dtype='|S13'), array(['dog', '3'],
dtype='|S13'), array(['', '2'],
dtype='|S13'), array(['bird', '2'],
dtype='|S13'), array(['elephant', '1'],
dtype='|S13')]
['cat', '11']
Summary
My question is: how do I sort the array (which in this case is a string array) in descending order of counts? Please feel free to suggest alternative and faster/improved methods to my code sample above.

It is unfortunate that itemfreq returns the unique items and their counts in the same array. For your case, it means the counts are converted to strings, which is just dumb.
If you can upgrade numpy to version 1.9, then instead of using itemfreq, you can use numpy.unique with the argument return_counts=True (see below for how to accomplish this in older numpy):
In [29]: items = ['platypus duck','platypus duck','platypus duck','platypus duck','cat','dog','platypus duck','elephant','cat','cat','dog','bird','','','cat','dog','bird','cat','cat','cat','cat','cat','cat','cat']
In [30]: values, counts = np.unique(items, return_counts=True)
In [31]: values
Out[31]:
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'],
dtype='|S13')
In [32]: counts
Out[32]: array([ 2, 2, 11, 3, 1, 5])
Get indices that puts counts in decreasing order:
In [38]: idx = np.argsort(counts)[::-1]
In [39]: values[idx]
Out[39]:
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'],
dtype='|S13')
In [40]: counts[idx]
Out[40]: array([11, 5, 3, 2, 2, 1])
For older versions of numpy, you can combine np.unique and np.bincount, as follows:
In [46]: values, inv = np.unique(items, return_inverse=True)
In [47]: counts = np.bincount(inv)
In [48]: values
Out[48]:
array(['', 'bird', 'cat', 'dog', 'elephant', 'platypus duck'],
dtype='|S13')
In [49]: counts
Out[49]: array([ 2, 2, 11, 3, 1, 5])
In [50]: idx = np.argsort(counts)[::-1]
In [51]: values[idx]
Out[51]:
array(['cat', 'platypus duck', 'dog', 'bird', '', 'elephant'],
dtype='|S13')
In [52]: counts[idx]
Out[52]: array([11, 5, 3, 2, 2, 1])
In fact, the above is exactly what itemfreq does. Here's the definition of itemfreq in the scipy source code (without the docstring):
def itemfreq(a):
items, inv = np.unique(a, return_inverse=True)
freq = np.bincount(inv)
return np.array([items, freq]).T

A much simpler way of achieving your task - obtaining the frequency of an item and having the items sorted by frequency - is to use the pandas function value_counts (for the original post and more suggestions see here):
import pandas as pd
import numpy as np
x = np.array(["bird","cat","dog","dog","cat","cat"])
pd.value_counts(x)
cat 3
dog 2
bird 1
dtype: int64
Getting only the number of occurences, sorted:
y = pd.value_counts(x).values
array([3, 2, 1])
Getting only the unique names of the items you want to count, sorted:
z = pd.value_counts(x).index
Index(['cat', 'dog', 'bird'], dtype='object')

Related

How can I get rid of "array[], dtype='<U12'

I am making a code to analyze the frequencies of the sells of different products (StockCode), so this is the code to get the frequencies:
stockCode = df['StockCode'].values.tolist()
non_repeated_list = []
frequencies = []
list2d= []
for i in stockCode:
if i not in non_repeated_list:
non_repeated_list.append(i)
for i in non_repeated_list:
a = stockCode.count(i)
frequencies.append(a)
And then stack both lists in a 2D list with list2d = np.column_stack((non_repeated_list, frequencies)) so I can sort them with:
print(sorted(list2d,key=lambda x:x[-1], reverse=True))
But when I print it out it says:
[array(['22139', '993'], dtype='<U12'), array(['22911', '99'], dtype='<U12'), array(['17012D', '99'], dtype='<U12')...
So I wanteed to ask, how can I get just the rows between the []?
Since you didn't provide a minimal, reproducible example, I'll try to recreate one.
I'm guessing df is a dataframe, and df['StockCode'] is Series containing strings:
In [287]: ds = pd.Series(['one','two','one','three','two'])
In [288]: ds
Out[288]:
0 one
1 two
2 one
3 three
4 two
dtype: object
then we get a list of strings:
In [289]: x = ds.values.tolist()
In [290]: x
Out[290]: ['one', 'two', 'one', 'three', 'two']
and find the unique ones:
In [291]: alist = []
In [292]: for i in x:
...: if i not in alist:
...: alist.append(i)
...:
In [293]: alist
Out[293]: ['one', 'two', 'three']
and count them:
In [294]: freq = []
In [295]: for i in alist:
...: freq.append(x.count(i))
...:
In [296]: freq
Out[296]: [2, 2, 1]
column_stack of the two strings produces a 2d array, string dtype:
In [297]: np.column_stack((alist, freq))
Out[297]:
array([['one', '2'],
['two', '2'],
['three', '1']], dtype='<U21')
column_stack can't produce a list or array of arrays, so you must be doing something more.
Python sorted will treat the array as a list, the same as `list(..), or a list comprehension on the 2d array:
In [298]: [a for a in _]
Out[298]:
[array(['one', '2'], dtype='<U21'),
array(['two', '2'], dtype='<U21'),
array(['three', '1'], dtype='<U21')]
You could rejoin the arrays with:
In [299]: np.vstack(_)
Out[299]:
array([['one', '2'],
['two', '2'],
['three', '1']], dtype='<U21')

use the numpy.delete(...) delete array [('aa',1),('bb',2)],After delete the first column, the second column becomes a string

use the numpy.delete(...) delete array example for [('aa',1),('bb',2)],After delete the first column, the second column become a string,example for [('1'),('2')].but I want the second column to keep the original int type,how to do this.thanks for help.
A list 'delete':
In [92]: alist = [('aa',1),('bb',2)]
In [93]: [(row[1],) for row in alist]
Out[93]: [(1,), (2,)]
If we make an array from this list, we get a string dtype:
In [94]: np.array(alist)
Out[94]:
array([['aa', '1'],
['bb', '2']], dtype='<U2')
In [95]: np.delete(_, 0, 1)
Out[95]:
array([['1'],
['2']], dtype='<U2')

How to get a value of a column based on the id's given from another table

I wanted to extract the value of a column given another column with id's of a different dataset.
DF-1:
ID A B
1 cat 22
2 dog 33
3 mamal 44
4 rat 55
5 rabbit 66
6 puppy 77
DF-2:
name fav_animal
x 1,2,3
y 2,3
z 3,4
I wanted to see the fav animals of x in a new list say name_animal.
code:
#storing all the id's of x frist
list_id = []
name_animal = []
for i in list_ids:
name_animal.append(df1.loc[df1.id == i, 'A'].values.to_list()
Output:
list_id = [1,2,3]
name_animal = ['cat','dog','mamal']
First check find fav_animal values with boolean indexing, next and iter is for return empty list if no name matched.
a = next(iter(df2.loc[df2['name'] == 'x', 'fav_animal']), [])
Then split values and convert them to integers:
list_id = list(map(int, a.split(',')))
print (list_id)
[1, 2, 3]
And last filter by isin first DataFrame:
name_animal = df1.loc[df1.ID.isin(list_id), 'A'].values.tolist()
print (name_animal)
['cat', 'dog', 'mamal']
You can use this function for example:
def get_names(df, df2, name):
indices = np.asarray(df2.loc[name].values[0].split(',')).astype(int)
return indices.tolist(), df.loc[indices,:]['A'].tolist()
So, for example if you want the fav_animals for name x:
list_id, name_animal = get_names(df,df2, 'x')
print(list_id)
[1, 2, 3]
print(name_animal)
['dog', 'mamal', 'rat']
I think what you're looking for is this:
df1 = pd.DataFrame({'ID':np.arange(1, 7),
'A': ['cat', 'dog', 'mamal', 'rat', 'rabbit', 'puppy'],
'B': [22, 33, 44, 55, 66, 77]})
df2 = pd.DataFrame({'name': ['x', 'y', 'z'],
'fav_animal': ['1,2,3', '2,3', '3,4']})
df2.loc[df2.name == 'x', 'fav_animal'].str.split(',')[0]
['1', '2', '3']
Returns a list of strings. So you need to convert values to integers using map function.
mask = map(int, df2.loc[df2.name == 'x', 'fav_animal'].str.split(',')[0])
df1.loc[df1.ID.isin(mask), 'A'].values.tolist()
>['cat', 'dog', 'mamal']
Something like this?
for i in df2.fav_animal.tolist():
print(df1.loc[map(int, i.split(","))]["A"].tolist())
Output:
['dog', 'mamal', 'rat']
['mamal', 'rat']
['rat', 'rabbit']
Alternative:
print([df1.loc[map(int, i.split(","))]["A"].tolist() for i in df2.fav_animal.tolist()])
Output:
[['dog', 'mamal', 'rat'], ['mamal', 'rat'], ['rat', 'rabbit']]

concatenate arrays with mixed types

consider the np.array a
a = np.concatenate(
[np.arange(2).reshape(-1, 1),
np.array([['a'], ['b']])],
axis=1)
a
array([['0', 'a'],
['1', 'b']],
dtype='|S11')
How can I execute this concatenation such that the first column of a remains integers?
You can mix types in a numpy array by using a numpy.object as the dtype:
>>> import numpy as np
>>> a = np.empty((2, 0), dtype=np.object)
>>> a = np.append(a, np.arange(2).reshape(-1,1), axis=1)
>>> a = np.append(a, np.array([['a'],['b']]), axis=1)
>>> a
array([[0, 'a'],
[1, 'b']], dtype=object)
>>> type(a[0,0])
<type 'int'>
>>> type(a[0,1])
<type 'str'>
A suggested duplicate recommends making a recarray or structured array.
Store different datatypes in one NumPy array?
In this case:
In [324]: a = np.rec.fromarrays((np.arange(2).reshape(-1,1), np.array([['a'],['b']])))
In [325]: a
Out[325]:
rec.array([[(0, 'a')],
[(1, 'b')]],
dtype=[('f0', '<i4'), ('f1', '<U1')])
In [326]: a['f0']
Out[326]:
array([[0],
[1]])
In [327]: a['f1']
Out[327]:
array([['a'],
['b']],
dtype='<U1')
(I have reopened this because I think both approaches need to acknowledged. Plus the object answer was already given and accepted.)

dtypes. Difference between S1 and S2 in Python

I have two arrays of strings:
In [51]: r['Z']
Out[51]:
array(['0', '0', '0', ..., '0', '0', '0'],
dtype='|S1')
In [52]: r['Y']
Out[52]:
array(['X0', 'X0', 'X0', ..., 'X0', 'X1', 'X1'],
dtype='|S2')
What is the difference between S1 and S2? Is it just that they hold entries of different length?
What if my arrays have strings of different lengths?
Where can I find a list of all possible dtypes and what they mean?
See the dtypes documentation.
The |S1 and |S2 strings are data type descriptors; the first means the array holds strings of length 1, the second of length 2. The | pipe symbol is the byteorder flag; in this case there is no byte order flag needed, so it's set to |, meaning not applicable.
For storing strings of variable length in a numpy array you could store them as python objects. For example:
In [456]: x=np.array(('abagd','ds','asdfasdf'),dtype=np.object_)
In [457]: x[0]
Out[457]: 'abagd'
In [459]: map(len,x)
Out[459]: [5, 2, 8]
In [460]: x[1]=='ds'
Out[460]: True
In [461]: x
Out[461]: array([abagd, ds, asdfasdf], dtype=object)
In [462]: str(x)
Out[462]: '[abagd ds asdfasdf]'
In [463]: x.tolist()
Out[463]: ['abagd', 'ds', 'asdfasdf']
In [464]: map(type,x)
Out[464]: [str, str, str]

Categories

Resources