I am trying to count frequencies of an array.
I've read this post, I am using DataFrame and get a series.
>>> a = np.array([1, 1, 5, 0, 1, 2, 2, 0, 1, 4])
>>> df = pd.DataFrame(a, columns=['a'])
>>> b = df.groupby('a').size()
>>> b
a
0 2
1 4
2 2
4 1
5 1
dtype: int64
>>> b.iloc[:,-1]
when i try to get the last column, i got this error.
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 2013, in _getitem_tuple
self._has_valid_tuple(tup) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 220, in _has_valid_tuple
raise IndexingError('Too many indexers') pandas.core.indexing.IndexingError: Too many indexers
how to get the last column of b?
Since pandas.Series is a
One-dimensional ndarray with axis labels
If you want to get just the frequencies column, i.e. the values of
your series, use:
b.tolist()
or, alternatively:
b.to_dict()
to keep both labels and frequencies.
P.S:
For your specific task consider also collections package:
>>> from collections import Counter
>>> a = [1, 1, 5, 0, 1, 2, 2, 0, 1, 4]
>>> c = Counter(a)
>>> list(c.values())
[2, 4, 2, 1, 1]
Problem is output of GroupBy.size is Series, and Series have no columns, so is possible get last value only:
b.iloc[-1]
If use:
b.iloc[:,-1]
it return last column in Dataframe.
Here : means all rows and -1 in second position last column.
So if create DataFrame from Series:
b1 = df.groupby('a').size().reset_index(name='count')
it working like expected.
Related
I have two arrays like
[2,2,0,1,1,1,2] and [2,2,0,1,1,1,0]
I need to count (eg. with bincount) the occourances of each element in the first array, where the elements equal by position in the second array.
So in this case, we get [1,3,2], because 0 occurs once in the same position of the arrays, 1 occurs three times in the same positions and 2 occurs twice in the same positions.
I tried this, but not the desired result:
np.bincount(a[a==b])
Can someone help me?
You must put your lists in np array format:
import numpy as np
a = np.array([2,2,0,1,1,1,2])
b = np.array([2,2,0,1,1,1,0])
np.bincount(a, weights=(a==b)) # [1, 3, 2]
from datatable import dt, f, by
df = dt.Frame(
col1=[2, 2, 0, 1, 1, 1, 2],
col2=[2, 2, 0, 1, 1, 1, 0]
)
df['equal'] = dt.ifelse(f.col1 == f.col2, 1, 0)
df_sub = df[:, {"sum": dt.sum(f.equal)}, by('col1')]
yourlist = df_sub['sum'].to_list()[0]
yourlist
[1, 3, 2]
array_1 = np.array([2,2,0,1,1,1,2])
array_2 = np.array([2,2,0,1,1,1,0])
# set up a bins array for the results:
if array_1.max() > array_2.max():
bins = np.zeros(array_1.max()+1)
else:
bins = np.zeros(array_2.max()+1)
# fill the bin values:
for val1, val2 in zip(array_1, array_2):
if val1 == val2:
bins[val1] += 1
# convert bins to a list with int values
bins = bins.astype(int).tolist()
And the results:
[1, 3, 2]
I have a pandas dataframe with a hierarchical row index
def stack_example():
i = pd.DatetimeIndex([ '2011-04-04',
'2011-04-06',
'2011-04-12', '2011-04-13'])
cols = pd.MultiIndex.from_product([['milk', 'honey'],[u'jan', u'feb'], [u'PRICE','LITERS']])
df = pd.DataFrame(np.random.randint(12, size=(len(i), 8)), index=i, columns=cols)
df.columns.names = ['food', 'month', 'measure']
df.index.names = ['when']
df = df.stack('food', 'columns')
df= df.stack('month', 'columns')
df['constant_col'] = "foo"
df['liters_related_col'] = df['LITERS']*99
return df
I can add new columns to this dataframe based on constants or based on calculations involving other columns.
I would like to add new columns based in part on calculations involving the index.
For example, just repeat the food name twice:
df.index
MultiIndex(levels=[[2011-04-04 00:00:00, 2011-04-06 00:00:00, 2011-04-12 00:00:00, 2011-04-13 00:00:00], [u'honey', u'milk'], [u'feb', u'jan']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'when', u'food', u'month'])
df.index.values[4][1]*2
'honeyhoney'
But I can't figure out the syntax for creating something like this:
df['xcol'] = df.index.values[2]*2
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2519, in __setitem__
self._set_item(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2585, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2760, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\series.py", line 3080, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
I've also tried variations like df['xcol'] = df.index.values[:][2]*2
In the case of df.index.values[4][1] * 2, where the value is a string (honeyhoney), it's fine to assign that to a column:
df['col1'] = df.index.values[4][1] * 2
df.col1
when food month
2011-04-04 honey feb honeyhoney
jan honeyhoney
milk feb honeyhoney
jan honeyhoney
In your second example, though, the one with the error, you're not actually performing an operation on a single value:
df.index.values[2]*2
(Timestamp('2011-04-04 00:00:00'),
'milk',
'feb',
Timestamp('2011-04-04 00:00:00'),
'milk',
'feb')
You could still smush all that into a string, or into some other format, depending on your needs:
df['col2'] = ''.join([str(x) for x in df.index.values[2]*2])
But the main issue is that the output of df.index.values[2]*2 gives you a multi-dimensional structure, which doesn't map to the existing structure of df.
New columns in df can either be a single value (in which case it's replicated automatically to fit the number of rows in df), or they can have the same number of entries as len(df).
UPDATE
per comments
IIUC, you can use get_level_values() to apply an operation to an entire level of a MultiIndex:
df.index.get_level_values(1).values*2
array(['honeyhoney', 'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney',
'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney',
'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney', 'milkmilk',
'milkmilk'], dtype=object)
I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.
I would like to formulate an array which is the maximum (item # not value) between 3 columns.
E.g.
In: arr=([(1,2,3,4), (4,5,16,0), (7,8,9,2)]) # maximum of columns 0, 1, 2, 3
Out: array([2,2,1,0]) # As: 7 > 4 > 1, 8 > 5 > 2, 16 > 9 > 3, and 4 > 2 > 0
Current (non-working solution):
np.argmax([arr['f0'], arr['f1'], arr['f2']])
You can specify the axis key in numpy.argmax, which operates over a specified axis of a numpy array independently. In your case, you want to operate over each column individually by finding the index of the maximum of each column, so specify axis=0. Here's a sample run given your data in IPython:
In [10]: import numpy as np
In [11]: arr=np.array([(1,2,3), (4,5,16), (7,8,9)])
In [12]: np.argmax(arr, axis=0)
Out[12]: array([2, 2, 1])
The above example was what you had before you edited your post. With your new data in your edit, here's a sample run:
In [13]: arr=np.array([(1,2,3,4), (4,5,16,0), (7,8,9,2)])
In [14]: np.argmax(arr, axis=0)
Out[14]: array([2, 2, 1, 0])
More information about numpy.argmax can be found here: http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
I am trying to do something very similar to that previous question but I get an error.
I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:
import pandas
import milk
from scikits.statsmodels.tools import categorical
then I have:
trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'
labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets
The output console yields first:
features
[[ 0.38846334 0.97681855]
[ 3.8318634 0.5724734 ]
[ 0.67710876 1.01816444]
[ 1.12024943 0.91508699]
[ 7.51749674 1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]
I meet then the following error:
Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'
Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.
The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.
For a Series:
In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
For a DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei',
'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]),
Index([u'single', u'touching', u'nuclei', u'dusts'],
dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
labels labels_enc
0 single 0
1 touching 1
2 nuclei 2
3 dusts 3
4 touching 1
5 single 0
6 nuclei 2
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
The factor has attributes labels and levels:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
I am answering the question for Pandas 0.10.1. Factor.from_array seems to do the trick.
>>> s = pandas.Series(['a', 'b', 'a', 'c', 'a', 'b', 'a'])
>>> s
0 a
1 b
2 a
3 c
4 a
5 b
6 a
>>> f = pandas.Factor.from_array(s)
>>> f
Categorical:
array([a, b, a, c, a, b, a], dtype=object)
Levels (3): Index([a, b, c], dtype=object)
>>> f.labels
array([0, 1, 0, 2, 0, 1, 0])
>>> f.levels
Index([a, b, c], dtype=object)
because none of these work for dimensions>1, I made some code working for any numpy array dimensionality:
def encode_categorical(array):
d = {key: value for (key, value) in zip(np.unique(array), np.arange(len(u)))}
shape = array.shape
array = array.ravel()
new_array = np.zeros(array.shape, dtype=np.int)
for i in range(len(array)):
new_array[i] = d[array[i]]
return new_array.reshape(shape)