I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]
Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!
Related
I can see many questions in SO about how to filter an array (usually using pandas or numpy). The examples are very simple:
df = pd.DataFrame({ 'val': [1, 2, 3, 4, 5, 6, 7] })
a = df[df.val > 3]
Intuitively I understand the df[df.val > 3] statement. But it confuses me from the syntax point of view. In other languages, I would expect a lambda function instead of df.val > 3.
The question is: Why this style is possible and what is going on underhood?
Update 1:
To be more clear about the confusing part: In other languages I saw the next syntax:
df.filter(row => row.val > 3)
I understand here what is going on under the hood: For every iteration, the lambda function is called with the row as an argument and returns a boolean value. But here df.val > 3 doesn't make sense to me because df.val looks like a column.
Moreover, I can write df[df > 3] and it will be compiled and executed successfully. And it makes me crazy because I don't understand how a DataFrame object can be compared to a number.
Create an array and dataframe from it:
In [104]: arr = np.arange(1,8); df = pd.DataFrame({'val':arr})
In [105]: arr
Out[105]: array([1, 2, 3, 4, 5, 6, 7])
In [106]: df
Out[106]:
val
0 1
1 2
2 3
3 4
4 5
5 6
6 7
numpy arrays have methods and operators that operate on the whole array. For example you can multiply the array by a scalar, or add a scalar to all elements. Or in this case compare each element to a scalar. That's all implemented by the class (numpy.ndarray), not by Python syntax.
In [107]: arr>3
Out[107]: array([False, False, False, True, True, True, True])
Similarly pandas implements these methods (or uses numpy methods on the underlying arrays). Selecting a column of a frame, with `df['val'] or:
In [108]: df.val
Out[108]:
0 1
1 2
2 3
3 4
4 5
5 6
6 7
Name: val, dtype: int32
This is a pandas Series (slight difference in display).
It can be compared to a scalar - as with the array:
In [110]: df.val>3
Out[110]:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: val, dtype: bool
And the boolean array can be used to index the frame:
In [111]: df[arr>3]
Out[111]:
val
3 4
4 5
5 6
6 7
The boolean Series also works:
In [112]: df[df.val>3]
Out[112]:
val
3 4
4 5
5 6
6 7
Boolean array indexing works the same way:
In [113]: arr[arr>3]
Out[113]: array([4, 5, 6, 7])
Here I use the indexing to fetch values; setting values is analogous.
While using the apply function to process a DataFrame, the data type of columns was changed unexpectedly. What should I do to prevent this?
For example:
In [1]: import pandas as pd
In [2]: from pandas import DataFrame
In [3]: tmp = DataFrame({'item':[1,2,3]})
In [4]: tmp['score'] = 0.0
In [5]: tmp.dtypes
Out[5]:
item int64
score float64
dtype: object
In [6]: tmp
Out[6]:
item score
0 1 0.0
1 2 0.0
2 3 0.0
In [7]: def Test(x):
...: return x
...:
In [8]: tmp = tmp.apply(Test,axis=1)
In [9]: tmp.dtypes
Out[9]:
item float64
score float64
dtype: object
The data type of tmp['item'] was changed into float. How to maintain the original data type of it?
This is happening because .apply essentially iterates over rows (when axis=1) and applies the function to a Series that represents each row. Since Series must contain the same data type, a Series made from a row of mixed int and float types will properly promote ints to float:
In [4]: def test(x): return x
In [5]: tmp.iloc[0]
Out[5]:
item 1.0
score 0.0
Name: 0, dtype: float64
In [6]: tmp.apply(test, axis=1)
Out[6]:
item score
0 1.0 0.0
1 2.0 0.0
2 3.0 0.0
Note what happens when we select a column, though:
In [7]: tmp.iloc[:,0]
Out[7]:
0 1
1 2
2 3
Name: item, dtype: int64
In [8]: tmp.apply(test, axis=0)
Out[8]:
item score
0 1 0.0
1 2 0.0
2 3 0.0
I have a Series of strings and I need to apply boolean indexing using len() on it.
In one case it works, in another case it does not:
The working case is a groupby on a dataframe, followed by a unique() on the resulting Series and a apply(str) to change the resulting numpy.ndarray entries into strings:
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,3,4,5,4,4]})
dg = df.groupby('A')['B'].unique().apply(str)
db = dg[len(dg) > 2]
This just works fine and yields the desired result:
>>db
Out[119]: '[1 2 3]'
The following however throws KeyError: True:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[len(ss) > 2]
Both objects dg and ss are just Series of Strings:
>>type(dg)
Out[113]: pandas.core.series.Series
>>type(ss)
Out[114]: pandas.core.series.Series
>>type(dg['a'])
Out[115]: str
>>type(ss[0])
Out[116]: str
I'm following the syntax as described in the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
I can see a potential conflict because len(ss) on its own returns the length of the Series itself and now that exact command is used for boolean indexing ss[len(ss) > 2], but then I'd expect neither of the two examples to work.
Right now this behaviour seems inconsistent, unless I'm missing something obvious.
I think you need str.len, because need length of each value of Series:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
print (ss.str.len())
0 1
1 1
2 2
3 2
4 4
5 2
6 3
dtype: int64
print (ss.str.len() > 2)
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
ls = ss[ss.str.len() > 2]
print (ls)
4 eeee
6 ggg
dtype: object
If use len, get length of Series:
print (len(ss))
7
Another solution is apply len:
ss = pd.Series(['a','b','cc','dd','eeee','ff','ggg'])
ls = ss[ss.apply(len) > 2]
print (ls)
4 eeee
6 ggg
dtype: object
First script is wrong, you need apply len also:
df = pd.DataFrame({'A':['a','a','a','a','b','b','b','b'],'B':[1,2,2,2,4,5,4,6]})
dg = df.groupby('A')['B'].unique()
print (dg)
A
a [1, 2]
b [4, 5, 6]
Name: B, dtype: object
db = dg[dg.apply(len) > 2]
print (db)
A
b [4, 5, 6]
Name: B, dtype: object
If cast list to str, you get another len (length of data + length of [] + length of whitespaces):
dg = df.groupby('A')['B'].unique().apply(str)
print (dg)
A
a [1 2]
b [4 5 6]
Name: B, dtype: object
print (dg.apply(len))
A
a 5
b 7
Name: B, dtype: int64
I have a pandas dataframe:
banned_titles =
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083
when I apply groupby as following
In [84]: banned_titles.groupby('TitleId').groups
Out[84]: {89989: [0, 1], 95281: [2]}
This is so close but not I want.
What I want is:
{89989: [32598, 3085083], 95281: [3085083]}
Is there a way to do this?
try this:
In [8]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist()).to_dict()
Out[8]: {89989: [32598, 3085083], 95281: [3085083]}
or as series of lists:
In [10]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist())
Out[10]:
TitleId
89989 [32598, 3085083]
95281 [3085083]
Name: RelatedTitleId, dtype: object
data:
In [9]: x
Out[9]:
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083
Try list one line (no lambda):
dict(df.groupby('TitleId')['RelatedTitleId'].apply(list))
# {89989: [32598, 3085083], 95281: [3085083]}
I have a pandas series with some values like 19.99-20.99 (i.e. two numbers separated by a dash).
How would you just take the left or right value?
Use split("-") on the resulting string and then access the result with index notation, ie split_result[1].
Here's an example:
In [5]: my_series = pandas.Series(['19.22-20.11','18.55-34.22','12.33-22.00','13.33-34.23'])
In [6]: my_series[0]
Out[6]: '19.22-20.11'
In [7]: my_series[0].split("-")
Out[7]: ['19.22', '20.11']
In [8]: my_series[0].split("-")[0]
Out[8]: '19.22'
In [9]: my_series[0].split("-")[1]
Out[9]: '20.11'
In [1]: s = pd.Series(['19.99-20.99', '20.99-21.99'])
In [2]: s.str.split('-').str[0]
Out[2]:
0 19.99
1 20.99
dtype: object
In [3]: s.str.split('-').str[1]
Out[3]:
0 20.99
1 21.99
dtype: object