Extract on "-" delimited in pandas - python

I have a pandas series with some values like 19.99-20.99 (i.e. two numbers separated by a dash).
How would you just take the left or right value?

Use split("-") on the resulting string and then access the result with index notation, ie split_result[1].
Here's an example:
In [5]: my_series = pandas.Series(['19.22-20.11','18.55-34.22','12.33-22.00','13.33-34.23'])
In [6]: my_series[0]
Out[6]: '19.22-20.11'
In [7]: my_series[0].split("-")
Out[7]: ['19.22', '20.11']
In [8]: my_series[0].split("-")[0]
Out[8]: '19.22'
In [9]: my_series[0].split("-")[1]
Out[9]: '20.11'

In [1]: s = pd.Series(['19.99-20.99', '20.99-21.99'])
In [2]: s.str.split('-').str[0]
Out[2]:
0 19.99
1 20.99
dtype: object
In [3]: s.str.split('-').str[1]
Out[3]:
0 20.99
1 21.99
dtype: object

Related

How do numpy functions operate on pandas objects internally?

Numpy functions, eg np.mean(), np.var(), etc, accept an array-like argument, like np.array, or list, etc.
But passing in a pandas dataframe also works. This means that a pandas dataframe can indeed disguise itself as a numpy array, which I find a little strange (despite knowing the fact that the underlying values of a df are indeed numpy arrays).
For an object to be an array-like, I thought that it should be slicable using integer indexing in the way a numpy array is sliced. So for instance df[1:3, 2:3] should work, but it would lead to an error.
So, possibly a dataframe gets converted into a numpy array when it goes inside the function. But if that is the case then why does np.mean(numpy_array) lead to a different result than that of np.mean(df)?
a = np.random.rand(4,2)
a
Out[13]:
array([[ 0.86688862, 0.09682919],
[ 0.49629578, 0.78263523],
[ 0.83552411, 0.71907931],
[ 0.95039642, 0.71795655]])
np.mean(a)
Out[14]: 0.68320065182041034
gives a different result than what the below gives...
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
df
Out[18]:
0 1
0 0.866889 0.096829
1 0.496296 0.782635
2 0.835524 0.719079
3 0.950396 0.717957
np.mean(df)
Out[21]:
0 0.787276
1 0.579125
dtype: float64
The former output is a single number, whereas the latter is a column-wise mean. How does a numpy function know about the make of a dataframe?
If you step through this:
--Call--
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2796)mean()
-> def mean(a, axis=None, dtype=None, out=None, keepdims=False):
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2877)mean()
-> if type(a) is not mu.ndarray:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2878)mean()
-> try:
(Pdb) s
> d:\winpython-64bit-3.4.3.5\python-3.4.3.amd64\lib\site-packages\numpy\core\fromnumeric.py(2879)mean()
-> mean = a.mean
You can see that the type is not a ndarray so it tries to call a.mean which in this case would be df.mean():
In [6]:
df.mean()
Out[6]:
0 0.572999
1 0.468268
dtype: float64
This is why the output is different
Code to reproduce above:
In [3]:
a = np.random.rand(4,2)
a
Out[3]:
array([[ 0.96750329, 0.67623187],
[ 0.44025179, 0.97312747],
[ 0.07330062, 0.18341157],
[ 0.81094166, 0.04030253]])
In [4]:
np.mean(a)
Out[4]:
0.52063384885403818
In [5]:
df = pd.DataFrame(data=a, index=range(np.shape(a)[0]),
columns=range(np.shape(a)[1]))
​
df
Out[5]:
0 1
0 0.967503 0.676232
1 0.440252 0.973127
2 0.073301 0.183412
3 0.810942 0.040303
numpy output:
In [7]:
np.mean(df)
Out[7]:
0 0.572999
1 0.468268
dtype: float64
If you'd called .values to return a np array then the output is the same:
In [8]:
np.mean(df.values)
Out[8]:
0.52063384885403818

Pandas groupby two columns then get dict for values

I have a pandas dataframe:
banned_titles =
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083
when I apply groupby as following
In [84]: banned_titles.groupby('TitleId').groups
Out[84]: {89989: [0, 1], 95281: [2]}
This is so close but not I want.
What I want is:
{89989: [32598, 3085083], 95281: [3085083]}
Is there a way to do this?
try this:
In [8]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist()).to_dict()
Out[8]: {89989: [32598, 3085083], 95281: [3085083]}
or as series of lists:
In [10]: x.groupby('TitleId')['RelatedTitleId'].apply(lambda x: x.tolist())
Out[10]:
TitleId
89989 [32598, 3085083]
95281 [3085083]
Name: RelatedTitleId, dtype: object
data:
In [9]: x
Out[9]:
TitleId RelatedTitleId
0 89989 32598
1 89989 3085083
2 95281 3085083
Try list one line (no lambda):
dict(df.groupby('TitleId')['RelatedTitleId'].apply(list))
# {89989: [32598, 3085083], 95281: [3085083]}

Pandas - how to slice value_counts?

I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]
Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!

Vectorized Splitting of String on Character Count using Numpy or Pandas

Is there a way to split a Numpy Array in a vectorized manner based upon character count for each element?
Input:
In [1]: import numpy as np
In [2]: y = np.array([ 'USC00013160194806SNOW','USC00013160194806SNOW','USC00013160194806SNOW' ])
In [3]: y
Out[3]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
I want each element of the array split according to a certain number of characters.
Desired Output:
In [3]: y
Out[3]:
array(['USC00013160', 'USC00013160',
'USC00013160'],
dtype='|S21')
I've executed this using standard python loops, but I'm dealing with millions of values, so I'm trying to figure the fastest method.
You can create a view using a data type with the same size as y's dtype that has subfields corresponding to the parts that you want. For example,
In [22]: y
Out[22]:
array(['USC00013160194806SNOW', 'USC00013160194806SNOW',
'USC00013160194806SNOW'],
dtype='|S21')
In [23]: dt = np.dtype([('part1', 'S11'), ('part2', 'S6'), ('part3', 'S4')])
In [24]: v = y.view(dt)
In [25]: v['part1']
Out[25]:
array(['USC00013160', 'USC00013160', 'USC00013160'],
dtype='|S11')
In [26]: v['part2']
Out[26]:
array(['194806', '194806', '194806'],
dtype='|S6')
In [27]: v['part3']
Out[27]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Note that these are all views of the same data in y. If you modify them in place, you are also modifying y. For example,
In [32]: v3 = v['part3']
In [33]: v3
Out[33]:
array(['SNOW', 'SNOW', 'SNOW'],
dtype='|S4')
Change v3[1] to 'RAIN':
In [34]: v3[1] = 'RAIN'
In [35]: v3
Out[35]:
array(['SNOW', 'RAIN', 'SNOW'],
dtype='|S4')
Now see that y[1] is also changed:
In [36]: y
Out[36]:
array(['USC00013160194806SNOW', 'USC00013160194806RAIN',
'USC00013160194806SNOW'],
dtype='|S21')
One possible solution I've found is just completing the operation using Pandas Series, but I'm wondering if this can be done using an only Numpy arrays slicing methods. If not, it is fine, more curious about the best practice.
Starting Pandas Series:
In [33]: x = pd.read_csv("data.txt", delimiter='\n', dtype=str, squeeze=True)
In [34]: x
Out[34]:
0 USC00013160194807SNOW
1 USC00013160194808SNOW
2 USC00013160194809SNOW
3 USC00013160194810SNOW
4 USC00013160194811SNOW, dtype: object
Vectorized String Processing based on Character Count:
In [37]: k = x.str[0:11]
Output:
In [38]: k
Out[38]:
0 USC00013160
1 USC00013160
2 USC00013160
3 USC00013160
4 USC00013160

Comparing pandas.Series for equality when they are in different orders

Pandas automatically aligns data indices of Series objects before applying the binary operators such as addition and subtraction, but this is not done when checking for equality. Why is this, and how do I overcome it?
Consider the following example:
In [15]: x = pd.Series(index=["A", "B", "C"], data=[1,2,3])
In [16]: y = pd.Series(index=["C", "B", "A"], data=[3,2,1])
In [17]: x
Out[17]:
A 1
B 2
C 3
dtype: int64
In [18]: y
Out[18]:
C 3
B 2
A 1
dtype: int64
In [19]: x==y
Out[19]:
A False
B True
C False
dtype: bool
In [20]: x-y
Out[20]:
A 0
B 0
C 0
dtype: int64
I am using pandas 0.12.0.
You can overcome it with:
In [5]: x == y.reindex(x.index)
Out[5]:
A True
B True
C True
dtype: bool
or
In [6]: x.sort_index() == y.sort_index()
Out[6]:
A True
B True
C True
dtype: bool
The 'why' is explained here: https://github.com/pydata/pandas/issues/1134#issuecomment-5347816
Update: there is an issue that dicusses this (https://github.com/pydata/pandas/issues/1134) and a PR to fix this (https://github.com/pydata/pandas/pull/6860)

Categories

Resources