Create three arrays from pandas series - python

For example, I have pandas data series like this:
df = pd.DataFrame({'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)})
ga = df.groupby(['A'])['C'].value_counts()
print ga
A
bar 1 3
0 1
foo 0 3
1 1
ololo 0 4
I want to create three arrays, like this:
First array
bar, foo, ololo
Second array (number of '1')
2 3 1
Third array (number of '0')
2 1 3
What's a simplest way to do this?

Starting with:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': ['foo', 'bar', 'ololo'] * 4,
'B': np.random.randn(12),
'C': np.random.randint(0, 2, 12)
})
counts = df.groupby('A')['C'].value_counts()
Gives (for counts):
A
bar 1 4
foo 1 4
ololo 0 3
1 1
dtype: int64
So, effectively we want to unstack and transpose so that 0/1 are the index, which we do by:
reshaped = counts.unstack().T.reindex([0, 1]).fillna(0)
DSM points out it's possible to avoid .reindex by doing the following:
reshaped = counts.unstack().T.loc[[0, 1]].fillna(0)
Which gives:
A bar foo ololo
0 0 0 3
1 4 4 1
We force a .reindex to ensure it always contains 0/1 (in cases where the randomness means that nothing turns up for 0/1) and force all columns values to be 0 (.fillna(0)) where that's the case. You can then get your arrays by doing the following:
arrays = reshaped.columns.values, reshaped.loc[1].values, reshaped.loc[0].values
Which gives you:
(array(['bar', 'foo', 'ololo'], dtype=object),
array([ 4., 4., 1.]),
array([ 0., 0., 3.]))

Related

A Lexicographical Bug in Pandas?

Please take this question lightly as asked from curiosity:
As I was trying to see how the slicing in MultiIndex works, I came across the following situation ↓
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
Returns:
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
NOTE that the indices are not in the sorted order ie. a, c, b is the order which will result in the expected error that we want while slicing.
# When we do slicing
data.loc["a":"c"]
Errors like:
UnsortedIndexError
----> 1 data.loc["a":"c"]
UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'
That's expected. But now, after doing the following steps:
# Making a DataFrame
data = data.unstack()
# Redindexing - to unsort the indices like before
data = data.reindex(["a", "c", "b"])
# Which looks like
1 2
a 5 0
c 8 6
b 6 3
# Then again making series
data = data.stack()
# Reindex Again!
data = data.reindex(["a", "c", "b"], level=0)
# Which looks like before
a 1 5
2 0
c 1 8
2 6
b 1 6
2 3
dtype: int32
The Problem
So, now the process is: Series → Unstack → DataFrame → Stack → Series
Now, if I do the slicing like before (still on with the indices unsorted) we don't get any error!
# The same slicing
data.loc["a":"c"]
Results without an error:
a 1 5
2 0
c 1 8
2 6
dtype: int32
Even if the data.index.is_monotonic → False. Then still why can we slice?
So the question is: WHY?.
I hope you got the understanding of the situation here. Because see, the same series which was before giving the error, after the unstack and stack operation is not giving any error.
So is that a bug, or a new concept that I am missing here?
Thanks!
Aayush ∞ Shah
UPDATE:
I have used the data.reindex() so to unsort that once more. Please have a look at it again.
The difference between your 2 dataframes is the following:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.randint(10, size=6), index=index)
data2 = data.unstack().reindex(["a", "c", "b"]).stack()
>>> data.index.codes
FrozenList([[0, 0, 2, 2, 1, 1], [0, 1, 0, 1, 0, 1]])
>>> data2.index.codes
FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
Even if your two indexes are the same appearance (values), the internal index (codes) are differents.
Check this method of MultiIndex:
Create a new MultiIndex from the current to monotonically sorted
items IN the levels. This does not actually make the entire MultiIndex
monotonic, JUST the levels.
The resulting MultiIndex will have the same outward
appearance, meaning the same .values and ordering. It will also
be .equals() to the original.
Old answer
# Making a DataFrame
data = data.unstack()
# Which looks like # <- WRONG
1 2 # 1 2
a 5 0 # a 8 0
c 8 6 # b 4 1
b 6 3 # c 7 6
# Then again making series
data = data.stack()
# Which looks like before # <- WRONG
a 1 5 # a 1 2
2 0 # 2 1
c 1 8 # b 1 0
2 6 # 2 1
b 1 6 # c 1 3
2 3 # 2 9
dtype: int32
If you want to use slicing, you have to check if the index is monotonic:
# Simple MultiIndex Creation
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
# Making Series with that MultiIndex
data = pd.Series(np.random.randint(10, size=6), index=index)
>>> data.index.is_monotonic
False
>>> data.unstack().stack().index.is_monotonic
True
>>> data.sort_index().index.is_monotonic
True

Question about iterating through Pandas series with conditional statements

I'm trying to generate a column that would have zeros everywhere except when a specific condition is met.
Right now, I have an existing series of 0s and 1s saved as a Series object. Let's call this Series A. I've created another series of the same size filled with zeros, let's call this Series B. What I'd like to do is, whenever I hit the last 1 in a sequence of 1s in Series A, then the next six rows of Series B should replace the 0s with 1s.
For example:
Series A
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
0...
Should produce Series B
0
0
0
0
0
0
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
1...
Here's what I've tried so far:
for row in SeriesA:
if row == 1:
continue
if SeriesA[row] == 1 and SeriesA[row + 1] == 0:
SeriesB[row]=1
SeriesB[row+1]=1
SeriesB[row+2]=1
SeriesB[row+3]=1
SeriesB[row+4]=1
SeriesB[row+5]=1
However, this just generates Series B full of zeros except for the first five rows with become 1s. (Series A is all zeros until at least row 50)
I think I'm not understanding how iterating works with Pandas, so any help is appreciated!
EDIT: Full(ish) code
import os
import numpy as np
import pandas as pd
df = pd.read_csv("Python_Datafile.csv", names = fields) #fields is a list with names for each column, the first column is called "Date".
df["Date"] = pd.to_datetime(df["Date"], format = "%m/%Y")
df.set_index("Date", inplace = True)
Recession = df["NBER"] # This is series A
Rin6 = Recession*0 # This is series B
gps = Recession.ne(Recession.shift(1)).where(Recession.astype(bool)).cumsum()
idx = Recession[::-1].groupby(gps).idxmax()
to_one = np.hstack(pd.date_range(start=x+pd.offsets.DateOffset(months=1), freq='M', periods=6) for x in idx)
Rin6[Rin6.index.isin(to_one)]= 1
Rin6.unique() # Returns -> array([0], dtype=int64)
You can create an ID for consecutive groups of 1s using .shift + .cumsum:
gps = s.ne(s.shift(1)).where(s.astype(bool)).cumsum()
Then you can get the last index for each group by:
idx = s[::-1].groupby(gps).idxmax()
#0
#1.0 5
#2.0 18
#Name: 0, dtype: int64
Frorm the list of all indices with np.hstack
import numpy as np
np.hstack(np.arange(x+1, x+7, 1) for x in idx)
#array([ 6, 7, 8, 9, 10, 11, 19, 20, 21, 22, 23, 24])
And set those indices to 1 in the second Series:
s2[np.hstack(np.arange(x+1, x+7, 1) for x in idx)] = 1
s2.ravel()
# array([0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0.,..
Update from your comment: Assuming you have a Series s whose indices are datetimes, and another Series s2 which has the same indices but all values are 0 and they have the MonthStart frequency, you can proceed in a similar fasion:
s = pd.Series([0,0,0,0,0,0,0,0,0,1,1]*5, index=pd.date_range('2010-01-01', freq='MS', periods=55))
s2 = s*0
gps = s.ne(s.shift(1)).where(s.astype(bool)).cumsum()
idx = s[::-1].groupby(gps).idxmax()
#1.0 2010-11-01
#2.0 2011-10-01
#3.0 2012-09-01
#4.0 2013-08-01
#5.0 2014-07-01
#dtype: datetime64[ns]
to_one = np.hstack(pd.date_range(start=x+pd.offsets.DateOffset(months=1), freq='MS', periods=6) for x in idx)
s2[s2.index.isin(to_one)]= 1
# I check .isin in case the indices extend beyond the indices in s2

Python Dask map_partitions

Probably a continuation of this question, working from the dask docs examples for map_partitions.
import dask.dataframe as dd
df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [1., 2., 3., 4., 5.]})
ddf = dd.from_pandas(df, npartitions=2)
from random import randint
def myadd(df):
new_value = df.x + randint(1,4)
return new_value
res = ddf.map_partitions(lambda df: df.assign(z=myadd)).compute()
res
In the above code, randint is only being called once, not once per row as I would expect. How come?
Output:
X Y Z
1 1 4
2 2 5
3 3 6
4 4 7
5 5 8
If you performed the same operation (df.x + randint(1,4)) on the original pandas dataframe, you would only get one random number, added to every previous value of the column. This is doing exactly the same as the pandas case, except that it is being called once for each partition - this is what map_partition does.
If you wanted a new random number for every row, you should first think of how you would achieve this with pandas. I can immediately think of two:
df.x.map(lambda x: x + random.randint(1, 4))
or
df.x + np.random.randint(1, 4, size=len(df.x))
If you replace your newvalue = line with one of these, it will work as expected.

Pandas - how to slice value_counts?

I would like to slice a pandas value_counts() :
>sur_perimetre[col].value_counts()
44341006.0 610
14231009.0 441
12131001.0 382
12222009.0 364
12142001.0 354
But I get an error :
> sur_perimetre[col].value_counts()[:5]
KeyError: 5.0
The same with ix :
> sur_perimetre[col].value_counts().ix[:5]
KeyError: 5.0
How would you deal with that ?
EDIT
Maybe :
pd.DataFrame(sur_perimetre[col].value_counts()).reset_index()[:5]
Method 1:
You need to observe that value_counts() returns a Series object. You can process it like any other series and get the values. You can even construct a new dataframe out of it.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: vc = df.C1.value_counts()
In [4]: type(vc)
Out[4]: pandas.core.series.Series
In [5]: vc.values
Out[5]: array([2, 1, 1, 1, 1])
In [6]: vc.values[:2]
Out[6]: array([2, 1])
In [7]: vc.index.values
Out[7]: array([3, 5, 4, 2, 1])
In [8]: df2 = pd.DataFrame({'value':vc.index, 'count':vc.values})
In [8]: df2
Out[8]:
count value
0 2 3
1 1 5
2 1 4
3 1 2
4 1 1
Method2:
Then, I was trying to regenerate the error you mentioned. But, using a single column in DF, I didnt get any error in the same notation as you mentioned.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([1,2,3,3,4,5], columns=['C1'])
In [3]: df['C1'].value_counts()[:3]
Out[3]:
3 2
5 1
4 1
Name: C1, dtype: int64
In [4]: df.C1.value_counts()[:5]
Out[4]:
3 2
5 1
4 1
2 1
1 1
Name: C1, dtype: int64
In [5]: pd.__version__
Out[5]: u'0.17.1'
Hope it helps!

Get dot-product of dataframe with vector, and return dataframe, in Pandas

I am unable to find the entry on the method dot() in the official documentation. However the method is there and I can use it. Why is this?
On this topic, is there a way compute an element-wise multiplication of every row in a data frame with another vector? (and obtain a dataframe back?), i.e. similar to dot() but rather than computing the dot product, one computes the element-wise product.
mul is doing essentially an outer-product, while dot is an inner product. Let me expand on the accepted answer:
In [13]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [14]: v1 = np.array([2,2,2,3,3,3])
In [15]: v2 = np.array([2,3])
In [16]: df.shape
Out[16]: (6, 2)
In [17]: v1.shape
Out[17]: (6,)
In [18]: v2.shape
Out[18]: (2,)
In [24]: df.mul(v2)
Out[24]:
A B
0 2 3
1 2 6
2 2 9
3 4 12
4 4 15
5 4 18
In [26]: df.dot(v2)
Out[26]:
0 5
1 8
2 11
3 16
4 19
5 22
dtype: float64
So:
df.mul takes matrix of shape (6,2) and vector (6, 1) and returns matrix shape (6,2)
While:
df.dot takes matrix of shape (6,2) and vector (2,1) and returns (6,1).
These are not the same operation, they are outer and inner products, respectively.
Here is an example of how to multiply a DataFrame by a vector:
In [60]: df = pd.DataFrame({'A': [1., 1., 1., 2., 2., 2.], 'B': np.arange(1., 7.)})
In [61]: vector = np.array([2,2,2,3,3,3])
In [62]: df.mul(vector, axis=0)
Out[62]:
A B
0 2 2
1 2 4
2 2 6
3 6 12
4 6 15
5 6 18
It's quite hard to say with a degree of accuracy.
Often, a method exists and is undocumented because it's considered internal by the vendor, and may be subject to change.
It could, of course, be a simple oversight by the folks who put together the documentation.
Regarding your second question; I don't really know about that - but it might be better to make a new S/O question for it.
Just scanning the the API, could you do something with the DataFrame's .applymap(function) feature ?

Categories

Resources