Pandas : Confused when extending DataFrame vs Series (Column/Index). Why the difference? - python

First off, let me say that I've already looked over various responses to similar questions, but so far, none of them has really made it clear to me why (or why not) the Series and DataFrame methodologies are different.
Also, some of the Pandas information is not clear, for example looking up Series.reindex,
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reindex.html
all the examples suddenly switch to showing examples for DataFrame not Series, but the functions don't seem to overlap exactly.
So, now to it, first with a DataFrame.
> df = pd.DataFrame(np.random.randn(6,4), index=range(6), columns=list('ABCD'))
> df
Out[544]:
A B C D
0 0.136833 -0.974500 1.708944 0.435174
1 -0.357955 -0.775882 -0.208945 0.120617
2 -0.002479 0.508927 -0.826698 -0.904927
3 1.955611 -0.558453 -0.476321 1.043139
4 -0.399369 -0.361136 -0.096981 0.092468
5 -0.130769 -0.075684 0.788455 1.640398
Now, to add new columns, I can do something simple (2 ways, same result).
> df[['X','Y']] = (99,-99)
> df.loc[:,['X','Y']] = (99,-99)
> df
Out[557]:
A B C D X Y
0 0.858615 -0.552171 1.225210 -1.700594 99 -99
1 1.062435 -1.917314 1.160043 -0.058348 99 -99
2 0.023910 1.262706 -1.924022 -0.625969 99 -99
3 1.794365 0.146491 -0.103081 0.731110 99 -99
4 -1.163691 1.429924 -0.194034 0.407508 99 -99
5 0.444909 -0.905060 0.983487 -4.149244 99 -99
Now, with a Series, I have hit a (mental?) block trying the same.
I'm going to be using a loop to construct a list of Series that will eventually be a data frame, but I want to deal with each 'row' as a Series first, (to make development easier).
> ss = pd.Series(np.random.randn(4), index=list('ABCD'))
> ss
Out[552]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
dtype: float64
> ss['X','Y'] = (99,-99)
Traceback (most recent call last):
...
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
Same for,
> ss[['X','Y']] = (99,-99)
> ss.loc[['X','Y']] = (99,-99)
KeyError: "None of [Index(['X', 'Y'], dtype='object')] are in the [index]"
The only way I can get this working is a rather clumsy (IMHO),
> ss['X'],ss['Y'] = (99,-99)
> ss
Out[560]:
A 0.078013
B 1.707052
C -0.177543
D -1.072017
X 99.000000
Y -99.000000
dtype: float64
I did think that, perhaps, reindexing the Series to add the new indices prior to assignment might solve to problem. It would, but then I hit an issue trying to change the index.
> ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
> xs = pd.Series([99,-99], index=['X','Y'], name='z')
Here I can concat my 2 Series to create a new one, and I can also concat the Series indices, eg,
> ss.index.append(xs.index)
Index(['A', 'B', 'C', 'D', 'X', 'Y'], dtype='object')
But I can't extend the current index with,
> ss.index = ss.index.append(xs.index)
ValueError: Length mismatch: Expected axis has 4 elements, new values have 6 elements
So, what intuitive leap must I make to understand why the former Series methods don't work, but (what looks like an equivalent) DataFrame method does work?
It makes passing multiple outputs back from a function into new Series elements a bit clunky. I can't 'on the fly' make up new Series index names to insert values into my exiting Series object.

I don't think you can directly modify the Series in place to add multiple values at once.
If having a new object is not an issue:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
# new object with updated index
ss = ss.reindex(ss.index.union(xs.index))
ss.update(xs)
Output:
A -0.369182
B -0.239379
C 1.099660
D 0.655264
X 99.000000
Y -99.000000
Name: z, dtype: float64
in place alternative using a function:
ss = pd.Series(np.random.randn(4), index=list('ABCD'), name='z')
xs = pd.Series([99,-99], index=['X','Y'], name='z')
def extend(s1, s2):
s1.update(s2) # update common indices
# add others
for idx, val in s2[s2.index.difference(s1.index)].items():
s1[idx] = val
extend(ss, xs)
Updated ss:
A 0.279925
B -0.098150
C 0.910179
D 0.317218
X 99.000000
Y -99.000000
Name: z, dtype: float64

While I have accepted #mozway's answer above since it nicely handles extending the Series even when there are possible index conflicts, I'm adding this 'answer' to demonstrate my point about the inconsistency in the extend operation between Series and DataFrame.
If I create my Series as single row DataFrames, as below, I can now extend the 'series' as I expected.
z=pd.Index(['z'])
ss = pd.DataFrame(np.random.randn(1,4), columns=list('ABCD'),index=z)
xs = pd.DataFrame([[99,-99]], columns=['X','Y'],index=z)
ss
Out[619]:
A B C D
z 1.052589 -0.337622 -0.791994 -0.266888
ss[['x','y']] = xs
ss
Out[620]:
A B C D x y
z 1.052589 -0.337622 -0.791994 -0.266888 99 -99
type(ss)
Out[621]: pandas.core.frame.DataFrame
Note that, as a DataFrame, I don't even need a Series for the extend object.
ss[['X','Y']] = [123,-123]
ss
Out[633]:
A B C D X Y
z 0.600981 -0.473031 0.216941 0.255252 123 -123
So I've simply extended the DataFrame, but it's still a DataFrame of 1 row.
I can now either 'squeeze' the DataFrame,
zz1=ss.squeeze()
type(zz1)
Out[624]: pandas.core.series.Series
zz1
Out[625]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Alternatively, I can use 'iloc[0]' to get a Series directly. Note that 'loc' will return a DataFrame not a Series and will still require 'squeezing'.
zz2=ss.iloc[0]
type(zz2)
Out[629]: pandas.core.series.Series
zz2
Out[630]:
A 1.052589
B -0.337622
C -0.791994
D -0.266888
x 99.000000
y -99.000000
Name: z, dtype: float64
Please note, I'm not a Pandas 'wizard' so there may be other insights that I lack.

Related

Pandas: filter on multiple columns [duplicate]

This question already has answers here:
selecting across multiple columns with pandas
(3 answers)
Closed 5 years ago.
I am working in Pandas, and I want to apply multiple filters to a data frame across multiple fields.
I am working with another, more complex data frame, but I am simplifying the contex for this question. Here is the setup for a sample data frame:
dates = pd.date_range('20170101', periods=16)
rand_df = pd.DataFrame(np.random.randn(16,4), index=dates, columns=list('ABCD'))
Applying one filter to this data frame is well documented and simple:
rand_df.loc[lambda df: df['A'] < 0]
Since the lambda looks like a simple boolean expression. It is tempting to do the following. This does not work, since, instead of being a boolean expression, it is a callable. Multiple of these cannot combine as boolean expressions would:
rand_df.loc[lambda df: df['A'] < 0 and df[‘B’] < 0]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-dfa05ab293f9> in <module>()
----> 1 rand_df.loc[lambda df: df['A'] < 0 and df['B'] < 0]
I have found two ways to successfully implement this. I will add them to the potential answers, so you can comment directly on them as solutions. However, I would like to solicit other approaches, since I am not really sure that either of these is a very standard approach for filtering a Pandas data frame.
In [3]: rand_df.query("A < 0 and B < 0")
Out[3]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
or:
In [6]: rand_df[rand_df[['A','B']].lt(0).all(1)]
Out[6]:
A B C D
2017-01-02 -0.701682 -1.224531 -0.273323 -1.091705
2017-01-05 -1.262971 -0.531959 -0.997451 -0.070095
2017-01-06 -0.065729 -1.427199 1.202082 0.136657
2017-01-08 -1.445050 -0.367112 -2.617743 0.496396
2017-01-12 -1.273692 -0.456254 -0.668510 -0.125507
PS You will find a lot of examples in the Pandas docs
rand_df[(rand_df.A < 0) & (rand_df.B <0)]
To use the lambda, don't pass the entire column.
rand_df.loc[lambda x: (x.A < 0) & (x.B < 0)]
# Or
# rand_df[lambda x: (x.A < 0) & (x.B < 0)]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
You can speed up the evaluation by using boolean numpy arrays
c1 = rand_df.A.values > 0
c2 = rand_df.B.values > 0
rand_df[c1 & c2]
A B C D
2017-01-12 -0.460918 -1.001184 -0.796981 0.328535
2017-01-14 -0.146846 -1.088095 -1.055271 -0.778120
Here is an approach that “chains” use of the ‘loc’ operation:
rand_df.loc[lambda df: df['A'] < 0].loc[lambda df: df['B'] < 0]
Here is an approach which includes writing a method to do the filtering. I am sure that some filters will be sufficiently complex or complicated that the method is the best way to go (this case is not so complex.) Also, when I am using Pandas and I write a “for” loop, I feel like I am doing it wrong.
def lt_zero_ab(df):
result = []
for index, row in df.iterrows():
if row['A'] <0 and row['B'] <0:
result.append(index)
return result
rand_df.loc[lt_zero_ab]

Is there a query method or similar for pandas Series (pandas.Series.query())?

The pandas.DataFrame.query() method is of great usage for (pre/post)-filtering data when loading or plotting. It comes particularly handy for method chaining.
I find myself often wanting to apply the same logic to a pandas.Series, e.g. after having done a method such as df.value_counts which returns a pandas.Series.
Example
Lets assume there is a huge table with the columns Player, Game, Points and I want to plot a histogram of the players with more than 14 times 3 points. I first have to sum the points of each player (groupby -> agg) which will return a Series of ~1000 players and their overall points. Applying the .query logic it would look something like this:
df = pd.DataFrame({
'Points': [random.choice([1,3]) for x in range(100)],
'Player': [random.choice(["A","B","C"]) for x in range(100)]})
(df
.query("Points == 3")
.Player.values_count()
.query("> 14")
.hist())
The only solutions I find force me to do an unnecessary assignment and break the method chaining:
(points_series = df
.query("Points == 3")
.groupby("Player").size()
points_series[points_series > 100].hist()
Method chaining as well as the query method help to keep the code legible meanwhile the subsetting-filtering can get messy quite quickly.
# just to make my point :)
series_bestplayers_under_100[series_prefiltered_under_100 > 0].shape
Please help me out of my dilemma! Thanks
If I understand correctly you can add query("Points > 100"):
df = pd.DataFrame({'Points':[50,20,38,90,0, np.Inf],
'Player':['a','a','a','s','s','s']})
print (df)
Player Points
0 a 50.000000
1 a 20.000000
2 a 38.000000
3 s 90.000000
4 s 0.000000
5 s inf
points_series = df.query("Points < inf").groupby("Player").agg({"Points": "sum"})['Points']
print (points_series)
a = points_series[points_series > 100]
print (a)
Player
a 108.0
Name: Points, dtype: float64
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})
.query("Points > 100")
print (points_series)
Points
Player
a 108.0
Another solution is Selection By Callable:
points_series = df.query("Points < inf")
.groupby("Player")
.agg({"Points": "sum"})['Points']
.loc[lambda x: x > 100]
print (points_series)
Player
a 108.0
Name: Points, dtype: float64
Edited answer by edited question:
np.random.seed(1234)
df = pd.DataFrame({
'Points': [np.random.choice([1,3]) for x in range(100)],
'Player': [np.random.choice(["A","B","C"]) for x in range(100)]})
print (df.query("Points == 3").Player.value_counts().loc[lambda x: x > 15])
C 19
B 16
Name: Player, dtype: int64
print (df.query("Points == 3").groupby("Player").size().loc[lambda x: x > 15])
Player
B 16
C 19
dtype: int64
Why not convert from Series to DataFrame, do the querying, and then convert back.
df["Points"] = df["Points"].to_frame().query('Points > 100')["Points"]
Here, .to_frame() converts to DataFrame, while the trailing ["Points"] converts to Series.
The method .query() can then be used consistently whether or not the Pandas object has 1 or more columns.
Instead of query you can use pipe:
s.pipe(lambda x: x[x>0]).pipe(lambda x: x[x<10])

Summing 3 columns in a dataframe

This should be easy:
I have a data frame with the following columns
a,b,min,w,w_min
all I want to do is sum up the columns min,w,and w_min and read that result into another data frame.
I've looked, but I can not find a previously asked question that directly relates back to this. Everything I've found seems much more complex then what I'm trying to do.
You can just pass a list of cols and select these to perform the summation on:
In [64]:
df = pd.DataFrame(columns=['a','b','min','w','w_min'], data = np.random.randn(10,5) )
df
Out[64]:
a b min w w_min
0 0.626671 0.850726 0.539850 -0.669130 -1.227742
1 0.856717 2.108739 -0.079023 -1.107422 -1.417046
2 -1.116149 -0.013082 0.871393 -1.681556 -0.170569
3 -0.944121 -2.394906 -0.454649 0.632995 1.661580
4 0.590963 0.751912 0.395514 0.580653 0.573801
5 -1.661095 -0.592036 -1.278102 -0.723079 0.051083
6 0.300866 -0.060604 0.606705 1.412149 0.916915
7 -1.640530 -0.398978 0.133140 -0.628777 -0.464620
8 0.734518 1.230869 -1.177326 -0.544876 0.244702
9 -1.300137 1.328613 -1.301202 0.951401 -0.693154
In [65]:
cols=['min','w','w_min']
df[cols].sum()
Out[65]:
min -1.743700
w -1.777642
w_min -0.525050
dtype: float64

calling function with dataframe data gives error (cannot convert the series to <class 'float'>)

I have an option pricing model (very simple Black Scholes) that works fine with data in this fashion:
In [18]:
BS2(100.,100.,1.,.001,.3)
Out[18]:
11.96762435837207
the function is here:
Black Sholes Function
def BS2(S,X,T,r,v):
d1 = (log(S/X)+(.001+v*v/2)*T)/(v*sqrt(T))
d2 = d1-v*sqrt(T)
return (S*CND(d1)-X*exp(-.001*T)*CND(d2))
I do not think it matters for this question, but BS2 calls this:
Cumulative normal distribution function
def CND(X):
(a1,a2,a3,a4,a5) = (0.31938153, -0.356563782, 1.781477937,
-1.821255978, 1.330274429)
L = abs(X)
K = 1.0 / (1.0 + 0.2316419 * L)
w = 1.0 - 1.0 / sqrt(2*pi)*exp(-L*L/2.) * (a1*K + a2*K*K + a3*pow(K,3) +
a4*pow(K,4) + a5*pow(K,5))
if X<0:
w = 1.0-w
return w
I tried to modify the working BS function to accept data from a df but seem to have done something wrong:
def BS(df):
d1 = (log(S/X)+(.001+v*v/2)*T)/(v*sqrt(T))
d2 = d1-v*sqrt(T)
return pd.Series((S*CND(d1)-X*exp(-.001*T)*CND(d2)))
my data is very straight forward:
In [13]:
df
Out[13]:
S X T r v
0 100 100 1 0.001 0.3
1 50 50 1 0.001 0.3
and are all float64
In [14]:
df.dtypes
Out[14]:
S float64
X float64
T float64
r float64
v float64
dtype: object
I aslo tried assigning the df variables to a name before sending to BS2 (I did this way and without this assignment:
S=df['S']
X=df['X']
T=df['T']
r=df['r']
v=df['v']
at the risk of sending too much info, here is the error message:
In [18]:
BS(df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-18-745e7dd0eb2c> in <module>()
----> 1 BS(df)
<ipython-input-17-b666a39cd530> in BS(df)
3 def BS(df):
4 CallPutFlag='c'
----> 5 d1 = (log(S/X)+(.001+v*v/2)*T)/(v*sqrt(T))
6 d2 = d1-v*sqrt(T)
7 cp = ((S*CND(d1)-X*exp(-.001*T)*CND(d2)))
C:\Users\camcompco\AppData\Roaming\Python\Python34\site- packages\pandas\core\series.py in wrapper(self)
74 return converter(self.iloc[0])
75 raise TypeError(
---> 76 "cannot convert the series to {0}".format(str(converter)))
77 return wrapper
78
TypeError: cannot convert the series to <class 'float'>
any assistance would be greatly appreciated.
John
I think it would be easier to use dataframe.apply()
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.apply.html
then the syntax would be df.apply(func, axis = 1) to apply the function func to each row.
The answer to this question is similar:
Apply function to each row of pandas dataframe to create two new columns
#JonD's answer is good, but here's an alternate answer that will be faster if you dataframe has more than a few rows:
from scipy.stats import norm
def BS2(df):
d1 = (np.log(df.S/df.X)+(.001+df.v*df.v/2)*df['T'])/(df.v*np.sqrt(df['T']))
d2 = d1-df.v*np.sqrt(df['T'])
return (df.S*norm.cdf(d1)-df.X*np.exp(-.001*df['T'])*norm.cdf(d2))
Changes:
Main point is to vectorize the function. Syntax-wise the main change is to explicitly use numpy versions of sqrt, log, and exp. Otherwise you don't have to change much because numpy/pandas support basic math operations in an elementwise manner.
Replaced user-written CND with norm.cdf from scipy. Much faster b/c built in functions are almost always as fast as possible.
This is minor, but I went with shortcut notation on df.X and others, but df['T'] needs to be written out since df.T would be interpreted as df.transpose(). I guess this is a good example of why you should avoid the shortcut notation but I'm lazy...
Btw, if you want even more speed, the next thing to try would be to do it in numpy rather than pandas. You could also check if others have already written Black-Scholes functions/libraries (probably, though I don't know anything about it).

DataFrame Subset

I have a dataframe already and am subsetting some of it to another dataframe.
I do that like this:
D = njm[['svntygene', 'intgr', 'lowgr', 'higr', 'lumA', 'lumB', 'wndres', 'nlbrst', 'Erneg', 'basallike']]
I want to try and set it by the integer position though, something like this:
D = njm.iloc[1:, 2:, 3:, 7:]
But I get an error. How would I do this part? Read the docs but could not find a clear answer.
Also, is it possible to pass a list to this as values too?
Thanks.
This is covered in the iloc section of the documentation: you can pass a list with the desired indices.
>>> df = pd.DataFrame(np.random.random((5,5)),columns=list("ABCDE"))
>>> df
A B C D E
0 0.605594 0.229728 0.390391 0.754185 0.516801
1 0.384228 0.106261 0.457507 0.833473 0.786098
2 0.364943 0.664588 0.330835 0.846941 0.229110
3 0.025799 0.681206 0.235821 0.418825 0.878566
4 0.811800 0.761962 0.883281 0.932983 0.665609
>>> df.iloc[:,[1,2,4]]
B C E
0 0.229728 0.390391 0.516801
1 0.106261 0.457507 0.786098
2 0.664588 0.330835 0.229110
3 0.681206 0.235821 0.878566
4 0.761962 0.883281 0.665609

Categories

Resources