I'm trying to divide a Pandas DataFrame column by a lagged value, which is 1 in this example.
Create the dataframe. This example only has 1 column, even though my real data has dozens
dTest = pd.DataFrame(data={'Open': [0.99355, 0.99398, 0.99534, 0.99419]})
When I try this vector division (I'm a Python newbie coming from R):
dTest.ix[range(1,4),'Open'] / dTest.ix[range(0,3),'Open']
I get this output:
NaN 1 1 NaN
But I'm expecting:
1.0004327915052085
1.0013682367854484
0.9988446159101413
There's clearly something that I don't understand about the data structure. I'm expecting 3 values but it's outputting 4. What am I missing?
What you tried failed because the sliced ranges of the indices only overlap on the middle 2 rows. You should use shift to shift the rows to achieve what you want:
In [166]:
dTest['Open'] / dTest['Open'].shift()
Out[166]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
you can also use div:
In [159]:
dTest['Open'].div(dTest['Open'].shift(), axis=0)
Out[159]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
You can see that the indices are different when you slice so when using / only the common indices are affected:
In [164]:
dTest.ix[range(0,3),'Open']
Out[164]:
0 0.99355
1 0.99398
2 0.99534
Name: Open, dtype: float64
In [165]:
dTest.ix[range(1,4),'Open']
Out[165]:
1 0.99398
2 0.99534
3 0.99419
Name: Open, dtype: float64
here:
In [168]:
dTest.ix[range(0,3),'Open'].index.intersection(dTest.ix[range(1,4),'Open'].index
Out[168]:
Int64Index([1, 2], dtype='int64')
Related
I have the following data frame:
pa=pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
I want to select the column 'a' and then only a particular element (i.e. first: 1., 2., 3.)
What do I need to add to:
pa.loc[:,['a']]
?
pa.loc[row] selects the row with label row.
pa.loc[row, col] selects the cells which are the instersection of row and col
pa.loc[:, col] selects all rows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']
Now you have lists in the cells of your column so you can use the vectorized string methods to access the elements of those lists like so.
pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists
Storing lists as values in a Pandas DataFrame tends to be a mistake because
it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.
Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:
import numpy as np
import pandas as pd
pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
# 0 1 2
# 0 1.0 4.0 NaN
# 1 2.0 NaN NaN
# 2 3.0 4.0 5.0
Now, you could select the first column like this:
In [36]: df.iloc[:, 0]
Out[36]:
0 1.0
1 2.0
2 3.0
Name: 0, dtype: float64
or the first row like this:
In [37]: df.iloc[0, :]
Out[37]:
0 1.0
1 4.0
2 NaN
Name: 0, dtype: float64
If you wish to drop NaNs, use .dropna():
In [38]: df.iloc[0, :].dropna()
Out[38]:
0 1.0
1 4.0
Name: 0, dtype: float64
and .tolist() to retrieve the values as a list:
In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]
but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on df itself without converting back to Python lists.
I am using pandas 0.18.1 on a large dataframe. I am confused by the behaviour of value_counts(). This is my code:
print df.phase.value_counts()
def normalise_phase(x):
print x
return int(str(x).split('/')[0])
df['phase_normalised'] = df['phase'].apply(normalise_phase)
This prints the following:
2 35092
3 26248
1 24646
4 22189
1/2 8295
2/3 4219
0 1829
dtype: int64
1
nan
Two questions:
Why is nan printing as an output of normalise_phase, when nan
is not listed as a value in value_counts?
Why does value_counts show dtype as int64 if it has string values like
1/2 and nan in it too?
You need to pass dropna=False for NaNs to be tallied (see the docs).
int64 is the dtype of the series (counts of the values). The values themselves are the index. dtype of the index will be object, if you check.
ser = pd.Series([1, '1/2', '1/2', 3, np.nan, 5])
ser.value_counts(dropna=False)
Out:
1/2 2
5 1
3 1
1 1
NaN 1
dtype: int64
ser.value_counts(dropna=False).index
Out: Index(['1/2', 5, 3, 1, nan], dtype='object')
Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?
Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.
I have the following data frame:
pa=pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
I want to select the column 'a' and then only a particular element (i.e. first: 1., 2., 3.)
What do I need to add to:
pa.loc[:,['a']]
?
pa.loc[row] selects the row with label row.
pa.loc[row, col] selects the cells which are the instersection of row and col
pa.loc[:, col] selects all rows and the column named col. Note that although this works it is not the idiomatic way to refer to a column of a dataframe. For that you should use pa['a']
Now you have lists in the cells of your column so you can use the vectorized string methods to access the elements of those lists like so.
pa['a'].str[0] #first value in lists
pa['a'].str[-1] #last value in lists
Storing lists as values in a Pandas DataFrame tends to be a mistake because
it prevents you from taking advantage of fast NumPy or Pandas vectorized operations.
Therefore, you might be better off converting your DataFrame of lists of numbers into a wider DataFrame with native NumPy dtypes:
import numpy as np
import pandas as pd
pa = pd.DataFrame({'a':np.array([[1.,4.],[2.],[3.,4.,5.]])})
df = pd.DataFrame(pa['a'].values.tolist())
# 0 1 2
# 0 1.0 4.0 NaN
# 1 2.0 NaN NaN
# 2 3.0 4.0 5.0
Now, you could select the first column like this:
In [36]: df.iloc[:, 0]
Out[36]:
0 1.0
1 2.0
2 3.0
Name: 0, dtype: float64
or the first row like this:
In [37]: df.iloc[0, :]
Out[37]:
0 1.0
1 4.0
2 NaN
Name: 0, dtype: float64
If you wish to drop NaNs, use .dropna():
In [38]: df.iloc[0, :].dropna()
Out[38]:
0 1.0
1 4.0
Name: 0, dtype: float64
and .tolist() to retrieve the values as a list:
In [39]: df.iloc[0, :].dropna().tolist()
Out[39]: [1.0, 4.0]
but if you wish to leverage NumPy/Pandas for speed, you'll want to express your calculation as vectorized operations on df itself without converting back to Python lists.
I'm trying to concatenate several columns which mostly contain NaNs to one, but here is an example on 2 only:
2013-06-18 21:46:33.422096-05:00 A NaN
2013-06-18 21:46:35.715770-05:00 A NaN
2013-06-18 21:46:42.669825-05:00 NaN B
2013-06-18 21:46:45.409733-05:00 A NaN
2013-06-18 21:46:47.130747-05:00 NaN B
2013-06-18 21:46:47.131314-05:00 NaN B
This could go on for 3 or 4 or 10 columns, always 1 being pd.notnull() and the rest are NaN.
I want to concatenate these into 1 column the fastest way possible. How can I do this?
You get one string per line and the other cells are NaN, then the math to apply is to ask for the max value:
df.max(axis=1)
As per comment, if it doesn't work in Python 3, project your NaN into strings before:
df.fillna('').max(axis=1)
You could do
In [278]: df = pd.DataFrame([[1, np.nan], [2, np.nan], [np.nan, 3]])
In [279]: df
Out[279]:
0 1
0 1 NaN
1 2 NaN
2 NaN 3
In [280]: df.sum(1)
Out[280]:
0 1
1 2
2 3
dtype: float64
Since NaNs are treated as 0 when summed, they don't show up.
A couple of caveats: You need to be sure that only one of the columns has a non-Nan for this to work. It will also only work on numeric data.
You can also use
df.fillna(method='ffill', axis=1).iloc[:, -1]
The last column will now contain all the valid observations since the valid ones have been filled ahead. See the documentation here. The second way should be more flexible but slower. I slice off every row and the last column with iloc[:, -1].