fill missing indices in pandas - python

I have data like follows:
import pandas as pd
from datetime import datetime
x = pd.Series([1, 2, 4], [datetime(2013,11,1), datetime(2013,11, 2), datetime(2013, 11, 4)])
The missing index at November 3rd corresponds to a zero value, and I want it to look like this:
y = pd.Series([1,2,0,4], pd.date_range('2013-11-01', periods = 4))
What's the best way to convert x to y? I've tried
y = pd.Series(x, pd.date_range('2013-11-1', periods = 4)).fillna(0)
This throws an index error sometimes which I can't interpret (Index length did not match values, even though index and data have the same length. Is there a better way to do this?

You can use pandas.Series.resample() for this:
>>> x.resample('D').fillna(0)
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
There's fill_method parameter in the resample() function, but I don't know if it's possible to use it to replace NaN during resampling. But looks like you can use how method to take care of it, like:
>>> x.resample('D', how=lambda x: x.mean() if len(x) > 0 else 0)
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Don't know which method is preferred one. Please also take a look at #AndyHayden's answer - probably reindex() with fill_value=0 would be most efficien way to do this, but you have to make your own tests.

I think I would use a resample (note if there are dupes it takes the mean by default):
In [11]: x.resample('D') # you could use how='first'
Out[11]:
2013-11-01 1
2013-11-02 2
2013-11-03 NaN
2013-11-04 4
Freq: D, dtype: float64
In [12]: x.resample('D').fillna(0)
Out[12]:
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Freq: D, dtype: float64
If you prefered dupes to raise, then use reindex:
In [13]: x.reindex(pd.date_range('2013-11-1', periods=4), fill_value=0)
Out[13]:
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Freq: D, dtype: float64

Related

Odd behaviour when applying `Pandas.Timedelta.total_seconds`

I have a pandas dataframe with a column that is of Timedelta type. I used groupby with a separate month column to create groups of these Timdelta by month, I then tried to use the agg function along with min, max, mean on the Timedelta column which triggered DataError: No numeric types to aggregate
As a solution for this I tried to use the total_seconds() function along with apply() to get a numeric representation of the column, however the behaviour seems strange to me as the NaT values in my Timedelta column were turned into -9.223372e+09 but they result in NaN when total_seconds() is used on a scalar without apply()
A minimal example:
test = pd.Series([np.datetime64('nat'),np.datetime64('nat')])
res = test.apply(pd.Timedelta.total_seconds)
print(res)
which produces:
0 -9.223372e+09
1 -9.223372e+09
dtype: float64
whereas:
res = test.iloc[0].total_seconds()
print(res)
yields:
nan
The behaviour of the second example is desired as I wish to perform aggregations etc and propagate missing/invalid values. Is this a bug ?
You should use .dt.total_seconds() method, instead of applying pd.Timedelta.total_seconds function onto datetime64[ns] dtype column:
In [232]: test
Out[232]:
0 NaT
1 NaT
dtype: datetime64[ns] # <----
In [233]: pd.to_timedelta(test)
Out[233]:
0 NaT
1 NaT
dtype: timedelta64[ns] # <----
In [234]: pd.to_timedelta(test).dt.total_seconds()
Out[234]:
0 NaN
1 NaN
dtype: float64
Another demo:
In [228]: s = pd.Series(pd.to_timedelta(['03:33:33','1 day','aaa'], errors='coerce'))
In [229]: s
Out[229]:
0 0 days 03:33:33
1 1 days 00:00:00
2 NaT
dtype: timedelta64[ns]
In [230]: s.dt.total_seconds()
Out[230]:
0 12813.0
1 86400.0
2 NaN
dtype: float64

Pandas time interpolation - NaNs

I'm having problems interpolating over time in Pandas so I've taken it back to a very basic example and I still see the same problem.
c is the dataframe, a is the index (a datetime64 array) and b is the data (a float array)
In [104]: c
Out[104]:
b
a
2009-04-01 386.928680
2009-06-01 386.502686
In [105]: a
Out[105]:
0 2009-04-01
1 2009-06-01
dtype: datetime64[ns]
In [106]: b
Out[106]:
0 386.928680
1 386.502686
dtype: float64
upsampled = c.resample('M')
interpolated = upsampled.interpolate(method='linear')
In [107]: interpolated
Out[107]:
b
a
2009-04-30 NaN
2009-05-31 NaN
2009-06-30 NaN
I've tried changing the interpolation method and setting the limit keyword but nothing seems to help and I just get all NaNs.
You need to change your resample to 'MS' month start to get the original values.
c.resample('MS').asfreq().interpolate(method='linear')
Output:
b
a
2009-04-01 386.928680
2009-05-01 386.715683
2009-06-01 386.502686

pandas dividing a column by lagged values

I'm trying to divide a Pandas DataFrame column by a lagged value, which is 1 in this example.
Create the dataframe. This example only has 1 column, even though my real data has dozens
dTest = pd.DataFrame(data={'Open': [0.99355, 0.99398, 0.99534, 0.99419]})
When I try this vector division (I'm a Python newbie coming from R):
dTest.ix[range(1,4),'Open'] / dTest.ix[range(0,3),'Open']
I get this output:
NaN 1 1 NaN
But I'm expecting:
1.0004327915052085
1.0013682367854484
0.9988446159101413
There's clearly something that I don't understand about the data structure. I'm expecting 3 values but it's outputting 4. What am I missing?
What you tried failed because the sliced ranges of the indices only overlap on the middle 2 rows. You should use shift to shift the rows to achieve what you want:
In [166]:
dTest['Open'] / dTest['Open'].shift()
Out[166]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
you can also use div:
In [159]:
dTest['Open'].div(dTest['Open'].shift(), axis=0)
Out[159]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
You can see that the indices are different when you slice so when using / only the common indices are affected:
In [164]:
dTest.ix[range(0,3),'Open']
Out[164]:
0 0.99355
1 0.99398
2 0.99534
Name: Open, dtype: float64
In [165]:
dTest.ix[range(1,4),'Open']
Out[165]:
1 0.99398
2 0.99534
3 0.99419
Name: Open, dtype: float64
here:
In [168]:
dTest.ix[range(0,3),'Open'].index.intersection(dTest.ix[range(1,4),'Open'].index
Out[168]:
Int64Index([1, 2], dtype='int64')

Python Pandas drop columns based on max value of column

Im just getting going with Pandas as a tool for munging two dimensional arrays of data. It's super overwhelming, even after reading the docs. You can do so much that I can't figure out how to do anything, if that makes any sense.
My dataframe (simplified):
Date Stock1 Stock2 Stock3
2014.10.10 74.75 NaN NaN
2014.9.9 NaN 100.95 NaN
2010.8.8 NaN NaN 120.45
So each column only has one value.
I want to remove all columns that have a max value less than x. So say here as an example, if x = 80, then I want a new DataFrame:
Date Stock2 Stock3
2014.10.10 NaN NaN
2014.9.9 100.95 NaN
2010.8.8 NaN 120.45
How can this be acheived? I've looked at dataframe.max() which gives me a series. Can I use that, or have a lambda function somehow in select()?
Use the df.max() to index with.
In [19]: from pandas import DataFrame
In [23]: df = DataFrame(np.random.randn(3,3), columns=['a','b','c'])
In [36]: df
Out[36]:
a b c
0 -0.928912 0.220573 1.948065
1 -0.310504 0.847638 -0.541496
2 -0.743000 -1.099226 -1.183567
In [24]: df.max()
Out[24]:
a -0.310504
b 0.847638
c 1.948065
dtype: float64
Next, we make a boolean expression out of this:
In [31]: df.max() > 0
Out[31]:
a False
b True
c True
dtype: bool
Next, you can index df.columns by this (this is called boolean indexing):
In [34]: df.columns[df.max() > 0]
Out[34]: Index([u'b', u'c'], dtype='object')
Which you can finally pass to DF:
In [35]: df[df.columns[df.max() > 0]]
Out[35]:
b c
0 0.220573 1.948065
1 0.847638 -0.541496
2 -1.099226 -1.183567
Of course, instead of 0, you use any value that you want as the cutoff for dropping.

Filtering all rows with NaT in a column in Dataframe python

I have a df like this:
a b c
1 NaT w
2 2014-02-01 g
3 NaT x
df=df[df.b=='2014-02-01']
will give me
a b c
2 2014-02-01 g
I want a database of all rows with NaT in column b?
df=df[df.b==None] #Doesn't work
I want this:
a b c
1 NaT w
3 NaT x
isnull and notnull work with NaT so you can handle them much the same way you handle NaNs:
>>> df
a b c
0 1 NaT w
1 2 2014-02-01 g
2 3 NaT x
>>> df.dtypes
a int64
b datetime64[ns]
c object
just use isnull to select:
df[df.b.isnull()]
a b c
0 1 NaT w
2 3 NaT x
Using your example dataframe:
df = pd.DataFrame({"a":[1,2,3],
"b":[pd.NaT, pd.to_datetime("2014-02-01"), pd.NaT],
"c":["w", "g", "x"]})
Until v0.17 this didn't use to work:
df.query('b != b')
and you had to do:
df.query('b == "NaT"') # yes, surprisingly, this works!
Since v0.17 though, both methods work, although I would only recommend the first one.
For those interested, in my case I wanted to drop the NaT contained in the DateTimeIndex of a dataframe. I could not directly use the notnull construction as suggested by Karl D. You first have to create a temporary column out of the index, then apply the mask, and then delete the temporary column again.
df["TMP"] = df.index.values # index is a DateTimeIndex
df = df[df.TMP.notnull()] # remove all NaT values
df.drop(["TMP"], axis=1, inplace=True) # delete TMP again
I feel that the comment by #DSM is worth a answer on its own, because this answers the fundamental question.
The misunderstanding comes from the assumption that pd.NaT acts like None. However, while None == None returns True, pd.NaT == pd.NaT returns False. Pandas NaT behaves like a floating-point NaN, which is not equal to itself.
As the previous answer explain, you should use
df[df.b.isnull()] # or notnull(), respectively

Categories

Resources