Setting values with pandas.DataFrame - python

Having this DataFrame:
import pandas
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)
df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.

This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.

Simply set df.is_copy = False before asignment of new value.

Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).
Restructured Data
import pandas as pd
# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)
# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)
print df
a 1
b 2
bar 8
2016-01-01 00:00:00 0
2016-01-01 01:00:00 1
2016-01-01 02:00:00 2
2016-01-01 03:00:00 3
2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()
print df_resampled
a 1
b 2
bar 8
2016-01-01 10
Update (from data description)
If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:
# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))
df = pd.concat([df1, df2])
print df
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 1.0 NaN
2016-01-01 02:00:00 2.0 NaN
2016-01-01 03:00:00 3.0 NaN
2016-01-01 04:00:00 4.0 NaN
2016-01-14 00:00:00 NaN -200.0
2016-01-14 01:00:00 NaN 10.0
2016-01-14 02:00:00 NaN 24.0
2016-01-14 03:00:00 NaN 30.0
2016-01-14 04:00:00 NaN 40.0
2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')
print df_resampled
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 10.0 NaN
2016-01-14 NaN 4.0

Related

How to shift a column by 1 year in Python

With the python shift function, you are able to offset values by the number of rows. I'm looking to offset values by a specified time, which is 1 year in this case.
Here is my sample data frame. The value_py column is what I'm trying to return with a shift function. This is an over simplified example of my problem. How do I specify date as the offset parameter and not use rows?
import pandas as pd
import numpy as np
test_df = pd.DataFrame({'dt':['2020-01-01', '2020-08-01', '2021-01-01', '2022-01-01'],
'value':[10,13,15,14]})
test_df['dt'] = pd.to_datetime(test_df['dt'])
test_df['value_py'] = [np.nan, np.nan, 10, 15]
I have tried this but I'm seeing the index value get shifted by 1 year and not the value column
test_df.set_index('dt')['value'].shift(12, freq='MS')
This should solve your problem:
test_df['new_val'] = test_df['dt'].map(test_df.set_index('dt')['value'].shift(12, freq='MS'))
test_df
dt value value_py new_val
0 2020-01-01 10 NaN NaN
1 2020-08-01 13 NaN NaN
2 2021-01-01 15 10.0 10.0
3 2022-01-01 14 15.0 15.0
Use .map() to map the values of the shifted dates to original dates.
Also you should use 12 as your shift parameter not -12.

How to lag data by x specific days on a multi index pandas dataframe?

I have a DataFrame that has dates, assets, and then price/volume data. I'm trying to pull in data from 7 days ago, but the issue is that I can't use shift() because my table has missing dates in it.
date cusip price price_7daysago
1/1/2017 a 1
1/1/2017 b 2
1/2/2017 a 1.2
1/2/2017 b 2.3
1/8/2017 a 1.1 1
1/8/2017 b 2.2 2
I've tried creating a lambda function to try to use loc and timedelta to create this shifting, but I was only able to output empty numpy arrays:
def row_delta(x, df, days, colname):
if datetime.strptime(x['recorddate'], '%Y%m%d') - timedelta(days) in [datetime.strptime(x,'%Y%m%d') for x in df['recorddate'].unique().tolist()]:
return df.loc[(df['recorddate_date'] == df['recorddate_date'] - timedelta(days)) & (df['cusip'] == x['cusip']) ,colname]
else:
return 'nothing'
I also thought of doing something similar to this in order to fill in missing dates, but my issue is that I have multiple indexes, the dates and the cusips so I can't just reindex on this.
merge the DataFrame with itself while adding 7 days to the date column for the right Frame. Use the suffixes argument to name the columns appropriately.
import pandas as pd
df['date'] = pd.to_datetime(df.date)
df.merge(df.assign(date = df.date+pd.Timedelta(days=7)),
on=['date', 'cusip'],
how='left', suffixes=['', '_7daysago'])
Output: df
date cusip price price_7daysago
0 2017-01-01 a 1.0 NaN
1 2017-01-01 b 2.0 NaN
2 2017-01-02 a 1.2 NaN
3 2017-01-02 b 2.3 NaN
4 2017-01-08 a 1.1 1.0
5 2017-01-08 b 2.2 2.0
you can set date and cusip as index and use unstack and shift together
shifted = df.set_index(["date", "cusip"]).unstack().shift(7).stack()
then simply merge shifted with your original df

Rolling row filter/crosstab in pandas?

I have a highly sparse dataframe (only one non-zero value per row) indexed by non-regular Timestamps for which I am trying to do the following.
For each non-zero value in a given column, I want to count the numbers of other non-zero values in other columns within a given timedelta. In a way, I am trying to compute something similar to a rolling cross_tab.
My solution so far is ugly and slow as I haven't figured out how to do this using slicing and rolling. It looks something like:
delta = 1
values = pd.DataFrame(0,index= df.columns,columns= df.columns)
for j in df.columns:
for i in range(len(df[df[j]!=0].index)-1):
#min is used to avoid overlapping
values[j] +=df[(df.index<min((df[df[j]!=0].index + pd.tseries.timedeltas.to_timedelta(delta, unit='h'))[i],df[df[j]!=0].index[i+1]))&(df.index>=df[df[j]!=0].index[i])].astype(bool).sum()
values = values.T
and a toy-example dataframe is:
df = pd.DataFrame.from_dict({"2016-01-01 10:00.00":[0,1],
"2016-01-01 10:30.00":[1,0],
"2016-01-01 12:00.00":[0,1],
"2016-01-01 14:00.00":[1,0]},
orient="index")
df.columns=['a','b']
df.index = pd.to_datetime(df.index)
a b
2016-01-01 10:00:00 0 1
2016-01-01 10:30:00 1 0
2016-01-01 12:00:00 0 1
2016-01-01 14:00:00 1 0
The desired output should look like (with the counts depending on the timedelta):
a b
a 1 0
b 1 1
Hard to tell what exactly you want. But it sounded kind of like this
I want to use a new feature pandas 0.19. Time aware rolling. In order to use it, we need a sorted index.
d1 = df.sort_index()
Now, let's assume we want to count within plus or minus one hour. Let's start by adding two hours to every element of the index
d1.index = d1.index + pd.offsets.Hour(2)
Then we'll roll through, looking back four hours. This will be like looking forward two hours and backwards two hours relative to the original indices.
d2 = d1.rolling('4H').sum()
d2.index = d2.index - pd.offsets.Hour(2)
d2
a b
2016-01-01 10:00:00 0.0 1.0
2016-01-01 10:30:00 1.0 1.0
2016-01-01 12:00:00 1.0 2.0
2016-01-01 14:00:00 2.0 1.0

Dropping column values that don't meet a requirement

I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?
I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64

reindex to add missing dates to pandas dataframe

I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.
Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.

Categories

Resources