reindex to add missing dates to pandas dataframe - python

I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.

Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.

Related

Bug/Feature for pandas where a multi-indexed dataframe filtered by date returns all the unfiltered dates when extracting the date index level

This is easiest to explain by code, so here goes - imagine the commands in ipython/jupyter notebooks:
from io import StringIO
import pandas as pd
test = StringIO("""Date,Ticker,x,y
2008-10-23,A,0,10
2008-10-23,B,1,11
2008-10-24,A,2,12
2008-10-24,B,3,13
2008-10-25,A,4,14
2008-10-25,B,5,15
2008-10-26,A,6,16
2008-10-26,B,7,17
""")
# Multi-index by Date and Ticker
df = pd.read_csv(test, index_col=[0, 1], parse_dates=True)
df
# Output to the command line
x y
Date Ticker
2008-10-23 A 0 10
B 1 11
2008-10-24 A 2 12
B 3 13
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
ts = pd.Timestamp(2008, 10, 25)
# Filter the data by Date >= ts
filtered_df = df.loc[ts:]
# output the filtered data
filtered_df
x y
Date Ticker
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
# Get all the level 0 data (i.e. the dates) in the filtered dataframe
dates = filtered_df.index.levels[0]
# output the dates in the filtered dataframe:
dates
DatetimeIndex(['2008-10-23', '2008-10-24', '2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
# WTF!!!??? This was ALL of the dates in the original dataframe - I asked for the dates in the filtered dataframe!
# The correct output should have been:
DatetimeIndex(['2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
So clearly in multi-indexing, when one filters, the index of the filtered dataframe retains all of the indices of the original dataframe, but only shows the visible indices when viewing the entire dataframe. However, when looking at data by index levels, it appears there is a bug (feature somehow?) where the entire index including the invisible indices is used to perform the operation I did to extract all the dates in the code above.
This is actually explained in the MultiIndex's User Guide (emphasis added):
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. ... This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.
In your case:
>>> filtered_df.index.get_level_values(0)
DatetimeIndex(['2008-10-25', '2008-10-25', '2008-10-26', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
Which is what you expected.

Setting values with pandas.DataFrame

Having this DataFrame:
import pandas
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)
df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.
This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.
Simply set df.is_copy = False before asignment of new value.
Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).
Restructured Data
import pandas as pd
# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)
# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)
print df
a 1
b 2
bar 8
2016-01-01 00:00:00 0
2016-01-01 01:00:00 1
2016-01-01 02:00:00 2
2016-01-01 03:00:00 3
2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()
print df_resampled
a 1
b 2
bar 8
2016-01-01 10
Update (from data description)
If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:
# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))
df = pd.concat([df1, df2])
print df
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 1.0 NaN
2016-01-01 02:00:00 2.0 NaN
2016-01-01 03:00:00 3.0 NaN
2016-01-01 04:00:00 4.0 NaN
2016-01-14 00:00:00 NaN -200.0
2016-01-14 01:00:00 NaN 10.0
2016-01-14 02:00:00 NaN 24.0
2016-01-14 03:00:00 NaN 30.0
2016-01-14 04:00:00 NaN 40.0
2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')
print df_resampled
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 10.0 NaN
2016-01-14 NaN 4.0

pandas - read_csv with missing values in headline

I have this kind of csv file:
date,a,b,c
2014,12,29,7,12,45
2014,12,30,7,13,12
2014,12,31,6.5,6,5
So the first row does not explicitly specify all columns, and kind of assumes that you understand that the date is the first 3 columns.
How do I tell read_csv to consider the first three columns as one date column (while keeping the other labels)?
You can parse your columns directly as a date, if you use the parse_dates argument.
From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict, default False
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column. If [[1, 3]] -> combine
columns 1 and 3 and parse as a single date column. {‘foo’ : [1, 3]} ->
parse columns 1, 3 as date and call result ‘foo’ A fast-path exists
for iso8601-formatted dates.
For your file, you can do something like this:
pd.read_csv(file_path, names=['y', 'm', 'd', 'a', 'b', 'c'], header=0,
parse_dates={'date': [0, 1, 2]}, index_col='date', )
a b c
date
2014-12-29 7.0 12 45
2014-12-30 7.0 13 12
2014-12-31 6.5 6 5
The thing with the missing values in headline is solved by passing the names argument and header=0 (to overwrite the existing header). Then it is possible to specify which columns should be parsed as a date.
See another example here.

Slice by date in pandas without re-indexing

I have a pandas dataframe where one of the columns is made up of strings representing dates, which I then convert to python timestamps by using pd.to_datetime().
How can I select the rows in my dataframe that meet conditions on date.
I know you can use the index (like in this question) but my timestamps are not unique.
How can I select the rows where the 'Date' field is say, after 2015-03-01?
You can use a mask on the date, e.g.
df[df['date'] > '2015-03-01']
Here is a full example:
>>> df = pd.DataFrame({'date': pd.date_range('2015-02-15', periods=5, freq='W'),
'val': np.random.random(5)})
>>> df
date val
0 2015-02-15 0.638522
1 2015-02-22 0.942384
2 2015-03-01 0.133111
3 2015-03-08 0.694020
4 2015-03-15 0.273877
>>> df[df.date > '2015-03-01']
date val
3 2015-03-08 0.694020
4 2015-03-15 0.273877

How to make pandas read_csv distinguish strings based on quoting

I want pandas.io.parsers.read_csv to distinguish between strings and the rest of data types based on the fact that strings in my csv file are always "quoted". Is it possible?
I have the following csv example:
"ID"|"DATE"|"NAME"|"YEAR"|"FLOAT"|"BOOL"
"01"|2000-01-01|"Name1"|1975|1.2|1
"02"||""||||
It should give me a dataframe where all the quoted guys are strings. Most likely pandas will make everything else np.float64, but I could deal with it afterwards. I want to wait with using dtype, because I have many columns, and I don't want to map types for all of them. I would like to try to make it only "quote"-based, if possible.
I tried to use quotechar='"' and quoting=3, but quotechar doesn't do anything at all, while quoting keeps "" which I don't want as well. It seems to me pandas parsers should be able to do it, since this is the way to distinguish strings in csv files.
Specifying dtypes would be the more straightforward way, but if you don't want to do that I'd suggest using quoting=3 and cleaning up afterwards.
strip_char = lambda x: x.strip('"')
In [40]: df = pd.read_csv(StringIO(s), sep='|', quoting=3)
In [41]: df
Out[41]:
"ID" "DATE" "NAME" "YEAR" "FLOAT" "BOOL"
0 "01" 2000-01-01 "Name1" 1975 1.2 1
1 "02" NaN "" NaN NaN NaN
[2 rows x 6 columns]
In [42]: df = df.rename(columns=strip_char)
In [43]: df[['ID', 'NAME']] = df[['ID', 'NAME']].applymap(strip_char)
In [44]: df
Out[44]:
ID DATE NAME YEAR FLOAT BOOL
0 01 2000-01-01 Name1 1975 1.2 1
1 02 NaN NaN NaN NaN
[2 rows x 6 columns]
In [45]: df.dtypes
Out[45]:
ID object
DATE object
NAME object
YEAR float64
FLOAT float64
BOOL float64
dtype: object
EDIT: Then you can set the index:
In [11]: df = df.set_index('ID')
In [12]: df
Out[12]:
DATE NAME YEAR FLOAT BOOL
ID
01 2000-01-01 Name1 1975 1.2 1
02 NaN NaN NaN NaN
[2 rows x 5 columns]

Categories

Resources