Dropping column values that don't meet a requirement - python

I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?

I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64

Related

Odd behaviour when applying `Pandas.Timedelta.total_seconds`

I have a pandas dataframe with a column that is of Timedelta type. I used groupby with a separate month column to create groups of these Timdelta by month, I then tried to use the agg function along with min, max, mean on the Timedelta column which triggered DataError: No numeric types to aggregate
As a solution for this I tried to use the total_seconds() function along with apply() to get a numeric representation of the column, however the behaviour seems strange to me as the NaT values in my Timedelta column were turned into -9.223372e+09 but they result in NaN when total_seconds() is used on a scalar without apply()
A minimal example:
test = pd.Series([np.datetime64('nat'),np.datetime64('nat')])
res = test.apply(pd.Timedelta.total_seconds)
print(res)
which produces:
0 -9.223372e+09
1 -9.223372e+09
dtype: float64
whereas:
res = test.iloc[0].total_seconds()
print(res)
yields:
nan
The behaviour of the second example is desired as I wish to perform aggregations etc and propagate missing/invalid values. Is this a bug ?
You should use .dt.total_seconds() method, instead of applying pd.Timedelta.total_seconds function onto datetime64[ns] dtype column:
In [232]: test
Out[232]:
0 NaT
1 NaT
dtype: datetime64[ns] # <----
In [233]: pd.to_timedelta(test)
Out[233]:
0 NaT
1 NaT
dtype: timedelta64[ns] # <----
In [234]: pd.to_timedelta(test).dt.total_seconds()
Out[234]:
0 NaN
1 NaN
dtype: float64
Another demo:
In [228]: s = pd.Series(pd.to_timedelta(['03:33:33','1 day','aaa'], errors='coerce'))
In [229]: s
Out[229]:
0 0 days 03:33:33
1 1 days 00:00:00
2 NaT
dtype: timedelta64[ns]
In [230]: s.dt.total_seconds()
Out[230]:
0 12813.0
1 86400.0
2 NaN
dtype: float64

Setting values with pandas.DataFrame

Having this DataFrame:
import pandas
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
df = pandas.DataFrame([(1, 2, s, 8)], columns=['a', 'b', 'foo', 'bar'])
df.set_index(['a', 'b'], inplace=True)
df
I would like to replace the Series in there with a new one that is simply the old one, but resampled to a day period (i.e. x.resample('D').sum().dropna()).
When I try:
df['foo'][0] = df['foo'][0].resample('D').sum().dropna()
That seems to work well:
However, I get a warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
The question is, how should I do this instead?
Notes
Things I have tried but do not work (resampling or not, the assignment raises an exception):
df.iloc[0].loc['foo'] = df.iloc[0].loc['foo']
df.loc[(1, 2), 'foo'] = df.loc[(1, 2), 'foo']
df.loc[df.index[0], 'foo'] = df.loc[df.index[0], 'foo']
A bit more information about the data (in case it is relevant):
The real DataFrame has more columns in the multi-index. Not all of them necessarily integers, but more generally numerical and categorical. The index is unique (i.e.: there is only one row with a given index value).
The real DataFrame has, of course, many more rows in it (thousands).
There are not necessarily only two columns in the DataFrame and there may be more than 1 columns containing a Series type. Columns usually contain series, categorical data and numerical data as well. Any single column is always single-typed (either numerical, or categorical, or series).
The series contained in each cell usually have a variable length (i.e.: two series/cells in the DataFrame do not, unless pure coincidence, have the same length, and will probably never have the same index anyway, as dates vary as well between series).
Using Python 3.5.1 and Pandas 0.18.1.
This should work:
df.iat[0, df.columns.get_loc('foo')] = df['foo'][0].resample('D').sum().dropna()
Pandas is complaining about chained indexing but when you don't do it that way it's facing problems assigning whole series to a cell. With iat you can force something like that. I don't think it would be a preferable thing to do, but seems like a working solution.
Simply set df.is_copy = False before asignment of new value.
Hierarchical data in pandas
It really seems like you should consider restructure your data to take advantage of pandas features such as MultiIndexing and DateTimeIndex. This will allow you to still operate on a index in the typical way while being able to select on multiple columns across the hierarchical data (a,b, andbar).
Restructured Data
import pandas as pd
# Define Index
dates = pd.date_range('2016-01-01', periods=5, freq='H')
# Define Series
s = pd.Series([0, 1, 2, 3, 4], index=dates)
# Place Series in Hierarchical DataFrame
heirIndex = pd.MultiIndex.from_arrays([1,2,8], names=['a','b', 'bar'])
df = pd.DataFrame(s, columns=heirIndex)
print df
a 1
b 2
bar 8
2016-01-01 00:00:00 0
2016-01-01 01:00:00 1
2016-01-01 02:00:00 2
2016-01-01 03:00:00 3
2016-01-01 04:00:00 4
Resampling
With the data in this format, resampling becomes very simple.
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna()
print df_resampled
a 1
b 2
bar 8
2016-01-01 10
Update (from data description)
If the data has variable length Series each with a different index and non-numeric categories that is ok. Let's make an example:
# Define Series
dates = pandas.date_range('2016-01-01', periods=5, freq='H')
s = pandas.Series([0, 1, 2, 3, 4], index=dates)
# Define Series
dates2 = pandas.date_range('2016-01-14', periods=6, freq='H')
s2 = pandas.Series([-200, 10, 24, 30, 40,100], index=dates2)
# Define DataFrames
df1 = pd.DataFrame(s, columns=pd.MultiIndex.from_arrays([1,2,8,'cat1'], names=['a','b', 'bar','c']))
df2 = pd.DataFrame(s2, columns=pd.MultiIndex.from_arrays([2,5,5,'cat3'], names=['a','b', 'bar','c']))
df = pd.concat([df1, df2])
print df
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 00:00:00 0.0 NaN
2016-01-01 01:00:00 1.0 NaN
2016-01-01 02:00:00 2.0 NaN
2016-01-01 03:00:00 3.0 NaN
2016-01-01 04:00:00 4.0 NaN
2016-01-14 00:00:00 NaN -200.0
2016-01-14 01:00:00 NaN 10.0
2016-01-14 02:00:00 NaN 24.0
2016-01-14 03:00:00 NaN 30.0
2016-01-14 04:00:00 NaN 40.0
2016-01-14 05:00:00 NaN 100.0
The only issues is that after resampling. You will want to use how='all' while dropping na rows like this:
# Simple Direct Resampling
df_resampled = df.resample('D').sum().dropna(how='all')
print df_resampled
a 1 2
b 2 5
bar 8 5
c cat1 cat3
2016-01-01 10.0 NaN
2016-01-14 NaN 4.0

How to Bin/Count based on dates in Python

I have a python series which contains datetime.date objects ranging from 1/2013 to 12/2015 which is the month a product was sold. What I would like to do is count and bin by month the number of products sold.
Is there an efficient way of doing this with pandas?
I recommend using datetime64, that is first apply pd.to_datetime on the index. If you set this as an index then you can use resample:
In [11]: s = pd.date_range('2015-01', '2015-03', freq='5D') # DatetimeIndex
In [12]: pd.Series(1, index=s).resample('M', how='count')
Out[12]:
2015-01-31 7
2015-02-28 5
2015-03-31 1
Freq: M, dtype: int64

How to make pandas read_csv distinguish strings based on quoting

I want pandas.io.parsers.read_csv to distinguish between strings and the rest of data types based on the fact that strings in my csv file are always "quoted". Is it possible?
I have the following csv example:
"ID"|"DATE"|"NAME"|"YEAR"|"FLOAT"|"BOOL"
"01"|2000-01-01|"Name1"|1975|1.2|1
"02"||""||||
It should give me a dataframe where all the quoted guys are strings. Most likely pandas will make everything else np.float64, but I could deal with it afterwards. I want to wait with using dtype, because I have many columns, and I don't want to map types for all of them. I would like to try to make it only "quote"-based, if possible.
I tried to use quotechar='"' and quoting=3, but quotechar doesn't do anything at all, while quoting keeps "" which I don't want as well. It seems to me pandas parsers should be able to do it, since this is the way to distinguish strings in csv files.
Specifying dtypes would be the more straightforward way, but if you don't want to do that I'd suggest using quoting=3 and cleaning up afterwards.
strip_char = lambda x: x.strip('"')
In [40]: df = pd.read_csv(StringIO(s), sep='|', quoting=3)
In [41]: df
Out[41]:
"ID" "DATE" "NAME" "YEAR" "FLOAT" "BOOL"
0 "01" 2000-01-01 "Name1" 1975 1.2 1
1 "02" NaN "" NaN NaN NaN
[2 rows x 6 columns]
In [42]: df = df.rename(columns=strip_char)
In [43]: df[['ID', 'NAME']] = df[['ID', 'NAME']].applymap(strip_char)
In [44]: df
Out[44]:
ID DATE NAME YEAR FLOAT BOOL
0 01 2000-01-01 Name1 1975 1.2 1
1 02 NaN NaN NaN NaN
[2 rows x 6 columns]
In [45]: df.dtypes
Out[45]:
ID object
DATE object
NAME object
YEAR float64
FLOAT float64
BOOL float64
dtype: object
EDIT: Then you can set the index:
In [11]: df = df.set_index('ID')
In [12]: df
Out[12]:
DATE NAME YEAR FLOAT BOOL
ID
01 2000-01-01 Name1 1975 1.2 1
02 NaN NaN NaN NaN
[2 rows x 5 columns]

Calculating date_range over GroupBy object in pandas

I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())

Categories

Resources