I have a pandas dataframe where one of the columns is made up of strings representing dates, which I then convert to python timestamps by using pd.to_datetime().
How can I select the rows in my dataframe that meet conditions on date.
I know you can use the index (like in this question) but my timestamps are not unique.
How can I select the rows where the 'Date' field is say, after 2015-03-01?
You can use a mask on the date, e.g.
df[df['date'] > '2015-03-01']
Here is a full example:
>>> df = pd.DataFrame({'date': pd.date_range('2015-02-15', periods=5, freq='W'),
'val': np.random.random(5)})
>>> df
date val
0 2015-02-15 0.638522
1 2015-02-22 0.942384
2 2015-03-01 0.133111
3 2015-03-08 0.694020
4 2015-03-15 0.273877
>>> df[df.date > '2015-03-01']
date val
3 2015-03-08 0.694020
4 2015-03-15 0.273877
Related
I am trying to pivot dataframe given as below. I have datetime column(hh:mm:ss) format in dataframe. I want to pivot dataframe in which I want to use aggfunc on date column.
import pandas as pd
data = {'Type':['A', 'B', 'C', 'C'],'Name':['ab', 'ef','gh', 'ij'],'Time':['02:00:00', '03:02:00', '04:00:30','01:02:20']}
df = pd.DataFrame(data)
print (df)
pivot = (
df.pivot_table(index=['Type'],values=['Time'], aggfunc='sum')
)
Type
Name
Time
A
ab
02:00:00
B
ef
03:02:00
C
gh
04:00:30
C
ij
01:02:20
Type
Time
C
04:00:3001:02:20
A
02:00:00
B
03:02:00
I want C row should be addition of two time ; 05:02:50
This looks more like groupby sum than pivot_table.
Convert to_timedelta to have appropriate dtype for duration. (Makes mathmatical operations function as expected)
groupby sum on Type and Time to get the total duration per Type.
# Convert to TimeDelta (appropriate dtype)
df['Time'] = pd.to_timedelta(df['Time'])
new_df = df.groupby('Type')['Time'].sum().reset_index()
new_df:
Type Time
0 A 0 days 02:00:00
1 B 0 days 03:02:00
2 C 0 days 05:02:50
Optional convert back to string:
new_df['Time'] = new_df['Time'].dt.to_pytimedelta().astype(str)
new_df:
Type Time
0 A 2:00:00
1 B 3:02:00
2 C 5:02:50
I have the dataframe df1 with the columns type, Date and amount.
My goal is to create a Dataframe df2 with a subset of dates from df1, in which each type has a column with the amounts of the type as values for the respective date.
Input Dataframe:
df1 =
,type,Date,amount
0,42,2017-02-01,4
1,42,2017-02-02,5
2,42,2017-02-03,7
3,42,2017-02-04,2
4,48,2017-02-01,6
5,48,2017-02-02,8
6,48,2017-02-03,3
7,48,2017-02-04,6
8,46,2017-02-01,3
9,46,2017-02-02,8
10,46,2017-02-03,3
11,46,2017-02-04,4
Desired Output, if the subset of Dates are 2017-02-02 and 2017-02-04:
df2 =
,Date,42,48,46
0,2017-02-02,5,8,8
1,2017-02-04,2,6,4
I tried it like this:
types = list(df1["type"].unique())
dates = ["2017-02-02","2017-02-04"]
df2 = pd.DataFrame()
df2["Date"]=dates
for t in types:
df2[t] = df1[(df1["type"]==t)&(df1[df1["type"]==t][["Date"]]==df2["Date"])][["amount"]]
but with this solution I get a lot of NaNs, it seems my comparison condition is wrong.
This is the Ouput I get:
,Date,42,48,46
0,2017-02-02,,,
1,2017-02-04,,,
You can use .pivot_table() and then filter data:
df2 = df1.pivot_table(
index="Date", columns="type", values="amount", aggfunc="sum"
)
dates = ["2017-02-02", "2017-02-04"]
print(df2.loc[dates].reset_index())
Prints:
type Date 42 46 48
0 2017-02-02 5 8 8
1 2017-02-04 2 4 6
This is easiest to explain by code, so here goes - imagine the commands in ipython/jupyter notebooks:
from io import StringIO
import pandas as pd
test = StringIO("""Date,Ticker,x,y
2008-10-23,A,0,10
2008-10-23,B,1,11
2008-10-24,A,2,12
2008-10-24,B,3,13
2008-10-25,A,4,14
2008-10-25,B,5,15
2008-10-26,A,6,16
2008-10-26,B,7,17
""")
# Multi-index by Date and Ticker
df = pd.read_csv(test, index_col=[0, 1], parse_dates=True)
df
# Output to the command line
x y
Date Ticker
2008-10-23 A 0 10
B 1 11
2008-10-24 A 2 12
B 3 13
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
ts = pd.Timestamp(2008, 10, 25)
# Filter the data by Date >= ts
filtered_df = df.loc[ts:]
# output the filtered data
filtered_df
x y
Date Ticker
2008-10-25 A 4 14
B 5 15
2008-10-26 A 6 16
B 7 17
# Get all the level 0 data (i.e. the dates) in the filtered dataframe
dates = filtered_df.index.levels[0]
# output the dates in the filtered dataframe:
dates
DatetimeIndex(['2008-10-23', '2008-10-24', '2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
# WTF!!!??? This was ALL of the dates in the original dataframe - I asked for the dates in the filtered dataframe!
# The correct output should have been:
DatetimeIndex(['2008-10-25', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
So clearly in multi-indexing, when one filters, the index of the filtered dataframe retains all of the indices of the original dataframe, but only shows the visible indices when viewing the entire dataframe. However, when looking at data by index levels, it appears there is a bug (feature somehow?) where the entire index including the invisible indices is used to perform the operation I did to extract all the dates in the code above.
This is actually explained in the MultiIndex's User Guide (emphasis added):
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. ... This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.
In your case:
>>> filtered_df.index.get_level_values(0)
DatetimeIndex(['2008-10-25', '2008-10-25', '2008-10-26', '2008-10-26'], dtype='datetime64[ns]', name='Date', freq=None)
Which is what you expected.
Given the following example DataFrame:
>>> df
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
3 05/11/2017 08:25:20 4
4 05/11/2017 08:30:14 5
5 05/11/2017 08:30:35 6
I want to subset this DataFrame by the 'Time' column, by matching a partial string up to the hour. For example, I want to subset using partial strings which contain "05/10/2017 01:" and "05/11/2017 08:" which breaks up the subsets into two new data frames:
>>> df1
Times Values
0 05/10/2017 01:01:03 1
1 05/10/2017 01:05:00 2
2 05/10/2017 01:06:10 3
and
>>> df2
0 05/11/2017 08:25:20 4
1 05/11/2017 08:30:14 5
2 05/11/2017 08:30:35 6
Is it possible to make this subset iterative in Pandas, for multiple dates/times that similarly have the date/hour as the common identifier?
First, cast your Times column into a datetime format, and set it as the index:
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace = True)
Then use the groupby method, with a TimeGrouper:
g = df.groupby(pd.TimeGrouper('h'))
g is an iterator that yields tuple pairs of times and sub-dataframes of those times. If you just want the sub-dfs, you can do zip(*g)[1].
A caveat: the sub-dfs are indexed by the timestamp, and pd.TimeGrouper only works when the times are the index. If you want to have the timestamp as a column, you could instead do:
df['Times'] = pd.to_datetime(df['Times'])
df['time_hour'] = df['Times'].dt.floor('1h')
g = df.groupby('time_hour')
Alternatively, you could just call .reset_index() on each of the dfs from the former method, but this will probably be much slower.
Convert Times to a hour period, groupby and then extract each group as a DF.
df1,df2=[g.drop('hour',1) for n,g in\
df.assign(hour=pd.DatetimeIndex(df.Times)\
.to_period('h')).groupby('hour')]
df1
Out[874]:
Times Values
0 2017-05-10 01:01:03 1
1 2017-05-10 01:05:00 2
2 2017-05-10 01:06:10 3
df2
Out[875]:
Times Values
3 2017-05-11 08:25:20 4
4 2017-05-11 08:30:14 5
5 2017-05-11 08:30:35 6
First make sure that the Times column is of type DateTime.
Second, set times column as index.
Third, use between_time method.
df['Times'] = pd.to_datetime(df['Times'])
df.set_index('Times', inplace=True)
df1 = df.between_time('1:00:00', '1:59:59')
df2 = df.between_time('8:00:00', '8:59:59')
If you use the datetime type you can extract things like hours and days.
times = pd.to_datetime(df['Times'])
hours = times.apply(lambda x: x.hour)
df1 = df[hours == 1]
You can use the str[] accessor to truncate the string representation of your date (you might have to cast astype(str) if your columns is a datetime and then use groupby.groups to access the dataframes as a dictionary where the keys are your truncated date values:
>>> df.groupby(df.Times.astype(str).str[0:13]).groups
{'2017-05-10 01': DatetimeIndex(['2017-05-10 01:01:03', '2017-05-10 01:05:00',
'2017-05-10 01:06:10'],
dtype='datetime64[ns]', name='time', freq=None),
'2017-05-11 08': DatetimeIndex(['2017-05-11 08:25:20', '2017-05-11 08:30:14',
'2017-05-11 08:30:35'],
dtype='datetime64[ns]', name='time', freq=None)}
I try to parse a CSV file which looks like this:
dd.mm.yyyy value
01.01.2000 1
02.01.2000 2
01.02.2000 3
I need to add missing dates and fill according values with NaN. I used Series.reindex like in this question:
import pandas as pd
ts=pd.read_csv(file, sep=';', parse_dates='True', index_col=0)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
But in result, values for certain dates are swapped due to date format (i.e. mm/dd instead of dd/mm):
01.01.2000 1
02.01.2000 3
03.01.2000 NaN
...
...
31.01.2000 NaN
01.02.2000 2
I tried several ways (i.e. add dayfirst=True to read_csv) to do it right but still can't figure it out. Please, help.
Set parse_dates to the first column with parse_dates=[0]:
ts = pd.read_csv(file, sep=';', parse_dates=[0], index_col=0, dayfirst=True)
idx = pd.date_range('01.01.2000', '02.01.2000')
ts.index = pd.DatetimeIndex(ts.index)
ts = ts.reindex(idx, fill_value='NaN')
print(ts)
prints:
value
2000-01-01 1
2000-01-02 2
2000-01-03 NaN
...
2000-01-31 NaN
2000-02-01 3
parse_dates=[0] tells pandas to explicitly parse the first column as dates. From the docs:
parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index.
If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
{'foo' : [1, 3]} -> parse columns 1, 3 as date and call result 'foo'
A fast-path exists for iso8601-formatted dates.