I'm working with a large data frame and I'm struggling to find an efficient way to eliminate specific dates. Note that I'm trying to eliminate any measurements from a specific date.
Pandas has this great function, where you can call:
df.ix['2016-04-22']
and pull all rows from that day. But what if I want to eliminate all rows from '2016-04-22'?
I want a function like this:
df.ix[~'2016-04-22']
(but that doesn't work)
Also, what if I want to eliminate a list of dates?
Right now, I have the following solution:
import numpy as np
import pandas as pd
from numpy import random
###Create a sample data frame
dates = [pd.Timestamp('2016-04-25 06:48:33'), pd.Timestamp('2016-04-27 15:33:23'), pd.Timestamp('2016-04-23 11:23:41'), pd.Timestamp('2016-04-28 12:08:20'), pd.Timestamp('2016-04-21 15:03:49'), pd.Timestamp('2016-04-23 08:13:42'), pd.Timestamp('2016-04-27 21:18:22'), pd.Timestamp('2016-04-27 18:08:23'), pd.Timestamp('2016-04-27 20:48:22'), pd.Timestamp('2016-04-23 14:08:41'), pd.Timestamp('2016-04-27 02:53:26'), pd.Timestamp('2016-04-25 21:48:31'), pd.Timestamp('2016-04-22 12:13:47'), pd.Timestamp('2016-04-27 01:58:26'), pd.Timestamp('2016-04-24 11:48:37'), pd.Timestamp('2016-04-22 08:38:46'), pd.Timestamp('2016-04-26 13:58:28'), pd.Timestamp('2016-04-24 15:23:36'), pd.Timestamp('2016-04-22 07:53:46'), pd.Timestamp('2016-04-27 23:13:22')]
values = random.normal(20, 20, 20)
df = pd.DataFrame(index=dates, data=values, columns ['values']).sort_index()
### This is the list of dates I want to remove
removelist = ['2016-04-22', '2016-04-24']
This for loop basically grabs the index for the dates I want to remove, then eliminates it from the index of the main dataframe, then positively selects the remaining dates (ie: the good dates) from the dataframe.
for r in removelist:
elimlist = df.ix[r].index.tolist()
ind = df.index.tolist()
culind = [i for i in ind if i not in elimlist]
df = df.ix[culind]
Is there anything better out there?
I've also tried indexing by the rounded date+1 day, so something like this:
df[~((df['Timestamp'] < r+pd.Timedelta("1 day")) & (df['Timestamp'] > r))]
But this gets really cumbersome and (at the end of the day) I'll still be using a for loop when I need to eliminate n specific dates.
There's got to be a better way! Right? Maybe?
You can create a boolean mask using a list comprehension.
>>> df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
values
2016-04-21 15:03:49 28.059520
2016-04-23 08:13:42 -22.376577
2016-04-23 11:23:41 40.350252
2016-04-23 14:08:41 14.557856
2016-04-25 06:48:33 -0.271976
2016-04-25 21:48:31 20.156240
2016-04-26 13:58:28 -3.225795
2016-04-27 01:58:26 51.991293
2016-04-27 02:53:26 -0.867753
2016-04-27 15:33:23 31.585201
2016-04-27 18:08:23 11.639641
2016-04-27 20:48:22 42.968156
2016-04-27 21:18:22 27.335995
2016-04-27 23:13:22 13.120088
2016-04-28 12:08:20 53.730511
Same idea as #Alexander, but using properties of the DatetimeIndex and numpy.in1d:
mask = ~np.in1d(df.index.date, pd.to_datetime(removelist).date)
df = df.loc[mask, :]
Timings:
%timeit df.loc[~np.in1d(df.index.date, pd.to_datetime(removelist).date), :]
1000 loops, best of 3: 1.42 ms per loop
%timeit df[[d.date() not in pd.to_datetime(removelist) for d in df.index]]
100 loops, best of 3: 3.25 ms per loop
Related
I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2
I am having performance issues with iterrows in on my dataframe as I start to scale up my data analysis.
Here is the current loop that I am using.
for ii, i in a.iterrows():
for ij, j in a.iterrows():
if ii != ij:
if i['DOCNO'][-5:] == j['DOCNO'][4:9]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
elif i['DOCNO'][-5:] == j['DOCNO'][-5:]:
if i['RSLTN1'] > j['RSLTN1']:
dl.append(ij)
else:
dl.append(ii)
c = a.drop(a.index[dl])
The point of the loop is to find 'DOCNO' values that are different in the dataframe but are known to be equivalent denoted by the 5 characters that are equivalent but spaced differently in the string. When found I want to drop the smaller number from the associated 'RSLTN1' column. Additionally, my data set may have multiple entries for a unique 'DOCNO' that I want to drop the lower number 'RSLTN1' result.
I was successful running this will small quantities of data (~1000 rows) but as I scale up 10x I am running into performance issues. Any suggestions?
Sample from dataset
In [107]:a[['DOCNO','RSLTN1']].sample(n=5)
Out[107]:
DOCNO RSLTN1
6815 MP00064958 72386.0
218 MP0059189A 65492.0
8262 MP00066187 96497.0
2999 MP00061663 43677.0
4913 MP00063387 42465.0
How does this fit you needs?
import pandas as pd
s = '''\
DOCNO RSLTN1
MP00059189 72386.0
MP0059189A 65492.0
MP00066187 96497.0
MP00061663 43677.0
MP00063387 42465.0'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Create mask
# We sort to make sure we keep only highest value
# Remove all non-digit according to: https://stackoverflow.com/questions/44117326/
m = (df.sort_values(by='RSLTN1',ascending=False)['DOCNO']
.str.extract('(\d+)', expand=False)
.astype(int).duplicated())
# Apply inverted `~` mask
df = df.loc[~m]
Resulting df:
DOCNO RSLTN1
0 MP00059189 72386.0
2 MP00066187 96497.0
3 MP00061663 43677.0
4 MP00063387 42465.0
In this example the following row was removed:
MP0059189A 65492.0
I have the following dataframe with datetime, lon and lat variables. This data is collected for each second which means each date is repeated 60 times
I am doing some calculations using lat, lon values and at the end I need to write
this data to Postgres table.
2016-07-27 06:43:45 50.62 3.15
2016-07-27 06:43:46 50.67 3.22
2016-07-28 07:23:45 52.32 3.34
2016-07-28 07:24:46 52.67 3.45
Currently I have 10 million records . It is taking longer time if I use whole dataframe for computing.
How can I loop this for each date, write it to DB and clear the dataframe??
I have converted the datetime variable to date format
df['date'] = df['datetime'].dt.date
df = df.sort(['datetime'])
my computation is
df.loc[(df['lat'] > 50.10) & (df['lat'] <= 50.62), 'var1'] = 1
df.loc[(df['lan'] > 3.00) & (df['lan'] <= 3.20), 'var2'] = 1
Writing it to DB
df.to_sql('Table1', engine,if_exists = "replace",index = False)
Have you considered using the groupby() function? You can use it to treat each 'date' as a seperate DataFrame and then run your computations.
for sub_df in df.groupby('date'):
# your computations
I am working with hourly monitoring data which consists of incomplete time series, i.e. several hours during a year (or during several years) will be absent from my dataframe.
I would like to determine the data capture, i.e. the percentage of values present in a month, a season, or a year.
This works with the following code (for demonstration written for monthly resampling) - however that piece of code appears somewhat inefficient, because I need to create a second hourly dataframe and I need to resample two dataframes.
Is there a more elegant solution to this?
import numpy as np
import pandas as pd
# create dummy series
t1 = pd.date_range(start="1997-01-01 05:00", end="1997-04-25 17:00", freq="H")
t2 = pd.date_range(start="1997-06-11 15:00", end="1997-06-15 12:00", freq="H")
t3 = pd.date_range(start="1997-06-18 00:00", end="1997-08-22 23:00", freq="H")
df1 = pd.DataFrame(np.random.randn(len(t1)), index=t1)
df2 = pd.DataFrame(np.random.randn(len(t2)), index=t2)
df3 = pd.DataFrame(np.random.randn(len(t3)), index=t3)
df = pd.concat((df1, df2, df3))
# create time index with complete hourly coverage over entire years
tstart = "%i-01-01 00:00"%(df.index.year[0])
tend = "%i-12-31 23:00"%(df.index.year[-1])
tref = pd.date_range(start=tstart, end=tend, freq="H")
dfref = pd.DataFrame(np.zeros(len(tref)), index=tref)
# count number of values in reference dataframe and actual dataframe
# Example: monthly resampling
cntref = dfref.resample("MS", "count")
cnt = df.resample("MS", "count").reindex(cntref.index).fillna(0)
for i in range(len(cnt.index)):
print cnt.index[i], cnt.values[i], cntref.values[i], cnt.values[i] / cntref.values[i]
pandas' Timedelta will do the trick:
# Time delta between rows of the df
df['index'] = df.index
pindex = df['index'].shift(1)
delta = df['index'] - pindex
# Any delta > 1H means a missing data period
missing_delta = delta[delta > pd.Timedelta('1H')]
# Sum of missing data periods divided by total period
ratio_missing = missing_delta.sum() / (df.index[-1] - df.index[0])
You can use TimeGrouper.
# Create an hourly index spanning the range of your data.
idx = pd.date_range(pd.Timestamp(df.index[0].strftime('%Y-%m-%d %H:00')),
pd.Timestamp(df.index[-1].strftime('%Y-%m-%d %H:00')),
freq='H')
# Use TimeGrouper to calculate the fraction of observations from `df` that are in the
# hourly time index.
>>> (df.groupby(pd.TimeGrouper('M')).size() /
pd.Series(idx).reindex(idx).groupby(pd.TimeGrouper('M')).size())
1997-01-31 1.000000
1997-02-28 1.000000
1997-03-31 1.000000
1997-04-30 0.825000
1997-05-31 0.000000
1997-06-30 0.563889
1997-07-31 1.000000
1997-08-31 1.000000
Freq: M, dtype: float64
As there have been no further suggestions, it appears as if the originally posted solution is most efficient.
Not sure about performance, but for a (very long) one liner you can do this once you have created 'df'... It at least has the benefits of not requiring a dummy dataframe. It should work for any period of data input and resampling.
month_counts = df.resample('H').mean().resample('M').count() / df.resample('H').ffill().fillna(1).resample('M').count()
I am trying to do some simple analyses on the Kenneth French industry portfolios (first time with Pandas/Python), data is in txt format (see link in the code). Before I can do computations, first want to load it in a Pandas dataframe properly, but I've been struggling with this for hours:
import urllib.request
import os.path
import zipfile
import pandas as pd
import numpy as np
# paths
url = 'http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/48_Industry_Portfolios_CSV.zip'
csv_name = '48_Industry_Portfolios.CSV'
local_zipfile = '{0}/data.zip'.format(os.getcwd())
local_file = '{0}/{1}'.format(os.getcwd(), csv_name)
# download data
if not os.path.isfile(local_file):
print('Downloading and unzipping file!')
urllib.request.urlretrieve(url, local_zipfile)
zipfile.ZipFile(local_zipfile).extract(csv_name, os.path.dirname(local_file))
# read from file
df = pd.read_csv(local_file,skiprows=11)
df.rename(columns={'Unnamed: 0' : 'dates'}, inplace=True)
# build new dataframe
first_stop = df['dates'][df['dates']=='201412'].index[0]
df2 = df[:first_stop]
# convert date to datetime object
pd.to_datetime(df2['dates'], format = '%Y%m')
df2.index = df2.dates
All the columns, except dates, represent financial returns. However, due to the file formatting, these are now strings. According to Pandas docs, this should do the trick:
df2.convert_objects(convert_numeric=True)
But the columns remain strings. Other suggestions are to loop over the columns (see for example pandas convert strings to float for multiple columns in dataframe):
for d in df2.columns:
if d is not 'dates':
df2[d] = df2[d].map(lambda x: float(x)/100)
But this gives me the following warning:
home/<xxxx>/Downloads/pycharm-community-4.5/helpers/pydev/pydevconsole.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
try:
I have read the documentation on views vs copies, but having difficulty to understand why it is a problem in my case, but not in the code snippets in the question I linked to. Thanks
Edit:
df2=df2.convert_objects(convert_numeric=True)
Does the trick, although I receive a depreciation warning (strangely enough that is not in the docs at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html)
Some of df2:
dates Agric Food Soda Beer Smoke Toys Fun \
dates
192607 192607 2.37 0.12 -99.99 -5.19 1.29 8.65 2.50
192608 192608 2.23 2.68 -99.99 27.03 6.50 16.81 -0.76
192609 192609 -0.57 1.58 -99.99 4.02 1.26 8.33 6.42
192610 192610 -0.46 -3.68 -99.99 -3.31 1.06 -1.40 -5.09
192611 192611 6.75 6.26 -99.99 7.29 4.55 0.00 1.82
Edit2: the solution is actually more simple than I thought:
df2.index = pd.to_datetime(df2['dates'], format = '%Y%m')
df2 = df2.astype(float)/100
I would try the following to force convert everything into floats:
df2=df2.astype(float)
You can convert specific column to float(or any numerical type for that matter) by
df["column_name"] = pd.to_numeric(df["column_name"])
Posting this because pandas.convert_objects is deprecated in pandas 0.20.1
You need to assign the result of convert_objects as there is no inplace param:
df2=df2.convert_objects(convert_numeric=True)
you refer to the rename method but that one has an inplace param which you set to True.
Most operations in pandas return a copy and some have inplace param, convert_objects is one that does not. This is probably because if the conversion fails then you don't want to blat over your data with NaNs.
Also the deprecation warning is to split out the different conversion routines, presumably so you can specialise the params e.g. format string for datetime etc..