I am working with a big dataset (more than 2 million rows x 10 columns) that has a date column. Some of the rows are formatted correctly (e.g. 2020/04/08) but I want to change the format of others that are not (concretely, those are formatted as 20200408).
I want to change the format of those that are wrong but I don't want to iterate through all the rows.
Normally, for a small dataset I would do
for i in range (0,len(df)):
cell=str(df.iloc[i]['date'])
if len(cell)==8:
df.iat[i, df.columns.get_loc('date')] = datetime.strptime(cell, '%Y%m%d').strftime('%Y-%m-%d')
but I know this is far from optimal.
How can I use the power of pandas to avoid the loop here?
Thanks!
Filter rows by Series.str.len, then select column by DataFrame.loc and mask, convert to datetimes by to_datetime and last to custom format by Series.dt.strftime:
m = df['date'].str.len() == 8
df.loc[m, 'date'] = pd.to_datetime(df.loc[m, 'date'], format='%Y%m%d').dt.strftime('%Y-%m-%d')
Try
df['datetime'] = df['datetime'].apply(lambda x: x.to_datetime())
Related
So I have this data frame that has many columns, but I'm only interested in the data spanning from say 01/01/2009-01/01/2019 so I want to keep all the data in that range and get rid of everything else
Assuming date column name as date_column
df_new = df[(df['date_column'] > '01/01/2009') & (df['date_column'] <= '01/01/2019')]
print(df_new)
If they're correctly formatted:
df_new = df[df['date_col'].between('2009-01-01', '2019-01-01')]
this will work no for any date format, dd-mm-yyyy or yyyy-mm-dd
df[(pd.to_datetime(df['Date']).dt.year >= 2009) & (pd.to_datetime(df['Date']).dt.year <= 2019)]
I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks
The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings
I have a column in a data frame with dates in the format of “1/4/2021 0:00”. And I would like to get rid of everything after the first space, including the first space so that way it becomes “1/4/2021”.
How can I do that in Python? Also, does the column already have to be a specific data type in order to complete this task?
If you are using pandas you can try the following, assuming the entire column is following a similar datetime format.
Your dataframe is called df, and your column of dates is date.
df['date'] = df['date'].dt.date
or
df['date'] = pd.to_datetime(df['date'].dt.date)
or
df['date'] = df['date'].dt.normalize()
Depending on what you want the format of your date column to be.
Try this:
df['date'] = df['date'].apply(lambda x: x.split(' ')[0] if isinstance(x, str) else x)
Note that this code only works if your column in data frame has type string.
In order to check the data type, run: df.dtypes.
I have a list of dates in a DF that have been converted to a YYYY-MM format and need to select a range. This is what I'm trying:
#create dataframe
data = ['2016-01','2016-02','2016-09','2016-10','2016-11','2017-04','2017-05','2017-06','2017-07','2017-08']
df = pd.DataFrame(data, columns = {'date'})
#lookup range
df[df["date"].isin(pd.date_range('2016-01', '2016-06'))]
It doesn't seem to be working because the date column is no longer a datetime column. The format has to be in YYYY-MM. So I guess the question is, how can I make a datetime column with YYYY-MM? Can someone please help?
Thanks.
You do not need an actual datetime-type column or query values for this to work. Keep it simple:
df[df.date.between('2016-01', '2016-06')]
That gives:
date
0 2016-01
1 2016-02
It works because ISO 8601 date strings can be sorted as if they were plain strings. '2016-06' comes after '2016-05' and so on.
I have been trying to convert values with commas in a pandas dataframe to floats with little success. I also tried .replace(",","") but it doesn't work? How can I go about changing the Close_y column to float and the Date column to date values so that I can plot them? Any help would be appreciated.
Convert 'Date' using to_datetime for the other use str.replace(',','.') and then cast the type:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['Close_y'] = df['Close_y'].str.replace(',','.').astype(float)
replace looks for exact matches, what you're trying to do is replace any match in the string
pandas.read_clipboard implements the same kwargs as pandas.read_table in which there are options for the thousands and parse_dates kwarg.s
Try loading your data with:
df = pd.read_clipboard(thousands=',', parse_dates=[0])
Assuming that the Dates column is in the 0 index. If you have a large amount of data you may also try using the infer_datetime_format kwarg to speed things up.