How to delete everything after first space in Python? - python

I have a column in a data frame with dates in the format of “1/4/2021 0:00”. And I would like to get rid of everything after the first space, including the first space so that way it becomes “1/4/2021”.
How can I do that in Python? Also, does the column already have to be a specific data type in order to complete this task?

If you are using pandas you can try the following, assuming the entire column is following a similar datetime format.
Your dataframe is called df, and your column of dates is date.
df['date'] = df['date'].dt.date
or
df['date'] = pd.to_datetime(df['date'].dt.date)
or
df['date'] = df['date'].dt.normalize()
Depending on what you want the format of your date column to be.

Try this:
df['date'] = df['date'].apply(lambda x: x.split(' ')[0] if isinstance(x, str) else x)
Note that this code only works if your column in data frame has type string.
In order to check the data type, run: df.dtypes.

Related

Change dtype of column from object to datetime64

I have a column with dates (Format: 2022-05-15) with the current dtype: object. I want to change the dtype to datetime with the following code:
df['column'] = pd.to_datetime(df['column'])
I receive the error:
ParserError: Unknown string format: DU2999
Im changing multible columns (e.g. another date column with format dd-mm-yyyy hh-mm-ss). I get the error only for the mentioned column.
Thank you very much for your help in advance.
If you want to handle this error by setting the resulting datetime value to NaT whenever the input value is "DU2999" (or another string that does not match the expected format), you can use:
df['column'] = pd.to_datetime(df['column'], errors='coerce'). See https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html.
If you want to manually correct this specific case, you could use print(df.loc[df['column']=="DU2999"]) to view that row of the dataframe and decide what to overwrite it with.
As #Naveed said there invalid date strings in date column such as DU2999 . What you can do is simply find find out which strings that are not in date format.
temp_date = pd.to_datetime(df['Date_column'], errors='coerce', dayfirst=True)
mask = temp_date.isna()
out = df[mask]
#Problmeatic columns ==Filter columns with True values
df_problematic = out[ out.any(axis=1)]
print(df_problematic)

In pandas, how to infer date typed columns with a custom format

I am trying to parse a csv file using pandas, with read_csv, and I am running into an issue where dates are not properly parsed, since they are in the format "%d.%m.%Y" (example : 22.01.2022)
I understand a custom date parser is needed, so I passed one in input, such as here:
data = pd.read_csv(p, skiprows=[0,1,2,4], keep_default_na=False,
date_parser=lambda x: datetime.strptime(x, "%d.%m.%Y").date(),
sep="\t"
)
This data extraction doesn't parse the dates as expected.
If I pass the list of columns that I expect to have dates in it, then those columns are properly parsed as dates, so I assume my custom date parser works:
data = pd.read_csv(p, skiprows=[0,1,2,4], keep_default_na=False,
date_parser=lambda x: datetime.strptime(x, "%d.%m.%Y").date(),
parse_dates=['date1', 'date2'],
sep="\t"
)
But I would like to avoid having to manually specify which columns pandas should be trying to parse as date columns, since the data source could evolve. I would like to have pandas guess which columns contain dates, like it does when the dates match a more standard format.
Since the pandas behaviour I was looking for turned out to not exist, here is the solution that I went with, which involves building the dataframe, and then turning the appropriate columns to date.
First I find the columns where the string data matches my format, and then apply the type:
date_cols = [col for col in df.columns if df[col].astype(str).str.contains(r'^\d{2}\.\d{2}\.\d{4}$', case=True, regex=True).any()]
for col in date_cols:
df[col] = pd.to_datetime(df[col],format='%d.%m.%Y')

.fillna breaking .dt.normalize()

I am trying to clean up some data, by formatting my floats to show no decimal points and my date/time to only show date. After this, I want to fill in my NaNs with an empty string, but when I do that, my date goes back to showing both date/time. Any idea why? Or how to fix it.
This is before I run the fillna() method with a picture of what my data looks like:
#Creating DataFrame from path variable
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
#daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
code with NaNs
This is when I run the fillna() method:
daily_production_df = pd.read_excel(path)
#Reformated Data series to only include date (excluded time)
daily_production_df['Date'] = daily_production_df['Date'].dt.normalize()
pd.options.display.float_format = '{:,.0f}'.format
daily_production_df = daily_production_df.fillna('')
#Called only the 16 rows that have data, including the columns/header
daily_production_df.head(16)
date_time
Using normalize() does not change the dtype of the column, pandas just stop displaying the time portion when print because they share the same time portion.
I would recommend the correct solution which is convert the column to actual datetime.date instead of using normalize():
df['date'] = pd.to_datetime(df['date']).dt.date

Faster solution for date formatting

I am trying to change the format of the date in a pandas dataframe.
If I check the date in the beginning, I have:
df['Date'][0]
Out[158]: '01/02/2008'
Then, I use:
df['Date'] = pd.to_datetime(df['Date']).dt.date
To change the format to
df['Date'][0]
Out[157]: datetime.date(2008, 1, 2)
However, this takes a veeeery long time, since my dataframe has millions of rows.
All I want to do is change the date format from MM-DD-YYYY to YYYY-MM-DD.
How can I do it in a faster way?
You should first collapse by Date using the groupby method to reduce the dimensionality of the problem.
Then you parse the dates into the new format and merge the results back into the original DataFrame.
This requires some time because of the merging, but it takes advantage from the fact that many dates are repeated a large number of times. You want to convert each date only once!
You can use the following code:
date_parser = lambda x: pd.datetime.strptime(str(x), '%m/%d/%Y')
df['date_index'] = df['Date']
dates = df.groupby(['date_index']).first()['Date'].apply(date_parser)
df = df.set_index([ 'date_index' ])
df['New Date'] = dates
df = df.reset_index()
df.head()
In my case, the execution time for a DataFrame with 3 million lines reduced from 30 seconds to about 1.5 seconds.
I'm not sure if this will help with the performance issue, as I haven't tested with a dataset of your size, but at least in theory, this should help. Pandas has a built in parameter you can use to specify that it should load a column as a date or datetime field. See the parse_dates parameter in the pandas documentation.
Simply pass in a list of columns that you want to be parsed as a date and pandas will convert the columns for you when creating the DataFrame. Then, you won't have to worry about looping back through the dataframe and attempting the conversion after.
import pandas as pd
df = pd.read_csv('test.csv', parse_dates=[0,2])
The above example would try to parse the 1st and 3rd (zero-based) columns as dates.
The type of each resulting column value will be a pandas timestamp and you can then use pandas to print this out however you'd like when working with the dataframe.
Following a lead at #pygo's comment, I found that my mistake was to try to read the data as
df['Date'] = pd.to_datetime(df['Date']).dt.date
This would be, as this answer explains:
This is because pandas falls back to dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied (this is much more flexible, but also slower).
As you have shown above, you can improve the performance by supplying a format string to to_datetime. Or another option is to use infer_datetime_format=True
When using any of the date parsers from the answers above, we go into the for loop. Also, when specifying the format we want (instead of the format we have) in the pd.to_datetime, we also go into the for loop.
Hence, instead of doing
df['Date'] = pd.to_datetime(df['Date'],format='%Y-%m-%d')
or
df['Date'] = pd.to_datetime(df['Date']).dt.date
we should do
df['Date'] = pd.to_datetime(df['Date'],format='%m/%d/%Y').dt.date
By supplying the current format of the data, it is read really fast into datetime format. Then, using .dt.date, it is fast to change it to the new format without the parser.
Thank you to everyone who helped!

Converting values with commas in a pandas dataframe to floats.

I have been trying to convert values with commas in a pandas dataframe to floats with little success. I also tried .replace(",","") but it doesn't work? How can I go about changing the Close_y column to float and the Date column to date values so that I can plot them? Any help would be appreciated.
Convert 'Date' using to_datetime for the other use str.replace(',','.') and then cast the type:
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df['Close_y'] = df['Close_y'].str.replace(',','.').astype(float)
replace looks for exact matches, what you're trying to do is replace any match in the string
pandas.read_clipboard implements the same kwargs as pandas.read_table in which there are options for the thousands and parse_dates kwarg.s
Try loading your data with:
df = pd.read_clipboard(thousands=',', parse_dates=[0])
Assuming that the Dates column is in the 0 index. If you have a large amount of data you may also try using the infer_datetime_format kwarg to speed things up.

Categories

Resources