Splitting Date in a Pandas Dataframe - python

Dataset
Hi, I Have a Index ['release_date'] in a format of month,date,year , I was trying to split this column by doing
test['date_added'].str.split(' ',expand=True) #code_1
but now it's creating a 4 columns and what really is happening is for some reason is it is simply for few rows it's shifting columns therefore creating a 4th column
code_1
This is the error I am facing
I tried splitting ['release_date'], I am expecting it to be splitted into 3 rows but for some reason few rows are being shifting to other column.
if someone wants to inspect that dataframe you can use google colab for it,
!gdown 1x-_Kq9qYrybB9-DxJHoeVlPabmAm6xbQ

you can use:
df['day'] = pd.DatetimeIndex(df['date_added']).day
df['Month'] = pd.DatetimeIndex(df['date_added']).month
df['year'] = pd.DatetimeIndex(df['date_added']).year

day, month, year = zip(*[(d.day, d.month, d.year) for d in df['date_added']])
df df.assign(day = day, month = month, year = year)

Related

How to extract year and month from string in a dataframe

1. Question
I have a dataframe, and the Year-Month column contains the year and month which I want to extract.
For example, an element in this column is "2022-10". And I want to extract year=2022, month=10 from it.
My current solution is to use apply and lambda function:
df['xx_month'] = df['Year-Month'].apply(lambda x: int(x.split('-')[1]))
But it's super slow on a huge dataframe.
How to do it more efficiently?
2. Solutions
Thanks for your wisdom, I summarized each one's solution with the code:
(1) split by '-' and join #Vitalizzare
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
(2) convert the datatype from object (str) into datetime format #Neele22
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")
(3) use regex or datetime to extract year and month #mozway
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
# If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Or use datetime:
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
3. Follow up question
But there will be a problem if I want to subtract 'Year-Month' with other datetime columns after converting the incomplete 'Year-Month' column from string to datetime.
For example, if I want to get the data which is no later than 2 months after the timestamp of each record.
import dateutil # dateutil is a better package than datetime package according to my experience
df[(df['timestamp'] - df['Year-Month'])>= dateutil.relativedelta.relativedelta(months=0) and (df['timestamp'] - df['Year-Month'])<= datetime.timedelta(months=2)]
This code will have type error for subtracting the converted Year-Month column with actual datetime column.
TypeError: Cannot subtract tz-naive and tz-aware datetime-like objects
The types for these two columns are:
Year-Month is datetime64[ns]
timestamp is datetime64[ns, UTC]
Then, I tried to specify utc=True when changing Year-Month to datetime type:
df[["Year-Month"]] = pd.to_datetime(df[["Year-Month"]],utc=True,format="%Y-%m")
But I got Value Error.
ValueError: to assemble mappings requires at least that [year, month,
day] be specified: [day,month,year] is missing
4. Take away
If the [day,month,year] is not complete for the elements in a column. (like in my case, I only have year and month), we can't change this column from string type into datetime type to do calculations. But to use the extracted day and month to do the calculations.
If you don't need to do calculations between the incomplete datetime column and other datetime columns like me, you can change the incomplete datetime string into datetime type, and extract [day,month,year] from it. It's easier than using regex, split and join.
df = pd.DataFrame({'Year-Month':['2022-10','2022-11','2022-12']})
df = df.join(
df['Year-Month']
.str.split('-', expand=True)
.set_axis(['year','month'], axis='columns')
)
pandas.Series.str.split - split strings of a series, if expand=True then return a data frame with each part in a separate column;
pandas.DataFrame.set_axis - if axis='columns' then rename column names of a data frame;
pandas.DataFrame.join - if the indices are equal, then the frames stacked together horizontally are returned.
You can use a regex for that.
Creating a new DataFrame:
df['Year-Month'].str.extract(r'(?P<year>\d+)-(?P<month>\d+)').astype(int)
If you want to assign the output to the same DataFrame while removing the original Year-Month:
df[['year', 'month']] = df.pop('Year-Month').str.extract(r'(\d+)-(\d+)').astype(int)
Example input:
Year-Month
0 2022-10
output:
year month
0 2022 10
alternative using datetime:
You can also use a datetime intermediate
date = pd.to_datetime(df['Year-Month'])
df['year'] = date.dt.year
df['month'] = date.dt.month
output:
Year-Month year month
0 2022-10 2022 10
You can also convert the datatype from object (str) into datetime format. This will make it easier to work with the dates.
import pandas as pd
df['Year-Month'] = pd.to_datetime(df['Year-Month'], format="%Y-%m")

Pandas Dataframe return rows where state, city, and date occur multiple times

Firstly, this is my first post, so my apologies if it is formatted poorly.
So I have this dataframe which I have attached a picture of. It contains UFO sightings and I want to return the rows where the if the city and state are the same and then also if the dates are the same. I am trying to find sightings that occurred on the same day in the same city and state. Please let me know if more info is required.
Thank you in advance!
Alternative, boolean indexing to keep the duplicated rows:
df['date'] = pd.to_datetime(df['occurred_date_time']).dt.normalize()
df2 = df[df.duplicated(['date','city','state'], keep=False)]
If you don't want the new column:
df2 = df[df.assign(date=pd.to_datetime(df['occurred_date_time'])
.dt.normalize())
.duplicated(['date','city','state'], keep=False)]
Try this.
# Create a column converting date_time to just date
df['date'] = pd.to_datetime(df['occurred_date_time']).dt.normalize()
# groupby and count times where date, city and state
# then create boolean series where count is greater than 1
m = df.groupby(['date','city','state']).transform("count") > 1
# boolean filter the dataframe rows with that series, m.
df[m]

retrieve only months with at least 28 sample days - pandas dataframe

Hello to the people of the web,
I have a dataframe containing 'DATE' (datetime) as index and TMAX as column with values:
tmax dataframe
What i'm trying to do is checking for every month (of each year) the amount of samples (each TMAX column value is considered as a sample).
If I have less than 28 samples, I want to drop that particular month (of that particular year) and all it's samples.
I have the following code:
if __name__ == '__main__':
df = pd.read_csv("2961941.csv")
# set date column as index, drop the 'DATE' column to avoid repititions + create as datetime object
# speed up parsing using infer_datetime_format=True.
df['DATE'] = pd.to_datetime(df['DATE'], infer_datetime_format=True)
df.set_index('DATE', inplace=True)
# create new table out of 'DATE' and 'TMAX'
tmax = df.filter(['DATE', 'TMAX'], axis=1)
# erase rows with missing data
tmax.dropna()
# create snow table & delete rows with missing info
snow = df.filter(['DATE', 'SNOW']).dropna()
# for index, row in tmax.iterrows():
Thanks for the help.
I can suggest trying the following.
Here I have highlighted the results of counting days in a month into a variable 'a'.
And then I filter the data in which there are less than 28 days in a month.
It worked for me.
a = df.groupby(pd.Grouper(level='DATE', freq="M")).transform('count')
print(df[a['TMAX'] >= 28])

Access only second row of column names in a dataframe

I want to read an excel file where the second line is a date in a string format and the first line is the weekday that corresponds to each date, and then change the second line from string to datetime. If I only read the second line as index, and completely skip the first line with the days, I do the following to convert it to a datetime:
Receipts_tbl.columns = pd.to_datetime(Receipts_tbl.columns)
How do I do that if I have a multiindexed dataframe, where the first line of the indices remains as weekdays, and I want the second to be converted to datetime?
Thanx
You didn't give an example of what your data source looks like, so I'm inferring.
If you use pd.read_excel with header=None, it will treat the first two rows as data and you can manipulate them to achieve your goal. Here's a minimum example, with an example "real" data row beneath:
df = pd.DataFrame([['Mon', 'Tues'], ['10-02-1995', '11-23-1997'],
[12, 32]])
# 0 1
#0 Mon Tues
#1 10-02-1995 11-23-1997
#2 12 32
Next, convert the first row to datetime as you said in your question.
df.loc[1] = pd.to_datetime(df.loc[1])
Create a multi-index from the first two rows, and set it as the dataframe's columns
df.columns = df.T.set_index([0,1]).index.set_names(['DOW', 'Date'])
Lastly, select from second row down, as the first two rows are now in the columns.
df = df.loc[2:].reset_index()
df
#DOW Mon Tues
#Date 812592000000000000 880243200000000000
#0 12 32
Note that DOW and Date are now a multilevel index for the columns, and the 'data' rows have been reindexed to start at 0.
Please let me know if I misunderstood your question.
Assuming you have this data in the clipboard
Day Date Data
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Mo 2018-08-06 blah
Tu 2018-08-07 blah
Try
import pandas as pd
df = pd.read_clipboard().set_index(['Day', 'Date'])
to get a multiindexed example
Then change the Date to Datetime
df2 = df.reset_index()
df2.Date = pd.to_datetime(df2.Date, yearfirst=True)
Afterwards you can set the multiindex again, if you want.
Note, check out the documentation on to_datetime if your
datetime string is formatted differently. It assumes
month first, unless you set dayfirst or yearfirst to True.

Python Pandas - Day and Month mix up

I have a 'myfile.csv' file which has a 'timestamp' column which starts at
(01/05/2015 11:51:00)
and finishes at
(07/05/2015 23:22:00)
A total span of 9,727 minutes
'myfile.csv' also has a column named 'A' which is some numerical value, there are values are multiple values for 'A' within each minute, each with a unique timestamp to the nearest second.
I have code as follows
df = pd.read_csv('myfile.csv')
df = df.set_index('timestamp')
df.index = df.index.to_datetime()
df.sort_index(inplace=True)
df = df['A'].resample('1Min').mean()
df.index = (df.index.map(lambda t: t.strftime('%Y-%m-%d %H:%M')))
My problem is that python seems to think 'timestamp' starts at
(01/05/2015 11:51:00)
-> 5th January
and finishes at
(07/05/2015 23:22:00)
-> 5th July
But really 'timestamp' starts at the
1st May
and finishes at the
7th of May
So the above code produces a dataframe with 261,332 rows, OMG, when it should really only have 9,727 rows.
Somehow Python is mixing up the month with the day, misinterpreting the dates, how do I sort this out?
There are many arguments within csv_read that can help you parse dates from a csv straight into your pandas DataFrame. Here we can set parse_dates with the columns you want as dates and then use dayfirst. This is defaulted to false so the following should do what you want, assuming the dates are in the first column.
df = pd.read_csv('myfile.csv', parse_dates=[0], dayfirst=True)
If the dates column is not the first row, just change the 0 to the column number.
The format of dates that you have included in your question don't seem to match your strftime filter. Take a look at this to fix your string parameter.
It looks to me that it should be something in the lines of:
'%d/%m/%Y %H:%M:%S'

Categories

Resources