I have year, month and date in three columns, I am concatenating them to one column then trying to make this column to YYYY/mm/dd format as follows:
dfyz_m_d['dt'] = '01'# to bring one date of each of the month
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].map(str) + dfyz_m_d['mon'].map(str) + dfyz_m_d['dt'].map(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y%m%d')
but for both 1 ( jan) and 10 ( Oct) months I am getting only oct in final outcome (CalendarWeek comun doesn't have any Jan. Basically it is retaining all records but Jan month also it is formatting to Oct
The issue is Jan is single digit numerically, so you end up with something like 2021101 which will be interpreted as Oct instead of Jan. Make sure your mon column is always converted to two digit months with leading zeros if needed using .zfill(2):
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.zfill(2) + dfyz_m_d['dt'].astype(str)
zfill example:
df = pd.DataFrame({'mon': [1,2,10]})
df.mon.astype(str).str.zfill(2)
0 01
1 02
2 10
Name: mon, dtype: object
I usually do
pd.to_datetime(df.mon,format='%m').dt.strftime('%m')
0 01
1 02
2 10
Name: mon, dtype: object
Also , if you name the column correctly , notice the name as year month and day
df['day'] = '01'
df['new'] = pd.to_datetime(df.rename(columns={'mon':'month'})).dt.strftime('%m/%d/%Y')
df
year mon day new
0 2020 1 1 01/01/2020
1 2020 1 1 01/01/2020
I like str.pad :)
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.pad(2, 'left', '0') + dfyz_m_d['dt'].astype(str)
It will pad zeros to the left to ensure that the length of the strings will be two. SO 1 becomes 01, but 10 stays to be 10.
You should be able to use pandas.to_datetime with your input dataframe. You may need to rename your columns.
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'dt': [4, 5]})
print(pd.to_datetime(df.rename(columns={"dt": "day"})))
Output
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
You can add / between year, mon and dt and amend the format string to include it, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d')
Data Input
year mon dt
0 2021 1 01
1 2021 2 01
2 2021 10 01
3 2021 11 01
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021-01-01
1 2021 2 01 2021/2/01 2021-02-01
2 2021 10 01 2021/10/01 2021-10-01
3 2021 11 01 2021/11/01 2021-11-01
If you want the final output date format be YYYY/mm/dd, you can further use .dt.strftime after pd.to_datetime, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d').dt.strftime('%Y/%m/%d')
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021/01/01
1 2021 2 01 2021/2/01 2021/02/01
2 2021 10 01 2021/10/01 2021/10/01
3 2021 11 01 2021/11/01 2021/11/01
Related
The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02
I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct
I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))
I need a Python function to return a Pandas DataFrame with range of dates, only year and month, for example, from November 2016 to March 2017 and have this as result:
year month
2016 11
2016 12
2017 01
2017 02
2017 03
My dates are in string format Y-m (from = '2016-11', to = '2017-03'). I'm not sure on turning them to datetime type or to separate them into two different integer values.
Any ideas on how to achieve it properly?
Are you looking at something like this?
pd.date_range('November 2016', 'April 2017', freq = 'M')
You get
DatetimeIndex(['2016-11-30', '2016-12-31', '2017-01-31', '2017-02-28',
'2017-03-31'],
dtype='datetime64[ns]', freq='M')
To get dataframe
index = pd.date_range('November 2016', 'April 2017', freq = 'M')
df = pd.DataFrame(index = index)
pd.Series(pd.date_range('2016-11', '2017-4', freq='M').strftime('%Y-%m')) \
.str.split('-', expand=True) \
.rename(columns={0: 'year', 1: 'month'})
year month
0 2016 11
1 2016 12
2 2017 01
3 2017 02
4 2017 03
You can use a combination of pd.to_datetime and pd.date_range.
import pandas as pd
start = 'November 2016'
end = 'March 2017'
s = pd.Series(pd.date_range(*(pd.to_datetime([start, end]) \
+ pd.offsets.MonthEnd()), freq='1M'))
Construct a dataframe using the .dt accessor attributes.
df = pd.DataFrame({'year' : s.dt.year, 'month' : s.dt.month})
df
month year
0 11 2016
1 12 2016
2 1 2017
3 2 2017
4 3 2017
I have a very large dataframe in which one of the columns, ['date'], datetime (dtype is string still) is formatted as below.. sometimes it is displayed as hh:mm:ss and sometimes as h:mm:ss (with hours 9 and earlier)
Tue Mar 1 9:23:58 2016
Tue Mar 1 9:29:04 2016
Tue Mar 1 9:42:22 2016
Tue Mar 1 09:43:50 2016
pd.to_datetime() won't work when I'm trying to convert the string into datetime format so I was hoping to find some help in getting 0's in front of the time where missing.
Any help is greatly appreciated!
import pandas as pd
date_stngs = ('Tue Mar 1 9:23:58 2016','Tue Mar 1 9:29:04 2016','Tue Mar 1 9:42:22 2016','Tue Mar 1 09:43:50 2016')
a = pd.Series([pd.to_datetime(date) for date in date_stngs])
print a
output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
time = df[0].str.split(' ').str.get(3).str.split('').str.get(0).str.strip().str[:8]
year = df[0].str.split('--').str.get(0).str[-5:].str.strip()
daynmonth = df[0].str[:10].str.strip()
df_1['date'] = daynmonth + ' ' +year + ' ' + time
df_1['date'] = pd.to_datetime(df_1['date'])
Found this to work myself when rearranging the order
Assuming you have a one column DataFrame with strings as above and column name is 0 then the following will split the strings by space and then take the third string and zero-fill it with zfill
Assuming starting df
0
0 Tue Mar 1 9:23:58 2016
1 Tue Mar 1 9:29:04 2016
2 Tue Mar 1 9:42:22 2016
3 Tue Mar 1 09:43:50 2016
df1 = df[0].str.split(expand=True)
df1[3] = df1[3].str.zfill(8)
pd.to_datetime(df1.apply(lambda x: ' '.join(x.tolist()), axis=1))
Output
0 2016-03-01 09:23:58
1 2016-03-01 09:29:04
2 2016-03-01 09:42:22
3 2016-03-01 09:43:50
dtype: datetime64[ns]