The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02
Related
I have year, month and date in three columns, I am concatenating them to one column then trying to make this column to YYYY/mm/dd format as follows:
dfyz_m_d['dt'] = '01'# to bring one date of each of the month
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].map(str) + dfyz_m_d['mon'].map(str) + dfyz_m_d['dt'].map(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y%m%d')
but for both 1 ( jan) and 10 ( Oct) months I am getting only oct in final outcome (CalendarWeek comun doesn't have any Jan. Basically it is retaining all records but Jan month also it is formatting to Oct
The issue is Jan is single digit numerically, so you end up with something like 2021101 which will be interpreted as Oct instead of Jan. Make sure your mon column is always converted to two digit months with leading zeros if needed using .zfill(2):
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.zfill(2) + dfyz_m_d['dt'].astype(str)
zfill example:
df = pd.DataFrame({'mon': [1,2,10]})
df.mon.astype(str).str.zfill(2)
0 01
1 02
2 10
Name: mon, dtype: object
I usually do
pd.to_datetime(df.mon,format='%m').dt.strftime('%m')
0 01
1 02
2 10
Name: mon, dtype: object
Also , if you name the column correctly , notice the name as year month and day
df['day'] = '01'
df['new'] = pd.to_datetime(df.rename(columns={'mon':'month'})).dt.strftime('%m/%d/%Y')
df
year mon day new
0 2020 1 1 01/01/2020
1 2020 1 1 01/01/2020
I like str.pad :)
dfyz_m_d['year'].astype(str) + dfyz_m_d['mon'].astype(str).str.pad(2, 'left', '0') + dfyz_m_d['dt'].astype(str)
It will pad zeros to the left to ensure that the length of the strings will be two. SO 1 becomes 01, but 10 stays to be 10.
You should be able to use pandas.to_datetime with your input dataframe. You may need to rename your columns.
import pandas as pd
df = pd.DataFrame({'year': [2015, 2016],
'month': [2, 3],
'dt': [4, 5]})
print(pd.to_datetime(df.rename(columns={"dt": "day"})))
Output
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
You can add / between year, mon and dt and amend the format string to include it, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d')
Data Input
year mon dt
0 2021 1 01
1 2021 2 01
2 2021 10 01
3 2021 11 01
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021-01-01
1 2021 2 01 2021/2/01 2021-02-01
2 2021 10 01 2021/10/01 2021-10-01
3 2021 11 01 2021/11/01 2021-11-01
If you want the final output date format be YYYY/mm/dd, you can further use .dt.strftime after pd.to_datetime, as follows:
dfyz_m_d['dt'] = '01'
dfyz_m_d['CalendarWeek1'] = dfyz_m_d['year'].astype(str) + '/' + dfyz_m_d['mon'].astype(str) + '/' + dfyz_m_d['dt'].astype(str)
dfyz_m_d['CalendarWeek'] = pd.to_datetime(dfyz_m_d['CalendarWeek1'], format='%Y/%m/%d').dt.strftime('%Y/%m/%d')
Output
year mon dt CalendarWeek1 CalendarWeek
0 2021 1 01 2021/1/01 2021/01/01
1 2021 2 01 2021/2/01 2021/02/01
2 2021 10 01 2021/10/01 2021/10/01
3 2021 11 01 2021/11/01 2021/11/01
I have a dataframe (df):
year month ETP
0 2021 1 49.21
1 2021 2 34.20
2 2021 3 31.27
3 2021 4 29.18
4 2021 5 33.25
5 2021 6 24.70
I would like to add a column that gives me the number of working days for each row excluding holidays and weekends (for a specific country, exp: France or US)
so the output will be :
year month ETP work_day
0 2021 1 49.21 20
1 2021 2 34.20 20
2 2021 3 31.27 21
3 2021 4 29.18 19
4 2021 5 33.25 20
5 2021 6 24.70 19
code :
import numpy as np
import pandas as pd
days = np.busday_count( '2021-01', '2021-06' )
df.insert(3, "work_day", [days])
and I got this error :
ValueError: Length of values does not match length of index
Any suggestions?
Thank you for your help
assuming you are the one that will input the workdays, I suppose you can do it like this:
data = {'year': [2020, 2020, 2021, 2023, 2022],
'month': [1, 2, 3, 4, 6]}
df = pd.DataFrame(data)
df.insert(2, "work_day", [20,20,23,21,22])
Where the 2 is the position of the new column, not just to be at the end, work_day is the name and the list has the values for every row.
EDIT: With NumPy
import numpy as np
import pandas as pd
days = np.busday_count( '2021-02', '2021-03' )
data = {'year': [2021],
'month': ['february']}
df = pd.DataFrame(data)
df.insert(2, "work_day", [days])
with the busday_count you specify the starting and ending dates you want to see the workdays in.
the result :
year month work_day
0 2021 february 20
I would like to create two columns "Year" and "Month" from a Date column that contains different year and month arrangements. Some are YY-Mmm and the others are Mmm-YY.
import pandas as pd
dataSet = {
"Date": ["18-Jan", "18-Jan", "18-Feb", "18-Feb", "Oct-17", "Oct-17"],
"Quantity": [3476, 20, 789, 409, 81, 640],
}
df = pd.DataFrame(dataSet, columns=["Date", "Quantity"])
My attempt is as follows:
Date1 = []
Date2 = []
for dt in df.Date:
Date1.append(dt.split("-")[0])
Date2.append(dt.split("-")[1])
Year = []
try:
for yr in Date1:
Year.append(int(yr.Date1))
except:
for yr in Date2:
Year.append(int(yr.Date2))
You can make use of the extract dataframe string method to split the date strings up. Since the year can precede or follow the month, we can get a bit creative and have a Year1 column and Year2 columns for either position. Then use np.where to create a single Year column pulls from each of these other year columns.
For example:
import numpy as np
split_dates = df["Date"].str.extract(r"(?P<Year1>\d+)?-?(?P<Month>\w+)-?(?P<Year2>\d+)?")
split_dates["Year"] = np.where(
split_dates["Year1"].notna(),
split_dates["Year1"],
split_dates["Year2"],
)
split_dates = split_dates[["Year", "Month"]]
With result for split_dates:
Year Month
0 18 Jan
1 18 Jan
2 18 Feb
3 18 Feb
4 17 Oct
5 17 Oct
Then you can merge back with your original dataframe with pd.merge, like so:
pd.merge(df, split_dates, how="inner", left_index=True, right_index=True)
Which yields:
Date Quantity Year Month
0 18-Jan 3476 18 Jan
1 18-Jan 20 18 Jan
2 18-Feb 789 18 Feb
3 18-Feb 409 18 Feb
4 Oct-17 81 17 Oct
5 Oct-17 640 17 Oct
Thank you for your help. I managed to get it working with what I've learned so far, i.e. for loop, if-else and split() and with the help of another expert.
# Split the Date column and store it in an array
dA = []
for dP in df.Date:
dA.append(dP.split("-"))
# Append month and year to respective lists based on if conditions
Month = []
Year = []
for moYr in dA:
if len(moYr[0]) == 2:
Month.append(moYr[1])
Year.append(moYr[0])
else:
Month.append(moYr[0])
Year.append(moYr[1])
This took me hours!
Try using Python datetime strptime(<date>, "%y-%b") on the date column to convert it to a Python datetime.
from datetime import datetime
def parse_dt(x):
try:
return datetime.strptime(x, "%y-%b")
except:
return datetime.strptime(x, "%b-%y")
df['timestamp'] = df['Date'].apply(parse_dt)
df
Date Quantity timestamp
0 18-Jan 3476 2018-01-01
1 18-Jan 20 2018-01-01
2 18-Feb 789 2018-02-01
3 18-Feb 409 2018-02-01
4 Oct-17 81 2017-10-01
5 Oct-17 640 2017-10-01
Then you can just use .month and .year attributes, or if you prefer the month as its abbreviated form, use Python datetime.strftime('%b').
df['year'] = df.timestamp.apply(lambda x: x.year)
df['month'] = df.timestamp.apply(lambda x: x.strftime('%b'))
df
Date Quantity timestamp year month
0 18-Jan 3476 2018-01-01 2018 Jan
1 18-Jan 20 2018-01-01 2018 Jan
2 18-Feb 789 2018-02-01 2018 Feb
3 18-Feb 409 2018-02-01 2018 Feb
4 Oct-17 81 2017-10-01 2017 Oct
5 Oct-17 640 2017-10-01 2017 Oct
I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60
I have different data in two dataframes. Both have two columns called Date and data corresponding to those date. However both the dates are of different frequencies.
Dataframe1 contains data at the end of month. So there is only one entry for every month. Dataframe2 contains dates which are not separated evenly. That is it may contain multiple dates from same month. For example if Dataframe1 contains 30 Apr 2014, Dataframe2 may contain 01 May 2014, 07 May 2014 and 22 May 2014.
I want to merge the data frames in a way so that data from Dataframe1 corresponding to 30 Apr 2014 appears against all dates in May 2014 in Dataframe2. Is there any simple way to do it?
My approach would be to add a month column for df1 that is the current month + 1 (you'll need to roll December over to January which just means substituting 13 for 1). Then I'd set the index of df1 to this 'month' column and call map on df2 against the month of the 'date' column, this will perform a lookup and assign the 'val' value:
In [70]:
# create df1
df1 = pd.DataFrame({'date':[dt.datetime(2014,4,30), dt.datetime(2014,5,31)], 'val':[12,3]})
df1
Out[70]:
date val
0 2014-04-30 12
1 2014-05-31 3
In [74]:
# create df2
df2 = pd.DataFrame({'date':['01 May 2014', '07 May 2014', '22 May 2014', '23 Jun 2014']})
df2['date'] = pd.to_datetime(df2['date'], format='%d %b %Y')
df2
Out[74]:
date
0 2014-05-01
1 2014-05-07
2 2014-05-22
3 2014-06-23
In [75]:
# add month column, you'll need to replace 13 with 1 for December
df1['month'] = df1['date'].dt.month+1
df1['month'].replace(13,1)
df1
Out[75]:
date val month
0 2014-04-30 12 5
1 2014-05-31 3 6
In [76]:
# now call map on the month attribute and pass df1 with the index set to month
df2['val'] = df2['date'].dt.month.map(df1.set_index('month')['val'])
df2
Out[76]:
date val
0 2014-05-01 12
1 2014-05-07 12
2 2014-05-22 12
3 2014-06-23 3