pandas rename: change values of index for a specific column only - python

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?

use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Related

Converting different columns to a datetime

The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02

how to calculate number of working days with python

I have a dataframe (df):
year month ETP
0 2021 1 49.21
1 2021 2 34.20
2 2021 3 31.27
3 2021 4 29.18
4 2021 5 33.25
5 2021 6 24.70
I would like to add a column that gives me the number of working days for each row excluding holidays and weekends (for a specific country, exp: France or US)
so the output will be :
year month ETP work_day
0 2021 1 49.21 20
1 2021 2 34.20 20
2 2021 3 31.27 21
3 2021 4 29.18 19
4 2021 5 33.25 20
5 2021 6 24.70 19
code :
import numpy as np
import pandas as pd
days = np.busday_count( '2021-01', '2021-06' )
df.insert(3, "work_day", [days])
and I got this error :
ValueError: Length of values does not match length of index
Any suggestions?
Thank you for your help
assuming you are the one that will input the workdays, I suppose you can do it like this:
data = {'year': [2020, 2020, 2021, 2023, 2022],
'month': [1, 2, 3, 4, 6]}
df = pd.DataFrame(data)
df.insert(2, "work_day", [20,20,23,21,22])
Where the 2 is the position of the new column, not just to be at the end, work_day is the name and the list has the values for every row.
EDIT: With NumPy
import numpy as np
import pandas as pd
days = np.busday_count( '2021-02', '2021-03' )
data = {'year': [2021],
'month': ['february']}
df = pd.DataFrame(data)
df.insert(2, "work_day", [days])
with the busday_count you specify the starting and ending dates you want to see the workdays in.
the result :
year month work_day
0 2021 february 20

How to extract a portion of a datframe column and create another column with that extraction

I have a Pandas dataframe like this:
id Month
1 Month 01
1 Month 05
2 Month 12
...
And I wanted to extract the value from the Month column and add that extraction to a new column Month_no and obtain this output:
id Month Month_no
1 Month 01 1
1 Month 05 5
2 Month 12 12
...
Assuming Month column has Month and number separated by a whitespace, you can use str.split:
df['Month_no'] = df['Month'].str.split().str[1].astype(int)
Example:
In [1168]: df
Out[1168]:
id Month
0 1 Month 01
1 1 Month 05
2 2 Month 12
In [1169]: df['Month_no'] = df['Month'].str.split().str[1].astype(int)
In [1170]: df
Out[1170]:
id Month Month_no
0 1 Month 01 1
1 1 Month 05 5
2 2 Month 12 12
Alternatively:
df['Month_no'] = df['Month'].str.strip('Month').astype(int)

Need to group the data using pandas based on months in the column data

I would like to group the data based on the month January and February. Here is a sample of the data set that I have.
Date Count
01.01.2019 1
01.02.2019 7
02.01.2019 4
03.01.2019 4
04.01.2019 1
04.02.2019 5
I want to group the data as follows, where total count is summed up of count based on month 1(Jan) and 2(Feb):
Month Total_Count
Jan 10
Feb 12
Cast to datetime, group by the dt.month_name and sum:
(df.groupby(pd.to_datetime(df['Date'], format='%d.%m.%Y')
.dt.month_name()
.str[:3])['Count']
.sum()
.rename_axis('Month')
.reset_index(name='Total_Count'))
Month Total_Count
0 Feb 12
1 Jan 10
To sort the index by month, we could instead do:
s = df.groupby(pd.to_datetime(df['Date-'], format='%d.%m.%Y-').dt.month)['Count'].sum()
s.index = pd.to_datetime(s.index, format='%m').month_name().str[:3]
s.rename_axis('Month').reset_index(name='Total_Count')
Month Total_Count
0 Jan 10
1 Feb 12

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

Categories

Resources