Merge pandas dataframe where columns don't match - python

I have different data in two dataframes. Both have two columns called Date and data corresponding to those date. However both the dates are of different frequencies.
Dataframe1 contains data at the end of month. So there is only one entry for every month. Dataframe2 contains dates which are not separated evenly. That is it may contain multiple dates from same month. For example if Dataframe1 contains 30 Apr 2014, Dataframe2 may contain 01 May 2014, 07 May 2014 and 22 May 2014.
I want to merge the data frames in a way so that data from Dataframe1 corresponding to 30 Apr 2014 appears against all dates in May 2014 in Dataframe2. Is there any simple way to do it?

My approach would be to add a month column for df1 that is the current month + 1 (you'll need to roll December over to January which just means substituting 13 for 1). Then I'd set the index of df1 to this 'month' column and call map on df2 against the month of the 'date' column, this will perform a lookup and assign the 'val' value:
In [70]:
# create df1
df1 = pd.DataFrame({'date':[dt.datetime(2014,4,30), dt.datetime(2014,5,31)], 'val':[12,3]})
df1
Out[70]:
date val
0 2014-04-30 12
1 2014-05-31 3
In [74]:
# create df2
df2 = pd.DataFrame({'date':['01 May 2014', '07 May 2014', '22 May 2014', '23 Jun 2014']})
df2['date'] = pd.to_datetime(df2['date'], format='%d %b %Y')
df2
Out[74]:
date
0 2014-05-01
1 2014-05-07
2 2014-05-22
3 2014-06-23
In [75]:
# add month column, you'll need to replace 13 with 1 for December
df1['month'] = df1['date'].dt.month+1
df1['month'].replace(13,1)
df1
Out[75]:
date val month
0 2014-04-30 12 5
1 2014-05-31 3 6
In [76]:
# now call map on the month attribute and pass df1 with the index set to month
df2['val'] = df2['date'].dt.month.map(df1.set_index('month')['val'])
df2
Out[76]:
date val
0 2014-05-01 12
1 2014-05-07 12
2 2014-05-22 12
3 2014-06-23 3

Related

Converting different columns to a datetime

The time in my csv file is divided into 4 columns, (year, julian day, hour/minut(utc) and second), and I wanted to convert to a single column so that it looks like this: 14/11/2017 00:16:00.
Is there a easy way to do this?
A sample of the code is
cols = [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
D14 = pd.read_csv(r'C:\Users\William Jacondino\Desktop\DadosTimeSeries\PIRATA-PROFILE\Dados FLUXO\Dados_brutos_copy-20220804T151840Z-002\Dados_brutos_copy\tm_data_2017_11_14_0016.dat', header=None, usecols=cols, names=["Year","Julian day", "Hour/minut (UTC)", "Second", "Bateria (V)", "PTemp (°C)", "Latitude", "Longitude", "Magnectic_Variation (arb)", "Altitude (m)", "Course (º)", "WS", "Nmbr_of_Satellites (arb)", "RAD", "Tar", "UR", "slp",], sep=',')
D14 = D14.loc[:, ["Year","Julian day", "Hour/minut (UTC)", "Second", "Latitude", "Longitude","WS", "RAD", "Tar", "UR", "slp"]]
My array looks like that:
The file: csv file sample
The "Hour/minut (UTC)" column has the first two digits referring to the Local Time and the last two digits referring to the minute.
The beginning of the time in the "Hour/minut (UTC)" column starts at 016 which refers to 0 hour UTC and minute 16.
and goes up to hour 12 UTC and minute 03.
I wanted to unify everything into a single datetime column so from the beginning to the end of the array:
1 - 2017
1412 - 14/11/2017 12:03:30
but the column "Hour/minut (UTC)" from hour 0 to hour 9 only has one value like this:
9
instead of 09
How do I create the array with the correct datetime?
You can create a new column which also adds the data from other columns.
For example, if you have a dataframe like so:
df = pd.DataFrame(dict)
# Print df:
year month day a b c
0 2010 jan 1 1 4 7
1 2010 feb 2 2 5 8
2 2020 mar 3 3 6 9
You can add a new column field on the DataFrame, with the values extracted from the Year Month and Date columns.
df['newColumn'] = df.year.astype(str) + '-' + df.month + '-' + df.day.astype(str)
Edit: In your situation instead of using df.month use df['Julian Day'] since the column name is different. To understand more on why this is, read here
The data in the new column will be as string with the way you like to format it. You can also substitute the dash '-' with a slash '/' or however you need to format the outcome. You just need to convert the integers into strings with .astype(str)
Output:
year month day a b c newColumn
0 2010 jan 1 1 4 7 2010-jan-1
1 2010 feb 2 2 5 8 2010-feb-2
2 2020 mar 3 3 6 9 2020-mar-3
After that you can do anything as you would on a dataframe object.
If you only need it for data analysis you can do it with the function .groupBy() which groups the data fields and performs the analysis.
source
If your dataframe looks like
import pandas as pd
df = pd.DataFrame({
"year": [2017, 2017], "julian day": [318, 318], "hour/minut(utc)": [16, 16],
"second": [0, 30],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
then you could use pd.to_datetime() and pd.to_timedelta() to do
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour/minut(utc)"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
and get
year julian day hour/minut(utc) second datetime
0 2017 318 16 0 14/11/2017 00:16:00
1 2017 318 16 30 14/11/2017 00:16:30
The column datetime now contains strings. Remove the .dt.strftime("%d/%m/%Y %H:%M:%S") part at the end, if you want datetimes instead.
Regarding your comment: If I understand correctly, you could try the following:
df["hours_min"] = df["hour/minut(utc)"].astype("str").str.zfill(4)
df["hour"] = df["hours_min"].str[:2].astype("int")
df["minute"] = df["hours_min"].str[2:].astype("int")
df = df.drop(columns=["hours_min", "hour/minut(utc)"])
df["datetime"] = (
pd.to_datetime(df["year"].astype("str"), format="%Y")
+ pd.to_timedelta(df["julian day"] - 1, unit="days")
+ pd.to_timedelta(df["hour"], unit="hours")
+ pd.to_timedelta(df["minute"], unit="minutes")
+ pd.to_timedelta(df["second"], unit="seconds")
).dt.strftime("%d/%m/%Y %H:%M:%S")
Result for the sample dataframe df
df = pd.DataFrame({
"year": [2017, 2017, 2018, 2019], "julian day": [318, 318, 10, 50],
"hour/minut(utc)": [16, 16, 234, 1201], "second": [0, 30, 1, 2],
})
year julian day hour/minut(utc) second
0 2017 318 16 0
1 2017 318 16 30
2 2018 10 234 1
3 2019 50 1201 2
would be
year julian day second hour minute datetime
0 2017 318 0 0 16 14/11/2017 00:16:00
1 2017 318 30 0 16 14/11/2017 00:16:30
2 2018 10 1 2 34 10/01/2018 02:34:01
3 2019 50 2 12 1 19/02/2019 12:01:02

Sorting the date column on a pandas dataframe that is not in date time format and has format mmm dd, yyyy

Hi all I am working with a pandas dataframe that contains a date column. I would like to sort this column by date in ascending order, meaning that the most recent date is at the bottom of the dataframe. The problem that I am running into is that the date column displays the dates in the following format:
"Nov 3, 2020"
how can I sort these dates, the suggested advice that I have found online is to convert the date into a date time format and then sort then change it back to this format. Is there a more simple way to do this? I have tried this
new_df.sort_values(by=["Date"],ascending=True)
where new_df is the dataframe, but this does not seem to work.
Any ideas on how can do this? I essentially want the output to have something like
Date
----
Oct 31, 2020
Nov 1, 2020
Nov 12,2020
.
.
.
I would reformat the date column first, then convert to datetime, and then sort:
dates = ['Nov 1, 2020','Nov 12,2020','Oct 31, 2020']
df = pd.DataFrame({'Date':dates, 'Col1':[2,3,1]})
# Date Col1
# 0 Nov 1, 2020 2
# 1 Nov 12,2020 3
# 2 Oct 31, 2020 1
df['Date'] = pd.to_datetime(df['Date'].apply(lambda x: "-".join(x.replace(',',' ').split())))
df = df.sort_values('Date')
# Date Col1
# 2 2020-10-31 1
# 0 2020-11-01 2
# 1 2020-11-12 3
# and if you want to get the dates back in their original format
df['Date'] = df['Date'].apply(lambda x: "{} {}, {}".format(x.month_name()[:3],x.day,x.year))
# Date Col1
# 2 Oct 31, 2020 1
# 0 Nov 1, 2020 2
# 1 Nov 12, 2020 3
df.sort_values(by = "Date", key = pd.to_datetime)

Need to group the data using pandas based on months in the column data

I would like to group the data based on the month January and February. Here is a sample of the data set that I have.
Date Count
01.01.2019 1
01.02.2019 7
02.01.2019 4
03.01.2019 4
04.01.2019 1
04.02.2019 5
I want to group the data as follows, where total count is summed up of count based on month 1(Jan) and 2(Feb):
Month Total_Count
Jan 10
Feb 12
Cast to datetime, group by the dt.month_name and sum:
(df.groupby(pd.to_datetime(df['Date'], format='%d.%m.%Y')
.dt.month_name()
.str[:3])['Count']
.sum()
.rename_axis('Month')
.reset_index(name='Total_Count'))
Month Total_Count
0 Feb 12
1 Jan 10
To sort the index by month, we could instead do:
s = df.groupby(pd.to_datetime(df['Date-'], format='%d.%m.%Y-').dt.month)['Count'].sum()
s.index = pd.to_datetime(s.index, format='%m').month_name().str[:3]
s.rename_axis('Month').reset_index(name='Total_Count')
Month Total_Count
0 Jan 10
1 Feb 12

Pivot and rename Pandas dataframe

I have a dataframe in the format
Date Datediff Cumulative_sum
01 January 2019 1 5
02 January 2019 1 7
02 January 2019 2 15
01 January 2019 2 8
01 January 2019 3 13
and I want to pivot the column Datediff from the dataframe such that the end result looks like
Index Day-1 Day-2 Day-3
01 January 2019 5 8 13
02 January 2019 7 15
I have used the pivot command shuch that
pt = pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum") \
.reset_index() \
.set_index("Date"))
which returns the pivoted table
1 2 3
01 January 2019 5 8 13
02 January 2019 7 15
And I can then rename rename the columns using the loop
for column in pt:
pt.rename(columns = {column : "Day-" + str(column)}, inplace = True)
which returns exactly what I want. However, I was wondering if there is a faster way to rename the columns when pivoting and get rid of the loop altogether.
Use DataFrame.add_prefix:
df.add_prefix('Day-')
In your solution:
pt = (pd.pivot_table(df, index = "Date",
columns = "Datediff",
values = "Cumulative_sum")
.add_prefix('Day-'))

pandas rename: change values of index for a specific column only

I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60

Categories

Resources