Merge two data frames on three columns in Python - python

I have two data frames and I would like to merge them on the two columns Latitude and Longitude. The resulting df should include all columns.
df1:
Date Latitude Longitude LST
0 2019-01-01 66.33 17.100 -8.010004
1 2019-01-09 66.33 17.100 -6.675005
2 2019-01-17 66.33 17.100 -21.845003
3 2019-01-25 66.33 17.100 -26.940004
4 2019-02-02 66.33 17.100 -23.035009
... ... ... ... ...
and df2:
Station_Number Date Latitude Longitude Elevation Value
0 CA002100636 2019-01-01 69.5667 -138.9167 1.0 -18.300000
1 CA002100636 2019-01-09 69.5667 -138.9167 1.0 -26.871429
2 CA002100636 2019-01-17 69.5667 -138.9167 1.0 -19.885714
3 CA002100636 2019-01-25 69.5667 -138.9167 1.0 -17.737500
4 CA002100636 2019-02-02 69.5667 -138.9167 1.0 -13.787500
... ... ... ... ... ... ...
I have tried: LST_1=pd.merge(df1, df2, how = 'inner') but using merge in that way I have lost several data points, which are included in both data frames.

I am not sure if you want to merge on a specific column, if so you need to pick one with overlapping identifiers - for instance the "Date" column.
df_ = pd.merge(df1, df2, on="Date")
print(df_)
Date Latitude_x Longitude_x ... Longitude_y Elevation Value
0 01.01.2019 66.33 17.1 ... -138.9167 1.0 -18.300000
1 09.01.2019 66.33 17.1 ... -138.9167 1.0 -26.871429
2 17.01.2019 66.33 17.1 ... -138.9167 1.0 -19.885714
3 25.01.2019 66.33 17.1 ... -138.9167 1.0 -17.737500
4 02.02.2019 66.33 17.1 ... -138.9167 1.0 -13.787500
[5 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5 non-null object
1 Latitude_x 5 non-null float64
2 Longitude_x 5 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Latitude_y 5 non-null int64
6 Longitude_y 5 non-null int64
7 Elevation 5 non-null float64
8 Value 5 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 400.0+ bytes
As you have column names that are the same pandas will create _x and _y on Latitude and Longitude.
If you want all the columns and the data in one row is independent from the others, then you can use pd.concat. However, this will create some NaN values, due to missing data.
df_1 = pd.concat([df1, df2])
print(df_1)
Date Latitude Longitude ... Station_Number Elevation Value
0 01.01.2019 66.33 17.1 ... NaN NaN NaN
1 09.01.2019 66.33 17.1 ... NaN NaN NaN
2 17.01.2019 66.33 17.1 ... NaN NaN NaN
3 25.01.2019 66.33 17.1 ... NaN NaN NaN
4 02.02.2019 66.33 17.1 ... NaN NaN NaN
0 01.01.2019 69.56 -138.9167 ... CA002100636 1.0 -18.300000
1 09.01.2019 69.56 -138.9167 ... CA002100636 1.0 -26.871429
2 17.01.2019 69.56 -138.9167 ... CA002100636 1.0 -19.885714
3 25.01.2019 69.56 -138.9167 ... CA002100636 1.0 -17.737500
4 02.02.2019 69.56 -138.9167 ... CA002100636 1.0 -13.787500
df_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 4
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Elevation 5 non-null float64
6 Value 5 non-null object
dtypes: float64(3), object(4)
memory usage: 640.0+ bytes

Related

How does pandas groupby() function makes a difference in this code?

import pandas as pd
data = {'Company':['GOOG','MSFT','FB','GOOG','MSFT','FB'],
'Dates':["1970-01-01 01:00:00","1970-01-01 01:00:02","1970-01-01 01:00:03","1970-01-01 01:00:04","1970-01-01 01:00:05","1970-01-01 01:00:06"]}
df = pd.DataFrame(data)
df["Sales"]=pd.to_datetime(df["Sales"])
df.Sales.diff().dt.total_seconds()/3600
This code gives me output
0 NaN
1 0.000556
2 0.000278
3 0.000278
4 0.000278
5 0.000278
Name: Sales, dtype: float64
and
df.groupby("Company").Sales.diff().dt.total_seconds()/3600
this gives me output
0 NaN
1 NaN
2 NaN
3 0.001111
4 0.000833
5 0.000833
Name: Sales, dtype: float64
Can you explain what groupby function does here?
The reason why , you have three NaN ,due to you have three different company name in df, so when we do groupby , it will split the dataframe into 3, then do diff for each of them, and concat the result back
Detail :
df["Dates"] = pd.to_datetime(df["Dates"])
...:
for x , y in df.groupby('Company'):
...: print(y)
...: print(y['Dates'].diff().dt.total_seconds())
...:
Company Dates
2 FB 1970-01-01 01:00:03
5 FB 1970-01-01 01:00:06
2 NaN
5 3.0
Name: Dates, dtype: float64
Company Dates
0 GOOG 1970-01-01 01:00:00
3 GOOG 1970-01-01 01:00:04
0 NaN
3 4.0
Name: Dates, dtype: float64
Company Dates
1 MSFT 1970-01-01 01:00:02
4 MSFT 1970-01-01 01:00:05
1 NaN
4 3.0
Name: Dates, dtype: float64

Datetime fails when setting astype, date mangled

I am importing a csv of 20 variables and 1500 records. There are 5 date columns that are in UK date format dd/mm/yyyy , and import as .str
I need to be be able to subract one date from another.They are hsopital admissions, - I need to subtract discharge date from admission date to get length of stay.
I have has a number of problems.
To illustrate I have used 2 columns.
import pandas as pd
import numpy as np
from datetime import datetime
import .csv
df = pd.read_csv("/Users........csv", usecols = ['ADMIDATE', 'DISDATE'])
df
ADMIDATE DISDATE
0 04/02/2018 07/02/2018
1 25/07/2017 1801-01-01
2 28/06/2017 01/07/2017
3 22/06/2017 1801-01-01
4 11/12/2017 15/12/2017
... ... ...
1503 25/01/2019 27/01/2019
1504 31/08/2018 1801-01-01
1505 20/09/2018 05/11/2018
1506 28/09/2018 1801-01-01
1507 21/02/2019 24/02/2019
1508 rows × 2 columns
I removed about 100 records with a DISDATE of 1801-01-01, - these are likely bad data from the patient still being in hospital when the data was collected.
To convert the dates to datetime, I have used .astype('datetime64[ns]')
This is because I didn't know how to use pd.to_datetime on multiple columns.
df[['ADMIDATE', 'DISDATE']] = df[['ADMIDATE', 'DISDATE']].astype('datetime64[ns]')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null int64
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](2), int64(1)
memory usage: 32.9 KB
So, the conversion appears to have worked.
However on examining the data, ADMIDATE has become yyyy-mm-dd and DISDATE yyyy-dd-mm.
df.head(20)
Unnamed: 0 ADMIDATE DISDATE
0 0 2018-04-02 2018-07-02
1 2 2017-06-28 2017-01-07
2 4 2017-11-12 2017-12-15
3 5 2017-09-04 2017-12-04
4 6 2017-05-30 2017-01-06
5 7 2017-02-08 2017-07-08
6 8 2017-11-17 2017-11-18
7 9 2018-03-14 2018-03-20
8 10 2017-04-26 2017-03-05
9 11 2017-05-16 2017-05-17
10 12 2018-01-17 2018-01-19
11 13 2017-12-18 2017-12-20
12 14 2017-02-10 2017-04-10
13 16 2017-03-30 2017-07-04
14 17 2017-01-12 2017-12-18
15 18 2017-12-07 2017-07-14
16 19 2017-05-04 2017-08-04
17 20 2017-10-30 2017-01-11
18 21 2017-06-19 2017-06-22
19 22 2017-04-05 2017-08-05
So when I subract the ADMIDATE from the DISDATE I am getting negative values.
df['DISDATE'] - df['ADMIDATE']
0 91 days
1 -172 days
2 33 days
3 91 days
4 -144 days
...
1394 188 days
1395 -291 days
1396 2 days
1397 -132 days
1398 3 days
Length: 1399, dtype: timedelta64[ns]
I would like a method that works on all my date columns, keeps the UK format and allows me to do basic operations on the date fields.
After the suggestions from #code-different which seems very sensible below
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
The format is unchanged despite dayfirst=True.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1399 entries, 0 to 1398
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 1399 non-null datetime64[ns]
1 ADMIDATE 1399 non-null datetime64[ns]
2 DISDATE 1391 non-null datetime64[ns]
dtypes: datetime64[ns](3)
memory usage: 32.9 KB
df.head()
Unnamed: 0 ADMIDATE DISDATE
0 1970-01-01 00:00:00.000000000 2018-04-02 2018-07-02
1 1970-01-01 00:00:00.000000002 2017-06-28 2017-01-07
2 1970-01-01 00:00:00.000000004 2017-11-12 2017-12-15
3 1970-01-01 00:00:00.000000005 2017-09-04 2017-12-04
4 1970-01-01 00:00:00.000000006 2017-05-30 2017-01-06
I have also tried format='%d%m%Y' and still the year is first. Would datetime.strptime be any good?.
just tell pandas.to_datetime to use a specific and adequate format, e.g.:
import pandas as pd
import numpy as np
df = pd.DataFrame({'ADMIDATE': ['04/02/2018', '25/07/2017',
'28/06/2017', '22/06/2017', '11/12/2017'],
'DISDATE': ['07/02/2018', '1801-01-01',
'01/07/2017', '1801-01-01', '15/12/2017']}).replace({'1801-01-01': np.datetime64('NaT')})
for col in ['ADMIDATE', 'DISDATE']:
df[col] = pd.to_datetime(df[col], format='%d/%m/%Y')
# df
# ADMIDATE DISDATE
# 0 2018-02-04 2018-02-07
# 1 2017-07-25 NaT
# 2 2017-06-28 2017-07-01
# 3 2017-06-22 NaT
# 4 2017-12-11 2017-12-15
# Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 ADMIDATE 5 non-null datetime64[ns]
# 1 DISDATE 3 non-null datetime64[ns]
# dtypes: datetime64[ns](2)
Note: replace '1801-01-01' with np.datetime64('NaT') so you don't have to ignore errors when calling pd.to_datetime.
to_datetime is the function you want. It does not support multiple columns so you just loop over the columns one by one. The strings are in UK format (day-first) so you simply tell to_datetime that:
df = pd.read_csv('/path/to/file.csv', usecols = ['ADMIDATE','DISDATE']).replace({'1801-01-01': pd.NA})
for col in df.columns:
df[col] = pd.to_datetime(df[col], dayfirst=True, errors='coerce')
astype('datetime64[ns]') is too inflexible for what you need.

Pandas Merge returning only null values

I am trying to use Pandas to merge a products packing information with each order record for a given product. The data frame information is below.
BreakerOrders.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774010 entries, 0 to 3774009
Data columns (total 2 columns):
Material object
Quantity float64
dtypes: float64(1), object(1)
memory usage: 86.4+ MB
manh.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1381 entries, 0 to 1380
Data columns (total 4 columns):
Material 1381 non-null object
SUBPACK_QTY 202 non-null float64
PACK_QTY 591 non-null float64
PALLET_QTY 809 non-null float64
dtypes: float64(3), object(1)
memory usage: 43.2+ KB
When attempting the merge using the code below, I get the following table with all NaN values for packaging quantities.
BreakerOrders.merge(manh,how='left',on='Material')
Material Quantity SUBPACK_QTY PACK_QTY PALLET_QTY
HOM230CP 5.0 NaN NaN NaN
QO115 20.0 NaN NaN NaN
QO2020CP 20.0 NaN NaN NaN
QO220CP 50.0 NaN NaN NaN
HOM115CP 50.0 NaN NaN NaN
HOM120 100.0 NaN NaN NaN
I was having the same and I was able to solve it by just flipping the DFs. so instead of:
df2 = df.merge(df1)
try
df2 = df1.merge(df)
Looks silly, but it solved my issue.

Pandas import excel export HDF5

Working with pandas and PyTables. Begin by importing a table from excel containing columns of integers and floats, as well as other columns containing strings and even tuples. There are a limited number of options on the excel import and unfortunately, unlike the csv import process, datatypes must be converted from their inferred types after the import and cannot be specified in the process.
That being said, all non-numeric is apparently imported as unicode text which is incompatible with a later export to HDF5. Is there a simple way to convert all unicode columns (as well as all column headings) to an HDF5 compatible string format?
MORE DETAILS:
>>> metaFrame.head()
ProjectName Company ContactName \
LocationID
935 PCS Petaluma High School Site Testco Test Name
937 PCS Casa Grande High School Testco Test Name
3465 FUSD Fowler High School Testco Test Name
3466 FUSD Sutter Middle School Testco Test Name
3467 FUSD Fremont Elementary School Testco Test Name
Contactemail \
LocationID
935 test.address#email.com
937 test.address#email.com
3465 test.address#email.com
3466 test.address#email.com
3467 test.address#email.com
Link Systemsize(kW) \
LocationID
935 https://internal.testco.com/locations/935/syst... NaN
937 https://internal.testco.com/locations/937/syst... 675.39
3465 https://internal.testco.com/locations/3465/sys... 384.30
3466 https://internal.testco.com/locations/3466/sys... 198.90
3467 https://internal.testco.com/locations/3467/sys... 35.10
SystemCheckStartdate SystemCheckActive \
LocationID
935 2013-10-01 00:00:00 True
937 2013-10-01 00:00:00 True
3465 2013-10-01 00:00:00 True
3466 2013-10-01 00:00:00 True
3467 2013-10-01 00:00:00 True
YTDProductionPriortostartdate NumberofInverters/cktsmonitored \
LocationID
935 NaN NaN
937 NaN NaN
3465 NaN NaN
3466 NaN NaN
3467 NaN NaN
InverterMfg InverterModel \
LocationID
935 PV Powered : PVP260KW NaN
937 PV Powered : PVP260KW NaN
3465 Advanced Energy Industries : Solaron 333kW (31... NaN
3466 PV Powered : PVP260KW NaN
3467 PV Powered : PVP35KW-480 NaN
InverterCECefficiency ModuleMfg Modulemodel \
LocationID
935 97.0 NaN NaN
937 97.0 NaN NaN
3465 97.5 NaN NaN
3466 97.0 NaN NaN
3467 96.0 NaN NaN
Moduleirradiancefactor Moduleirradiancefactorslope \
LocationID
935 NaN NaN
937 NaN NaN
3465 NaN NaN
3466 NaN NaN
3467 NaN NaN
StraightLineIntercept ModuleTemp-PwrDerate MeterDK
LocationID
935 NaN 0.005 3291 ...
937 NaN 0.005 11548 ...
3465 NaN 0.005 19248 ...
3466 NaN 0.005 15846 ...
3467 NaN 0.005 15847 ...
[5 rows x 27 columns]
>>> metaFrame.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 935 to 5844
Data columns (total 27 columns):
ProjectName 43 non-null values
Company 43 non-null values
ContactName 43 non-null values
Contactemail 43 non-null values
Link 43 non-null values
Systemsize(kW) 42 non-null values
SystemCheckStartdate 37 non-null values
SystemCheckActive 43 non-null values
YTDProductionPriortostartdate 0 non-null values
NumberofInverters/cktsmonitored 2 non-null values
InverterMfg 42 non-null values
InverterModel 8 non-null values
InverterCECefficiency 33 non-null values
ModuleMfg 0 non-null values
Modulemodel 0 non-null values
Moduleirradiancefactor 0 non-null values
Moduleirradiancefactorslope 0 non-null values
StraightLineIntercept 0 non-null values
ModuleTemp-PwrDerate 43 non-null values
MeterDK 43 non-null values
Genfieldname 43 non-null values
WSDK 34 non-null values
WSirradianceField 43 non-null values
WSCellTempField 43 non-null values
MiscDerate 1 non-null values
InverterDKs 37 non-null values
Invertergenfields 37 non-null values
dtypes: bool(1), datetime64[ns](1), float64(9), object(16)

Resample data with pandas

my initial data.head() has result:
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 45993 entries, 2009-11-17 14:14:00 to 2012-12-16 14:26:00
Data columns (total 4 columns):
rain 45993 non-null values
temp 45993 non-null values
windspeed 45993 non-null values
dew_point 45993 non-null values
dtypes: float64(4)
2009-11-17 14:14:00 0 22.5 4.9 12.3
2009-11-17 14:44:00 0 22.3 6.1 12.1
2009-11-17 15:14:00 0 22.1 5.3 12.5
2009-11-17 15:44:00 0 22.2 3.3 12.0
2009-11-17 16:14:00 0 20.4 4.9 11.7
When i resample:
data = data.resample('30min', how ='sum')
data.head()
i get :
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 68861 entries, 2009-01-12 00:00:00 to 2012-12-16 14:00:00
Freq: 30T
Data columns (total 4 columns):
rain 45987 non-null values
temp 45987 non-null values
windspeed 45987 non-null values
dew_point 45987 non-null values
dtypes: float64(4)
2009-01-12 00:00:002 0 17.4 7.1 14.6
2009-01-12 00:30:00 0 17.4 7.2 14.7
2009-01-12 01:00:00 0 18.0 10.5 14.3
2009-01-12 01:30:00 0 18.3 9.6 14.2
2009-01-12 02:00:00 0 18.4 10.8 14.8
As you see my initial date is 2009-11-17 14:14:00 but resample day start at 2009-01-12. Can anyone explain that happens?
EDIT , i did find the problem , so for others
the provided dataset had :
2009-01-12 00:00:00 value
2009-01-12 00:30:00 value ... but the next line was!!!!!
2009-01-12 01:00 value
so the missing :00 seconds made all the confusion

Categories

Resources