Getting a valueerror on merging two dataframes in Pandas

Getting a valueerror on merging two dataframes in Pandas - python

I tried to merge two dataframes using panda but this is the error code that I get:
ValueError: You are trying to merge on datetime64[ns] and datetime64[ns, UTC] columns. If you wish to proceed you should use pd.concat
I have tried different solutions found online but nothing works!! The code has been provided to me and it seems to work on other PCs but not on my computer.
This is my code:
import sys
import os
from datetime import datetime
import numpy as np
import pandas as pd
# --------------------------------------------------------------------
# -- price, consumption and production --
# --------------------------------------------------------------------
fn = '../data/np_data.csv'
if os.path.isfile(fn):
df_data = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- temp. data --
# --------------------------------------------------------------------
fn = '../data/temp.csv'
if os.path.isfile(fn):
dtemp = pd.read_csv(fn,header=[0],parse_dates=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- price data --
# -- first date: 2014-01-13 --
# -- last date: 2020-02-01 --
# --------------------------------------------------------------------
fn = '../data/eprice.csv'
if os.path.isfile(fn):
eprice = pd.read_csv(fn,header=[0])
else:
sys.exit('Could not open data file {}̈́'.format(fn))
# --------------------------------------------------------------------
# -- combine dataframes (and save as CSV file) --
# --------------------------------------------------------------------
#
df= df_data.merge(dtemp, on='time',how='left') ## This is where I get the error.
print(df.info())
print(eprice.info())
#
# add eprice
df = df.merge(eprice, on='date', how='left')
#
# eprice available only available on trading days
# fills in missing values, last observation is used
df = df.fillna(method='ffill')
#
# keep only the relevant time period
df = df[df.date > '2014-01-23']
df = df[df.date < '2020-02-01']
df.to_csv('../data/my_data.csv',index=False)
The datasets that have been imported look normal with expected number of columns and observations. The version that I have in Panda is 1.0.3
Edit:
this is the output (df) when I first merge df_data and dtemp.
time price_sys price_no1 ... temp_no3 temp_no4 temp_no5
0 2014-01-23 00:00:00+00:00 32.08 32.08 ... NaN NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 31.60 ... -2.5 -8.7 2.5
2 2014-01-24 00:00:00+00:00 30.96 31.02 ... -2.5 -8.7 2.5
3 2014-01-24 00:00:00+00:00 30.84 30.79 ... -2.5 -8.7 2.5
4 2014-01-24 00:00:00+00:00 31.58 31.10 ... -2.5 -8.7 2.5
[5 rows x 25 columns]
This is the output for eprice before I merge:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
This is what happends when I merge df and eprice:
<bound method NDFrame.head of date gas price oil price coal price carbon price
0 2014-01-24 00:00:00 66.00 107.88 79.42 6.89
1 2014-01-27 00:00:00 64.20 106.69 79.43 7.04
2 2014-01-28 00:00:00 63.75 107.41 79.29 7.20
3 2014-01-29 00:00:00 63.20 107.85 78.52 7.21
4 2014-01-30 00:00:00 62.60 107.95 78.18 7.46
... ... ... ... ...
1608 2020-03-25 00:00:00 22.30 27.39 67.81 17.51
1609 2020-03-26 00:00:00 21.55 26.34 70.35 17.35
1610 2020-03-27 00:00:00 18.90 24.93 72.46 16.39
1611 2020-03-30 00:00:00 19.20 22.76 71.63 17.06
1612 2020-03-31 00:00:00 18.00 22.74 71.13 17.68
[1613 rows x 5 columns]>
time price_sys ... coal price carbon price
0 2014-01-23 00:00:00+00:00 32.08 ... NaN NaN
1 2014-01-24 00:00:00+00:00 31.56 ... NaN NaN
2 2014-01-24 00:00:00+00:00 30.96 ... NaN NaN
3 2014-01-24 00:00:00+00:00 30.84 ... NaN NaN
4 2014-01-24 00:00:00+00:00 31.58 ... NaN NaN
[5 rows x 29 columns]

Try doing df['Time'] = pd.to_datetime(df['Time'], utc = True) on both the time columns before joining (or rather the one without UTC needs to go through this!)

Related

How to calculate the mean of a column dataframe in python

I have this dataframe and i want to calculate the Temperature mean for each day:
Dates Temp
13 2019-08-02 24.5
20 2019-08-02 24.3
27 2019-08-03 24.1
34 2019-08-03 23.7
41 2019-08-04 23.6
I use this code that seemed good to me:
df.groupby('Dates', as_index=False)['Temp'].mean()
But the final result is this, which is clearly not the good output as i would have the mean temperature for each day of the year :
Dates Temp
0 2019-08-02 24.4
1 2019-08-03 23.9
2 2019-08-04 23.6
Any idea?

If data has same year use date_range with Series.reindex:
df['Dates'] = pd.to_datetime(df['Dates'])
y = df['Dates'].dt.year.min()
r = pd.date_range(f'{y}-01-01', f'{y}-12-31', name='Dates')
df1 = df.groupby('Dates')['Temp'].mean().reindex(r).reset_index()
print (df1)
Dates Temp
0 2019-01-01 NaN
1 2019-01-02 NaN
2 2019-01-03 NaN
3 2019-01-04 NaN
4 2019-01-05 NaN
.. ... ...
360 2019-12-27 NaN
361 2019-12-28 NaN
362 2019-12-29 NaN
363 2019-12-30 NaN
364 2019-12-31 NaN
[365 rows x 2 columns]
If multiple years:
y1, y2 = df['Dates'].dt.year.min(), df['Dates'].dt.year.max()
r = pd.date_range(f'{y1}-01-01', f'{y2}-12-31')
df.groupby('Dates')['Temp'].mean().reindex(r).reset_index()

how to use pandas resample method?

I want to perform a sampling from a datetime series pandas using resample method. I don't understand the output I've got.
I was expecting to get a sampling of '5s' but I'm getting 17460145 rows from 100 original dataframe. How should be the correct use of resample ?
import numpy as np
import pandas as pd
def random_dates(start, end, n=100):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2022-01-01')
end = pd.to_datetime('2023-01-01')
rd=random_dates(start, end)
clas = np.random.choice(['A','B','C'],size=100)
value = np.random.randint(0,100,size=100)
df =pd.DataFrame.from_dict({'ts':rd,'cl':clas,'vl':value}).set_index('ts').sort_index()
df
Out[48]:
cl vl
ts
2022-01-04 17:25:10 B 27
2022-01-06 19:17:35 C 34
2022-01-17 22:55:25 B 1
2022-01-23 00:33:25 A 20
2022-01-27 18:26:56 A 55
.. ..
2022-12-14 07:46:50 C 22
2022-12-18 02:33:52 C 52
2022-12-22 17:35:10 A 52
2022-12-28 04:55:20 A 57
2022-12-29 03:19:00 A 60
[100 rows x 2 columns]
df.groupby(by='cl').resample('5s').mean()
Out[49]:
vl
cl ts
A 2022-01-23 00:33:25 20.0
2022-01-23 00:33:30 NaN
2022-01-23 00:33:35 NaN
2022-01-23 00:33:40 NaN
2022-01-23 00:33:45 NaN
...
C 2022-12-18 02:33:30 NaN
2022-12-18 02:33:35 NaN
2022-12-18 02:33:40 NaN
2022-12-18 02:33:45 NaN
2022-12-18 02:33:50 52.0
[17460145 rows x 1 columns]

Use pd.Grouper:
>>> df.groupby(['cl', pd.Grouper(freq='5s')]).mean()
vl
cl ts
A 2022-01-22 11:53:30 31.0
2022-02-01 21:24:55 60.0
2022-03-20 06:01:05 24.0
2022-04-03 00:04:05 55.0
2022-04-03 06:30:10 81.0
... ...
C 2022-11-23 23:17:20 92.0
2022-11-25 07:07:45 27.0
2022-12-07 00:18:05 88.0
2022-12-25 10:37:25 77.0
2022-12-28 14:29:25 33.0
[100 rows x 1 columns]

Filtering Pandas dataframe on thousands of conditions

I currently have a list of tuples that look like this:
time_constraints = [
('001', '01/01/2020 10:00 AM', '01/01/2020 11:00 AM'),
('001', '01/03/2020 05:00 AM', '01/03/2020 06:00 AM'),
...
('999', '01/07/2020 07:00 AM', '01/07/2020 08:00 AM')
]
where:
each tuple contains an id, lower_bound, and upper_bound
none of the time frames overlap for a given id
len(time_constraints) can be on the order of 10^4 to 10^5.
My goal is to quickly and efficiently filter a relatively large (millions of rows) Pandas dataframe (df) to include only the rows that match on the id column and fall between the specified lower_bound and upper_bound times (inclusive).
My current plan is to do this:
import pandas as pd
output = []
for i, lower, upper in time_constraints:
indices = list(df.loc[(df['id'] == i) & (df['timestamp'] >= lower) & (df['timestamp'] <= upper), ].index)
output.extend(indices)
output_df = df.loc[df.index.isin(output), ].copy()
However, using a for-loop isn't ideal. I was wondering if there was a better solution (ideally vectorized) using Pandas or NumPy arrays that would be faster.
Edited:
Here's some sample rows of df:
id
timestamp
1
01/01/2020 9:56 AM
1
01/01/2020 10:32 AM
1
01/01/2020 10:36 AM
2
01/01/2020 9:42 AM
2
01/01/2020 9:57 AM
2
01/01/2020 10:02 AM

I already answered for a similar case.
To test, I used 100,000 constraints (tc) and 5,000,000 of records (df).
Is it what you expect
>>> df
id timestamp
0 565 2020-08-16 05:40:55
1 477 2020-04-05 22:21:40
2 299 2020-02-22 04:54:34
3 108 2020-08-17 23:54:02
4 041 2020-09-10 10:01:31
... ... ...
4999995 892 2020-12-27 16:16:35
4999996 373 2020-08-29 05:44:34
4999997 659 2020-05-23 20:48:15
4999998 858 2020-09-08 22:58:20
4999999 710 2020-04-10 08:03:14
[5000000 rows x 2 columns]
>>> tc
id lower_bound upper_bound
0 000 2020-01-01 00:00:00 2020-01-04 14:00:00
1 000 2020-01-04 15:00:00 2020-01-08 05:00:00
2 000 2020-01-08 06:00:00 2020-01-11 20:00:00
3 000 2020-01-11 21:00:00 2020-01-15 11:00:00
4 000 2020-01-15 12:00:00 2020-01-19 02:00:00
... ... ... ...
99995 999 2020-12-10 09:00:00 2020-12-13 23:00:00
99996 999 2020-12-14 00:00:00 2020-12-17 14:00:00
99997 999 2020-12-17 15:00:00 2020-12-21 05:00:00
99998 999 2020-12-21 06:00:00 2020-12-24 20:00:00
99999 999 2020-12-24 21:00:00 2020-12-28 11:00:00
[100000 rows x 3 columns]
# from tqdm import tqdm
from itertools import chain
# df = pd.DataFrame(data, columns=['id', 'timestamp'])
tc = pd.DataFrame(time_constraints, columns=['id', 'lower_bound', 'upper_bound'])
g1 = df.groupby('id')
g2 = tc.groupby('id')
indexes = []
# for id_ in tqdm(tc['id'].unique()):
for id_ in tc['id'].unique():
df1 = g1.get_group(id_)
df2 = g2.get_group(id_)
ii = pd.IntervalIndex.from_tuples(list(zip(df2['lower_bound'],
df2['upper_bound'])),
closed='both')
indexes.append(pd.cut(df1['timestamp'], bins=ii).dropna().index)
out = df.loc[chain.from_iterable(indexes)]
Performance:
100%|█████████████████████████████████████████████████| 1000/1000 [00:17<00:00, 58.40it/s]
Output result:
>>> out
id timestamp
1326 000 2020-11-10 05:51:00
1685 000 2020-10-07 03:12:48
2151 000 2020-05-08 11:11:18
2246 000 2020-07-06 07:36:57
3995 000 2020-02-02 04:39:11
... ... ...
4996406 999 2020-02-19 15:27:06
4996684 999 2020-02-05 11:13:56
4997408 999 2020-07-09 09:31:31
4997896 999 2020-04-10 03:26:13
4999674 999 2020-04-21 22:57:04
[4942976 rows x 2 columns] # 57024 records filtered

You can use boolean indexing, likewise:
output_df = df[pd.Series(list(zip(df['id'],
df['lower_bound'],
df['upper_bound']))).isin(time_constraints)]
The zip function is creating tuples from each column and then comparing it with your list of tuple. The pd.Series is used to create a Boolean series.

Extract date from pandas.core.series.Series in pandas dataframe columns

For downloading German bank holidays via a web api and converting the json data into a pandas dataframe I use the following code (python 3):
import datetime
import requests
import pandas as pd
now = datetime.datetime.now()
year = now.year
URL ='https://feiertage-api.de/api/?jahr='+ str(year)
r = requests.get(URL)
df = pd.DataFrame(r.json())
The goal is a pandas dataframe looking like (picture = section of the dataframe):
The Problem: "columns" are pandas.core.series.Series and I cannot figure out how to extract the date using various versions of
df['BW'].str.split(", ", n = 0, expand = True)
See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html
Please, can anyone help me to turn df into a "proper" dataframe with columns that only contain dates?

One approach would be to do df.applymap(lambda x: '' if pd.isna(x) else x['datum']):
In [21]: df.applymap(lambda x: '' if pd.isna(x) else x['datum'])
Out[21]:
BW BY BE BB HB ... SN ST SH TH NATIONAL
1. Weihnachtstag 2019-12-25 2019-12-25 2019-12-25 2019-12-25 2019-12-25 ... 2019-12-25 2019-12-25 2019-12-25 2019-12-25 2019-12-25
2. Weihnachtstag 2019-12-26 2019-12-26 2019-12-26 2019-12-26 2019-12-26 ... 2019-12-26 2019-12-26 2019-12-26 2019-12-26 2019-12-26
Allerheiligen 2019-11-01 2019-11-01 ...
Augsburger Friedensfest 2019-08-08 ...
Buß- und Bettag 2019-11-20 ... 2019-11-20
Christi Himmelfahrt 2019-05-30 2019-05-30 2019-05-30 2019-05-30 2019-05-30 ... 2019-05-30 2019-05-30 2019-05-30 2019-05-30 2019-05-30
Frauentag 2019-03-08 ...
Fronleichnam 2019-06-20 2019-06-20 ... 2019-06-20 2019-06-20
Gründonnerstag 2019-04-18 ...
Heilige Drei Könige 2019-01-06 2019-01-06 ... 2019-01-06
Karfreitag 2019-04-19 2019-04-19 2019-04-19 2019-04-19 2019-04-19 ... 2019-04-19 2019-04-19 2019-04-19 2019-04-19 2019-04-19
Mariä Himmelfahrt 2019-08-15 ...
Neujahrstag 2019-01-01 2019-01-01 2019-01-01 2019-01-01 2019-01-01 ... 2019-01-01 2019-01-01 2019-01-01 2019-01-01 2019-01-01
Ostermontag 2019-04-22 2019-04-22 2019-04-22 2019-04-22 2019-04-22 ... 2019-04-22 2019-04-22 2019-04-22 2019-04-22 2019-04-22
Ostersonntag 2019-04-21 ...
Pfingstmontag 2019-06-10 2019-06-10 2019-06-10 2019-06-10 2019-06-10 ... 2019-06-10 2019-06-10 2019-06-10 2019-06-10 2019-06-10
Pfingstsonntag 2019-06-09 ...
Reformationstag 2019-10-31 2019-10-31 2019-10-31 ... 2019-10-31 2019-10-31 2019-10-31 2019-10-31
Tag der Arbeit 2019-05-01 2019-05-01 2019-05-01 2019-05-01 2019-05-01 ... 2019-05-01 2019-05-01 2019-05-01 2019-05-01 2019-05-01
Tag der Deutschen Einheit 2019-10-03 2019-10-03 2019-10-03 2019-10-03 2019-10-03 ... 2019-10-03 2019-10-03 2019-10-03 2019-10-03 2019-10-03

You could try to fix the shape of the input (i.e. the json response) before constructing the data frame & then reshape as needed.
example:
import datetime
import requests
import pandas as pd
now = datetime.datetime.now()
year = now.year
URL ='https://feiertage-api.de/api/?jahr='+ str(year)
r = requests.get(URL)
df = pd.DataFrame(
[(k1,k2,k3,v3)
for k1, v1 in r.json().items()
for k2, v2 in v1.items()
for k3, v3 in v2.items()]
)
df.head()
# Outputs:
0 1 2 3
0 BW Neujahrstag datum 2019-01-01
1 BW Neujahrstag hinweis
2 BW Heilige Drei Könige datum 2019-01-06
3 BW Heilige Drei Könige hinweis
4 BW Gründonnerstag datum 2019-04-18
# it is easier to see what is happening if we
# fix the column names
df.columns = ['State', 'Holiday', 'value_type', 'value']
pivoted = df[df.value_type == 'datum'].set_index(['Holiday', 'State']).value.unstack(-1)
pivoted.head()
# Outputs:
State BB BE BW ... SN ST TH
Holiday ...
1. Weihnachtstag 2019-12-25 2019-12-25 2019-12-25 ... 2019-12-25 2019-12-25 2019-12-25
2. Weihnachtstag 2019-12-26 2019-12-26 2019-12-26 ... 2019-12-26 2019-12-26 2019-12-26
Allerheiligen NaN NaN 2019-11-01 ... NaN NaN NaN
Augsburger Friedensfest NaN NaN NaN ... NaN NaN NaN
Buß- und Bettag NaN NaN NaN ... 2019-11-20 NaN NaN
[5 rows x 17 columns]

parsing data in excel file to create data frame

I am analyzing data from excel file.
I want to create data frame by parsing data from excel using python.
Data in my excel file looks like as follow:
The first row highlighted in yellow contains match, which will be one of the columns in data frame that I wanted to create.
In fact, second row and 4th row are the name of the columns that I wanted to created in a new data frame.
3rd row and fifth row are the value of each column.
The sample here is only for one match.
I have multiple matches in the excel file.
I want to create a data frame that contain the column Match and all name in blue colors in the file.
I have attached the sample file that contains multiple matches.
Download the file here.
My expected data frame is
Match 1-0 2-0 2-1 3-0 3-1 3-2 4-0 4-1 4-2 4-3.......
MOL Vivi -vs- Chelsea 14 42 20 170 85 85 225 225 225 .....
Can anyone advise me how to parse the excel data and convert to data frame?
Thanks,
Zep

Use:
import pandas as pd
from datetime import datetime
df = pd.read_excel('test_match.xlsx')
#mask for check a-z in column HOME -vs- AWAY
m1 = df['HOME -vs- AWAY'].str.contains('[a-z]', na=False)
#create index by matches
df.index = df['HOME -vs- AWAY'].where(m1).ffill()
df.index.name = 'Match'
#remove same index and HOME -vs- AWAY column rows
df = df[df.index != df['HOME -vs- AWAY']].copy()
#test if datetime or string
m2 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, datetime))
m3 = df['HOME -vs- AWAY'].apply(lambda x: isinstance(x, str))
#seelct next rows and set new columns names
df1 = df[m2.shift().fillna(False)]
df1.columns = df[m2].iloc[0]
#also remove only NaNs columns
df2 = df[m3.shift().fillna(False)].dropna(axis=1, how='all')
df2.columns = df[m3].iloc[0].dropna()
#join together
df = pd.concat([df1, df2], axis=1).astype(float).reset_index().rename_axis(None, axis=1)
print (df.head())
Match 2000-01-01 00:00:00 2000-02-01 00:00:00 \
0 MOL Vidi -vs- Chelsea 14.00 42.00
1 Lazio -vs- Eintracht Frankfurt 8.57 11.55
2 Sevilla -vs- FC Krasnodar 7.87 6.63
3 Villarreal -vs- Spartak Moscow 7.43 7.03
4 Rennes -vs- FC Astana 4.95 6.38
2018-02-01 00:00:00 2000-03-01 00:00:00 2018-03-01 00:00:00 \
0 20.00 170.00 85.00
1 7.87 23.80 15.55
2 7.87 8.72 8.65
3 7.07 10.00 9.43
4 7.33 12.00 13.20
2018-03-02 00:00:00 2000-04-01 00:00:00 2018-04-01 00:00:00 \
0 85.0 225.00 225.00
1 21.3 64.30 42.00
2 25.9 14.80 14.65
3 23.9 19.35 17.65
4 38.1 31.50 34.10
2018-04-02 00:00:00 ... 0-1 0-2 2018-01-02 00:00:00 \
0 225.0 ... 5.6 6.80 7.00
1 55.7 ... 11.0 19.05 10.45
2 38.1 ... 28.0 79.60 29.20
3 38.4 ... 20.9 58.50 22.70
4 81.4 ... 12.9 42.80 22.70
0-3 2018-01-03 00:00:00 2018-02-03 00:00:00 0-4 \
0 12.5 12.0 32.0 30.0
1 48.4 27.4 29.8 167.3
2 223.0 110.0 85.4 227.5
3 203.5 87.6 73.4 225.5
4 201.7 97.6 103.6 225.5
2018-01-04 00:00:00 2018-02-04 00:00:00 2018-03-04 00:00:00
0 29.0 60.0 220.0
1 91.8 102.5 168.3
2 227.5 227.5 227.5
3 225.5 225.5 225.5
4 225.5 225.5 225.5
[5 rows x 27 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a valueerror on merging two dataframes in Pandas - python

Try doing df['Time'] = pd.to_datetime(df['Time'], utc = True) on both the time columns before joining (or rather the one without UTC needs to go through this!)

Related

How to calculate the mean of a column dataframe in python

how to use pandas resample method?

Filtering Pandas dataframe on thousands of conditions

Extract date from pandas.core.series.Series in pandas dataframe columns

parsing data in excel file to create data frame

Categories

Resources