Pandas import excel export HDF5 - python

Working with pandas and PyTables. Begin by importing a table from excel containing columns of integers and floats, as well as other columns containing strings and even tuples. There are a limited number of options on the excel import and unfortunately, unlike the csv import process, datatypes must be converted from their inferred types after the import and cannot be specified in the process.
That being said, all non-numeric is apparently imported as unicode text which is incompatible with a later export to HDF5. Is there a simple way to convert all unicode columns (as well as all column headings) to an HDF5 compatible string format?
MORE DETAILS:
>>> metaFrame.head()
ProjectName Company ContactName \
LocationID
935 PCS Petaluma High School Site Testco Test Name
937 PCS Casa Grande High School Testco Test Name
3465 FUSD Fowler High School Testco Test Name
3466 FUSD Sutter Middle School Testco Test Name
3467 FUSD Fremont Elementary School Testco Test Name
Contactemail \
LocationID
935 test.address#email.com
937 test.address#email.com
3465 test.address#email.com
3466 test.address#email.com
3467 test.address#email.com
Link Systemsize(kW) \
LocationID
935 https://internal.testco.com/locations/935/syst... NaN
937 https://internal.testco.com/locations/937/syst... 675.39
3465 https://internal.testco.com/locations/3465/sys... 384.30
3466 https://internal.testco.com/locations/3466/sys... 198.90
3467 https://internal.testco.com/locations/3467/sys... 35.10
SystemCheckStartdate SystemCheckActive \
LocationID
935 2013-10-01 00:00:00 True
937 2013-10-01 00:00:00 True
3465 2013-10-01 00:00:00 True
3466 2013-10-01 00:00:00 True
3467 2013-10-01 00:00:00 True
YTDProductionPriortostartdate NumberofInverters/cktsmonitored \
LocationID
935 NaN NaN
937 NaN NaN
3465 NaN NaN
3466 NaN NaN
3467 NaN NaN
InverterMfg InverterModel \
LocationID
935 PV Powered : PVP260KW NaN
937 PV Powered : PVP260KW NaN
3465 Advanced Energy Industries : Solaron 333kW (31... NaN
3466 PV Powered : PVP260KW NaN
3467 PV Powered : PVP35KW-480 NaN
InverterCECefficiency ModuleMfg Modulemodel \
LocationID
935 97.0 NaN NaN
937 97.0 NaN NaN
3465 97.5 NaN NaN
3466 97.0 NaN NaN
3467 96.0 NaN NaN
Moduleirradiancefactor Moduleirradiancefactorslope \
LocationID
935 NaN NaN
937 NaN NaN
3465 NaN NaN
3466 NaN NaN
3467 NaN NaN
StraightLineIntercept ModuleTemp-PwrDerate MeterDK
LocationID
935 NaN 0.005 3291 ...
937 NaN 0.005 11548 ...
3465 NaN 0.005 19248 ...
3466 NaN 0.005 15846 ...
3467 NaN 0.005 15847 ...
[5 rows x 27 columns]
>>> metaFrame.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 43 entries, 935 to 5844
Data columns (total 27 columns):
ProjectName 43 non-null values
Company 43 non-null values
ContactName 43 non-null values
Contactemail 43 non-null values
Link 43 non-null values
Systemsize(kW) 42 non-null values
SystemCheckStartdate 37 non-null values
SystemCheckActive 43 non-null values
YTDProductionPriortostartdate 0 non-null values
NumberofInverters/cktsmonitored 2 non-null values
InverterMfg 42 non-null values
InverterModel 8 non-null values
InverterCECefficiency 33 non-null values
ModuleMfg 0 non-null values
Modulemodel 0 non-null values
Moduleirradiancefactor 0 non-null values
Moduleirradiancefactorslope 0 non-null values
StraightLineIntercept 0 non-null values
ModuleTemp-PwrDerate 43 non-null values
MeterDK 43 non-null values
Genfieldname 43 non-null values
WSDK 34 non-null values
WSirradianceField 43 non-null values
WSCellTempField 43 non-null values
MiscDerate 1 non-null values
InverterDKs 37 non-null values
Invertergenfields 37 non-null values
dtypes: bool(1), datetime64[ns](1), float64(9), object(16)

Related

Merge two data frames on three columns in Python

I have two data frames and I would like to merge them on the two columns Latitude and Longitude. The resulting df should include all columns.
df1:
Date Latitude Longitude LST
0 2019-01-01 66.33 17.100 -8.010004
1 2019-01-09 66.33 17.100 -6.675005
2 2019-01-17 66.33 17.100 -21.845003
3 2019-01-25 66.33 17.100 -26.940004
4 2019-02-02 66.33 17.100 -23.035009
... ... ... ... ...
and df2:
Station_Number Date Latitude Longitude Elevation Value
0 CA002100636 2019-01-01 69.5667 -138.9167 1.0 -18.300000
1 CA002100636 2019-01-09 69.5667 -138.9167 1.0 -26.871429
2 CA002100636 2019-01-17 69.5667 -138.9167 1.0 -19.885714
3 CA002100636 2019-01-25 69.5667 -138.9167 1.0 -17.737500
4 CA002100636 2019-02-02 69.5667 -138.9167 1.0 -13.787500
... ... ... ... ... ... ...
I have tried: LST_1=pd.merge(df1, df2, how = 'inner') but using merge in that way I have lost several data points, which are included in both data frames.
I am not sure if you want to merge on a specific column, if so you need to pick one with overlapping identifiers - for instance the "Date" column.
df_ = pd.merge(df1, df2, on="Date")
print(df_)
Date Latitude_x Longitude_x ... Longitude_y Elevation Value
0 01.01.2019 66.33 17.1 ... -138.9167 1.0 -18.300000
1 09.01.2019 66.33 17.1 ... -138.9167 1.0 -26.871429
2 17.01.2019 66.33 17.1 ... -138.9167 1.0 -19.885714
3 25.01.2019 66.33 17.1 ... -138.9167 1.0 -17.737500
4 02.02.2019 66.33 17.1 ... -138.9167 1.0 -13.787500
[5 rows x 9 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5 entries, 0 to 4
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5 non-null object
1 Latitude_x 5 non-null float64
2 Longitude_x 5 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Latitude_y 5 non-null int64
6 Longitude_y 5 non-null int64
7 Elevation 5 non-null float64
8 Value 5 non-null object
dtypes: float64(3), int64(2), object(4)
memory usage: 400.0+ bytes
As you have column names that are the same pandas will create _x and _y on Latitude and Longitude.
If you want all the columns and the data in one row is independent from the others, then you can use pd.concat. However, this will create some NaN values, due to missing data.
df_1 = pd.concat([df1, df2])
print(df_1)
Date Latitude Longitude ... Station_Number Elevation Value
0 01.01.2019 66.33 17.1 ... NaN NaN NaN
1 09.01.2019 66.33 17.1 ... NaN NaN NaN
2 17.01.2019 66.33 17.1 ... NaN NaN NaN
3 25.01.2019 66.33 17.1 ... NaN NaN NaN
4 02.02.2019 66.33 17.1 ... NaN NaN NaN
0 01.01.2019 69.56 -138.9167 ... CA002100636 1.0 -18.300000
1 09.01.2019 69.56 -138.9167 ... CA002100636 1.0 -26.871429
2 17.01.2019 69.56 -138.9167 ... CA002100636 1.0 -19.885714
3 25.01.2019 69.56 -138.9167 ... CA002100636 1.0 -17.737500
4 02.02.2019 69.56 -138.9167 ... CA002100636 1.0 -13.787500
df_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 0 to 4
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 10 non-null object
1 Latitude 10 non-null float64
2 Longitude 10 non-null float64
3 LST 5 non-null object
4 Station_Number 5 non-null object
5 Elevation 5 non-null float64
6 Value 5 non-null object
dtypes: float64(3), object(4)
memory usage: 640.0+ bytes

Pandas DataFrame mean of data in columns occurring before certain date time

I have a dataframe with ID's of clients and their expenses for 2014-2018. What I want is to have the mean of the expenses per ID but only the years before a certain date can be taken into account when calculating the mean value (so column 'Date' dictates which columns can be taken into account for the mean).
Example: for index 0 (ID: 12), the date states '2016-03-08', then the mean should be taken from the columns 'y_2014' and 'y_2015', so then for this index, the mean is 111.0. If the date is too early (e.g. somewhere in 2014 or earlier in this case), then NaN should be returned (see index 6 and 9).
Desired output:
y_2014 y_2015 y_2016 y_2017 y_2018 Date ID mean
0 100.0 122.0 324 632 NaN 2016-03-08 12 111.0
1 120.0 159.0 54 452 541.0 2015-04-09 96 120.0
2 NaN 164.0 687 165 245.0 2016-02-15 20 164.0
3 180.0 421.0 512 184 953.0 2018-05-01 73 324.25
4 110.0 654.0 913 173 103.0 2017-08-04 84 559.0
5 130.0 NaN 754 124 207.0 2016-07-03 26 130.0
6 170.0 256.0 843 97 806.0 2013-02-04 87 NaN
7 140.0 754.0 95 101 541.0 2016-06-08 64 447
8 80.0 985.0 184 84 90.0 2019-03-05 11 284.6
9 96.0 65.0 127 130 421.0 2014-05-14 34 NaN
The code below is what I tried.
Tried code:
import pandas as pd
import numpy as np


df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
 '2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})

print(df)

# the years from columns
data = df.filter(like='y_')
data_years = data.columns.str.extract('(\d+)')[0].astype(int)

# the years from Date
years = pd.to_datetime(df.Date).dt.year.values


df['mean'] = data.where(data_years<years[:,None]).mean(1)
print(df)
-> ValueError: Lengths must match to compare
Solved: one possible answer to my own question
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [12,96,20,73,84,26,87,64,11,34],
"y_2014": [100,120,np.nan,180,110,130,170,140,80,96],
"y_2015": [122,159,164,421,654,np.nan,256,754,985,65],
"y_2016": [324,54,687,512,913,754,843,95,184,127],
"y_2017": [632,452,165,184,173,124,97,101,84,130],
"y_2018": [np.nan,541,245,953,103,207,806,541,90,421],
"Date": ['2016-03-08', '2015-04-09', '2016-02-15', '2018-05-01', '2017-08-04',
'2016-07-03', '2013-02-04', '2016-06-08', '2019-03-05', '2014-05-14']})
#Subset from original df to calculate mean
subset = df.loc[:,['y_2014', 'y_2015', 'y_2016', 'y_2017', 'y_2018']]
#an expense value is only available for the calculation of the mean when that year has passed, therefore 2015-01-01 is chosen for the 'y_2014' column in the subset etc. to check with the 'Date'-column
subset.columns = ['2015-01-01', '2016-01-01', '2017-01-01', '2018-01-01', '2019-01-01']

s = subset.columns[0:].values < df.Date.values[:,None]
t = s.astype(float)
t[t == 0] = np.nan
df['mean'] = (subset.iloc[:,0:]*t).mean(1)

print(df)
#Additionally: (gives the sum of expenses before a certain date in the 'Date'-column
df['sum'] = (subset.iloc[:,0:]*t).sum(1)

print(df)

Pandas Merge returning only null values

I am trying to use Pandas to merge a products packing information with each order record for a given product. The data frame information is below.
BreakerOrders.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3774010 entries, 0 to 3774009
Data columns (total 2 columns):
Material object
Quantity float64
dtypes: float64(1), object(1)
memory usage: 86.4+ MB
manh.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1381 entries, 0 to 1380
Data columns (total 4 columns):
Material 1381 non-null object
SUBPACK_QTY 202 non-null float64
PACK_QTY 591 non-null float64
PALLET_QTY 809 non-null float64
dtypes: float64(3), object(1)
memory usage: 43.2+ KB
When attempting the merge using the code below, I get the following table with all NaN values for packaging quantities.
BreakerOrders.merge(manh,how='left',on='Material')
Material Quantity SUBPACK_QTY PACK_QTY PALLET_QTY
HOM230CP 5.0 NaN NaN NaN
QO115 20.0 NaN NaN NaN
QO2020CP 20.0 NaN NaN NaN
QO220CP 50.0 NaN NaN NaN
HOM115CP 50.0 NaN NaN NaN
HOM120 100.0 NaN NaN NaN
I was having the same and I was able to solve it by just flipping the DFs. so instead of:
df2 = df.merge(df1)
try
df2 = df1.merge(df)
Looks silly, but it solved my issue.

How do I get minimal value of multiple column timestamp

I want to get minimal value of multiple column timestamp. Here's my data
Id timestamp 1 timestamp 2 timestamp 3
136 2014-08-27 17:29:23 2014-11-05 13:02:18 2014-09-29 22:26:34
245 2015-09-06 15:46:00 NaN NaN
257 2014-09-29 22:26:34 2016-02-02 17:59:54 NaN
258 NaN NaN NaN
480 2016-02-02 17:59:54 2014-11-05 13:02:18 NaN
I want to get minimal timestamp of minimal
Id minimal
136 2014-08-27 17:29:23
245 2015-09-06 15:46:00
257 2014-09-29 22:26:34
258 NaN
480 2014-11-05 13:02:18
Select all columns without first by iloc, convert to datetimes and get min per rows and it is added to first column by join:
df = df[['Id']].join(df.iloc[:, 1:].apply(pd.to_datetime).min(axis=1).rename('min'))
print (df)
Id min
0 136 2014-08-27 17:29:23
1 245 2015-09-06 15:46:00
2 257 2014-09-29 22:26:34
3 258 NaT
4 480 2014-11-05 13:02:18

How get monthly mean in pandas using groupby

I have the next DataFrame:
data=pd.read_csv('anual.csv', parse_dates='Fecha', index_col=0)
data
DatetimeIndex: 290 entries, 2011-01-01 00:00:00 to 2011-12-31 00:00:00
Data columns (total 12 columns):
HR 290 non-null values
PreciAcu 290 non-null values
RadSolar 290 non-null values
T 290 non-null values
Presion 290 non-null values
Tmax 290 non-null values
HRmax 290 non-null values
Presionmax 290 non-null values
RadSolarmax 290 non-null values
Tmin 290 non-null values
HRmin 290 non-null values
Presionmin 290 non-null values
dtypes: float64(4), int64(8)
where:
data['HR']
Fecha
2011-01-01 37
2011-02-01 70
2011-03-01 62
2011-04-01 69
2011-05-01 72
2011-06-01 71
2011-07-01 71
2011-08-01 70
2011-09-01 40
...
2011-12-17 92
2011-12-18 78
2011-12-19 79
2011-12-20 76
2011-12-21 78
2011-12-22 80
2011-12-23 72
2011-12-24 70
In addition, some months are not always complete. My goal is to calculate the average of each month from daily data. This is achieved as follows:
monthly=data.resample('M', how='mean')
HR PreciAcu RadSolar T Presion Tmax
Fecha
2011-01-31 68.586207 3.744828 163.379310 17.496552 0 25.875862
2011-02-28 68.666667 1.966667 208.000000 18.854167 0 28.879167
2011-03-31 69.136364 3.495455 218.090909 20.986364 0 30.359091
2011-04-30 68.956522 1.913043 221.130435 22.165217 0 31.708696
2011-05-31 72.700000 0.550000 201.100000 18.900000 0 27.460000
2011-06-30 70.821429 6.050000 214.000000 23.032143 0 30.621429
2011-07-31 78.034483 5.810345 188.206897 21.503448 0 27.951724
2011-08-31 71.750000 1.028571 214.750000 22.439286 0 30.657143
2011-09-30 72.481481 0.185185 196.962963 21.714815 0 29.596296
2011-10-31 68.083333 1.770833 224.958333 18.683333 0 27.075000
2011-11-30 71.750000 0.812500 169.625000 18.925000 0 26.237500
2011-12-31 71.833333 0.160000 159.533333 17.260000 0 25.403333
The first error I find is in the column of precipitation, since all observations are 0 in January and an average of 3.74 is obtained for this particular month.
When averages in Excel and compare them with the results above, there is significant variation. For Example, the mean of HR for Febrero is
mean HR using pandas=68.66
mean HR using excel=67
Another detail I found:
data['PreciAcu']['2011-01'].count()
29 and should be 31
Am I doing something wrong?
How I can fix this error?
Annex csv file:
[link] https://www.dropbox.com/s/p5hl137bqm82j41/anual.csv
Your date column is being misinterpreted, because it's in DD/MM/YYYY format. Set dayfirst=True instead:
>>> df = pd.read_csv('anual.csv', parse_dates='Fecha', dayfirst=True, index_col=0, sep="\s+")
>>> df['PreciAcu']['2011-01'].count()
31
>>> df.resample("M").mean()
HR PreciAcu RadSolar T Presion Tmax \
Fecha
2011-01-31 68.774194 0.000000 162.354839 16.535484 0 25.393548
2011-02-28 67.000000 0.000000 193.481481 15.418519 0 25.696296
2011-03-31 59.083333 0.850000 254.541667 21.295833 0 32.325000
2011-04-30 61.200000 1.312000 260.640000 24.676000 0 34.760000
2011-05-31 NaN NaN NaN NaN NaN NaN
2011-06-30 68.428571 8.576190 236.619048 25.009524 0 32.028571
2011-07-31 81.518519 11.488889 185.407407 22.429630 0 27.681481
2011-08-31 76.451613 0.677419 219.645161 23.677419 0 30.719355
2011-09-30 77.533333 2.883333 196.100000 21.573333 0 28.723333
2011-10-31 73.120000 1.260000 196.280000 19.552000 0 27.636000
2011-11-30 71.277778 -79.333333 148.555556 18.250000 0 26.511111
2011-12-31 73.741935 0.067742 134.677419 15.687097 0 24.019355
HRmax Presionmax Tmin
Fecha
2011-01-31 92.709677 0 10.909677
2011-02-28 92.111111 0 8.325926
2011-03-31 89.291667 0 13.037500
2011-04-30 89.400000 0 17.328000
2011-05-31 NaN NaN NaN
2011-06-30 92.095238 0 19.761905
2011-07-31 97.185185 0 18.774074
2011-08-31 96.903226 0 18.670968
2011-09-30 97.200000 0 16.373333
2011-10-31 97.000000 0 13.412000
2011-11-30 94.555556 0 11.877778
2011-12-31 94.161290 0 10.070968
[12 rows x 9 columns]
(Note, though - I'd forgotten this -- that dayfirst=True isn't strict, see here. Maybe using date_parser would be safer.)

Categories

Resources