Python pandas append dataframe with array content to hdf file - python

How can I append a dataframe to another dataframe which is already saved in a file without loading it from the file? (Python 3.6 & Pandas 1.0.1)
Example:
import pandas as pd
data = [[['A01','A02'],'B0','C0'],[['A11','A12'],'B1','C1'],[['A21','A22'],'B2','C2']]
df = pd.DataFrame(data,columns=['A','B','C'])
data2 = [[['A31','A32'],'B3','C3'],[['A41','A42'],'B4','C4'],[['A51','A52'],'B5','C5']]
df2 = pd.DataFrame(data2,columns=['A','B','C'])
print(df.append(df2,ignore_index=True))
#version 1:
store = pd.HDFStore('test.h5','a')
store.append(key='foo',value=df)#, format='t', data_columns=True)
store.append(key='foo',value=df2)#, format='t', data_columns=True, append=True)
#version 2
df.to_hdf(path_or_buf='test.h5',key='foo',mode='w',format='t')
df2.to_hdf(path_or_buf='test.h5',key='foo',mode='a',append=True,format='t',data_columns=True)
#version 3
df.to_hdf(path_or_buf='test.h5',key='foo',mode='w',format='f')
df2.to_hdf(path_or_buf='test.h5',key='foo',mode='a',append=True,format='f',data_columns=True)
df3 = pd.read_hdf('test.h5',key='foo',mode='r')
print(df3)
version 1: TypeError: object of type 'int' has no len()
version 2: TypeError: object of type 'int' has no len()
version 3: ValueError: Can only append to Tables
This question was asked similarly here but quite a while ago. I tried it with an older pandas version but this causes even more problems.
EDIT:
It seems that the issue are the arrays as content. If I use only the Bs and Cs, like so, it works:
data = [['B0','C0'],['B1','C1'],['B2','C2']]
df = pd.DataFrame(data,columns=['B','C'])
data2 = [['B3','C3'],['B4','C4'],['B5','C5']]
df2 = pd.DataFrame(data2,columns=['B','C'])
Does anybody know a possibility how I can get it to work despite using arrays as content?

Related

for scorecardpy.woe_bin package in python I am getting "TypeError: unhashable type: 'numpy.ndarray'"

Python is having scorecardpy library for scorecard development which is alternative to R pacakge scorecardpy
but while running woe_binning from scorecardpy as follow
bins = sc.woebin(df_temp,y=target_var,positive='bad|1')
#df_temp is dataframe with all columns having 'float64' data type
#y is binary variable having data type 'int'
I was getting error for few variables
TypeError: unhashable type: 'numpy.ndarray
others were getting binned properly.
i tried finding. difference between variable where binning was successful vs binning was failing basis few parameters as follows
Null value
dtypes
shape
in order to understand the pattern with following code
unhashable_nparray = []
successfully_binned = []
for j,i in enumerate(df3.columns):
try:
print(i)
df_temp = pd.DataFrame()
df_temp[i] = df3[i]
df_temp[target_var] = y
bins = sc.woebin(df_temp,y=target_var,positive='bad|1')
successfully_binned.append(i)
print(i,"---{a}---{b}----{c}--{d}---{z}----Success".format(z =df3[i].nunique() ,a = df3[i].shape, b=df3[i].isna().sum(),c=type(df3[i]),d=df3[i].dtypes))
except TypeError:
unhashable_nparray.append(i)
print(i,"---{a}---{b}----{c}--{d}---{z}----Fail".format(z =df3[i].nunique(),a = df3[i].shape, b=df3[i].isna().sum(),c=type(df3[i]),d=df3[i].dtypes))
but I could not find any pattern.
what could be the cause of this?
Here is the solution
This error is already reported.
and getting discussed here but unfortunately the language is Chinese it is difficult to read.
this error doesn't necessarily mean that your dataframe is uncleaned.or having not expected datatype
please check your pandas version
import pandas as pd
pd.__version__
if output is > 1.4.0 that means you are using recent version of pandas than what scorecardpy is expecting
as of today scorecardpy expects you to work on pandas version <= 1.3.5
so please install pandas == 1.3.5

Python .index & .map( ) gives 'Series' object is not callable error

I have two excels I coverted into dataframes.
DF1: contains columns 'JobKeys' and 'Aircraft Numbers' (Amongst other data)
DF2: contains columns 'JobKeys' and 'Shortage' (Amongst other data)
I want to create a column 'Short' in DF1 mapping for the jobkeys present in DF2 (effectively a VLOOKUP)
For both I set JobKey as the index:
#Import relevant libraries:
import pandas as pd
import numpy as np
DF1 = pd.read_excel('...')
DF2 = pd.read.excel('...')
DF1['Short'] = " "
DF1.set_index('JobKey', inplace = True)
DF2.set_index('JobKey', inplace = True)
Both OK. I print a sample of both using DF.head() and it looks OK. I want to use .index.map() what was done here:
https://towardsdatascience.com/vlookup-implementation-in-python-in-three-simple-steps-93b5a290fd72
DF1["Short"]=DF1.index.map(DF2["Shortage"])
However I get the error:
---------------------------------------------------------------------------
C:\ProgramData\Anaconda3\lib\site-packages\pandas\indexes\base.py in map(self, mapper)
2439 applied : array
2440 """
-> 2441 return self._arrmap(self.values, mapper)
2442
2443 def isin(self, values, level=None):
pandas\src\algos_common_helper.pxi in pandas.algos.arrmap_object (pandas\algos.c:46681)()
TypeError: 'Series' object is not callable
-------
Any ideas as to why? it seems pretty straight forward yet I can't find the cause of my problem.

How to load a Statsmodels dataset in Python?

I am trying to load a statsmodels dataset as I saw on a tutorial, but I keep getting an error.
import statsmodels as sm
import pandas as pd
data = sm.datasets.co2.load_pandas()
co2 = data.data
co2['ds'] = co2.index
co2.rename(columns={'co2': 'y'}, inplace=True)
co2.tail()
This is the error I am getting:
TypeError: new() got an unexpected keyword argument 'format'
looks like the problem is with original function "load_pandas", the "format" parameter no long exists in the new version of pd.DatetimeIndex, for details please refer to https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DatetimeIndex.html
def load_pandas():
data = load()
# pandas <= 0.12.0 fails in the to_datetime regex on Python 3
index = pd.DatetimeIndex(start=data.data['date'][0].decode('utf-8'),
periods=len(data.data), format='%Y%m%d',
freq='W-SAT')
dataset = pd.DataFrame(data.data['co2'], index=index, columns=['co2'])
#NOTE: this is how I got the missing values in co2.csv
#new_index = pd.DatetimeIndex(start='1958-3-29', end=index[-1],
# freq='W-SAT')
#data.data = dataset.reindex(new_index)
data.data = dataset
return data
so my solution of working around this is below:
load data into pandas DataFrame
co2 = pd.DataFrame(sm.datasets.co2.load().data)
convert bytes into string and then datetime
co2['date'] = pd.to_datetime(co2.date.apply(lambda x: x.decode("utf-8")))
set the date as index
co2.set_index('date',inplace=True)
output:

ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'numpy.ndarray'

I am creating a python script using pandas to read through a file which has multiple row values.
Once read, I need to build an array of these values and then assign it to a dataframe row value.
The code I have used is
import re
import numpy as np
import pandas as pd
master_data = pd.DataFrame()
temp_df = pd.DataFrame()
new_df = pd.DataFrame()
for f in data:
##Reading the file in pandas which is in excel format
#
file_df = pd.read_excel(f)
filename = file_df['Unnamed: 1'][2]
##Skipping first 24 rows to get the required reading values
column_names = ['start_time','xxx_value']
data_df = pd.read_excel(f, names=column_names, skiprows=25)
array =np.array([])
for i in data_df.iterrows():
array = np.append(array,i[1][1])
temp_df['xxx_value'] = [array]
temp_df['Filename'] = filename
temp_df['sub_id']=
temp_df['Filename'].str.split('_',1).str[1].str.strip()
temp_df['sen_site']=
temp_df['Filename'].str.split('_',1).str[0].str.strip()
temp_df['sampling_interval'] = 15
temp_df['start_time'] = data_df['start_time'][2]
new_df= new_df.append(xxx_df)
new_df.index = new_df.index + 1
new_df=new_df.sort_index()
new_df.index.name='record_id'
new_df = new_df.drop("Filename",1) ##dropping the Filename as it
is not needed to be loaded in postgresql
##Rearrange to postgresql format
column_new_df = new_df.columns.tolist()
column_new_df.
insert(4,column_new_df.pop(column_new_df.index('xxx_value')))
new_df = new_df.reindex(columns = column_new_df)
print(new_df)
This code is not working when I try to insert the array data into Postgresql.
It gives me an error stating:
ProgrammingError: (psycopg2.ProgrammingError) can't adapt type
'numpy.ndarray'
I am not sure where the problem is, as I can't see in your code the part where you insert the data into Postgres.
My guess though is that you are giving Postgres a Numpy array: psycopg2 can't handle Numpy data types, but it should be fairly easy to convert it to native Python types that work with psycopg2 (e.g. by using the .tolist(method), it is difficult to give more precise information without the code).
In my opinion, the most effective way would be to make psycopg2 always aware of np.ndarray(s). One could do that by registering an adapter:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.ndarray, addapt_numpy_array)
To help working with numpy in general, my default addon to scripts/libraries dependent on psycopg2 is:
import numpy as np
from psycopg2.extensions import register_adapter, AsIs
def addapt_numpy_float64(numpy_float64):
return AsIs(numpy_float64)
def addapt_numpy_int64(numpy_int64):
return AsIs(numpy_int64)
def addapt_numpy_float32(numpy_float32):
return AsIs(numpy_float32)
def addapt_numpy_int32(numpy_int32):
return AsIs(numpy_int32)
def addapt_numpy_array(numpy_array):
return AsIs(tuple(numpy_array))
register_adapter(np.float64, addapt_numpy_float64)
register_adapter(np.int64, addapt_numpy_int64)
register_adapter(np.float32, addapt_numpy_float32)
register_adapter(np.int32, addapt_numpy_int32)
register_adapter(np.ndarray, addapt_numpy_array)
otherwise there would be some issues even with numerical types.
I got the adapter trick from this other stackoverflow entry.
Convert each numpy array element to its equivalent list using apply and tolist first, and then you should be able to write the data to Postgres:
df['column_name'] = df['column_name'].apply(lambda x: x.tolist())
We can address the issue by extracting one element at a time. Here I'm assuming for a dataframe temp_df, sub_id of type numpy.int64, we can directly extract the values using the iloc and item as temp_df.iloc[0]['sub_id'].item() and we can push that in DB.

Cannot compare type 'Timestamp' with type 'int'

When running the following code:
for row,hit in hits.iterrows():
forwardRows = data[data.index.values > row];
I get this error:
TypeError: Cannot compare type 'Timestamp' with type 'int'
If I look into what is being compared here I have these variables:
type(row)
pandas.tslib.Timestamp
row
Timestamp('2015-09-01 09:30:00')
is being compared with:
type(data.index.values[0])
numpy.datetime64
data.index.values[0]
numpy.datetime64('2015-09-01T10:30:00.000000000+0100')
I would like to understand whether this is something that can be easily fixed or should I upload a subset of my data? thanks
Although this isn't a direct answer to your question, I have a feeling that this is what you're looking for: pandas.DataFrame.truncate
You could use it as follows:
for row, hit in hits.iterrows():
forwardRows = data.truncate(before=row)
Here's a little toy example of how you might use it in general:
import pandas as pd
# let's create some data to play with
df = pd.DataFrame(
index=pd.date_range(start='2016-01-01', end='2016-06-01', freq='M'),
columns=['x'],
data=np.random.random(5)
)
# example: truncate rows before Mar 1
df.truncate(before='2016-03-01')
# example: truncate rows after Mar 1
df.truncate(after='2016-03-01')
When using values you put it into numpy world. Instead, try
for row,hit in hits.iterrows():
forwardRows = data[data.index > row];

Categories

Resources