Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance
Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')
Related
long time listener, first time caller..
I need to be able to call functions (custom made and otherwise, like numpy.where) as a string in a pandas dataframe eval statement. see example:
import pandas as pd
import numpy as np
data = [[0,10,10],[1,20,20],[0,30,30]]
df = pd.DataFrame(data,columns=['X','A','B'])
df['C'] = np.where(df.X==0,df.A+df.B,0) #This works
df['C'] = df.eval('np.where(X==0,A+B,0)') #But this is how i need to implement
df['C'] = df.eval('#np.where(X==0,A+B,0)') #pinata swings starting here
Please stackoverflow! help!
Update:
I found a way to solve by changing the string and the way I evaluated to this:
df['C'] = eval("np.where(df['X']==0,df['A']+df['B'],0)")
I don't understand the behaviour .loc or .at, when I want to save a variable in a specific cell of a dataframe. Can somebody help me to understand, please?
My failing working example:
import pandas as pd
import numpy as np
print(pd.__version__)
from platform import python_version
print(python_version())
df=pd.DataFrame(index=[0,1,2,3],columns=['A','B'])
df = pd.DataFrame({'a':[np.array([1,2,3]), np.array([4,5,6]), np.array([7,8,9]), np.array([10,11,12]), np.array([13,14,15])],'b':[5,5,12,123,6]})
display(df)
df.loc[0,'c']='string 0'
df.loc[1,'c']='string 1'
df.loc[2,'c']='string 2'
df.loc[3,'c']='string 3'
print(df.index.values)
testdata=np.array(np.arange(0,3648,1),dtype=np.float32)
print('----------testdata----------')
print(type(testdata))
print(testdata.dtype)
print(testdata.shape)
print('----------file_handle----------')
file_handle=np.array([1],dtype=np.int64)
print(file_handle)
print(type(file_handle))
print(file_handle.dtype)
if not 'new_column' in df.columns:
df=df.assign(new_column=None)
display(df)
df.loc[file_handle,'new_column']=[testdata]
display(df)
Result: ValueError: Must have equal len keys and value when setting with an ndarray
But with df.at[file_handle[0],'new_column']=[testdata], df.at[1,'new_column']=[testdata] it works. I don't understand. With df.loc[file_handle[0],'new_column']=testdata it does not work either.
In other places of my code, I can use as row index [1] to assign dicts or scalars into one specific location, but no numpy arrays.
Thank you for your explanation and insight. I would be thankful to understand, how to use .loc and at and what variables they accept, both as row index, but also as item stored in the dataframe.
When you have an ndarray on the right side, Pandas will not treat it like any Python object that can be inserted into the DataFrame. Instead you run into a code path that tries to set multiple values at multiple locations from that array, hence the error message pointing out when setting with an ndarray.
Consider some working multiloc code like
df.loc[[0,1,3], ['b', 'new_column']] = np.array([[4,5], [6,7], [8,9]])
Here, the shape of the ilocs on the left side is the same shape as the array on the right side, and it sets all the values successfully.
In your code, the list of the testdata array of shape (3648) is treated like a 2D-array of shape (1, 3648) by Pandas in this operation. This shape does not match the ilocs on the left side, thus Pandas throws an error about not being able to match them up.
The correct way to handle this issue is to use .at instead, which can only handle a single location, and won't run into the ndarray setting codepath.
First import:
import pandas as pd
import numpy as np
import hashlib
Next, consider the following:
np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1.
However, if we consider the following:
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Note that there are strings in the data now. The hash of the arr is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de) for different evaluations, but the one of the pandas.DataFrame changes each time.
Any pythonic way around it? No Pythonic?
Edit: Related links:
Hashable DataFrames
A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.
According to me when you are using string as values for your cells. Data frame type is object
df.dtypes
shows that.
That is why you get different hash each time.
Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:
print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())
Naturally, this is going to be very length for big dataframes.
Still, the it remains that pd.DataFrame(arr).values != arr, and this is counter-intuitive.
See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691
I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.
AzureML's Python Script module requires to return a Pandas DataFrame. I want to return only a value and I do this:
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
by which I am able to return just a single value in Azure ML's Python Script module.
What is a proper way to create a pandas DataFrame with a single value?
Following code should work:
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
result = pd.DataFrame({'mycol': [123]})
return result,
As EdChum commented
dataframe1=pd.DataFrame([result], dtype=float)
and it works, tested, instead of
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
where we don't need to use numpy to initiate the return value with zeroes.
P.s. EdChum can make this his answer if he wants.
I am trying to create column names for easy reference, That way I can just call the name from the rest of the program instead of having to know which column is where in terms of placement. The from_ column array is coming up empty. I am new to numpy so I am just wondering how this is done. Changing of data type for columns 5 and 6 was successful though.
def array_setter():
import os
import glob
import numpy as np
os.chdir\
('C:\Users\U2970\Documents\Arcgis\Text_files\Data_exports\North_data_folder')
for file in glob.glob('*.TXT'):
reader = open(file)
headerLine = reader.readlines()
for col in headerLine:
valueList = col.split(",")
data = np.array([valueList])
from_ = np.array(data[1:,[5]],dtype=np.float32)
# trying to assign a name to columns for easy reference
to = np.array(data[1:,[6]],dtype=np.float32)
if data[:,[1]] == 'C005706N':
if data[:,[from_] < 1.0]:
print data[:,[from_]]
array_setter()
If you want to index array columns by name name, I would recommend turning the array into a pandas dataframe. For example,
import pandas as pd
import numpy as np
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['f', 's'])
print df['f']
The nice part of this approach is that the arrays still maintain all their structure but you also get all the optimized indexing/slicing/etc. capabilities of pandas. For example, if you wanted to find elements of 'f' that corresponded to elements of 's' being equal to some value a, then you could use loc
a = 2
print df.loc[df['s']==2, 'f']
Check out the pandas docs for different ways to use the DataFrame object. Or you could read the book by Wes McKinney (pandas creator), Python for Data Analysis. Even though it was written for an older version of pandas, it's a great starting point and will set you in the right direction.