AzureML's Python Script module requires to return a Pandas DataFrame. I want to return only a value and I do this:
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
by which I am able to return just a single value in Azure ML's Python Script module.
What is a proper way to create a pandas DataFrame with a single value?
Following code should work:
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
result = pd.DataFrame({'mycol': [123]})
return result,
As EdChum commented
dataframe1=pd.DataFrame([result], dtype=float)
and it works, tested, instead of
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
where we don't need to use numpy to initiate the return value with zeroes.
P.s. EdChum can make this his answer if he wants.
Related
long time listener, first time caller..
I need to be able to call functions (custom made and otherwise, like numpy.where) as a string in a pandas dataframe eval statement. see example:
import pandas as pd
import numpy as np
data = [[0,10,10],[1,20,20],[0,30,30]]
df = pd.DataFrame(data,columns=['X','A','B'])
df['C'] = np.where(df.X==0,df.A+df.B,0) #This works
df['C'] = df.eval('np.where(X==0,A+B,0)') #But this is how i need to implement
df['C'] = df.eval('#np.where(X==0,A+B,0)') #pinata swings starting here
Please stackoverflow! help!
Update:
I found a way to solve by changing the string and the way I evaluated to this:
df['C'] = eval("np.where(df['X']==0,df['A']+df['B'],0)")
I have a following sample df which I want to drop missing row for period1 column. I use dropna method with subset parameter as a list (test1) and as a string (test2), both returns the same result.
I try to find out -
Would specifying the subset parameter as a string (since I only have one column) cause any issue?
If yes, is it due to different version of Python (currently I am using 3.7)
If yes, is it due to running the code on different OS (currently I am running on window, not on a server)
What would be the potential issue?
Any suggestion would be greatly appreciated.
import numpy as np
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4'],
'period1':[np.nan,222,5555,123],
'period2':[4567,3333,123,123],
'period3':[1234, 254,9993,321],
'period4':[999,525,2345,963]}
df1=pd.DataFrame(df1)
test1 = df1.copy()
test2 = df1.copy()
# period1 in a list
test1.dropna(subset=['period1'],inplace=True)
# period1 as a string
test2.dropna(subset='period1',inplace=True)
Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance
Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')
Recently I need to write a python script to find out how many times the specific string occurs in the excel sheet.
I noted that we can use *xlwings.Range('A1').table.formula* to achieve this task only if the cells are continuous. If the cells are not continuous how can I accomplish that?
It's a little hacky, but why not.
By the way, I'm assuming you are using python 3.x.
First well create a new boolean dataframe that matches the value you are looking for.
import pandas as pd
import numpy as np
df = pd.read_excel('path_to_your_excel..')
b = df.applymap(lambda x: x == 'value_you_want_to_find' if isinstance(x, str) else False)
and then simply sum all occurences.
print(np.count_nonzero(b.values))
As clarified in the comments, if you already have a dataframe, you can simply use count (Note: there must be a better way of doing it):
df = pd.DataFrame({'col_a': ['a'], 'col_b': ['ab'], 'col_c': ['c']})
string_to_search = '^a$' # should actually be a regex, in this example searching for 'a'
print(sum(df[col].str.count(string_to_search).sum() for col in df.columns))
>> 1
I'm having a problem with the apply() method of the pandas DataFrame. My issue is that apply() can return either a Series or a DataFrame, depending on the return type of the input function; however, when the frame is empty, apply() (almost) always returns a DataFrame. So I can't write code that expects a Series. Here's an example:
import pandas as pd
def area_from_row(row):
return row['width'] * row['height']
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1)
# This works as expected.
non_empty_frame = pd.DataFrame(data=[[2, 3]], columns=['width', 'height'])
add_area_column(non_empty_frame)
# This fails!
empty_frame = pd.DataFrame(data=None, columns=['width', 'height'])
add_area_column(empty_frame)
Is there a standard way of dealing with this? I can do the following, but it's silly:
def area_from_row(row):
# The way we respond to an empty row tells pandas whether we're a
# reduction or not.
if not len(row):
return None
return row['width'] * row['height']
(I'm using pandas 0.11.0, but I checked this on 0.12.0-1100-g0c30665 as well.)
You can set the result_type parameter in apply to 'reduce'.
From the documentation,
By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.
And then,
‘reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
In your code, update here:
def add_area_column(frame):
# I know I can multiply the columns directly, but my actual function is
# more complicated.
frame['area'] = frame.apply(area_from_row, axis=1, result_type='reduce')