Using np.where inside of df.eval string in python - python

long time listener, first time caller..
I need to be able to call functions (custom made and otherwise, like numpy.where) as a string in a pandas dataframe eval statement. see example:
import pandas as pd
import numpy as np
data = [[0,10,10],[1,20,20],[0,30,30]]
df = pd.DataFrame(data,columns=['X','A','B'])
df['C'] = np.where(df.X==0,df.A+df.B,0) #This works
df['C'] = df.eval('np.where(X==0,A+B,0)') #But this is how i need to implement
df['C'] = df.eval('#np.where(X==0,A+B,0)') #pinata swings starting here
Please stackoverflow! help!

Update:
I found a way to solve by changing the string and the way I evaluated to this:
df['C'] = eval("np.where(df['X']==0,df['A']+df['B'],0)")

Related

Pandas dropna method subset parameter - difference between specifying the column as a string and as a list

I have a following sample df which I want to drop missing row for period1 column. I use dropna method with subset parameter as a list (test1) and as a string (test2), both returns the same result.
I try to find out -
Would specifying the subset parameter as a string (since I only have one column) cause any issue?
If yes, is it due to different version of Python (currently I am using 3.7)
If yes, is it due to running the code on different OS (currently I am running on window, not on a server)
What would be the potential issue?
Any suggestion would be greatly appreciated.
import numpy as np
import pandas as pd
df1 = { 'item':['item1','item2','item3','item4'],
'period1':[np.nan,222,5555,123],
'period2':[4567,3333,123,123],
'period3':[1234, 254,9993,321],
'period4':[999,525,2345,963]}
df1=pd.DataFrame(df1)
test1 = df1.copy()
test2 = df1.copy()
# period1 in a list
test1.dropna(subset=['period1'],inplace=True)
# period1 as a string
test2.dropna(subset='period1',inplace=True)

pandas df.apply() not working with html.unescape()

I'm trying to decode html characters within a pandas dataframe.
I don't know why but my apply function won't work.
# requirements
import html
import pandas as pd
# This code works fine.
df = df.apply(lambda x: x + "TESTSTRING")
print(df) # "TESTSTRING" is appended to all values.
# This code also works fine. html.unescape() is working well.
fn = lambda x: html.unescape(x)
str = "Someting wrong with <b>E&S</b>"
print(fn(str)) # returns "Something wrong with <b>E&S</b>"
# However, the code below doesn't work. The "&" within the values dont' get decoded.
df2 = df.apply(fn)
print(df2) # The html characters aren't decoded!
It's really frustrating that the apply function and html.unescape() is working well separately, but I don't know why they don't work when they are together.
I've also tried axis=1
I'd really appreciate your help. Thanks in advance.
The problem is that html.unexcape() seems unvectorized, i.e. it accepts only one single string.
In case Your df is not really large, using applymap should still be sufficiently fast:
df2 = df.applymap(lambda x: html.unescape(x))
print(df2)

Constructing pandas dataframe from a list of objects

Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance
Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')

Python Pandas DataFrame with only a single number stored?

AzureML's Python Script module requires to return a Pandas DataFrame. I want to return only a value and I do this:
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
by which I am able to return just a single value in Azure ML's Python Script module.
What is a proper way to create a pandas DataFrame with a single value?
Following code should work:
import pandas as pd
def azureml_main(dataframe1 = None, dataframe2 = None):
result = pd.DataFrame({'mycol': [123]})
return result,
As EdChum commented
dataframe1=pd.DataFrame([result], dtype=float)
and it works, tested, instead of
result=7
dataframe1=pd.DataFrame(numpy.zeros(1))
dataframe1[0][0]=result
where we don't need to use numpy to initiate the return value with zeroes.
P.s. EdChum can make this his answer if he wants.

How to use xlwings or pandas to get all the non-null cell?

Recently I need to write a python script to find out how many times the specific string occurs in the excel sheet.
I noted that we can use *xlwings.Range('A1').table.formula* to achieve this task only if the cells are continuous. If the cells are not continuous how can I accomplish that?
It's a little hacky, but why not.
By the way, I'm assuming you are using python 3.x.
First well create a new boolean dataframe that matches the value you are looking for.
import pandas as pd
import numpy as np
df = pd.read_excel('path_to_your_excel..')
b = df.applymap(lambda x: x == 'value_you_want_to_find' if isinstance(x, str) else False)
and then simply sum all occurences.
print(np.count_nonzero(b.values))
As clarified in the comments, if you already have a dataframe, you can simply use count (Note: there must be a better way of doing it):
df = pd.DataFrame({'col_a': ['a'], 'col_b': ['ab'], 'col_c': ['c']})
string_to_search = '^a$' # should actually be a regex, in this example searching for 'a'
print(sum(df[col].str.count(string_to_search).sum() for col in df.columns))
>> 1

Categories

Resources