If Else, column with several string values to match on - python

The function will have much more conditional statements, but to start off, and where I've trouble-shooted to, I get the error: 'str' object has no attribute 'isin', etc. I've tried several things to no avail.
def categorise(row):
if (row['state'] == 'FL') & (row['city'].isin(['MIAMI', 'TALLAHASSEE', 'ORLANDO'])):
return 1
...
df['colF'] = df.apply(lambda row: categorise(row), axis=1)

Related

AttributeError: 'Series' object has no attribute 'Mean_μg_L'

Why am I getting this error if the column name exists.
I have tried everything. I am out of ideas
Since the AttributeError is raised at the first column with a name containing a mathematical symbol (µ), I would suggest you these two solutions :
Use replace right before the loop to get rid of this special character
df.columns = df.columns.str.replace("_\wg_", "_ug_", regex=True)
#change df to Table_1_1a_Tarawa_Terrace_System_1975_to_1985
Then inside the loop, use row.Mean_ug_L, .. instead of row.Mean_µg_L, ..
Use row["col_name"] (highly recommended) to refer to the column rather than row.col_name
for index, row in Table_1_1a_Tarawa_Terrace_System_1975_to_1985.iterrows():
SQL_VALUES_Tarawa = (row["Chemicals"], rows["Contamminant"], row["Mean_µg_L"], row["Median_µg_L"], row["Range_µg_L"], row["Num_Months_Greater_MCL"], row["Num_Months_Greater_100_µg_L"])
cursor.execute(SQL_insert_Tarawa, SQL_VALUES_Tarawa)
counting = cursor.rowcount
print(counting, "Record added")
conn.commit()

Pandas/Python function str.contains returns an error

I am trying to make a function where I feed my dataframe into - the purpose of the function is to categorize account postings into either "accept" or "ignore.
The problem I then have is that on some accounts I need to only look for a partial part of a text string. If I do that without a function it works, but in a function I get an error.
So this works fine:
ekstrakt.query("Account== 'Car_sales'").Tekst.str.contains("Til|Fra", na=False)
But this doesn't:
def cleansing(df):
if df['Account'] == 'Car_sales':
if df.Tekst.str.contains("Til|Fra", na=False) : return 'Ignore'
ekstrakt['Ignore'] = ekstrakt.apply(cleansing, axis = 1)
It results in an error: "AttributeError: 'str' object has no attribute 'str'"
I need the "cleansing" function to take more arguments afterwards, but I am struggling getting past this first part.
If use function processing each row separately, so cannot use pandas functon working with columns like str.contains.
Possible solution is create new column by chained mask by & for bitwise AND with numpy.where:
df = pd.DataFrame({'Account':['car','Car_sales','Car_sales','Car_sales'],
'Tekst':['Til','Franz','Text','Tilled']})
m1 = df['Account'] == 'Car_sales'
m2 = df.Tekst.str.contains("Til|Fra", na=False)
df['new'] = np.where(m1 & m2, 'Ignore', 'Accept')
print (df)
Account Tekst new
0 car Til Accept
1 Car_sales Franz Ignore
2 Car_sales Text Accept
3 Car_sales Tilled Ignore
If need processing in function, you can use in statement with or, because working with scalars:
def cleansing(x):
if x['Account'] == 'Car_sales':
if pd.notna(x.Tekst):
if ('Til' in x.Tekst) or ('Fra' in x.Tekst):
return 'Ignore'
df['Ignore'] = df.apply(cleansing, axis = 1)
print (df)
Account Tekst new Ignore
0 car Til Accept None
1 Car_sales Franz Ignore Ignore
2 Car_sales Text Accept None
3 Car_sales Tilled Ignore Ignore

Timestamp object has no attribute dt

I am trying to convert a new column in a dataframe through a function based on the values in the date column, but get an error indicating "Timestamp object has no attribute dt." However, if I run this outside of a function, the dt attribute works fine.
Any guidance would be appreciated.
This code runs with no issues:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
display(dftest.info())
dftest['year'] = dftest['Date'].dt.year
dftest['month'] = dftest['Date'].dt.month
This code gives me the error message:
sample = {'Date': ['2015-07-02 11:47:00', '2015-08-02 11:30:00']}
dftest = pd.DataFrame.from_dict(sample)
dftest['Date'] = pd.to_datetime(dftest['Date'])
def CALLYMD(dftest):
if dftest['Date'].dt.month>9:
return str(dftest['Date'].dt.year) + '1231'
elif dftest['Date'].dt.month>6:
return str(dftest['Date'].dt.year) + '0930'
elif dftest['Date'].dt.month>3:
return str(dftest['Date'].dt.year) + '0630'
else:
return str(dftest['Date'].dt.year) + '0331'
dftest['CALLYMD'] = dftest.apply(CALLYMD, axis=1)
Lastly, I'm open to any suggestions on how to make this code better as I'm still learning.
I'm guessing you should remove .dt in the second case. When you do apply it's applying to each element, .dt is needed when it's a group of data, if it's only one element you don't need .dt otherwise it will raise
{AttributeError: 'Timestamp' object has no attribute 'dt'}
reference: https://stackoverflow.com/a/48967889/13720936
After looking at the timestamp documentation, I found removing the .dt and just doing .year and .month works. However, I'm still confused as to why it works in the first code but does not work in the second code.
here is how to create a yearmonth bucket using the year and month
for key, item in df.iterrows():
year=pd.to_datetime(item['Date']).year
month=str(pd.to_datetime(item['Date']).month)
df.loc[key,'YearMonth']="{:.0f}{}".format(year,month.zfill(2))

pandas function return multiple values error - TypeError: unhashable type: 'list'

I have written a pandas function and it runs fine (the second last line of my code). When i try to assign my function's output to columns in dataframes i get an error TypeError: unhashable type: 'list'
i posted a something similar and i am using method shown in the answer of that question in the below function. But still it fails :(
import pandas as pd
import numpy as np
def benford_function(value):
if value == '':
return []
if ("." in value):
before_decimal=value.split(".")[0]
if len(before_decimal)==0:
bd_first="0"
bd_second="0"
if len(before_decimal)>1:
before_decimal=before_decimal[:2]
bd_first=before_decimal[0]
bd_second=before_decimal[1]
elif len(before_decimal)==1:
bd_first="0"
bd_second=before_decimal[0]
after_decimal=value.split(".")[1]
if len(after_decimal)>1:
ad_first=after_decimal[0]
ad_second=after_decimal[1]
elif len(after_decimal)==1:
ad_first=after_decimal[0]
ad_second="0"
else:
ad_first="0"
ad_second="0"
else:
ad_first="0"
ad_second="0"
if len(value)>1:
bd_first=value[0]
bd_second=value[1]
else:
bd_first="0"
bd_second=value[0]
return pd.Series([bd_first,bd_second,ad_first,ad_second])
df = pd.DataFrame(data = {'a': ["123"]})
df.apply(lambda row: benford_function(row['a']), axis=1)
df[['bd_first'],['bd_second'],['ad_first'],['ad_second']]= df.apply(lambda row: benford_function(row['a']), axis=1)
Change:
df[['bd_first'],['bd_second'],['ad_first'],['ad_second']] = ...
to
df[['bd_first', 'bd_second', 'ad_first', 'ad_second']] = ...
This will fix your type-error, since index elements must be hashable. The way you tried to index into the Dataframe by passing a tuple of single-element lists will interpret each of those single element lists as indices

pandas multiindex shift on filtered values

I want to get the time differences between rows of interest.
t = pd.data_range('1/1/2000', period=6, freq='D')
d = pd.DataFrame({'sid':['a']*3 + ['b']*3,
'src':['m']*3 + ['t']*3,
'alert_v':[1,0,0,0,1,1]}, index=rng)
I want to get the time difference between rows where alr==1.
Ive tried shifting, but are there other ways to take the difference between two rows in a column?
i have tried simple lambdas and more complex .loc:
`
def deltat(g):
g['d1'] = g[ g['alert_v']==1 ]['timeindex'].shift(1)
g['d0'] = g[ g['alert_v']==1 ]['timeindex']
return g['td'] = g['d1'] - g['d0']
d['td'] = d.groupby('src','sid').apply(lambda x: deltat(x) )
def indx(g):
d0 = g.loc[g['alert_v']==1 ]
d1[0] = d0[0]
d1.append( d0[:-1] )
g['tavg'] = g.apply( g.ix[d1,'timeindex'] - g.ix[d0,'timeindex'])
return g
After trying a bunch of approaches, I cant seem to get past either the multigroup or filtering issues...
whats the best way to do this?
edit:
diff(1) produces this error:
raise TypeError('incompatible index of inserted column '
TypeError: incompatible index of inserted column with frame index
while shift(1) produces this error:
ZeroDivisionError: integer division or modulo by zero
attempt to clean the data, not help.
if any( pd.isnull( g['timeindex'] ) ):
print '## timeindex not null'
g['timeindex'].fillna(method='ffill')
For multindex group, select rows, diff, and insert new column paradigm: this is how I got it to work with clean output.
some groups have 0 relevant rows, this throws an exception.
shift throws key error, so just sticking with diff()
# -- get the interarrival time
def deltat(g):
try:
g['tavg'] = g[ g['alert_v']==1 ]['timeindex'].diff(1)
return g
except:
pass
d.sort_index(axis=0, inplace=True)
d = d.groupby(['source','subject_id','alert_t','variable'],as_index=False,group_keys=False).apply( lambda x: deltat(x) )
print d[d['alert_v']==1][['timeindex','tavg']]

Categories

Resources