I am trying to create a new column in a dataframe within a for loop of dictionary items that uses a string literal and the key, but it throws a "ValueError: cannot set a frame with no defined index and a scalar" error message.
Dictionary definition for exp categories
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Creating a new dataframe based on a master dataframe
mcon = VA.loc[:,['Expense', 'Mgrl', 'Exp Category', 'Parent Category']]
mcon.loc[:,'Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in mcon['Mgrl']]
mcon.loc[:,'Business Unit'] = 'Managerial Consolidation'
mcon = mcon[['Business Unit', 'Exp Category','Parent Category', 'Expense', 'Mgrl', 'Variance Type']]
mcon.rename(columns={'Mgrl':'Variance'}, inplace=True)
Creating a new dataframe that will be written to excel eventually
a1 = pd.DataFrame()
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
a1.to_csv('example.csv', index = False)
Output looks like this
I am trying to add a new column that says Higher/Lower budget than {key} where key stands for the expense type using the below code
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
umconm.loc[:,'Explanation'] = f'Lower than budgeted {key}'
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
fmconm.loc[:,'Explanation'] = f'Higher than budgeted {key}'
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
but using the above string literal gives me the error message "ValueError: cannot set a frame with no defined index and a scalar"
I would really appreciate any help to either correct this or find a different solution for adding this field to my dataframe. Thanks in advance!
this error occurs because this line
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
will produce empty dataframe sometimes without index. instead use this approach when you want to set your column (not loc):
a['Explanation'] = f'Lower than budgeted {key}'
So silly of me, the solution is as follows:
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
umconm['Explanation'] = f'Higher than Budget for {key}'
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
fmconm['Explanation'] = f'Lower than Budget for {key}'
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
I didnt have to use .loc while creating a new column in this dataframe!
Related
Below is the Data set I was using(syn-retweet-done.csv in my code). And The above error did come off
,Unnamed: 0,created_at,tweet,category
0,0,2021-07-29 02:40:00,People Gather in numbers,Other
1,0,2021-07-29 02:40:00,No real sign of safety,Other
2,1,2021-07-27 10:40:00,President is On fire,Politics
3,1,2021-07-27 10:40:00,Election is to be held next month,Politics
Below is the codebase I worked on. It would be very helfil if someone can figure out the issue which is pointing aggregated()
def aggregated():
tweets = pd.read_csv(r'syn-retweet-done.csv')
df = pd.DataFrame(tweets, columns=['created_at', 'tweet','category'])
out = df.groupby(['created_at', 'category'], sort=False, as_index=False)['tweet'] \
.apply(lambda x: ' '.join(x))[df.columns]
# print(out)
return out
def saveFile():
df = pd.read_csv('test_1.csv');
categories = df['category'].unique()
for category in categories:
df[df['category'] == category].to_csv(category + '.csv')
# Driver code
if __name__ == '__main__':
print(aggregated())
aggregated().to_csv(r'test_1.csv',index = True, header=True)
saveFile()
I'm not sure I am approaching this in the right way.
Scenario:
I have two SQL tables that contain rent information. One table contains rent due, and the other contains rent received.
I'm trying to build a rent book which takes the data from both tables for a specific lease and generates a date ordered statement which will be displayed on a webpage.
I'm using Python, Flask and SQL Alchemy.
I am currently learning Python, so I'm not sure if my approach is the best.
I've created a dictionary which contains the keys 'Date', 'Payment type' and 'Payment Amount', and in each of these keys I store a list which contains the data from my SQL queries. The bit im struggling on is how to sort the dictionary so it sorts by the date key, keeping the values in the other keys aligned to their date.
lease_id = 5
dates_list = []
type_list = []
amounts_list = []
rentbook_dict = {}
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
rentbook_dict.setdefault('Date',[]).append(dates_list)
rentbook_dict.setdefault('Type',[]).append(type_list)
rentbook_dict.setdefault('Amount',[]).append(amounts_list)
I was then going to use a for loop within the flask template to iterate through each value and display it in a table on the page.
Or am I approaching this in the wrong way?
so I managed to get this working just using zipped list. Im sure there is a better way for me to accomplish this but im pleased I've got it working.
lease_id = 5
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
total_due = 0
for debit in payments_due:
total_due = total_due + int(debit.expected_rent_amount)
total_received = 0
for income in payments_received:
total_received = total_received + int(income.payment_amount)
balance = total_received - total_due
if balance < 0 :
arrears = "This account is in arrears"
else:
arrears = ""
dates_list = []
type_list = []
amounts_list = []
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
payment_data = zip(dates_list, type_list, amounts_list)
sorted_payment_data = sorted(payment_data)
tuples = zip(*sorted_payment_data)
list1, list2, list3 = [ list(tuple) for tuple in tuples]
return(render_template('rentbook.html',
payment_data = zip(list1,list2,list3),
total_due = total_due,
total_received = total_received,
balance = balance))
I have this function to find duplicates in a dataframe:
def checkForDuplicates():
database_dup_first = df.drop_duplicates(subset=['name','slug','id'], keep='first')
df = database_dup_first[database_dup_first.duplicated(['name','id'], keep=False)]
for index, value in df.iterrows():
slug = value.slug
team = value.team
print ((slug, team))
return df
which prints tuples with repeated players (with slightly different name entries) and their teams:
df = checkForDuplicates()
('Wesley', 'Bragantino')
('Wesley Pionteck', 'Bragantino')
('Leonardo Gil', 'Vasco')
('Léo Gil', 'Vasco')
('João Paulo', 'Fortaleza')
('João Paulo Silveira', 'Fortaleza')
...
Now I need to perform a replacement on the dataset, where the last (always the last) similar entry replaces the first.
I know I could perform this manually, for all returned duplicates, like so:
df = checkForDuplicates()
for index, value in df.iterrows():
slug = value.slug
team = value.team
if slug == 'Wesley' and team =='Bragantino':
df['slug'].iloc[index] = 'Wesley Pionteck'
if slug == 'Leonardo Gil' and team =='Vasco':
df['slug'].iloc[index] = 'Léo Gil'
if slug == 'João Paulo' and team =='Vasco':
df['slug'].iloc[index] = 'João Paulo Silveira'
But the actual list of duplicates is huge. So how can I automate this replacement for all duplicate entries?
Your solution:
n=len(df['slug'])
for i,e in enumerate(df['slug']):
if i < n-1:
a = df.iloc[i,0]
b = df.iloc[i+1,0]
match_count = 0
if len(a.split()) < len(b.split()):
fcount = len(a.split())
else:
fcount = len(b.split())
for a1 in a.split():
for b1 in b.split():
if (a1.find(b1)>=0 or b1.find(a1) >=0):
match_count+=1
if (match_count == fcount) and (df.iloc[i,1] == df.iloc[i+1,1]):
match_count = 0
df.iloc[i,0] = df.iloc[i+1,0]
else:
break
# keep the last slug with unique name and id
df_slug_last = df.drop_duplicates(subset=['name','id'], keep='last')[['name','slug','id']]
df_slug_last.columns = ['name','slug_last','id']
# merge the last slug to the origin df
dfn = pd.merge(df, df_slug_last, on=['name','id'], how='left')
dfn['slug_last'] will be the lastest slug for every unique name,id
I want to create comments from a dataset that details the growth rate, market share, etc for various markets and products. The dataset is in the form of a pd.DataFrame(). I would like the comment to include keywords like increase/decrease based on the calculations, for example, if 2020 Jan has sale of 1000, and 2021 Jan has a sale of 1600, then it will necessary mean an increase of 60%.
I defined a function outside as such and I would like to seek if this method is too clumsy, if so, how should I improve on it.
GrowthIncDec = namedtuple('gr_tuple', ['annual_growth_rate', 'quarterly_growth_rate'])
def increase_decrease(annual_gr, quarter_gr):
if annual_gr > 0:
annual_growth_rate = 'increased'
elif annual_gr < 0:
annual_growth_rate = 'decreased'
else:
annual_growth_rate = 'stayed the same'
if quarter_gr > 0:
quarterly_growth_rate = 'increased'
elif quarter_gr < 0:
quarterly_growth_rate = 'decreased'
else:
quarterly_growth_rate = 'stayed the same'
gr_named_tuple = GrowthIncDec(annual_growth_rate=annual_growth_rate, quarterly_growth_rate=quarterly_growth_rate)
return gr_named_tuple
myfunc = increase_decrease(5, -1)
myfunc.annual_growth_rate
output: 'increased'
A snippet of my main code is as follows to illustrate the use of the above function:
def get_comments(grp, some_dict: Dict[str, List[str]]):
.......
try:
subdf = the dataframe
annual_gr = subdf['Annual_Growth'].values[0]
quarter_gr = subdf['Quarterly_Growth'].values[0]
inc_dec_named_tup = increase_decrease(annual_gr, quarter_gr)
inc_dec_annual_gr = inc_dec_named_tup.annual_growth_rate
inc_dec_quarterly_gr = inc_dec_named_tup.quarterly_growth_rate
comment = "The {} has {} by {:.1%} in {} {} compared to {} {}"\
.format(market, inc_dec_annual_gr, annual_gr, timeperiod, curr_date, timeperiod, prev_year)
comments_df = pd.DataFrame(columns=['Date','Comments'])
# comments_df['Date'] = [curr_date]
comments_df['Comments'] = [comment]
return comments_df
except (IndexError, KeyError) as e:
# this is for all those nan values which is empty
annual_gr = 0
quarter_gr = 0
I have two functions, one which creates a dataframe from a csv and another which manipulates that dataframe. There is no problem the first time I pass the raw data through the lsc_age(import_data()) functions. However, I get the above-referenced error (TypeError: 'DataFrame' object is not callable) upon second+ attempts. Any ideas for how to solve the problem?
def import_data(csv,date1,date2):
global data
data = pd.read_csv(csv,header=1)
data = data.iloc[:,[0,1,4,6,7,8,9,11]]
data = data.dropna(how='all')
data = data.rename(columns={"National: For Dates 9//1//"+date1+" - 8//31//"+date2:'event','Unnamed: 1':'time','Unnamed: 4':'points',\
'Unnamed: 6':'name','Unnamed: 7':'age','Unnamed: 8':'lsc','Unnamed: 9':'club','Unnamed: 11':'date'})
data = data.reset_index().drop('index',axis=1)
data = data[data.time!='Time']
data = data[data.points!='Power ']
data = data[data['event']!="National: For Dates 9//1//"+date1+" - 8//31//"+date2]
data = data[data['event']!='USA Swimming, Inc.']
data = data.reset_index().drop('index',axis=1)
for i in range(len(data)):
if len(str(data['event'][i])) <= 3:
data['event'][i] = data['event'][i-1]
else:
data['event'][i] = data['event'][i]
data = data.dropna()
age = []
event = []
gender = []
for row in data.event:
gender.append(row.split(' ')[0])
if row[:9]=='Female 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
elif row[:7]=='Male 10':
n = 4
groups = row.split(' ')
age.append(' '.join(groups[1:n]))
event.append(' '.join(groups[n:]))
else:
n = 2
groups = row.split(' ')
event.append(' '.join(groups[n:]))
groups = row.split(' ')
age.append(groups[1])
data['age_group'] = age
data['event_simp'] = event
data['gender'] = gender
data['year'] = date2
return data
def lsc_age(data_two):
global lsc, lsc_age, top, all_performers
lsc = pd.DataFrame(data_two['event'].groupby(data_two['lsc']).count()).reset_index().sort_values(by='event',ascending=False)
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
top = pd.concat([lsc_age[lsc_age.age_group=='10 & under'].head(),lsc_age[lsc_age.age_group=='11-12'].head(),\
lsc_age[lsc_age.age_group=='13-14'].head(),lsc_age[lsc_age.age_group=='15-16'].head(),\
lsc_age[lsc_age.age_group=='17-18'].head()],ignore_index=True)
all_performers = pd.concat([lsc_age[lsc_age.age_group=='10 & under'],lsc_age[lsc_age.age_group=='11-12'],\
lsc_age[lsc_age.age_group=='13-14'],lsc_age[lsc_age.age_group=='15-16'],\
lsc_age[lsc_age.age_group=='17-18']],ignore_index=True)
all_performers = all_performers.rename(columns={'event':'no. top 100'})
all_performers['age_year_lsc'] = all_performers.age_group+' '+all_performers.year.astype(str)+' '+all_performers.lsc
return all_performers
years = [i for i in range(2008,2018)]
for i in range(len(years)-1):
lsc_age(import_data(str(years[i+1])+"national100.csv",\
str(years[i]),str(years[i+1])))
During the first call to your function lsc_age() in line
lsc_age = data_two.groupby(['year','age_group','lsc'])['event'].count().reset_index().sort_values(by=['age_group','event'],ascending=False)
you are overwriting your function object with a dataframe. This is happening since you imported the function object from the global namespace with
global lsc, lsc_age, top, all_performers
Functions in Python are objects. Please see more information about this here.
To solve your problem, try to avoid the global imports. They do not seem to be necessary. Try to pass your data around through the arguments of the function.