Pandas - automate string replacement on dataframe

Pandas - automate string replacement on dataframe - python

I have this function to find duplicates in a dataframe:
def checkForDuplicates():
database_dup_first = df.drop_duplicates(subset=['name','slug','id'], keep='first')
df = database_dup_first[database_dup_first.duplicated(['name','id'], keep=False)]
for index, value in df.iterrows():
slug = value.slug
team = value.team
print ((slug, team))
return df
which prints tuples with repeated players (with slightly different name entries) and their teams:
df = checkForDuplicates()
('Wesley', 'Bragantino')
('Wesley Pionteck', 'Bragantino')
('Leonardo Gil', 'Vasco')
('Léo Gil', 'Vasco')
('João Paulo', 'Fortaleza')
('João Paulo Silveira', 'Fortaleza')
...
Now I need to perform a replacement on the dataset, where the last (always the last) similar entry replaces the first.
I know I could perform this manually, for all returned duplicates, like so:
df = checkForDuplicates()
for index, value in df.iterrows():
slug = value.slug
team = value.team
if slug == 'Wesley' and team =='Bragantino':
df['slug'].iloc[index] = 'Wesley Pionteck'
if slug == 'Leonardo Gil' and team =='Vasco':
df['slug'].iloc[index] = 'Léo Gil'
if slug == 'João Paulo' and team =='Vasco':
df['slug'].iloc[index] = 'João Paulo Silveira'
But the actual list of duplicates is huge. So how can I automate this replacement for all duplicate entries?

Your solution:
n=len(df['slug'])
for i,e in enumerate(df['slug']):
if i < n-1:
a = df.iloc[i,0]
b = df.iloc[i+1,0]
match_count = 0
if len(a.split()) < len(b.split()):
fcount = len(a.split())
else:
fcount = len(b.split())
for a1 in a.split():
for b1 in b.split():
if (a1.find(b1)>=0 or b1.find(a1) >=0):
match_count+=1
if (match_count == fcount) and (df.iloc[i,1] == df.iloc[i+1,1]):
match_count = 0
df.iloc[i,0] = df.iloc[i+1,0]
else:
break

# keep the last slug with unique name and id
df_slug_last = df.drop_duplicates(subset=['name','id'], keep='last')[['name','slug','id']]
df_slug_last.columns = ['name','slug_last','id']
# merge the last slug to the origin df
dfn = pd.merge(df, df_slug_last, on=['name','id'], how='left')
dfn['slug_last'] will be the lastest slug for every unique name,id

Related

problem with creating column name dynamically in a dataframe

I am trying to create a pandas dataframe dynamically. So far I'm fine with capturing data within the dataframe, but not with the name of the columns.
I also want the name of the columns to be based on the 'category' of the data that comes with the record that I am reading in my function, but I always get the last one.
def funct_example(client):
documents = [ v_document ]
poller = client.begin_analyze_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
i = 1
df_final = pd.DataFrame()
for idx, doc in enumerate(docs):
for relation in doc.entity_relations:
for role in relation.roles:
name = str([format(entity.category)]) + str(i) # <---- THIS LINE ALWAYS IS THE LAST REGISTER
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)} # <---THIS IS OK
df = pd.DataFrame(data=d, index=[0])
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
return(df_final)
df_new_2 = funct_example(client)
I've tried adding an extra loop between creating the dataframe sentence and concat function like so:
for col in df.columns:
name = str([format(entity.category)]) + str(i)
df = df.rename(columns={col: name })
But the last category still appears in the column name...
How can I solve that?
From already thank you very much.
SOLUTION:
for idx, doc in enumerate(docs):
for relation in doc.entity_relations:
for role in relation.roles:
name = 'Relation_' + format(relation.relation_type) + '_' + str(i)
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)}
df = pd.DataFrame(data=d, index=[0])
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
return(df_final)
df_relations = funct_example(client)
Regards!! :D

its difficult to suggest a solution without knowing the properties of all the objects being used.
'entity' object, is not defined within the function, so is it a global variable?
Does the role has 'entity', which then has the 'category' property? Im assuming as such, since role does have entity property
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)} # <---THIS IS OK
Beside, the name variable while initialized is not used.
maybe you try
name = str([format(role.entity.category)]) + str(i)

How to order a python dictionary containing a list of values

I'm not sure I am approaching this in the right way.
Scenario:
I have two SQL tables that contain rent information. One table contains rent due, and the other contains rent received.
I'm trying to build a rent book which takes the data from both tables for a specific lease and generates a date ordered statement which will be displayed on a webpage.
I'm using Python, Flask and SQL Alchemy.
I am currently learning Python, so I'm not sure if my approach is the best.
I've created a dictionary which contains the keys 'Date', 'Payment type' and 'Payment Amount', and in each of these keys I store a list which contains the data from my SQL queries. The bit im struggling on is how to sort the dictionary so it sorts by the date key, keeping the values in the other keys aligned to their date.
lease_id = 5
dates_list = []
type_list = []
amounts_list = []
rentbook_dict = {}
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
rentbook_dict.setdefault('Date',[]).append(dates_list)
rentbook_dict.setdefault('Type',[]).append(type_list)
rentbook_dict.setdefault('Amount',[]).append(amounts_list)
I was then going to use a for loop within the flask template to iterate through each value and display it in a table on the page.
Or am I approaching this in the wrong way?

so I managed to get this working just using zipped list. Im sure there is a better way for me to accomplish this but im pleased I've got it working.
lease_id = 5
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
total_due = 0
for debit in payments_due:
total_due = total_due + int(debit.expected_rent_amount)
total_received = 0
for income in payments_received:
total_received = total_received + int(income.payment_amount)
balance = total_received - total_due
if balance < 0 :
arrears = "This account is in arrears"
else:
arrears = ""
dates_list = []
type_list = []
amounts_list = []
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
payment_data = zip(dates_list, type_list, amounts_list)
sorted_payment_data = sorted(payment_data)
tuples = zip(*sorted_payment_data)
list1, list2, list3 = [ list(tuple) for tuple in tuples]
return(render_template('rentbook.html',
payment_data = zip(list1,list2,list3),
total_due = total_due,
total_received = total_received,
balance = balance))

Creating a new column based on the key of a dictionary?

I am trying to create a new column in a dataframe within a for loop of dictionary items that uses a string literal and the key, but it throws a "ValueError: cannot set a frame with no defined index and a scalar" error message.
Dictionary definition for exp categories
d = {'Travel & Entertainment': [1,2,3,4,5,6,7,8,9,10,11], 'Office supplies & Expenses': [13,14,15,16,17],
'Professional Fees':[19,20,21,22,23], 'Fees & Assessments':[25,26,27], 'IT Expenses':[29],
'Bad Debt Expense':[31],'Miscellaneous expenses': [33,34,35,36,37],'Marketing Expenses':[40,41,42],
'Payroll & Related Expenses': [45,46,47,48,49,50,51,52,53,54,55,56], 'Total Utilities':[59,60],
'Total Equipment Maint, & Rental Expense': [63,64,65,66,67,68],'Total Mill Expense':[70,71,72,73,74,75,76,77],
'Total Taxes':[80,81],'Total Insurance Expense':[83,84,85],'Incentive Compensation':[88],
'Strategic Initiative':[89]}
Creating a new dataframe based on a master dataframe
mcon = VA.loc[:,['Expense', 'Mgrl', 'Exp Category', 'Parent Category']]
mcon.loc[:,'Variance Type'] = ['Unfavorable' if x < 0 else 'favorable' for x in mcon['Mgrl']]
mcon.loc[:,'Business Unit'] = 'Managerial Consolidation'
mcon = mcon[['Business Unit', 'Exp Category','Parent Category', 'Expense', 'Mgrl', 'Variance Type']]
mcon.rename(columns={'Mgrl':'Variance'}, inplace=True)
Creating a new dataframe that will be written to excel eventually
a1 = pd.DataFrame()
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
a1.to_csv('example.csv', index = False)
Output looks like this
I am trying to add a new column that says Higher/Lower budget than {key} where key stands for the expense type using the below code
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
umconm.loc[:,'Explanation'] = f'Lower than budgeted {key}'
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
fmconm.loc[:,'Explanation'] = f'Higher than budgeted {key}'
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
but using the above string literal gives me the error message "ValueError: cannot set a frame with no defined index and a scalar"
I would really appreciate any help to either correct this or find a different solution for adding this field to my dataframe. Thanks in advance!

this error occurs because this line
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
will produce empty dataframe sometimes without index. instead use this approach when you want to set your column (not loc):
a['Explanation'] = f'Lower than budgeted {key}'

So silly of me, the solution is as follows:
for key, value in d.items():
umconm = mcon.iloc[value].query('Variance < 0').nsmallest(5, 'Variance')
umconm['Explanation'] = f'Higher than Budget for {key}'
fmconm = mcon.iloc[value].query('Variance > 0').nlargest(5, 'Variance')
fmconm['Explanation'] = f'Lower than Budget for {key}'
if umconm.empty == False or fmconm.empty == False:
a1 = pd.concat([a1,umconm,fmconm], ignore_index = True)
else:
continue
I didnt have to use .loc while creating a new column in this dataframe!

combine two for loops in to fill same dictionary

I am trying to get two different merchants from a list of dictionaries with priority to merchants who have prices,if no two different merchants are found with prices, merchant 1 or 2 prices are to be filled with data from list,if list is not enough merchant 1 or 2 should be None.
I.e the for loop will return two merchants,priority to merchants who have prices, if that is not enough to fill merchants (1 or 2) get merchants with no prices.finally if still merchant 1 or 2 not created fill them with None value.
here is the code I have so far, it does the job but I believe it can be combined in a more Pythonic way.
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
for row in reader:
dummy_list.append(row)
item=dict()
index = 1
for merchant in dummy_list:
if merchant['price']:
if index==2:
if item['merchant_1']==merchant['name']:
continue
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
if index==3:
break
index+=1
for merchant in dummy_list:
if index==3:
break
if index<3:
try:
if item['merchant_1']==merchant['name']:
continue
except KeyError:
pass
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']
index+=1
while index<3:
item['merchant_%d'%index] = ''
item['merchant_%d_price'%index] = ''
item['merchant_%d_stock'%index] = ''
item['merchant_%d_link'%index] = ''
index+=1
print(item)
here is the contents of the csv file:
price,link,name,stock
,https://www.samsclub.com/sams/donut-shop-100-ct-k-cups/prod19381344.ip,Samsclub,
,https://www.costcobusinessdelivery.com/Green-Mountain-Original-Donut-Shop-Coffee%2C-Medium%2C-Keurig-K-Cup-Pods%2C-100-ct.product.100297848.html,Costcobusinessdelivery,
,https://www.costco.com/The-Original-Donut-Shop%2C-Medium-Roast%2C-K-Cup-Pods%2C-100-count.product.100381350.html,Costco,
57.99,https://www.target.com/p/the-original-donut-shop-regular-medium-roast-coffee-keurig-k-cup-pods-108ct/-/A-13649874,Target,Out of Stock
10.99,https://www.target.com/p/the-original-donut-shop-dark-roast-coffee-keurig-k-cup-pods-18ct/-/A-16185668,Target,In Stock
,https://www.homedepot.com/p/Keurig-Kcup-Pack-The-Original-Donut-Shop-Coffee-108-Count-110030/204077166,Homedepot,Undertermined

As you only want to keep at most 2 merchants, I would process the csv only once keeping separately a list of merchant with prices and a list of merchant without prices, stopping as soon as 2 merchant with prices have been found.
After that loop, I would concatenate those 2 list and a list of two empty merchants and take the first 2 elements of that. That will be enough to guarantee your requirements of 2 distinct merchants with priority to those having prices. Finaly, I would use that to fill the item dict.
Code would be:
import csv
with open('/home/timmy/testing/example/example/test.csv') as csvFile:
reader=csv.DictReader(csvFile)
names_price = set()
names_no_price = set()
merchant_price = []
merchant_no_price = []
item = {}
for merchant in reader:
if merchant['price']:
if not merchant['name'] in names_price:
names_price.add(merchant['name'])
merchant_price.append(merchant)
if len(merchant_price) == 2:
break;
else:
if not merchant['name'] in names_no_price:
names_no_price.add(merchant['name'])
merchant_no_price.append(merchant)
void = { k: '' for k in reader.fieldnames}
merchant_list = (merchant_price + merchant_no_price + [void, void.copy()])[:2]
for index, merchant in enumerate(merchant_list, 1):
item['merchant_%d'%index] = merchant['name']
item['merchant_%d_price'%index] = merchant['price']
item['merchant_%d_stock'%index] = merchant['stock']
item['merchant_%d_link'%index] = merchant['link']

python list of dictionaries only updating 1 attribute and skipping others

I have a list of lists containing company objects:
companies_list = [companies1, companies2]
I have the following function:
def get_fund_amount_by_year(companies_list):
companies_length = len(companies_list)
for idx, companies in enumerate(companies_list):
companies1 = companies.values_list('id', flat=True)
funding_rounds = FundingRound.objects.filter(company_id__in=companies1).order_by('announced_on')
amount_per_year_list = []
for fr in funding_rounds:
fr_year = fr.announced_on.year
fr_amount = fr.raised_amount_usd
if not any(d['year'] == fr_year for d in amount_per_year_list):
year_amount = {}
year_amount['year'] = fr_year
for companies_idx in range(companies_length):
year_amount['amount'+str(companies_idx)] = 0
if companies_idx == idx:
year_amount['amount'+str(companies_idx)] = fr_amount
amount_per_year_list.append(year_amount)
else:
for year_amount in amount_per_year_list:
if year_amount['year'] == fr_year:
year_amount['amount'+str(idx)] += fr_amount
return amount_per_year_list
The problem is the resulting list of dictionaries has only one amount attribute updated.
As you can see "amount0" contains all "0" amounts:
[{'amount1': 12100000L, 'amount0': 0, 'year': 1999}, {'amount1':
8900000L, 'amount0': 0, 'year': 2000}]
What am I doing wrong?

I put list of dictionaries being built in the loop and so when it iterated it overwrote the last input. I changed it to look like:
def get_fund_amount_by_year(companies_list):
companies_length = len(companies_list)
**amount_per_year_list = []**
for idx, companies in enumerate(companies_list):
companies1 = companies.values_list('id', flat=True)
funding_rounds = FundingRound.objects.filter(company_id__in=companies1).order_by('announced_on')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - automate string replacement on dataframe - python

Related

problem with creating column name dynamically in a dataframe

How to order a python dictionary containing a list of values

Creating a new column based on the key of a dictionary?

combine two for loops in to fill same dictionary

python list of dictionaries only updating 1 attribute and skipping others

Categories

Resources