Removing rows from a pandas dataframe while iterating through it

Removing rows from a pandas dataframe while iterating through it - python

I have the following python script. In it, I am iterating through a CSV file which has rows and rows of loyalty cards. In many cases, there is more than one entry per card. I am currently looping through each row, then using loc to find all other instances of the card in the current row, so I can combine them together to post to an API. What I'd like to do however, is when that post is done, remove all the rows I've just merged, so that way the iteration doesn't hit them again.
That's the part I'm stuck on. Any ideas? Essentially I want to remove all the rows in card_list from csv before I go for the next iteration. That way even though there might be 5 rows with the same card number, I only process that card once. I tried by using
csv = csv[csv.card != row.card]
At the end of the loop, thinking it might re-generate the dataframe without any rows with a card matching the one just processed, but it didn't work.
import urllib3
import json
import pandas as pd
import os
import time
import pyfiglet
from datetime import datetime
import array as arr
for row in csv.itertuples():
dt = datetime.now()
vouchers = []
if minutePassed(time.gmtime(lastrun)[4]):
print('Getting new token...')
token = get_new_token()
lastrun = time.time()
print('processing ' + str(int(row.card)))
card_list = csv.loc[csv['card'] == int(row.card)]
print('found ' + str(len(card_list)) + ' vouchers against this card')
for row in card_list.itertuples():
print('appending card ' + str(int(row.card)) + ' voucher ' + str(row.voucher))
vouchers.append(row.voucher)
print('vouchers, ', vouchers)
encoded_data = json.dumps({
"store_id":row.store,
"transaction_id":"11111",
"card_number":int(row.card),
"voucher_instance_ids":vouchers
})
print(encoded_data)
number += 1
r = http.request('POST', lcs_base_path + 'customer/auth/redeem-commit',body=encoded_data,headers={'x-api-key': api_key, 'Authorization': 'Bearer ' + token})
response_data = json.loads(r.data)
if (r.status == 200):
print (str(dt) + ' ' + str(number) + ' done. processing card:' + str(int(row.card)) + ' voucher:' + str(row.voucher) + ' store:' + str(row.store) + ' status: ' + response_data['response_message'] + ' request:' + response_data['lcs_request_id'])
else:
print (str(dt) + ' ' + str(number) + 'done. failed to commit ' + str(int(row.card)) + ' voucher:' + str(row.voucher) + ' store:' + str(row.store) + ' status: ' + response_data['message'])
new_row = {'card':row.card, 'voucher':row.voucher, 'store':row.store, 'error':response_data['message']}
failed_csv = failed_csv.append(new_row, ignore_index=True)
failed_csv.to_csv(failed_csv_file, index=False)
csv = csv[csv.card != row.card]
print ('script completed')
print (str(len(failed_csv)) + ' failed vouchers will be saved to failed_commits.csv')
print("--- %s seconds ---" % (time.time() - start_time))

First rule of thumb is never alternate what you are iterating on. Also, I think you are doing it wrong with itertuples. Let's do groupby:
for card, card_list in csv.groupby('card'):
# card_list now contains all the rows that have a specific cards
# exactly like `card_list` in your code
print('processing, card)
print('found', len(card_list), 'vouchers against this card')
# again `itertuples` is over killed -- REMOVE IT
# for row in card_list.itertuples():
encoded_data = json.dumps({
"store_id": card_list['store'].iloc[0], # same as `row.store`
"transaction_id":"11111",
"card_number":int(card),
"voucher_instance_ids": list(card_list['voucher']) # same as `vouchers`
})
# ... Other codes

Related

is there a way to make 270K API requests in a fast way using pandas data frames?

I have the following dataset (this is just 5 observations out of 270K):
ride_id_hash start_lat start_lon end_lat end_lon
0 3913897986482697189 52.514532 13.389043 52.513640 13.380389
1 -742299897365889600 52.515073 13.394545 52.518483 13.407965
2 -319522908151650562 52.528291 13.409570 52.526965 13.415183
3 -3247332391891858034 52.514276 13.453460 52.503649 13.448538
4 5368542961279891088 52.504448 13.468405 52.511891 13.474998
I would like to make requests to an API and retrieve the estimated distance and time based on longitude-latitude origin-destination from Mapbox. As I have to make it with 270 rows, the process is veeeery slow. I coded the following:
acces_token = 'pk.eyXXXX'
def create_route_json(row):
"""Get route JSON."""
base_url = 'https://api.mapbox.com/directions/v5/mapbox/cycling/'
url = base_url + str(row['Start Lon']) + \
',' + str(row['Start Lat']) + \
';' + str(row['End Lon']) + \
',' + str(row['End Lat'])
print('start lon: ' + str(row['Start Lon']))
print('start lat: ' + str(row['Start Lat']))
print('end lon: ' + str(row['End Lon']))
print('end lat: ' + str(row['End Lat']))
params = {
'geometries': 'geojson',
'access_token': acces_token
}
req = requests.get(url, params=params)
route_json = req.json()['routes'][0]
print('Duration: ' + str(route_json['duration']))
print('Distance: ' + str(route_json['distance']))
print('^'*40)
row['Estimated_duration'] = route_json['duration']
row['Estimated_distance'] = route_json['distance']
#mystr= str(route_json['duration']) + ';'+ str(route_json['distance'])
return row
start = time.time()
df_process.apply(create_route_json, axis=1)
I am wondering if there is a way to make 270K calls in a fast way. (you need to have your own access token from Mapbox Directions API

Trouble getting text in to an email

I'm trying to send an email to someone with information I've scraped from the web but I can't get the contents to send. I keep receiving empty emails. Any help would be great. I've tried all sorts of different numbers of ' and +s and i can't figure it out.
def singaporeweather():
singaporehigh=singapore_soup.find(class_='tab-temp-high').text
singaporelow=singapore_soup.find(class_='tab-temp-low').text
print('There will be highs of ' + singaporehigh + ' and lows of ' +
singaporelow + '.')
def singaporesuns():
singaporesunsets=singapore_soup.find(class_='row col-sm-5')
suns_singapore=singaporesunsets.find_all('time')
sunset_singapore=suns_singapore[1].text
sunrise_singapore=suns_singapore[0].text
print('Sunrise: ' + sunrise_singapore)
print('Sunset: ' + sunset_singapore)
def ukweather():
ukhigh= uk_soup.find('span', class_='tab-temp-high').text
uklow= uk_soup.find(class_='tab-temp-low').text
print('There will be highs of ' + ukhigh + ' and lows of ' + uklow +
'.')
def uksuns():
uk_humid = uk_soup.find('div', class_='row col-sm-5')
humidity=uk_humid.find_all('time')
sunrise_uk=humidity[0].text
sunset_uk= humidity[1].text
print('Sunrise: '+str(sunrise_uk))
print('Sunset: '+str(sunset_uk))
def ukdesc():
uk_desc=uk_soup.find('div',class_='summary-text hide-xs-only')
uk_desc_2=uk_desc.find('span')
print(uk_desc_2.text)`enter code here`
def quotes():
quote_text=quote_soup.find(class_='b-qt qt_914910 oncl_q').text
author=quote_soup.find(class_='bq-aut qa_914910 oncl_a').text
print('Daily quote:\n' + '\"'+quote_text +'\"'+ ' - ' + author +'\n')
def message():
print('Subject:Testing\n\n')
print(('Morning ' +
nameslist[random.randint(1(len(nameslist)-1))]).center(30,'*'),
end='\n'*2)
quotes()
print('UK'.center(30,'_') + '\n')
ukweather()
ukdesc()
uksuns()
print('\n' + 'Singapore'.center(30,'_') + '\n')
singaporeweather()
singaporedesc()
singaporesuns()
smtpthing.sendmail('XXX#outlook.com', 'XXX#bath.ac.uk', str(message()))

In your functions, instead of printing the results to the console, you should use return statements so that you can use the function's result in your main program. Otherwise, message() is returning null, which is why your email is empty (the main program cannot see message()'s result unless it is returned).
Try something like:
def singaporeweather():
singaporehigh=singapore_soup.find(class_='tab-temp-high').text
singaporelow=singapore_soup.find(class_='tab-temp-low').text
return 'There will be highs of ' + singaporehigh + ' and lows of ' +
singaporelow + '.'
By using a return statement like this one, you will be able to use singaporeweather()'s result in your main program, e.g.:
var result = singaporeweather()
Using returns in the rest of your methods as well, you will be able to do the following in your function message():
def message():
body = "" #your message
body += 'Subject:Testing\n\n'
body += ('Morning ' + nameslist[random.randint(1(len(nameslist)-1))]).center(30,'*')
body += quotes()
body += 'UK'.center(30,'_') + '\n'
+ ukweather()
+ ukdesc()
+ uksuns()
+ '\n' + 'Singapore'.center(30,'_') + '\n'
+ singaporeweather()
+ singaporedesc()
+ singaporesuns()
#finally, don't forget to return!
return body
Now you are returning body, now you can use message()'s result in your main program to send your email correctly:
smtpthing.sendmail('XXX#outlook.com', 'XXX#bath.ac.uk', str(message()))

how to compare two strings in pandas large dataframe (python3.x)?

I have two DFs from 2 excel files.
1st file(awcProjectMaster)(1500 records)
projectCode projectName
100101 kupwara
100102 kalaroos
100103 tangdar
2nd file(village master)(more than 10 million records)
villageCode villageName
425638 wara
783651 tangdur
986321 kalaroo
I need to compare the projectName and villageName along with the percentage match.
The following code works fine but it is slow. How can I do the same thing in a more efficient way.
import pandas as pd
from datetime import datetime
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")
def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
with open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a") as f:
percentMatch = 0
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
percentMatch = (prjLen / vLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
elif prjLen >= vLen:
if prjName.find(vName) != -1:
percentMatch = (vLen / prjLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
f.close()
for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

Your distance metric for the strings isn't too accurate, but if it works for you, fine. (You may want to look into other options like the builtin difflib, or the Python-Levenshtein module, though.)
If you really do need to compare 1,500 x 10,000,000 records pairwise, things are bound to take some time, but there are a couple things that we can do pretty easily to speed things up:
open the log file only once; there's overhead, sometimes significant, in that
refactor your comparison function into a separate unit, then apply the lru_cache() memoization decorator to make sure each pair is compared only once, and the subsequent result is cached in memory. (In addition, see how we sort the vName/prjName pair – since the actual order of the two strings doesn't matter, we end up with half the cache size.)
Then for general cleanliness,
use the csv module for streaming CSV into a file (the output format is slightly different than with your code, but you can change this with the dialect parameter to csv.writer()).
Hope this helps!
import pandas as pd
from datetime import datetime
from functools import lru_cache
import csv
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")
log_file = open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a")
log_writer = csv.writer(log_file)
#lru_cache()
def compare_vname_prjname(vName, prjName):
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
return (prjLen / vLen) * 100
elif prjLen >= vLen:
if prjName.find(vName) != -1:
return (vLen / prjLen) * 100
return None
def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
# help the cache decorator out by halving the number of possible pairs:
vName, prjName = sorted([vName, prjName])
percent_match = compare_vname_prjname(vName, prjName)
if percent_match is None: # No match
return False
log_writer.writerow(
[
prjCode,
prjName,
vCode,
vName,
round(percent_match),
stCode,
stName,
dCode,
dName + sdCode,
sdName,
]
)
return True
for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

Perfomance improvement - Looping with Get Method

I've built a program to fill up a databank and, by the time, it's working. Basically, the program makes a request to the app I'm using (via REST API) returns the data I want and then manipulate to a acceptable form for the databank.
The problem is: the GET method makes the algorithm too slow, because I'm acessing the details of particular entries, so for each entry I have to make 1 request. I have something close to 15000 requests to do and each row in the bank is taking 1 second to be made.
Is there any possible way to make this requests faster? How can I improve the perfomance of this method? And by the way, any tips to measure the perfomance of the code?
Thanks in advance!!
here's the code:
# Retrieving all the IDs I want to get the detailed info
abc_ids = serializers.serialize('json', modelExample.objects.all(), fields=('id'))
abc_ids = json.loads(abc_ids)
abc_ids_size = len(abc_ids)
# Had to declare this guys right here because in the end of the code I use them in the functions to create and uptade the back
# And python was complaining that I stated before assign. Picked random values for them.
age = 0
time_to_won = 0
data = '2016-01-01 00:00:00'
# First Loop -> Request to the detailed info of ABC
for x in range(0, abc_ids_size):
id = abc_ids[x]['fields']['id']
url = requests.get(
'https://api.example.com/v3/abc/' + str(
id) + '?api_token=123123123')
info = info.json()
dealx = dict(info)
# Second Loop -> Picking the info I want to uptade and create in the bank
for key, result in dealx['data'].items():
# Relevant only for ModelExample -> UPTADE
if key == 'age':
result = dict(result)
age = result['total_seconds']
# Relevant only For ModelExample -> UPTADE
elif key == 'average_time_to_won':
result = dict(result)
time_to_won = result['total_seconds']
# Relevant For Model_Example2 -> CREATE
# Storing a date here to use up foward in a datetime manipulation
if key == 'add_time':
data = str(result)
elif key == 'time_stage':
# Each stage has a total of seconds that the user stayed in.
y = result['times_in_stages']
# The user can be in any stage he want, there's no rule about the order.
# But there's a record of the order he chose.
z = result['order_of_stages']
# Creating a list to fill up with all stages info and use in the bulk_create.
data_set = []
index = 0
# Setting the number of repititions base on the number of the stages in the list.
for elemento in range(0, len(z)):
data_set_i = {}
# The index is to define the order of the stages.
index = index + 1
for key_1, result_1 in y.items():
if int(key_1) == z[elemento]:
data_set_i['stage_id'] = int(z[elemento])
data_set_i['index'] = int(index)
data_set_i['abc_id'] = id
# Datetime manipulation
if result_1 == 0 and index == 1:
data_set_i['add_date'] = data
# I know that I totally repeated the code here, I was trying to get this part shorter
# But I could not get it right.
elif result_1 > 0 and index == 1:
data_t = datetime.strptime(data, "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
else:
data_t = datetime.strptime(data_set[elemento - 1]['add_date'], "%Y-%m-%d %H:%M:%S")
data_sum = data_t + timedelta(seconds=result_1)
data_sum += timedelta(seconds=3)
data_nova = str(data_sum.year) + '-' + str(formaters.DateNine(
data_sum.month)) + '-' + str(formaters.DateNine(data_sum.day)) + ' ' + str(
data_sum.hour) + ':' + str(formaters.DateNine(data_sum.minute)) + ':' + str(
formaters.DateNine(data_sum.second))
data_set_i['add_date'] = str(data_nova)
data_set.append(data_set_i)
Model_Example2_List = [Model_Example2(**vals) for vals in data_set]
Model_Example2.objects.bulk_create(Model_Example2_List)
ModelExample.objects.filter(abc_id=id).update(age=age, time_to_won=time_to_won)

if the bottleneck is in your network request, there isn't much you can do except to perhaps use gzip or deflate but with requests ..
The gzip and deflate transfer-encodings are automatically decoded for
you.
If you want to be doubly sure, you can add the following headers to the get request.
{ 'Accept-Encoding': 'gzip,deflate'}
The other alternative is to use threading and have many requests operate in parrallel, a good option if you have lot's of bandwidth and multiple cores.
Lastly, there are lots of different ways to profile python including with cprofile + kcachegrind combo.

Working with nested dictionaries and formatting for display

I have a partial answer from here Construct a tree from list os file paths (Python) - Performance dependent
My specific problem requires me to go from
this
dir/file 10
dir/dir2/file2 20
dir/dir2/file3 10
dir/file3 10
dir3/file4 10
dir3/file5 10
To
dir/ **50**
dir2/ **30**
file2
file3
file
file3
dir3/ **20**
file4
file5
Basically the numbers at the end are the file sizes and
I have been trying to figure out how to display the size of all the files to the parent directory
Edit:
r = re.compile(r'(.+\t)(\d+)')
def prettify(d, indent=0):
for key, value in d.iteritems():
ss = 0
if key == FILE_MARKER:
if value:
for each in value:
mm = r.match(each)
ss += int(mm.group(2))
print ' ' * indent + each
***print ' ' * indent + format_size(ss)***
else:
print ' ' * indent + str(key)
if isinstance(value, dict):
addSizes(value, indent+1)
else:
print ' ' * (indent+1) + str(value)
This is mac's answer from the above link which i edited to use regExp
Solutions that occurred to me led me to create a new dict or adding an inner function.
I have lost my whole day and wished i had asked for help earlier in the day.
Please help.

Not the most elegant thing in the world, but this should get you where you need to be. You'll need to change the tree creation function to deal with whatever form of input you are getting. Once the tree is generated it's just using a recursive tree traversal to form the output.
import re
input_dirs = """dir/file 10
dir/dir2/file2 20
dir/dir2/file3 10
dir/file 10
dir3/file4 10
dir3/file5 10
dir/dir2/dir4/file2 10"""
def create_file_tree(input_string):
dir_dict = {}
for file_path in input_string.split('\n'):
path_list = re.sub('/',' ',file_path).split()
path_list[-1] = int(path_list[-1])
path_dict = dir_dict
final_item = ""
for item in path_list[:-1]:
parent_dict = path_dict
last_item = item
path_dict = path_dict.setdefault(item,{})
parent_dict[last_item] = path_list[-1]
return dir_dict
def pretty_file_tree(file_tree):
def traverse(sub_dict,indent=0, total=0):
string_out = ""
indent += 1
for key in sorted(sub_dict.keys()):
if type(sub_dict[key]) == dict:
sub_total = traverse(sub_dict[key],indent,0)
total += sub_total[0]
string_out += ' '*indent + key + ' ' + '**' + str(sub_total[0]) + '**' + '\n' + sub_total[1]
else:
string_out += ' '*indent + key + '\n'
total += sub_dict[key]
return total, string_out
output_string = traverse(file_tree)
print(output_string[1])
pretty_file_tree(create_file_tree(input_dirs))
Sorry it's not following the code you posted, but i'd begun to produce this before the edit...

As you process the input build a string with place holders (%d) for the numbers, then print out the string.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing rows from a pandas dataframe while iterating through it - python

Related

is there a way to make 270K API requests in a fast way using pandas data frames?

Trouble getting text in to an email

how to compare two strings in pandas large dataframe (python3.x)?

Perfomance improvement - Looping with Get Method

Working with nested dictionaries and formatting for display

Categories

Resources