Optimizing runtime of comparing 2 Pandas datasets - python

I have this issue where I take the 2009/2010 dataset of white house visitors, a csv with these headers.
https://obamawhitehouse.archives.gov/files/disclosures/visitors/WhiteHouse-WAVES-Key-1209.txt
I want to extract the name of all names of visitors who visited in both 2009 and 2010.
I have this function to do this, but it is far too slow. Is there a conceptually faster way to do this?
def task3():
culled_data = data[["NAMELAST", "NAMEFIRST", "TOA", "TOD"]]
data9 = culled_data[culled_data["TOA"].str.contains("2009", na = False)]
data10 = culled_data[culled_data["TOA"].str.contains("2010", na = False)]
unique_names = pandas.DataFrame({'count':\
data.groupby(["NAMELAST", "NAMEFIRST"]).size()}).reset_index()
unqiue_names = unique_names[unique_names["count"] > 1]
count = 0
for index, row in unique_names.iterrows():
if data9[data9.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data10[data10.NAMELAST == row["NAMELAST"]].shape[0] > 0 and data9[data9.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0 and data10[data10.NAMEFIRST == row["NAMEFIRST"]].shape[0] > 0:
count += 1
else:
unique_names = unique_names[unique_names.NAMELAST != row["NAMELAST"]]
return count, unique_names

One way is to use python sets:
fullnames9 = set([' '.join(r) for r in data9[['NAMEFIRST', 'NAMELAST']].values])
fullnames10 = set([' '.join(r) for r in data10[['NAMEFIRST', 'NAMELAST']].values])
names_who_visited_in_both_years = fullnames9 & fullnames10 # set intersection
Note that if two different people have the same first and last name, this code will falsely conclude that they visited in both years. Also, this only gets full names. Getting the DataFrame indexes of people who visited in both years would be more useful, and is left as an exercise ;)

Related

Fastest way to count event occurences in a Pandas dataframe?

I have a Pandas dataframe with ~100,000,000 rows and 3 columns (Names str, Time int, and Values float), which I compiled from ~500 CSV files using glob.glob(path + '/*.csv').
Given that two different names alternate, the job is to go through the data and count the number of times a value associated with a specific name ABC deviates from its preceding value by ±100, given that the previous 50 values for that name did not deviate by more than ±10.
I initially solved it with a for loop function that iterates through each row, as shown below. It checks for the correct name, then checks the stability of the previous values of that name, and finally adds one to the count if there is a large enough deviation.
count = 0
stabilityTime = 0
i = 0
if names[0] == "ABC":
j = value[0]
stability = np.full(50, values[0])
else:
j = value[1]
stability = np.full(50, values[1])
for name in names:
value = values[i]
if name == "ABC":
if j - 10 < value < j + 10:
stabilityTime += 1
if stabilityTime >= 50 and np.std(stability) < 10:
if value > j + 100 or value < j - 100:
stabilityTime = 0
count += 1
stability = np.roll(stability, -1)
stability[-1] = value
j = value
i += 1
Naturally, this process takes a very long computing time. I have looked at NumPy vectorization, but do not see how I can apply it in this case. Is there some way I can optimize this?
Thank you in advance for any advice!
Bonus points if you can give me a way to concatenate all the data from every CSV file in the directory that is faster than glob.glob(path + '/*.csv').

How to split a series by the longest repetition of a number in python?

df = pd.DataFrame({
'label':[f"subj_{i}" for i in range(28)],
'data':[i for i in range(1, 14)] + [1,0,0,0,2] + [0,0,0,0,0,0,0,0,0,0]
})
I have a dataset something like that. It looks like:
I want to cut it at where the longest repetitions of 0s occur, so I want to cut at index 18, but I want to leave index 14-16 intact. So far I've tried stuff like:
Counters
cad_recorder = 0
new_index = []
for i,row in tqdm(temp_df.iterrows()):
if row['cadence'] == 0:
cad_recorder += 1
new_index.append(i)
* But obviously that won't work since the indices will be rewritten at each occurrance of zero.
I also tried a dictionary, but I'm not sure how to compare previous and next values using iterrows.
I also took the rolling mean for X rows at a time, and if its zero then I got an index. But then I got stuck at actually inferring the range of indices. Or finding the longest sequence of zeroes.
Edit: A friend of mine suggested the following logic, which gave the same results as #shubham-sharma. The poster's solution is much more pythonic and elegant.
def find_longest_zeroes(df):
'''
Finds the index at which the longest reptitions of <1 values begin
'''
current_length = 0
max_length = 0
start_idx = 0
max_idx = 0
for i in range(len(df['data'])):
if df.iloc[i,9] <= 1:
if current_length == 0:
start_idx = i
current_length += 1
if current_length > max_length:
max_length = current_length
max_idx = start_idx
else:
current_length = 0
return max_idx
The code I went with following #shubham-sharma's solution:
cut_us_sof = {}
og_df_sof = pd.DataFrame()
cut_df_sof = pd.DataFrame()
for lab in df['label'].unique():
temp_df = df[df['label'] == lab].reset_index(drop=True)
mask = temp_df['data'] <= 1 # some values in actual dataset were 0.0000001
counts = temp_df[mask].groupby((~mask).cumsum()).transform('count')['data']
idx = counts.idxmax()
# my dataset's trailing zeroes are usually after 200th index. But I also didn't want to remove trailing zeroes < 500 in length
if (idx > 2000) & (counts.loc[idx] > 500):
cut_us_sof[lab] = idx
og_df_sof = og_df_sof.append(temp_df)
cut_df_sof = cut_df_sof.append(temp_df.iloc[:idx,:])
We can use boolean masking and cumsum to identify the blocks of zeros, then groupby and transform these blocks using count followed by idxmax to get the starting index of the block having the maximum consecutive zeros
m = df['data'].eq(0)
idx = m[m].groupby((~m).cumsum()).transform('count').idxmax()
print(idx)
18

Want to optimize my code for finding out overlapping times in a big amount of records pandas

I have a data table consisting 100000 records with 50 columns, It has a start time and end time value and a equipment key for which records are available. When this nodes are down, their records are stored. so start time is when the node goes down, and end time is when the node is up after getting down. If there are multiple records where we have same equipment key, and start time and end time values which are inside of previous record's start time and end time, then we call it that this new record has overlapping time and we need to ignore them. To find out these overlapping records, I have written a function and apply it on a dataframe, but it's taking a long time. I am not that efficient in optimization, that's why seeking any suggestion regarding this.
sitecode_info = []
def check_overlapping_sitecode(it):
sitecode = it['equipmentkey']
fo = it['firstoccurrence']
ct = it['cleartimestamp']
if len(sitecode_info) == 0:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
for list_item in sitecode_info:
for item in list_item.keys():
if item == 'sc':
if list_item[item] == sitecode:
# print("matched")
if fo >= list_item['fo'] and ct <= list_item['ct'] or \
fo >= list_item['fo'] and fo <= list_item['ct'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['ct'] or \
fo <= list_item['fo'] and ct >= list_item['fo'] and ct <= list_item['ct']:
return 1
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
else:
sitecode_info.append({
'sc': sitecode,
'fo': fo,
'ct': ct
})
return 0
I am calling this as following.
temp_df['false_alarms'] = temp_df.apply(check_overlapping_sitecode, axis=1)
I think you were just iterating over that list of dictionaries a touch too much.
**EDIT:**Added appending fo's and ct's even if it returns 1 in the method for enhanced accuracy.
'''
setting an empty dictionary.
this will look like: {sc1: [[fo, ct], [fo, ct]],
sc2:[[fo, ct], [fo, ct]]}
the keys are just the site_code,
this way we don't have to iterate over all of the fo's and ct's, just the ones related to that site code.
'''
sitecode_info = {}
# i set up a dataframe with 200000 rows x 50 columns
def check_overlapping_sitecode(site_code, fo, ct):
try:
#try to grab the existing site_code information from sitecode_info dict.
#if it fails then go ahead and make it while also returning 0 for that site_code
my_list = sitecode_info[site_code]
#if it works, go through that site's list.
for fo_old, ct_old in my_list:
#if the first occurence is >= old_first occurenc and <= cleartimestamp
if fo >= fo_old and fo <= ct_old:
sitecode_info[site_code].append([fo, ct])
return 1
#same but for cleartimestamp instead
elif ct <= ct_old and ct >= fo_old:
sitecode_info[site_code].append([fo, ct])
return 1
else:
#if it doesnt overlap at all go ahead and set the key to a list in list
sitecode_info[site_code].append([fo, ct])
return 0
except:
#set the key to a list in list if it fails
sitecode_info[site_code] = [[fo, ct]]
return 0
t = time.time()
"""Here's the real meat and potatoes.
using a lambda function to call method "check_overlapping_sitecode".
lambda: x where x is row
return the output of check_overlapping_sitecode
"""
temp_df['false_alarms'] = temp_df.apply(lambda x: check_overlapping_sitecode(x['equipmentkey'], x['firstoccurrence'], x['cleartimestamp']), axis=1)
print(time.time()-t)
#this code runs nearly 6 seconds for me.
#then you can do whatever you want with your DF.

Assigning salespersons to different cities program

I was working on a problem of assigning 8 salespersons to 8 different cities, which are represented in below format.
The column represents Sales person and row represent cities.
1) The condition for assigning are :
1) Only one city per person
2) Once a city is assigned a salesperson , the rows and columns and diagonal cities cannot be assigned to another person
I am not able to recreate an example from memory , sorry for that , but the representation of cities and salesperson is correct.
I thought to avoid rows or columns for similar salesperson , I could use permutations from python which will give me distinct set of cities without overlapping and from there on I could check for diagonal values.
Here is my attempt.
import collections
import itertools
def display (l):
list_count = 0
k = ''
for i in l:
print i
list_count = list_count + 1
if list_count != len(l):
k = k + ','
cities = [1,2,3,4,5,6,7,8]
sales = [1,2,3,4,5,6,7,8]
print_list = []
count = 0
for i in itertools.permutations([1,2,3,4,5,6,7,8],8):
print_list.append(i)
if count == 2:
display(print_list)
#print print_list
#print '\n'
for j in range(8):
print_list.pop()
count = 0
count = count + 1
I am stuck on how to check if a salesperson is in diagonal position to another Salesperson, If someone can extend my approach that would be great , would like any other explanation , would be helpful, I would like python as I am practising in it.

Python: keep top Nth results for csv.reader

I am doing some filtering on csv file where for every title there are many duplicate IDs with different prediction values, so the column 2 (pythoniac) is different. I would like to keep only 30 lowest values but with unique ID. I came to this code, but I don't know how to keep lowest 30 entries.
Can you please help with suggestions how to obtain 30 unique by ID entries?
# title1 id1 100 7.78E-25 # example of the line
with open("test.txt") as fi:
cmp = {}
for R in csv.reader(fi, delimiter='\t'):
for L in ligands:
newR = R[0], R[1]
if R[0] == L:
if (int(R[2]) <= int(1000) and int(R[2]) != int(0) and float(R[3]) < float("1.0e-10")):
if newR in cmp:
if float(cmp[newR][3]) > float(R[3]):
cmp[newR] = R[:-2]
else:
cmp[newR] = R[:-2]
Maybe try something along this line...
from bisect import insort
nth_lowest = [very_high_value] * 30
for x in my_loop:
do_stuff()
...
if x < nth_lowest[-1]:
insort(nth_lowest, x)
nth_lowest.pop() # remove the highest element

Categories

Resources