How to optimize searching and comparing rows in pandas?

How to optimize searching and comparing rows in pandas? - python

I have a two dfs. Base is 100k rows, Snps is 54k rows.
This is structure of dfs:
base:
SampleNum SampleIdInt SecondName
1 ASA2123313 A2123313
2 ARR4112234 R4112234
3 AFG4234122 G4234122
4 GGF412233 F412233
5 GTF423512 F423512
6 POL23523552 L23523552
...
And this is Snps df:
SampleNum SampleIdInt
1 ART2114155
2 KWW4112234
3 AFG4234122
4 GGR9999999
5 YUU33434324
6 POL23523552
...
And now look for example on 2nd row in Snps and base. They have a same numbers (the first 3 chars are not important to me now).
So I created a commonlist contained a numbers from snsp which also appear in the base. The all rows with SAME numbers between dfs. (common has 15k length)
common_list = [4112234, 4234122, 23523552]
And now I want create three new lists.
confirmedSnps = where whole SampleIdInt is identical as in base. In this example: AFG4234122. For this I have a sure that secondName will be proper.
un_comfirmedSnpS = where I have a good number but first three chars are different. Example: KWW4112234 in SnpS and ARR4112234 in base. In this case, I'm not sure that SecondName is proper, so I need to check it later.
And last moreThanOne list. That list should append all duplicate rows. For example If in base I will have KWW4112234 and AFG4112234 both should go to that list.
I wrote a code. It's work fine, but the problem is time. I got 15k elements to filter, and each element processing 4 second. It's mean whole loop will be run for 17h!
I looking for help in optimization that code.
That's my code:
comfirmedSnps = []
un_comfirmedSnps = []
moreThanOne = []
for i in range(len(common)):
testa = baza[baza['SampleIdInt'].str.contains(common[i])]
testa = testa.SampleIdInt.unique()
print("StepOne")
testb = snps[snps['SampleIdInt'].str.contains(common[i])]
testb = testb.SampleIdInt.unique()
print("StepTwo")
if len(testa) == 1 and len(testb) == 1:
if (testa == testb) == True:
comfirmedSnps.append(testb)
else:
un_comfirmedSnps.append(testb)
else:
print("testa has more than one contains records. ")
moreThanOne.append(testb)
print("StepTHREE")
print(i,"/", range(len(common)))
I added a Steps prints to check which part takes most of the time. It's the code between StepOne and stepTwo. First and third steps are running instant.
Can someone help me with that case? For sure most of U will see better solution to this problem.

What you are trying to do is commonly called join, which annoyingly enough is called merge in pandas. There's just the minor annoyance of the three initial letters to deal with, but that's easy:
snps.numeric_id = snps.SampleIdInt.apply(lambda s: s[3:])
base.numeric_id = base.SampleIdInt.apply(lambda s: s[3:])
now you can compute the three dataframes:
confirmed = snps.merge(base, on='SampleIdInt')
unconfirmed = snps.merge(
base, on='numeric_id'
).filter(
lambda r.SampleIdInt_x != r.SampleIdInt_y
)
more_than_one = snps.group_by('numeric_id').filter(lambda g: len(g) > 1)
I bet it won't work, but hopefully you get the idea.

Related

Masking the Zip Codes

I'm taking a course and I need to solve the following assignment:
"In this part, you should write a for loop, updating the df_users dataframe.
Go through each user, and update their zip code, to Safe Harbor specifications:
If the user is from a zip code for the which the “Geographic Subdivision” is less than equal to 20,000, change the zip code in df_users to ‘0’ (as a string)
Otherwise, zip should be only the first 3 numbers of the full zip code
Do all this by directly updating the zip column of the df_users DataFrame
Hints:
This will be several lines of code, looping through the DataFrame, getting each zip code, checking the geographic subdivision with the population in zip_dict, and setting the zip_code accordingly.
Be very aware of your variable types when working with zip codes here."
Here you can find all the data necessary to understand the context:
https://raw.githubusercontent.com/DataScienceInPractice/Data/master/
assignment: 'A4'
data_files: user_dat.csv, zip_pop.csv
After cleaning the data from user_dat.csv leaving only the columns: 'age', 'zip' and 'gender', and creating a dictionary from zip_pop.csv that contains the population of the first 3 digits from all the zipcodes; I wrote this code:
# Loop through the dataframe's to get each zipcode
for zipcode in df_users['zip']:
# check if the zipcode's 3 first numbers from the dataframe, correspond to a population of more or less than 20.000 people
if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
# if less, change zipcode value to string zero.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
else:
# If more, preserve only the first 3 digits of the zipcode.
df_users.loc[df_users['zip'] == zipcode, 'zip'] = zipcode[:len(zipcode) - 2]
This code works halfways and I don't understand why.
It changes the zipcode to 0 if the population is less than 20.000 people, and also changes the first zipcodes (up until the ones that start with '078') but then it returns this error message:
KeyError Traceback (most recent call last)
/var/folders/95/4vh4zhc1273fgmfs4wyntxn00000gn/T/ipykernel_44758/1429192050.py in < module >
1 for zipcode in df_users['zip']:
----> 2 if zip_dict[zipcode[:len(zipcode) - 2]] <= 20000:
3 df_users.loc[df_users['zip'] == zipcode, 'zip'] = '0'
4 else:
5 df_users.loc[df_users['zip'] == zipcode, 'zip'] = str(zipcode[:len(zipcode) - 2])
KeyError: '0'
I get that the problem is in the last line of code, because I've been doing every line at a time and each of them worked, until I put that last one. And if I just print the zipcodes instead of that last line, it also works!
Can anyone can help me understand why my code is wrong?

You're modifying a collection of values (i.e. df_users['zip']) whilst you're iterating over it. This is a common anti pattern. If a loop is absolutely required, then you could consider iterating over df_users['zip'].unique() instead. That creates a copy of all the unique zip codes, solving your current error, and it means that you aren't redoing work when you encounter a duplicate zipcode.
If a loop is not required, then there are better (more pandas style) ways to go about your problem. I would suggest something like (untested):
zip_start = df_users['zip'].str[:-2]
df_users['zip'] = zip_start.where(zip_start.map(zip_dict) > 20000, other="0")

I get results only for the last class, not all of them

I am doing a NLP work (just starting to learn) with policy descriptions (corpus) and policy areas (classes). I have calculated the word frequency, and made a list of words with the 100 most frequent terms. I have 27 classes. Some words are relevant for some classes but irrelevant for others, so I cant clean them from all the data set. I want to remove the words in the list from the classes where they are present in LESS than 50% od the rows, but when i print dfen_all it gives me only the results for the last class.
It seems that somehow it is not mergin the datasets for each class... Iàve tried a few modifications but nothing worked.
for term in most_common_words_list:
for class_num in arr2:
df_class = dfen[dfen['Policycat1'] == class_num] # get dataframe of class
num_with_term = len(df_class[df_class['corpus3'].str.contains(term, case=False)])# count number of records with term
num_total= len(df_class['corpus3'])
percent = num_with_term/num_total
if percent < 0.5:
df_class['corpus3_clean'] = df_class['corpus3'].str.replace(term, '') # delete term from records
#join all dataframes into a single dataframe
dfen_all = pd.DataFrame()
for class_num in class_values:
dfen_all = pd.concat([dfen_all, df_class[df_class['Policycat1'] == class_num]])
#save the dataframe to a csv
#df_all.to_csv('all_classes.csv', index=False)
dfen_all['totalwords_c3cl'] = dfen_all['corpus3_clean'].str.split().str.len()
dfen_all
I am expecting to get all the results together in a daframe, not only the last label.
This is my first question, I apologize if I missed something. Thanks a lot!

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x

Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Python / Pandas: Looping through a list of numbers

I am trying to create a loop involving Pandas/ Python and an Excel file. The column in question is named "ITERATION" and it has numbers ranging from 1 to 6. I'm trying to query the number of hits in the Excel file in the following iteration ranges:
1 to 2
3
4 to 6
I've already made a preset data frame named "df".
iteration_list = ["1,2", "3", "4,5,6"]
i = 1
for k in iteration_list:
table = df.query('STATUS == ["Sold", "Refunded"]')
table["ITERATION"] = table["ITERATION"].apply(str)
table = table.query('ITERATION == ["%s"]' % k)
table = pd.pivot_table(table, columns=["Month"], values=["ID"], aggfunc=len)
table.to_excel(writer, startrow = i)
i = i + 3
The snippet above works only for the number "3". The other 2 scenarios don't seem to work as it literally searches for the string "1,2". I've tried other ways such as:
iteration_list = [1:2, 3, 4:6]
iteration_list = [{1:2}, 3, {4:6}]
to no avail.
Does anyone have any suggestions?
EDIT
After looking over Stidgeon's answer, I seemed to come up with the following alternatives. Stidgeon's answer DOES provide an output but not the one I'm looking for (it gives 6 outputs - from iteration 1 to 6 in each loop).
Above, my list was the following:
iteration_list = ["1,2", "3", "4,5,6"]
If you play around with the quotation marks, you could input exactly what you want. Since your strings is literally going to be inputted into this line where %s is:
table = table.query('ITERATION == ["%s"]' % k)
You can essentially play around with the list to fit your precise needs with quotations. Here is a solution that could work:
list = ['1", "2', 3, '4", "5", "6']

Just focusing on getting the values out of the list of strings, this works for me (though - as always - there may be more Pythonic approaches):
lst = ['1,2','3','4,5,6']
for item in lst:
items = item.split(',')
for _ in items:
print int(_)
Though instead of printing at the end, you can pass the value to your script.
This will work if all your strings are either single numbers or numbers separated by commas. If the data are consistently formatted like that, you may have to tweak this code.

Python: How to speed up creating of objects?

I'm creating objects derived from a rather large txt file. My code is working properly but takes a long time to run. This is because the elements I'm looking for in the first place are not ordered and not (necessarily) unique. For example I am looking for a digit-code that might be used twice in the file but could be in the first and the last row. My idea was to check how often a certain code is used...
counter=collections.Counter([l[3] for l in self.body])
...and then loop through the counter. Advance: if a code is only used once you don't have to iterate over the whole file. However You are stuck with a lot of iterations which makes the process really slow.
So my question really is: how can I improve my code? Another idea of course is to oder the data first. But that could take quite long as well.
The crucial part is this method:
def get_pc(self):
counter=collections.Counter([l[3] for l in self.body])
# This returns something like this {'187':'2', '199':'1',...}
pcode = []
#loop through entries of counter
for k,v in counter.iteritems():
i = 0
#find post code in body
for l in self.body:
if i == v:
break
# find fist appearence of key
if l[3] == k:
#first encounter...
if i == 0:
#...so create object
self.pc = CodeCana(k,l[2])
pcode.append(self.pc)
i += 1
# make attributes
self.pc.attr((l[0],l[1]),l[4])
if v <= 1:
break
return pcode
I hope the code explains the problem sufficiently. If not, let me know and I will expand the provided information.

You are looping over body way too many times. Collapse this into one loop, and track the CodeCana items in a dictionary instead:
def get_pc(self):
pcs = dict()
pcode = []
for l in self.body:
pc = pcs.get(l[3])
if pc is None:
pc = pcs[l[3]] = CodeCana(l[3], l[2])
pcode.append(pc)
pc.attr((l[0],l[1]),l[4])
return pcode
Counting all items first then trying to limit looping over body by that many times while still looping over all the different types of items defeats the purpose somewhat...
You may want to consider giving the various indices in l names. You can use tuple unpacking:
for foo, bar, baz, egg, ham in self.body:
pc = pcs.get(egg)
if pc is None:
pc = pcs[egg] = CodeCana(egg, baz)
pcode.append(pc)
pc.attr((foo, bar), ham)
but building body out of a namedtuple-based class would help in code documentation and debugging even more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to optimize searching and comparing rows in pandas? - python

Related

Masking the Zip Codes

I get results only for the last class, not all of them

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

Python / Pandas: Looping through a list of numbers

Python: How to speed up creating of objects?

Categories

Resources