how to compare two strings in pandas large dataframe (python3.x)?

how to compare two strings in pandas large dataframe (python3.x)? - python

I have two DFs from 2 excel files.
1st file(awcProjectMaster)(1500 records)
projectCode projectName
100101 kupwara
100102 kalaroos
100103 tangdar
2nd file(village master)(more than 10 million records)
villageCode villageName
425638 wara
783651 tangdur
986321 kalaroo
I need to compare the projectName and villageName along with the percentage match.
The following code works fine but it is slow. How can I do the same thing in a more efficient way.
import pandas as pd
from datetime import datetime
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")
def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
with open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a") as f:
percentMatch = 0
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
percentMatch = (prjLen / vLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
elif prjLen >= vLen:
if prjName.find(vName) != -1:
percentMatch = (vLen / prjLen) * 100
f.write(prjCode + "," + prjName + "," + vCode + "," + vName + "," + str(round(percentMatch)) + "," + stCode + "," + stName + "," + dCode + "," + dName + sdCode + "," + sdName + "\n")
else:
res = 0
# print(res)
f.close()
for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

Your distance metric for the strings isn't too accurate, but if it works for you, fine. (You may want to look into other options like the builtin difflib, or the Python-Levenshtein module, though.)
If you really do need to compare 1,500 x 10,000,000 records pairwise, things are bound to take some time, but there are a couple things that we can do pretty easily to speed things up:
open the log file only once; there's overhead, sometimes significant, in that
refactor your comparison function into a separate unit, then apply the lru_cache() memoization decorator to make sure each pair is compared only once, and the subsequent result is cached in memory. (In addition, see how we sort the vName/prjName pair – since the actual order of the two strings doesn't matter, we end up with half the cache size.)
Then for general cleanliness,
use the csv module for streaming CSV into a file (the output format is slightly different than with your code, but you can change this with the dialect parameter to csv.writer()).
Hope this helps!
import pandas as pd
from datetime import datetime
from functools import lru_cache
import csv
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
df1 = pd.read_excel("C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.xlsx")
log_file = open(r"C:\\Users\\Desktop\\prjToVillageStateWise\\stCodeVillage1To6.txt", "a")
log_writer = csv.writer(log_file)
#lru_cache()
def compare_vname_prjname(vName, prjName):
vLen = len(vName)
prjLen = len(prjName)
if vLen > prjLen:
if vName.find(prjName) != -1:
return (prjLen / vLen) * 100
elif prjLen >= vLen:
if prjName.find(vName) != -1:
return (vLen / prjLen) * 100
return None
def compare(prjCode, prjName, stCode, stName, dCode, dName, sdCode, sdName, vCode, vName):
# help the cache decorator out by halving the number of possible pairs:
vName, prjName = sorted([vName, prjName])
percent_match = compare_vname_prjname(vName, prjName)
if percent_match is None: # No match
return False
log_writer.writerow(
[
prjCode,
prjName,
vCode,
vName,
round(percent_match),
stCode,
stName,
dCode,
dName + sdCode,
sdName,
]
)
return True
for idx, row in df.iterrows():
for idxv, r in df1.iterrows():
compare(
str(row["ProjectCode"]),
row["ProjectName"].lower(),
str(r["StateCensusCode"]),
r["StateName"],
str(r["DistrictCode"]),
r["DistrictName"],
str(r["SubDistrictCode"]),
r["SubDistrictNameInEnglish"],
str(r["VillageCode"]),
r["VillageNameInEnglish"].lower(),
)

Related

Openpyxl module: return weird value(not error) + hope to calculate

I wrote some codes trying to let the user be able to check the percentage of the money they spent(compared to the money they earned). Almost every step perform normally, until the final part.
a_c[('L'+row_t)].value return:
=<Cell 'Sheet1'.B5>/<Cell 'Sheet1'.J5>
yet I hope it should be some value.
Code:
st_column = st_column_r.capitalize()
row_s = str(a_c.max_row)
row_t = str(a_c.max_row + 1)
row = int(row_t)
a_c[('J'+row_t)] = ('=SUM(I2,J'+row_s+')') #總收入
errorprevention = a_c[('J'+row_t)].value
a_c[(st_column+row_t)] = ('=SUM('+(st_column+'2')+','+(st_column+row_s)+')')
a_c['L'+row_t].number_format = FORMAT_PERCENTAGE_00
if errorprevention != 0:
a_c[('L'+row_t)] = ('='+str(a_c[(st_column+row_t)])+'/'+str(a_c[('J'+row_t)]))
print('過往支出中，'+inputtype[st_column]+'類別佔總收入的比率為:'+a_c[('L'+row_t)].value)

Try changing the formula creation to;
a_c[('L' + row_t)].value = '=' + a_c[(st_column + row_t)].coordinate + '/' + a_c[('J' + row_t)].coordinate
or use an f string
a_c[('L' + row_t)].value = f"={a_c[(st_column + row_t)].coordinate}/{a_c[('J' + row_t)].coordinate}"

second if statement will not return the correct output

So currently I have this code and everything runs just fine. My "min" if statement is returning all of the cars in the json file instead of the minimum if that makes sense. I have tried everything, I'm unsure if it's an indentation issue or what.
horsepower = request.args.get('horsepower')
minmax = request.args.get('minmax')
message = "<h3>HORSEPOWER "+minmax.upper()+" "+str(horsepower)+"</h3>"
path = os.getcwd() + "/cars.json"
with open(path) as f:
data = json.load(f)
for records in data:
Car = str(records["Car"])
Horse = int(records["Horsepower"])
if minmax == "max" and horsepower >= str(Horse):
message += str(Car) + " " + str(Horse) + str(max) + "<br>"
if minmax == "min" and horsepower <= str(Horse):
message += str(Car) + " " + str(Horse) + str(min) + "<br>"
return message

You have to compare numbers not string
horsepower = int(request.args.get('horsepower'))
# ...
if minmax == "max" and horsepower >= int(Horse):
# ...
if minmax == "min" and horsepower <= int(Horse):

How to slice a very long string in python

I need to slice a very long string (DNA sequences) in python, currently I'm using this:
new_seq = clean_seq[start:end]
I'm slicing about every 20000 characters, and taking 1000 long slices (approximately)
it's a 250MB file containing a few strings, identified each one with an id, this method is taking too long.
The sequence string comes from biopython module:
def fasta_from_ann(annotation, sequence, feature, windows, output_fasta):
df_gff = pd.read_csv(annotation, index_col=False, sep='\t',header=None)
df_gff.columns = ['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand', 'frame', 'attribute']
fasta_seq = SeqIO.parse(sequence,'fasta')
buffer = []
for record in fasta_seq:
df_exctract = df_gff[(df_gff.seqname == record.id) & (df_gff.feature == feature)]
for k,v in df_exctract.iterrows():
clean_seq = ''.join(str(record.seq).splitlines())
if int(v.start) - windows < 0:
start = 0
else:
start = int(v.start) - windows
if int(v.end) + windows > len(clean_seq):
end = len(clean_seq)
else:
end = int(v.end) + windows
new_seq = clean_seq[start:end]
new_id = record.id + "_from_" + str(v.start) + "_to_" + str(v.end) + "_feature_" + v.feature
desc = "attribute: " + v.attribute + " strand: " + v.strand
seq = SeqRecord(Seq(new_seq), id=new_id,description = desc)
buffer.append(seq)
print(record.id)
SeqIO.write(buffer, output_fasta, "fasta")
Maybe there's a more memory-friendly way to accomplish this.

Python random hex generator

So I'm looking to generate a random hex value each time this is called
randhex = "\\x" + str(random.choice("123456789ABCDEF")) + str(random.choice("123456789ABCDEF"))
So far all I've come up with is to make different = calls (ex. randhex1 = ^^, randhex2) etc etc but that's tedious and inefficient and I don't want to do this
ErrorClass = "\\x" + str(random.choice("123456789ABCDEF")) + "\\x" + str(random.choice("123456789ABCDEF")) + "\\x" + str(random.choice("123456789ABCDEF")) + "\\x" + str(random.choice("123456789ABCDEF"))
because that doesn't look good and can be hard to tell how many there are.
I'm trying to assign it to this
ErrorClass = randhex1 + randhex2 + randhex3 + randhex4,
Flags = randhex5,
Flags2 = randhex6 + randhex7,
PIDHigh = randhex2 + randhex5,
and ideally, instead of having to assign different numbers, I want it all to be uniform or something like ErrorClass = randhex*4 which would be clean. If I do this, however, it simply copies the code to be something like this:
Input: ErrorClass = randhex + randhex + randhex + randhex
Output: \xFF\xFF\xFF\xFF
which obviously doesn't work because they are all the same then. Any help would be great.

Make a function that returns the randomly generated string. It will give you a new value every time you call it.
import random
def randhex():
return "\\x" + str(random.choice("0123456789ABCDEF")) + str(random.choice("0123456789ABCDEF"))
ErrorClass = randhex() + randhex() + randhex() + randhex()
Flags = randhex()
Flags2 = randhex() + randhex()
PIDHigh = randhex() + randhex()
print(ErrorClass)
print(Flags)
print(Flags2)
print(PIDHigh)
Sample result:
\xBF\x2D\xA2\xC2
\x74
\x55\x34
\xB6\xF5
For additional convenience, add a size parameter to randhex so you don't have to call it more than once per assignment:
import random
def randhex(size=1):
result = []
for i in range(size):
result.append("\\x" + str(random.choice("0123456789ABCDEF")) + str(random.choice("0123456789ABCDEF")))
return "".join(result)
ErrorClass = randhex(4)
Flags = randhex()
Flags2 = randhex(2)
PIDHigh = randhex(2)
print(ErrorClass)
print(Flags)
print(Flags2)
print(PIDHigh)

xml libxml2 parsing

In the code below, my problem is that it's writing output to all folders based on only one input file. Can some one give me a hint and check if my code is looping properly?
import libxml2
import os.path
from numpy import *
from cfs_utils import *
np=[1,2,3,4,5,6,7,8]
n=[20,30,40,60,80,100,130]
solver=["CG_iluk", "CG_saamg", "CG_ssor", "BiCGSTABL_iluk", "BiCGSTABL_saamg", "BiCGSTABL_ssor", "cholmod", "ilu" ]
file_list=["eval_CG_iluk_default","eval_CG_saamg_default", "eval_CG_ssor_default", "eval_BiCGSTABL_iluk", "eval_BiCGSTABL_saamg", "eval_BiCGSTABL_ssor","simp_cholmod_solver_3D_evaluate ", "simp_ilu_solver_3D_evaluate" ]
for sol in solver:
i=0
for cnt_np in np:
#open write_file= "Graphs/" + "Np"+ cnt_np + "/CG_iluk.dat"
#"Graphs/Np1/CG_iluk.dat"
write_file = open("Graphs/"+ "Np"+ str(cnt_np) + "/" + sol + ".dat", "w")
#loop through different unknowns
for cnt_n in n:
#open file "cfs_calculations_" + cnt_n +"np"+ cnt_np+ "/" + file_list(i) + "_default.info.xml"
read_file = "cfs_calculations_" +str(cnt_n) +"np"+ str(cnt_np) + "/" + file_list[i] + ".info.xml"
#read wall and cpu time and write
if os.path.exists(read_file):
doc = libxml2.parseFile(read_file)
xml = doc.xpathNewContext()
walltime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/setup/timer/#wall")
cputime = xpath(xml, "//cfsInfo/sequenceStep/OLAS/mechanic/solver/summary/setup/timer/#cpu")
unknowns = 3*cnt_n*cnt_n*cnt_n
write_file.write(str(unknowns) + "\t" + walltime + "\t" + cputime + "\n")
doc.freeDoc()
write_file.close()
i=i+1

Problem solved, I = o, was outside the loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to compare two strings in pandas large dataframe (python3.x)? - python

Related

Openpyxl module: return weird value(not error) + hope to calculate

second if statement will not return the correct output

How to slice a very long string in python

Python random hex generator

xml libxml2 parsing

Categories

Resources