Best search algorithm to find 'similar' strings in excel spreadsheet

Best search algorithm to find 'similar' strings in excel spreadsheet - python

I am trying to figure out the most efficient way of finding similar values of a specific cell in a specified column(not all columns) in an excel .xlsx document. The code I have currently assumes all of the strings are unsorted. However the file I am using and the files I will be using all have strings sorted from A-Z. So instead of doing a linear search I wonder what other search algorithm I could use as well as being able to fix my coding eg(binary search etc).
So far I have created a function: find(). Before the function runs the program takes in a value from the user's input that then gets set as the sheet name. I print out all available sheet names in the excel doc just to help the user. I created an empty array results[] to store well....the results. I created a for loop that iterates through only column A because I only want to iterate through a custom column. I created a variable called start that is the first coordinate in column A eg(A1 or A400) this will change depending on the iteration the loop is on. I created a variable called next that will get compared with the start. Next is technically just start + 1, however since I cant add +1 to a string I concatenate and type cast everything so that the iteration becomes a range from A1-100 or however many cells are in column A. My function getVal() gets called with two parameters, the coordinate of the cell and the worksheet we are working from. The value that is returned from getVal() is also passed inside my function Similar() which is just a function that calls SequenceMatcher() from difflib. Similar just returns the percentage of how similar two strings are. Eg. similar(hello, helloo) returns int 90 or something like that. Once the similar function is called if the strings are above 40 percent similar appends the coordinates into the results[] array.
def setSheet(ws):
sheet = wb[ws]
return sheet
def getVal(coordinate, worksheet):
value = worksheet[coordinate].value
return value
def similar(first, second):
percent = SequenceMatcher(None, first, second).ratio() * 100
return percent
def find():
column = "A"
print("\n")
print("These are all available sheets: ", wb.sheetnames)
print("\n")
name = input("What sheet are we working out of> ")
results = []
ws = setSheet(name)
for i in range(1, ws.max_row):
temp = str(column + str(i))
x = ws[temp]
start = ws[x].coordinate
y = str(column + str(i + 1))
next = ws[y].coordinate
if(similar(getVal(start,ws), getVal(next,ws)) > 40):
results.append(getVal(start))
return results
This is some nasty looking code so I do apologize in advance. The expected results should just be a list of strings that are "similar".

Related

python comparing two lists and retaining second list index

I am taking a user input of "components" splitting it into a list and comparing those components to a list of available components generated from column A of a google sheet. Then what I am attempting to do is return the cell value from column G corresponding the Column A index. Then repeat this for all input values.
So far I am getting the first value just fine but I'm obviously missing something to get it to cycle back and to the remaining user input components. I tried some stuff using itertools but wasn't able to get the results I wanted. I have a feeling I will facepalm when I discover the solution to this through here or on my own.
mix = select.split(',') # sets user input to string and sparates elements
ws = s.worksheet("Details") # opens table in google sheet
c_list = ws.col_values(1) # sets column A to a list
modifier = [""] * len(mix) # sets size of list based on user input
list = str(c_list).lower()
for i in range(len(mix)):
if str(mix[i]).lower() in str(c_list).lower():
for j in range(len(c_list)):
if str(mix[i]).lower() == str(c_list[j]).lower():
modifier[i] = ws.cell(j+1,7).value # get value of cell from Column G corresponding to Column A for component name
print(mix)
print(modifier)

You are over complicating the code by writing C like code.
I have changed all the loops you had to a simpler single loop, I have also left comments above each code line to explain what it does.
# Here we use .lower() to lower case all the values in select
# before splitting them and adding them to the list "mix"
mix = select.lower().split(",")
ws = s.worksheet("Details")
# Here we used a list comprehension to create a list of the "column A"
# values but all in lower case
c_list = [cell.lower() for cell in ws.col_values(1)]
modifier = [""] * len(mix)
# Here we loop through every item in mix, but also keep a count of iterations
# we have made, which we will use later to add the "column G" element to the
# corresponding location in the list "modifier"
for i, value in enumerate(mix):
# Here we check if the value exists in the c_list
if value in c_list:
# If we find the value in the c_list, we get the index of the value in c_list
index = c_list.index(value)
# Here we add the value of column G that has an index of "index + 1" to
# the modifier list at the same location of the value in list "mix"
modifier[i] = ws.cell(index + 1, 7).value

How to reference a Class List with another class?

I am setting up a script that will extract data from excel and return it in lists. Right now I am trying to be able to reorganized the data into smaller lists that have a common attribute. (Such as: A list that has the indices of the rows that contained, 'Pencil') I keep having the smaller list returning None.
I've checked and the lists that extract the data are working fine. But I cant seem to get the smaller lists working.
#Create a class for the multiple lists of Columns
class Data_Column(list):
def Fill_List (self,col): #fills the list
for i in range(sheet.nrows):
self.append(sheet.cell_value(i,col))
#Create a class for a specific list that has data of a common artifact
class Specific_List(list):
def Find_And_Fill (self, listy, word):
for i in range (sheet.nrows):
if listy[i] == word:
self.append(I)
#Initiate and Populate lists from excel spreadsheet
date = Data_Column()
date.Fill_List(0)
location = Data_Column()
location.Fill_List(1)
name = Data_Column()
name.Fill_List(2)
item = Data_Column()
item.Fill_List(3)
specPencil = Specific_List()
print(specPencil.Find_And_Fill(item,'Pencil'))
I expected a List that contained the indices where 'Pencil' was found such as [1,6,12,14,19].
The actual output was: None

I needed to take the print out of the very last line.
specPencil.Find_And_Fill(item,'Pencil')
print(specPencil)
I knew it was a simple fix

Python: My directory is not giving individual value output

I have created a code that imports data via .xlrd in two directories in Python.
Code:
import xlrd
#category.clear()
#term.clear()
book = xlrd.open_workbook("C:\Users\Koen\Google Drive\etc...etc..")
sheet = book.sheet_by_index(0)
num_rows = sheet.nrows
for i in range(1,num_rows,1):
category = {i:( sheet.cell_value(i, 0))}
term = {i:( sheet.cell_value(i, 1))}
When I open one of the two directories (category or term), it will present me with a list of values.
print(category[i])
So far, so good.
However, when I try to open an individual value
print(category["2"])
, it will consistently give me an error>>
Traceback (most recent call last):
File "testfile", line 15, in <module>
print(category["2"])
KeyError: '2'
The key's are indeed numbered (as determined by i).
I've already tried to []{}""'', etc etc. Nothing works.
As I need those values later on in the code, I would like to know what the cause of the key-error is.
Thanks in advance for taking a look!

First off, you are reassigning category and term in every iteration of the for loop, this way the dictionary will always have one key at each iteration, finishing with the last index, so if our sheet have 100 lines, the dict will only have the key 99. To overcome this, you need to define the dictionary outside the loop and assign the keys inside the loop, like following:
category = {}
term = {}
for i in range(1, num_rows, 1):
category[i] = (sheet.cell_value(i, 0))
term[i] = (sheet.cell_value(i, 1))
And second, the way you are defining the keys using the for i in range(1, num_rows, 1):, they are integers, so you have to access the dictionary keys like so category[1]. To use string keys you need to cast them with category[str(i)] for example.
I hope have clarifying the problem.

Python - Using string array, pass them as input for dataframe name for a function

I am new to Python, not sure if it is a stupid way of doing this (coz I might have thousands of stocks code). I am trying to pass a list of stock code which there is a list of dataframe I created separately. I'd like to pass different dataframe, based on the arrary, into the function. Any suggestion on the best way to carry out this?
def ind (in_df):
in_df['CHG']= in_df['Open'] / in_df['Close']
return ;
STK = ['GOOG','TSLA','AAPL']
for index in range(len(STK)):
print (STK[index])
ind(pd.DataFrame[STK[index]])

You can write the loop like below. That way you don't have to invoke len or individual list element separately.
STK = ['GOOG','TSLA','AAPL']
for stockCode in STK:
print (stockCode)
ind(pd.DataFrame[stockCode])

Issues with adding a variable to python gspread

I have started to use the gspread library and have sheet already that I'd like to append after the last row that has data in it. I'll retrieve the values between A1 and maxrows to loop through them and check if they are empty. However, I am unable to add a variable to the second line here. But perhaps I am just not escaping it correct? I bet this is very simple:
maxrows = "A" + str(worksheet.row_count)
cell_list = worksheet.range('A1:A%s') % (maxrows)

Your variable maxrows already is in the form of "An", the concatenation already contains the letter and the number
But you are adding an extra A to it here worksheet.range('A1:A%s')
Also you're not using the string interpolation correctly with % (in your code you are not applying % to the range string)
It should have been one of these
maxrows = "A" + str(worksheet.row_count)
worksheet.range('A1:%s' % maxrows)
or
worksheet.range('A1:A%d' % worksheet.row_count)
(among other possible solutions)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Best search algorithm to find 'similar' strings in excel spreadsheet - python

Related

python comparing two lists and retaining second list index

How to reference a Class List with another class?

Python: My directory is not giving individual value output

Python - Using string array, pass them as input for dataframe name for a function

Issues with adding a variable to python gspread

Categories

Resources