I have started to use the gspread library and have sheet already that I'd like to append after the last row that has data in it. I'll retrieve the values between A1 and maxrows to loop through them and check if they are empty. However, I am unable to add a variable to the second line here. But perhaps I am just not escaping it correct? I bet this is very simple:
maxrows = "A" + str(worksheet.row_count)
cell_list = worksheet.range('A1:A%s') % (maxrows)
Your variable maxrows already is in the form of "An", the concatenation already contains the letter and the number
But you are adding an extra A to it here worksheet.range('A1:A%s')
Also you're not using the string interpolation correctly with % (in your code you are not applying % to the range string)
It should have been one of these
maxrows = "A" + str(worksheet.row_count)
worksheet.range('A1:%s' % maxrows)
or
worksheet.range('A1:A%d' % worksheet.row_count)
(among other possible solutions)
Related
I am trying to figure out the most efficient way of finding similar values of a specific cell in a specified column(not all columns) in an excel .xlsx document. The code I have currently assumes all of the strings are unsorted. However the file I am using and the files I will be using all have strings sorted from A-Z. So instead of doing a linear search I wonder what other search algorithm I could use as well as being able to fix my coding eg(binary search etc).
So far I have created a function: find(). Before the function runs the program takes in a value from the user's input that then gets set as the sheet name. I print out all available sheet names in the excel doc just to help the user. I created an empty array results[] to store well....the results. I created a for loop that iterates through only column A because I only want to iterate through a custom column. I created a variable called start that is the first coordinate in column A eg(A1 or A400) this will change depending on the iteration the loop is on. I created a variable called next that will get compared with the start. Next is technically just start + 1, however since I cant add +1 to a string I concatenate and type cast everything so that the iteration becomes a range from A1-100 or however many cells are in column A. My function getVal() gets called with two parameters, the coordinate of the cell and the worksheet we are working from. The value that is returned from getVal() is also passed inside my function Similar() which is just a function that calls SequenceMatcher() from difflib. Similar just returns the percentage of how similar two strings are. Eg. similar(hello, helloo) returns int 90 or something like that. Once the similar function is called if the strings are above 40 percent similar appends the coordinates into the results[] array.
def setSheet(ws):
sheet = wb[ws]
return sheet
def getVal(coordinate, worksheet):
value = worksheet[coordinate].value
return value
def similar(first, second):
percent = SequenceMatcher(None, first, second).ratio() * 100
return percent
def find():
column = "A"
print("\n")
print("These are all available sheets: ", wb.sheetnames)
print("\n")
name = input("What sheet are we working out of> ")
results = []
ws = setSheet(name)
for i in range(1, ws.max_row):
temp = str(column + str(i))
x = ws[temp]
start = ws[x].coordinate
y = str(column + str(i + 1))
next = ws[y].coordinate
if(similar(getVal(start,ws), getVal(next,ws)) > 40):
results.append(getVal(start))
return results
This is some nasty looking code so I do apologize in advance. The expected results should just be a list of strings that are "similar".
I have this data set which is in this format in this way in csv file:
1st question : I am trying to find duplicates rows in the table just created in python below?
I did try to use the set function to run the rows and the output I got is
no duplicates even though there is a duplicate row in the data set.
2nd question: is it possible to reference this table as i realized that it becomes a table when I print?So that I can use it on the next step for calculation purpose.
COL_1_WIDTH = 10
COL_2_WIDTH = 35
for row in data:
IC1 = len(str(row[0]))
IC2 = len(str(row[1]))
print( str(row[0])+ str( (COL_1_WIDTH-IC1) *' ') +\
str(row[1]) + str( (COL_2_WIDTH-IC2) *' ') +\
str(row[2]))
for row in data:
if len(set(row)) !=len(row):
print ('duplicates: ', row)
else:
print ('no duplicates:', row)
P.s. Permit to use built in function & numpy only.
Grateful for any ideas. Thank you!
You don't really explain what kind of object is 'data', so I assumed it was a list of strings.
Here's how I created mine from a csv file:
with open('/home/sebastien/Documents/answerSO.csv') as file:
data=file.read() #a string
data=data.split('\n') #a list of strings
data.pop() #to delete the last element, an empty string
(note that using the csv module may be a better idea)
Now, to look for duplicates, I used the method explained here:
How do I find the duplicates in a list and create another list with them?
seen = set()
uniq = []
for row in data:
if row not in seen:
uniq.append(row)
seen.add(row)
else:
print("found a duplicate:",row)
And about referencing it, well, it's in 'data'
So I was using openpyxl for all my Excel projects, but now I have to work with .xls files, so I was forced to change library. I've chosen pyexcel cuz it seemed to be fairly easy and well documented. So I've gone through hell with creating hundreds of variables, cuz there is no .index property, or something.
What I want to do now is to read the column in the correct file, f.e "Quantity" column, and get f.e value 12 from it, then check the same column in other file, and if it is not 12, then make it 12. Easy. But I cannot find any words about changing a single cell value in their documentation. Can you help me?
I didn't get it, wouldn't it be the most simple thing?
column_name = 'Quantity'
value_to_find = 12
sheets1 = pe.get_book(file_name='Sheet1.xls')
sheets1[0].name_columns_by_row(0)
row = sheets1[0].column[column_name].index(value_to_find)
sheets2 = pe.get_book(file_name='Sheet2.xls')
sheets2[0].name_columns_by_row(0)
if sheets2[0][row, column_name] != value_to_find:
sheets2[0][row, column_name] = value_to_find
EDIT
Strange, you can only assign values if you use cell_address indexing, must be some bug. Add this function:
def index_to_letter(n):
alphabet = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
result = []
while (n > 26):
result.insert(0, alphabet[(n % 26)])
n = n // 26
result.insert(0, alphabet[n])
return ''.join(result)
And modify the last part:
sheets2[0].name_columns_by_row(0)
col_letter = index_to_letter(sheets2[0].colnames.index(column_name))
cel_address = col_letter+str(row+1)
if sheets2[0][cel_address] != value_to_find:
sheets2[0][cel_address] = value_to_find
EDIT 2
Looks like you cannot assign only when you use the column name directly, so a around would be to find the column_name's index:
sheets2[0].name_columns_by_row(0)
col_index = sheets2[0].colnames.index(column_name)
if sheets2[0][row, col_index] != value_to_find:
sheets2[0][row, col_index] = value_to_find
Excel uses 2 sets of references to a cell. Cell name ("A1") and cell vector (Row, Column).
The PyExcel documentation tutorial states it supports both methods. caiohamamura's method tries to build the cell name - you don't need to if the cells are in the same location in each file, you can use the vector.
Once you have the cell, assigning a value to a single cell is simple - you assign the value. Example:
import pyexcel
sheet = pyexcel.get_sheet(file_name="Units.xls")
print(sheet[3,2]) # this gives me "cloud, underwater"
sheet[3,2] = "cloud, underwater, special"
sheet.save_as("Units1.xls")
Note that all I had to do was "sheet[3,2] =".
This is not explicitly stated but is hinted at in the pyexcel documentation where it states that to update a whole column you do:
sheet.column["Column 2"] = [11, 12, 13]
i.e. replace a list by assigning a new list. Same logic applies to a single cell - just assign a new value.
Bonus - the [Row, column] method gets around cell locations in columns greater than 26 (i.e. 'AA' and above).
Caveat - make sure in your comparison you are comparing like-for-like i.e. int is understood to be an int and not a str. Python should implicitly converted but in some circumstances it may not - especially if you are using python 2 and Unicode is involved.
I have the following problem:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
I need to find which elements of list2 are in list1. In actual fact the elements of list1 correspond to a numerical value which I need to obtain then change. The problem is that 'xyz2' contains 'xyz' and therefore matches also with a regular expression.
My code so far (where 'data' is a python dictionary and 'specie_name_and_initial_values' is a list of lists where each sublist contains two elements, the first being specie name and the second being a numerical value that goes with it):
all_keys = list(data.keys())
for i in range(len(all_keys)):
if all_keys[i]!='Time':
#print all_keys[i]
pattern = re.compile(all_keys[i])
for j in range(len(specie_name_and_initial_values)):
print re.findall(pattern,specie_name_and_initial_values[j][0])
Variations of the regular expression I have tried include:
pattern = re.compile('^'+all_keys[i]+'$')
pattern = re.compile('^'+all_keys[i])
pattern = re.compile(all_keys[i]+'$')
And I've also tried using 'in' as a qualifier (i.e. within a for loop)
Any help would be greatly appreciated. Thanks
Ciaran
----------EDIT------------
To clarify. My current code is below. its used within a class/method like structure.
def calculate_relative_data_based_on_initial_values(self,copasi_file,xlsx_data_file,data_type='fold_change',time='seconds'):
copasi_tool = MineParamEstTools()
data=pandas.io.excel.read_excel(xlsx_data_file,header=0)
#uses custom class and method to get the list of lists from a file
specie_name_and_initial_values = copasi_tool.get_copasi_initial_values(copasi_file)
if time=='minutes':
data['Time']=data['Time']*60
elif time=='hour':
data['Time']=data['Time']*3600
elif time=='seconds':
print 'Time is already in seconds.'
else:
print 'Not a valid time unit'
all_keys = list(data.keys())
species=[]
for i in range(len(specie_name_and_initial_values)):
species.append(specie_name_and_initial_values[i][0])
for i in range(len(all_keys)):
for j in range(len(specie_name_and_initial_values)):
if all_keys[i] in species[j]:
print all_keys[i]
The table returned from pandas is accessed like a dictionary. I need to go to my data table, extract the headers (i.e. the all_keys bit), then look up the name of the header in the specie_name_and_initial_values variable and obtain the corresponding value (the second element within the specie_name_and_initial_value variable). After this, I multiply all values of my data table by the value obtained for each of the matched elements.
I'm most likely over complicating this. Do you have a better solution?
thanks
----------edit 2 ---------------
Okay, below are my variables
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
species = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
You don't need a regex to find common elements, set.intersection will find all elements in list2 that are also in list1:
list1=['xyz','xyz2','other_randoms']
list2=['xyz']
print(set(list2).intersection(list1))
set(['xyz'])
Also if you wanted to compare 'xyz' to 'xyz2' you would use == not in and then it would correctly return False.
You can also rewrite your own code a lot more succinctly, :
for key in data:
if key != 'Time':
pattern = re.compile(val)
for name, _ in specie_name_and_initial_values:
print re.findall(pattern, name)
Based on your edit you have somehow managed to turn lists into strings, one option is to strip the []:
all_keys = set([u'Cyp26_G_R1', u'Cyp26_G_rep1', u'Time'])
specie_name_and_initial_values = set(['[Cyp26_R1R2_RARa]', '[Cyp26_SRC3_1]', '[18-OH-RA]', '[p38_a]', '[Cyp26_G_rep1]', '[Cyp26]', '[Cyp26_G_a]', '[SRC3_p]', '[mRARa]', '[np38_a]', '[mRARa_a]', '[RARa_pp_TFIIH]', '[RARa]', '[Cyp26_G_L2]', '[atRA]', '[atRA_c]', '[SRC3]', '[RARa_Ser369p]', '[p38]', '[Cyp26_mRNA]', '[Cyp26_G_L]', '[TFIIH]', '[Cyp26_SRC3_2]', '[Cyp26_G_R1R2]', '[MSK1]', '[MSK1_a]', '[Cyp26_G]', '[Basal_Kinases]', '[Cyp26_R1_RARa]', '[4-OH-RA]', '[Cyp26_G_rep2]', '[Cyp26_Chromatin]', '[Cyp26_G_R1]', '[RXR]', '[SMRT]'])
specie_name_and_initial_values = set(s.strip("[]") for s in specie_name_and_initial_values)
print(all_keys.intersection(specie_name_and_initial_values))
Which outputs:
set([u'Cyp26_G_R1', u'Cyp26_G_rep1'])
FYI, if you had lists inside the set you would have gotten an error as lists are mutable so are not hashable.
I'm using the Field Calculator in ArcMap and
I need to create a unique ID for every storm drain in my county.
An ID Should look something like this: 16-I-003
The first number is the municipal number which is in the column/field titled "Munic"
The letter is using the letter in the column/field titled "Point"
The last number is simply just 1 to however many drains there are in a municipality.
So far I have:
rec=0
def autoIncrement()
pStart=1
pInterval=1
if(rec==0):
rec=pStart
else:
rec=rec+pInterval
return "16-I-" '{0:03}'.format(rec)
So you can see that I have manually been typing in the municipal number, the letter, and the hyphens. But I would like to use the fields: Munic and Point so I don't have to manually type them in each time it changes.
I'm a beginner when it comes to python and ArcMap, so please dumb things down a little.
I'm not familiar with the ArcMap, so can't directly help you, but you might just change your function to a generator as such:
def StormDrainIDGenerator():
rec = 0
while (rec < 99):
rec += 1
yield "16-I-" '{0:03}'.format(rec)
If you are ok with that, then parameterize the generator to accept the Munic and Point values and use them in your formatting string. You probably should also parameterize the ending value as well.
Use of a generator will allow you to drop it into any later expression that accepts an iterable, so you could create a list of such simply by saying list(StormDrainIDGenerator()).
Is your question on how to get Munic and Point values into the string ID? using .format()?
I think you can use following code to do that.
def autoIncrement(a,b):
global rec
pStart=1
pInterval=1
if(rec==0):
rec=pStart
else:
rec=rec+pInterval
r = "{1}-{2}-{0:03}".format(a,b,rec)
return r
and call
autoIncrement( !Munic! , !Point! )
The r = "{1}-{2}-{0:03}".format(a,b,rec) just replaces the {}s with values of variables a,b which are actually the values of Munic and Point passed to the function.