Is that any faster way to deleting Excel row using openpyxl? - python

I have a list of excel row numbers that I want to delete with 2138 length using Openpyxl. Here is the code :
delete_this_row = [1,2,....,2138]
for delete in delete_this_row:
worksheet.delete_rows(delete)
But its too slow. It takes 45 seconds until 1 minute to finish the process.
Is that any faster way to complete the task?

There's almost always a faster way to do something. Sometimes the cost is too high but not in this case, I suspect :-)
If it's just a set of contiguous rows you want to delete, you can just use:
worksheet.delete_rows(1, 2138)
Documentation here, copied below for completeness:
delete_rows(idx, amount=1): Delete row or rows from row==idx.
Your solution is slow since, every time you delete a single row, it has to shift everything beneath that point up one row then delete the final row.
By passing in the row count, it instead does one shift, shifting rows 2139..max straight up to rows 1..max-2138, then deletes all the rows that are below max-2138.
This is likely to be roughly 2,138 times faster than what you have now :-)
If you have arbitrary row numbers in your array, you can still use this approach to optimise it as much as possible.
The idea here is to first turn your row list into a tuple list where each tuple has:
the starting row; and
the number of rows to delete from there.
Ideally, you'd also generate this in reverse order so you could just process it as is. The following snippet shows how you could do this, with the openpyxl calls being printed rather than called:
def reverseCombiner(rowList):
# Don't do anything for empty list. Otherwise,
# make a copy and sort.
if len(rowList) == 0: return []
sortedList = rowList[:]
sortedList.sort()
# Init, empty tuple, use first item for previous and
# first in this run.
tupleList = []
firstItem = sortedList[0]
prevItem = sortedList[0]
# Process all other items in order.
for item in sortedList[1:]:
# If start of new run, add tuple and use new first-in-run.
if item != prevItem + 1:
tupleList = [(firstItem, prevItem + 1 - firstItem)] + tupleList
firstItem = item
# Regardless, current becomes previous for next loop.
prevItem = item
# Finish off the final run and return tuple list.
tupleList = [(firstItem, prevItem + 1 - firstItem)] + tupleList
return tupleList
# Test data, hit me with anything :-)
myList = [1, 70, 71, 72, 98, 21, 22, 23, 24, 25, 99]
# Create tuple list, show original and that list, then process.
tuples = reverseCombiner(myList)
print(f"Original: {myList}")
print(f"Tuples: {tuples}\n")
for tuple in tuples:
print(f"Would execute: worksheet.delete_rows({tuple[0]}, {tuple[1]})")
The output is:
Original: [1, 70, 71, 72, 98, 21, 22, 23, 24, 25, 99]
Tuples: [(98, 2), (70, 3), (21, 5), (1, 1)]
Would execute: worksheet.delete_rows(98, 2)
Would execute: worksheet.delete_rows(70, 3)
Would execute: worksheet.delete_rows(21, 5)
Would execute: worksheet.delete_rows(1, 1)

Related

How to compare a pair of values in python, to see if the next value in pair is greater than the previous?

I have the following list:
each pair of the value gives me information about a specific row. I want to be able to compare the values from the first position, to the next and see if the next value is less than the current value, if it is keep it that way, if not delete that pair. So for example, for the first index 0 and 1, comparing 29 to 25, I see that 25 is less than 29 so I keep the pair, now I add two to the current index, taking me to 16 here I see that 16 is not less than 19 so I delete the pair values(16,19). I have the following code:
curr = 0
skip = 0
finapS = []
while curr < len(apS):
if distance1[apS[skip+1]] < distance1[apS[skip]]:
print("its less than prev")
print(curr,skip)
finapS.append(distance1[apS[skip]])
finapS.append(distance1[apS[skip+1]])
skip = skip + 2
curr = curr + 1
print("itterated,", skip, curr)
distance1 is a list of values that has variations of data points. apS is a list that contains the index of the important values from the distance1 list. Distance1 has all the values, but I need only the values from the index of apS, now I need to see if those pairs and the values of them are in descending order. The Code I tried running is giving me infinite loop I can't understand why. Here I am adding the values to a new list, but if possible I would like to just delete those pairs of value and keep the original list.
I think this kind of logic is more easily done using a generator. You can loop through the data and only yield values if they meet your condition. e.g.
def filter_pairs(data):
try:
it = iter(data)
while True:
a, b = next(it), next(it)
if b < a:
yield from (a, b)
except StopIteration:
pass
Example usage:
>>> aps = [1, 2, 3, 1, 2, 4, 6, 5]
>>> finaps = list(filter_pairs(aps))
>>> finaps
[3, 1, 6, 5]
So it looks like you want a new list. Therefore:
apS = [29.12, 25.01, 16.39, 19.49, 14.24, 12.06]
apS_new = []
for x, y in zip(apS[::2], apS[1::2]):
if x > y:
apS_new.extend([x, y])
print(apS_new)
Output:
[29.12, 25.01, 14.24, 12.06]
Pure Python speaking, I think zip is the elegant way of doing this, combined with slice steps.
Assuming you list as defined as::
>>> a = [29, 25, 16, 19, 14, 12, 22, 8, 26, 25, 26]
You can zip the list into itself with a shift of one, and a slice step of two::
>>> list(zip(a[:-1:2], a[1::2]))
[(29, 25), (16, 19), (14, 12), (22, 8), (26, 25)]
Once you have that, you can then filter the sequence down to the items you want, your full solution will be::
>>> list((x, y) for (x, y) in zip(a[:-1:2], a[1::2]) if x > y)
[(29, 25), (14, 12), (22, 8), (26, 25)]
If you prefer to go the numpy path, then read about the np.shift function.
If your test is false you loop without augmenting the counter curr.
You need an
else:
curr+=1
(or +=2 according to the logic)
to progress through the list.

Change value of portion of a string within a list get sum of string

I am working on using itertools to get a list of combinations, but am stuck with manipulating those combinations once I have them. Here is what I have:
k_instances = 3 #Instances of lysine
k_modifications = {'Hydroxylation', 'Carboxylation', #Modifications applicable to lysine
}
k_combinations = itertools.combinations_with_replacement(k_modifications, k_instances) #Possible modifications assigned
k_comb_list = list(k_combinations) #Convert combinations to a list
k_comb_list_str = [k_comb_list[i:i+k_instances] for i in range(0, len(k_comb_list), k_instances)]
for i in range(len(k_comb_list_str)):
k_comb_list_str[i] = 16 if k_comb_list_str[i] == 'Hydroxylation' else k_comb_list_str[i]
print(k_comb_list_str)
When running this, I get:
[[('Carboxylation', 'Carboxylation', 'Carboxylation'), ('Carboxylation', 'Carboxylation', 'Hydroxylation'), ('Carboxylation', 'Hydroxylation', 'Hydroxylation')], [('Hydroxylation', 'Hydroxylation', 'Hydroxylation')]]
My idea is to replace each of these variables with their mass, for instance replace all occurrences of Carboxylation with 16. Doing this I would like to end up with a list of strings, something like this:
[[(16,16,16),(16,16,2),(16,2,2)...]]
I would then get the sum of each of the strings:
[[(48),(32),(20)]]
And then essentially have a list of values possible based on the combinations.
I'm sure there is a simpler way about carrying this out, so any suggestions for how to execute this would be appreciated. I have tried replacing each value using else if statements, but it doesn't work because I can't figure out how to manipulate within the string, so I can only search for the string, which defeats the purpose.
The easiest option is to make the various molecules variables, and use those, rather than trying to do string replacement later. For example:
import itertools
Hydroxylation = 2
Carboxylation = 16
k_instances = 3
k_modifications = [Hydroxylation, Carboxylation]
k_combinations = itertools.combinations_with_replacement(k_modifications, k_instances)
k_comb_l = list(k_combinations)
print(k_comb_l)
# [(16, 16, 16), (16, 16, 2), (16, 2, 2), (2, 2, 2)]
print([sum(x) for x in k_comb_l])
# [48, 34, 20, 6]

Get a list of maximum values per tuple from a list of tuples

I'm just getting into Python, and having some trouble understanding the control flow and iteration logic.
I am trying to create a function which takes a list of tuples, and I want to return a new list with the maximum element per tuple.
I know I'm missing putting the maximum element into a new list, but first I am trying to get that maximum value.
def max_element_per_tuple(tuple_list):
maximum = tuple_list[0]
for item in tuple_list:
if item > maximum:
maximum = item
return maximum
# test it
tuple_list = [(-1,0,1), (10,20,30), (100,50,25), (55,75,65)]
print(max_element_per_tuple(tuple_list))
This returns: (100, 50, 25)
Want returned: (1, 30, 100, 75)
Or if a list (?), then: [1, 30, 100, 75]
Simply, try this one-linear pythonic solution
>>> tuple_list = [(-1,0,1), (10,20,30), (100,50,25), (55,75,65)]
>>> [max(e) for e in tuple_list] # List
[1, 30, 100, 75]
>>> tuple(max(e) for e in tuple_list) # Tuple
(1, 30, 100, 75)
Right now you are just looping through the tuples and returning the "biggest" one - tuple comparison is done element-wise.
What you want is to add another loop level that will find the maximum of each tuple:
def max_element_per_tuple(tuple_list):
res = []
for tup in tuple_list: # loops over the tuples in the list
maximum = tup[0]
for item in tup: # loops over each item of the tuple
if item > maximum:
maximum = item
res.append(maximum)
return res
This gives as expected:
>>> max_element_per_tuple([(-1, 0, 1), (10, 20, 30), (100, 50, 25), (55, 75, 65)])
[1, 30, 100, 75]
Your function max_element_per_tuple is correct (though unnecessary, because standard function max() exists). What you did wrong was calling that function using a list of tuples as the argument. This found the biggest tuple in the list ("biggest" for tuples means "the one with first element biggest"), which happened to be the third one - (100,50,25). What you need to do is either:
result = list(map(max, tuple_list))
or
result = [max(t) for t in tuple_list]
This last one is roughly equivalent to:
result = []
for t in tuple_list:
result.append(max(t))
If you replace max with your max_element_per_tuple the results should be the same.
This should work
def max_element_per_tuple(tuple_list):
maximum = []
for item in tuple_list:
maximum.append(max(item))
return maximum
will give this output : [1, 30, 100, 75]
The issue: max_element_per_tuple(tuple_list) returns the wrong result because it is looking for the max tuple, not the max value in each tuple.
def max_element_per_tuple(tuple_list):
maximum = tuple_list[0] # maximum is = (-1,0,1)
for item in tuple_list:
if item > maximum: # compares the tuples e.g (-1,0,1) == (10,20,30)
maximum = item
return maximum # at the end you have (100,50,25). it's the max tuple
Try any of below options:
tuple_list = [(-1,0,1), (10,20,30), (100,50,25), (55,75,65)]
# Get the max from each tuple using List comprehension
max_items_list = [max(tuple_item) for tuple_item in tuple_list] # in a list
max_items_tuple = tuple(max(tuple_item) for tuple_item in tuple_list) # in a tuple
print(max_items_list)
print(max_items_tuple)
# Get the max from each tuple using For Loop
# Can be used with List only, tuples are immutable
for_max_items_list = list()
for tuple_item in tuple_list:
max_value = max(tuple_item) # get the max of each tuple e.g Max((-1,0,1)) = 1
for_max_items_list.append(max_value) # add the max to list
print(for_max_items_list)

removing initial items from nested lists

I have a script which imports data and I am storing these in nested lists.
I have one list which instructs how many elements from each sub-list are to be discarded.
How do I do this?
I know how to do it manually, but I want to be able to upload a csv file into my program, and then let it run.
I have run the same line of data twice in csv file to try and make it simpler for me to fix, so I have
starting_index = [203,203]
but in principle this could have a 100 or so elements of different number.
I then have a whole series of nested lists. The number of elements in my starting_index matches the number of sub-lists within each list so at the moment there are only two sub-lists in each nested list.
I wanted to define a function that I could call on to pare each list. I know what is wrong with my code, but I do not know how to make it work.
def index_filter(original_list, new_list):
for i in starting_index:
print(i)
for index, value in enumerate(original_list):
for item,element in enumerate(value):
if item >= i:
new_list[index].append(element)
I realise now that this does not work, and the problems is the
for i in starting_index:
because when it finishes the first element in starting index, it then goes on to the next and appends more data. It doesn't error, but it does not do what I wanted it to do. I just want to remove in this case the first 203 elements from sub-list 1, and the first 203 elements from sub list two, but in principle those numbers will change.
I try and use enumerate all the time, and perhaps it's not appropriate here.
How can I solve this?
Thanks
Edit: Some sample data:
starting_index = [2,1,3]
list_one = [[15,34,67,89,44], [44,23,67,88,45,67,233,567,41,56.4],[45,6734,5,67,29,55,6345,23,89,45,6,8,3,4,5,876]]
ideal result:
list_one = [[67,89,44],[23,67,23,67,88,45,67,233,567,41,56.4],[67,29,55,6345,23,89,45,6,8,3,4,5,876]]
I have just come across the del statement which I am looking at, and I'll also have a look at the slice suggestion. Thanks
Edit: I tried the solution but I can't get it to work.
I tried that but when I put some test data in I get back the original unaltered list.
How do I access the output?
My test script:
original_list=[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]]
starting_index=[3,6]
def index_filter(original_list, starting_index):
return [value[i:] for i, value in zip(starting_index, original_list)]
index_filter(original_list, starting_index)
print(index_filter)
print(original_list)
Outputs a strange message and the original unaltered list
<function index_filter at 0x039CC468>
[[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]]
Thank you
You need to loop through the starting_index and original_list in parallel, so use zip().
And you can use the slice value[i:] to get the part of a list starting at an index, rather than looping.
def index_filter(original_list, starting_index):
return [value[i:] for i, value in zip(starting_index, original_list)]
original_list=[[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]]
starting_index=[3,6]
new_list = index_filter(original_list, starting_index)
print(new_list)

Speed up code that doesn't use groupby()?

I have two pieces of code (doing the same job) which takes in array of datetime and produces clusters of datetime which have difference of 1 hour.
First piece is:
def findClustersOfRuns(data):
runClusters = []
for k, g in groupby(itertools.izip(data[0:-1], data[1:]),
lambda (i, x): (i - x).total_seconds() / 3600):
runClusters.append(map(itemgetter(1), g))
Second piece is:
def findClustersOfRuns(data):
if len(data) <= 1:
return []
current_group = [data[0]]
delta = 3600
results = []
for current, next in itertools.izip(data, data[1:]):
if abs((next - current).total_seconds()) > delta:
# Here, `current` is the last item of the previous subsequence
# and `next` is the first item of the next subsequence.
if len(current_group) >= 2:
results.append(current_group)
current_group = [next]
continue
current_group.append(next)
return results
The first code takes 5 minutes to execute while second piece takes few seconds. I am trying to understand why.
The data over which I am running the code has size:
data.shape
(13989L,)
The data contents is as:
data
array([datetime.datetime(2016, 10, 1, 8, 0),
datetime.datetime(2016, 10, 1, 9, 0),
datetime.datetime(2016, 10, 1, 10, 0), ...,
datetime.datetime(2019, 1, 3, 9, 0),
datetime.datetime(2019, 1, 3, 10, 0),
datetime.datetime(2019, 1, 3, 11, 0)], dtype=object)
How do I improve the first piece of code to make it run as fast?
Based on the size, it looks you are having a huge list of elements i.e. huge len. Your second code is having just one for loop where as your first approach has many. You see just one right? They are in the form of map(), groupby(). Multiple iteration on the huge list is adding huge cost to the time complexity. These are not just additional iterations, but also these are slower than the normal for loop.
I made a comparison for another post which you might find useful Comparing list comprehensions and explicit loops.
Also, the usage of lambda function is adding up extra time.
However, you may further improve the execution time of the code by storing results.append to a separate variable say my_func and make a call as my_func(current_group).
Few more comparisons are:
Python 3: Loops, list comprehension and map slower compared to Python 2
Speed/efficiency comparison for loop vs list comprehension vs other methods

Categories

Resources