Given the Below Lists:
a = ['abc','cde','efg']
b = [[1,2,3],[2,3,4],[4,5,6]]
What is an optimized way to print the output as shown below:
Looking for an optimized way as in real I have about 100 x 100 elements.
Also keep in mind that each element in b is an integer while in a is a string
abc,1,2,3
cde,2,3,4
efg,4,5,6
To print in the exact format you specified:
print('\n'.join([a[i] + ',' + str(b[i]).strip('[').strip(']').replace(' ','') for i in range(len(a))]))
Output:
abc,1,2,3
cde,2,3,4
efg,4,5,6
100*100 element is a very small number for a python program - any optimization at this scale will probably fail to be significant enough for us humans to have noticed. To test:
%%timeit
array = np.random.randn(100,100)
print('\n'.join([str(e) for e in array])) # prints like above
result:
148 ms ± 13.2 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Also - keep in mind the main bottle neck should be the print, not the actual process doing the printing, hence using zip or other trick may not work as they don't help the terminal/other stdout capture print fast enough.
Use range,
for i in range(len(b)):
print("{},{}".format(a[i],','.join([str(x) for x in b[i]])))
#output,
abc,1,2,3
cde,2,3,4
efg,4,5,6
You can try using zip() and str.join():
>>> a = ['abc','cde','efg']
>>> b = [[1,2,3],[2,3,4],[4,5,6]]
>>> print('\n'.join(','.join(map(str, (x, *y))) for x, y in zip(a, b)))
abc,1,2,3
cde,2,3,4
efg,4,5,6
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Super easy question. Let´s say I have this list in Python:
variables = ['A1,A1','A2,B2','A1,C2','B3,B3','C4,C4']
Now, I only need to keep those items where the value before and after the comma differs. In this case, the output would be:
result = ['A2,B2','A1,C2']
I already have a 'not-so-elegant' solution for this:
new_list = []
for i in range(len(variables)):
j = variables[i].split(",")
if j[0] != j[1].replace(" ", ""):
z = "{},{}".format(j[0], j[1])
new_list.append(z)
note: I had to add replace to remove whitespaces, but it´s not important...
Is there another (better) way to do this? Maybe regex?
note II: I also tried using a list comprehension:
lista_differents = ["{},{}".format(j[0], j[1]) for i in range(len(variables)) if j[0] != j[1].replace(" ", "")]
But I still have to figure out how to add the line j = variables[i].split(",")
Any ideas?
You can try this with list comprehension and set. You basically split the string into 2 lists and then see if the set of those 2 elements has a len > 1 which means both are unique.
variables = ['A1,A1','A2,B2','A1,C2','B3,B3','C4,C4']
[i for i in variables if len(set(i.split(',')))>1]
['A2,B2', 'A1,C2']
If you are bothered about runtime then try this approach without a split(','). This is much much faster than the fastest one in benchmark.
[i for i in variables if len(set(i))>3]
EDIT: Adding benchmarking results (300000 length input array, macbook pro 13)
Akshay Sehgal (first) - 215 ms ± 9.77 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Akshay Sehgal (second) - 136 ms ± 195 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Aviv Yaniv - 468 ms ± 39.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
jakub - 252 ms ± 29.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Md. Ashraful Alam - 252 ms ± 29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Here is another approach which looks at the number of unique characters. Max allowed is 3 (i.e. letter, number, comma).
l = ['A1,A1','A2,B2','A1,C2','B3,B3','C4,C4'
[i for i in l if len(set(i)) > 3]
Output:
['A2,B2', 'A1,C2']
... and a friendly challenge to # AkshaySehgal benchmarking. :-)
[v for v in variables if str.__ne__(*v.replace(" ", "").split(","))]
You can use the str.__ne__(x1, x2) function, which is equivalent to x1 != x2. The * unpacks the list into separate arguments, so the outputs of .split(",") are made into two positional arguments (assuming there is only one , character in the string).
variables = ['A1,A1','A2,B2','A1,C2','B3,B3','C4,C4']
result = [a for a in variables if a.split(',')[0]!=a.split(',')[1].replace(" ", "")]
print(result)
A solution without using additional memory:
variables = ['A1,A1','A2,B2','A1,C2','B3,B3','C4,C4']
def find_same_with_seperator(variables, SEPERATOR = ','):
same_vars = []
for v in variables:
# Finding seperator index
seperator_index = 0
for i in range(len(v)):
if SEPERATOR == v[i]:
break
seperator_index += 1
# If no seperator
if 0 == seperator_index:
continue
# Comparing parts
before_seperator = 0
after_seperator = seperator_index + 1
the_same = True
while after_seperator < len(v):
if v[before_seperator] != v[after_seperator]:
the_same = False
break
before_seperator += 1
after_seperator += 1
if the_same:
same_vars.append(v)
return same_vars
# ['A1,A1', 'B3,B3', 'C4,C4']
print(find_same_with_seperator(variables))
It's a time-honored tradition in Python to detect if items are different by putting them into a set and counting the number of elements.
[a for a in variables
if len(set(a.split(','))) > 1]
Another idiom in Python is to do "assignment" inside a list comprehension by iterating over a list with only one element. So another possible solution is:
[a for a in variables
for pair in [a.split(',')]
if pair[0] != pair[1]]
Python 3.8's walrus operator lets you write:
[a for a in variables
if (pair := a.split(','))[0] != pair[1]]
but I think that's rather ugly and hard to read.
This is just a classic problem. One suggested solution is checking on non-trivial rotation of the string. I believe this question has been used in several job interviews.
[x for x in variables if (x+','+x).find(x, 1, -1) == -1]
Out[183]: ['A2,B2', 'A1,C2']
P/s: This question is more about logic/algorithm than pandas or any specific programming language.
I would like to create a numpy array where the first element is a defined constant, and every next element is defined as the function of the previous element in the following way:
import numpy as np
def build_array_recursively(length, V_0, function):
returnList = np.empty(length)
returnList[0] = V_0
for i in range(1,length):
returnList[i] = function(returnList[i-1])
return returnList
d_t = 0.05
print(build_array_recursively(20, 0.3, lambda x: x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t))
The print method above outputs
[0.3 0.28511194 0.27095747 0.25750095 0.24470843 0.23254756 0.22098752
0.20999896 0.19955394 0.18962586 0.18018937 0.17122037 0.16269589
0.15459409 0.14689418 0.13957638 0.13262186 0.1260127 0.11973187 0.11376316]
Is there a fast way of doing this in numpy without a for loop?
If so is there a way to handle two elements before the current one, e.g. can a Fibonacci array be constructed similarly?
I found a similar question here
Is it possible to vectorize recursive calculation of a NumPy array where each element depends on the previous one?
but was not answered in general. In my example, the difference equation is difficult to solve manually.
This is faster for what you want to do. You don't have to use recursion for the function.
Calculate the element based on previous element. Append calculated element to a list, and then change the list to numpy.
def method2(length, V_0, d_t):
k = [V_0]
x = V_0
for i in range(1, length):
x = x - x * d_t + x * x / 2 * d_t * d_t - x * x * x / 6 * d_t * d_t * d_t
k.append(x)
return np.asarray(k)
print(method2(20,0.3, 0.05))
Running you existing method 10000 times takes 0.438 seconds, while method2 takes 0.097 seconds.
Using a function to make the code clearer (instead of the inline lambda):
def fn(x):
return x-x*d_t+x*x/2*d_t*d_t-x*x*x/6*d_t*d_t*d_t
And a function that combines elements of build_array_recursively and method2:
def foo1(length, V_0, function):
returnList = np.empty(length)
returnList[0] = x = V_0
for i in range(1,length):
returnList[i] = x = function(x)
return returnList
In [887]: timeit build_array_recursively(20,0.3, fn);
61.4 µs ± 63 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [888]: timeit method2(20,0.3, fn);
16.9 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [889]: timeit foo1(20,0.3, fn);
13 µs ± 29.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The main time saver in method2 and foo2 is carrying over x, the last value, from one iteration to the next, rather than indexing with returnList[i-1].
The accumulation method, assigning to a preallocated array, or list append, is less important. Performance is usually similar.
Here the calculation is simple enough that details of what you do in the loop makes a big difference in the overall time.
All of these are loops. Some ufunc have a reduce (and accumulate) method, that can apply the function repeatedly to a elements of the input array. np.sum, np.cumsum, etc make use of this. But you can't do that with a general Python function.
You have to use some sort of compilation tool like numba to perform this sort of loop much faster.
I want to apply outer addition of multiple vectors/matrices. Let's say four times:
import numpy as np
x = np.arange(100)
B = np.add.outer(x,x)
B = np.add.outer(B,x)
B = np.add.outer(B,x)
I would like best if the number of additions could be a variable, like a=4 --> 4 times the addition. Is this possible?
Approach #1
Here's one with array-initialization -
n = 4 # number of iterations to add outer versions
l = len(x)
out = np.zeros([l]*n,dtype=x.dtype)
for i in range(n):
out += x.reshape(np.insert([1]*(n-1),i,l))
Why this approach and not iterative addition to create new arrays at each iteration?
Iteratively creating new arrays at each iteration would require more memory and hence memory-overhead there. With array-initialization, we are adding element off x into an already initialized array. Hence, it tries to be memory-efficient with it.
Alternative #1
We can remove one iteration with initializing with x. Hence, the changes would be -
out = np.broadcast_to(x,[l]*n).copy()
for i in range(n-1):
Approach # 2: With np.add.reduce -
Another way would be with np.add.reduce, which again doesn't create any intermediate arrays, but being a reduction method might be better here as that's what it's implemented for -
l = len(x); n = 4
np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
Timings -
In [17]: x = np.arange(100)
In [18]: %%timeit
...: n = 4 # number of iterations to add outer versions
...: l = len(x)
...: out = np.zeros([l]*n,dtype=x.dtype)
...: for i in range(n):
...: out += x.reshape(np.insert([1]*(n-1),i,l))
829 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: l = len(x); n = 4
In [20]: %timeit np.add.reduce([x.reshape(np.insert([1]*(n-1),i,l)) for i in range(n)])
183 ms ± 2.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I don't think there's a builtin argument to repeat this procedure several times, but you can define a custom function for it fairly easily
def recursive_outer_add(arr, num):
if num == 1:
return arr
x = np.add.outer(arr, arr)
for i in range(num - 1):
x = np.add.outer(x, arr)
return x
Just as a warning: the array gets really big really fast
Short and reasonably fast:
n = 4
l = 10
x = np.arange(l)
sum(np.ix_(*n*(x,)))
timeit(lambda:sum(np.ix_(*n*(x,))),number=1000)
# 0.049082988989539444
We can speed this up a little by going back to front:
timeit(lambda:sum(reversed(np.ix_(*n*(x,)))),number=1000)
# 0.03847671199764591
We can also build our own reversed np.ix_:
from operator import getitem
from itertools import accumulate,chain,repeat
sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem))
timeit(lambda:sum(accumulate(chain((x,),repeat((slice(None),None),n-1)),getitem)),number=1000)
# 0.02427654700295534
I'm trying to help my friend to clean an order list dataframe with one million elements.
you can see that the product_name column should be a list, but they are in string type. So I want to split them into sublists.
Here's my code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
But in the 'split' part, it took lots of time to process. I'm wondering is there faster way to deal with it ?
Thanks~
EDIT
(I did not like last answer, it was too much confused, so I reordered it and tested I little bit more systematically).
Long story short:
For speed, just use:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
df['product_name'].apply(str_to_list).to_list()
Long story long:
Let's dissect your code:
order_ls = raw_df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
What you would really like to do is to have a function, say str_to_list() which converts your input string to a list.
For some reasons, you do it in multiple steps, but this is really not necessary. What you have so far, can be rewritten as:
def str_to_list_OP(s):
return s.replace('[', '').replace(']', '').replace('\'', '').split(', ')
If you can assume that [ and ] are always the first and last char of your string, you can simplify this to:
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
which should also be faster.
Alternative approaches would use regular expressions, e.g.:
def str_to_list_regex(s):
regex = re.compile(r'[\[\]\']')
return re.sub(regex, '', s).split(', ')
Note that all approaches so far use split(). This is a quite fast implementation which approach C speed and hardly any Python construct would beat it.
All these methods are quite unsafe as they do not take into account escaping properly, e.g. all of the above would fail for the following valid Python code:
['ciao', "pippo", 'foo, bar']
More robust alternative in this scenario would be:
ast.literal_eval which works for any valid Python code
json.loads which actually requires valid JSON strings so it is not really an option here.
The speed for these solutions is compared here:
As you can see, safety comes at the price of speed.
(these graphs are generated using these scripts with the following
def gen_input(n):
return str([str(x) for x in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = str_to_list_OP, str_to_list, str_to_list_regex, ast.literal_eval
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
Now let's concentrate to the looping. What you do is an explicit looping, and we know that Python is typically not terribly fast with that.
However, looping in a comprehension can be faster because it can generate more optimized code.
Another approach would be to use a vectorized expression using Pandas primitives, either using apply() or with .str. chainings.
The following timings are obtained, indicating comprehensions to be the fastest for smaller inputs, although the vectorized solution (using apply) catches up and eventually surpasses the comprehension:
The following test functions were used:
import pandas as pd
def str_to_list(s):
return s[1:-1].replace('\'', '').split(', ')
def func_OP(df):
order_ls = df['product_name'].tolist()
cln_order_ls = list()
for i in order_ls:
i = i.replace('[', '')
i = i.replace(']', '')
i = i.replace('\'', '')
cln_order_ls.append(i)
new_cln_order_ls = list()
for i in cln_order_ls:
new_cln_order_ls.append(i.split(', '))
return new_cln_order_ls
def func_QuangHoang(df):
return df['product_name'].str[1:-1].str.replace('\'','').str.split(', ').to_list()
def func_apply_df(df):
return df['product_name'].apply(str_to_list).to_list()
def func_compr(df):
return [str_to_list(s) for s in df['product_name']]
with the following test code:
def gen_input(n):
return pd.DataFrame(
columns=('order_id', 'product_name'),
data=[[i, "['ciao', 'pippo', 'foo', 'bar', 'baz']"] for i in range(n)])
def equal_output(a, b):
return a == b
input_sizes = (5, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000)
funcs = func_OP, func_QuangHoang, func_apply_df, func_compr
runtimes, input_sizes, labels, results = benchmark(
funcs, gen_input=gen_input, equal_output=equal_output,
input_sizes=input_sizes)
again using the same base scripts as before.
How about:
(df['product_name']
.str[1:-1]
.str.replace('\'','')
.str.split(', ')
)
Try this
import ast
raw_df['product_name'] = raw_df['product_name'].apply(lambda x : ast.literal_eval(x))
I am curious about list comp as anky_91, so I gave it a try. I do list comp directly on ndarray to save time on calling tolist
n = raw_df['product_name'].values
[x[1:-1].replace('\'', '').split(', ') for x in n]
Sample data:
In [1488]: raw_df.values
Out[1488]:
array([["['C1', 'None', 'None']"],
["['C1', 'C2', 'None']"],
["['C1', 'C1', 'None']"],
["['C1', 'C2', 'C3']"]], dtype=object)
In [1491]: %%timeit
...: n = raw_df['product_name'].values
...: [x[1:-1].replace('\'', '').split(', ') for x in n]
...:
16.2 µs ± 614 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [1494]: %timeit my_func_2b(raw_df)
36.1 µs ± 489 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1493]: %timeit my_func_2(raw_df)
39.1 µs ± 2.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [1492]: %timeit raw_df['product_name'].str[1:-1].str.replace('\'','').str.sp
...: lit(', ').tolist()
765 µs ± 41.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, listcomp with chain replace and split is fastest. Its speed is twice the next one. However, the save time is actually on using ndarray without calling tolist. If I add tolist, differences is not significant.
The list.remove() function serves to remove the first time an item appears in a list. Is there a built-in function to remove the last time? For instance if I have a list, say:
X = ['sf', 'cc', 'ch', 'sc', 'sh', 'ch']
and I want to remove the last 'ch' from the list, is there a better method than what I'm currently doing, which is:
X.reverse()
X.remove('ch')
X.reverse()
I will soon also have to worry about cases where the item being removed is potentially not in the list. So methods that do not throw errors in this case would be preferred.
if 'ch' in X:
X.reverse()
X.remove('ch')
X.reverse()
The most pythonic way would be to do a try: except around remove:
X.reverse()
try:
X.remove('ch')
except:
pass
X.reverse()
As per your comment on speed, both of these methods are O(N), as x in list and list.reverse() are both O(N), so there's not much between them. If you expect the element to usually be there, you can save the x in list check by using try: catch, however if you expect it to usually not be there, you can save the 2 reverse()s by checking for membership first.
There's really nothing wrong with your code at all. It works, it's clear why it works, it's hard to get wrong or misunderstand.
Yes, you could make it faster, but only by a constant factor. (Your algorithm does a two reverses, for N steps each, and one remove, which is N-1 steps, so O(N). And since your data aren't sorted or anything that would help us find a value faster, it's obvious that the ideal algorithm would also be O(N).) And at the cost of making it more complicated.
The obvious probably-faster way to do it is to just manually iterate from the end until we find a value, then delete that value. That also avoids having to deal with the ValueError. Using enumerate might help… but getting it right (without copying the whole thing) may be tricky.
So, let's compare these to your existing code, both wrapped it in a try/except, and in an if:
def f_rev_ex(xs, s):
xs.reverse()
try:
xs.remove(s)
except ValueError:
pass
xs.reverse()
def f_rev_if(xs, s):
if s in xs:
xs.reverse()
xs.remove(s)
xs.reverse()
def f_for(xs, s):
for i in range(len(xs)-1, -1, -1):
if s == xs[i]:
del xs[i]
break
def f_enum(xs, s):
for i, x in reversed(list(enumerate(xs))):
if x == s:
del xs[i]
break
For a list as tiny as yours, the test isn't even worth running, so I invented my own random data (in real life you have to know your data, of course):
In [58]: xs = [random.choice(string.ascii_lowercase) for _ in range(10000)]
In [59]: %timeit y = x[:]; f_rev_ex(y, 'a')
10000 loops, best of 3: 34.7 µs per loop
In [60]: %timeit y = x[:]; f_rev_if(y, 'a')
10000 loops, best of 3: 35.1 µs per loop
In [61]: %timeit y = x[:]; f_for(y, 'a')
10000 loops, best of 3: 26.6 µs per loop
In [62]: %timeit y = x[:]; f_enum(y, 'a')
1000 loops, best of 3: 604 µs per loop
Well, that last one wasn't a very good idea… but the other one is about 25% faster than what we started with. So we've saved a whole 9 microseconds, on data 4 orders of magnitude larger than your actual data. It's up to you whether that's worth the less-readable, easier-to-screw up code. (And I'm not going to show you my enumerate-based implementation without copying, because I got it wrong. :P)
Produce a reversed list, preserving the original indexes and remove the first instance you find.
X = ['sf', 'cc', 'ch', 'sc', 'sh', 'ch']
print X
for i, e in reversed(list(enumerate(X))):
if e == 'ch':
del X[i]
break
print X
If it doesn't find the string it leaves the list untouched.
Without reverse() and similar to one answer above:
def RightRemove(alist, x):
for i in range(len(alist), 0, -1): # from end to begin
if alist[i-1] == x: # element x exists
alist.pop(i-1) # remove it
break # return
Well first you can check if the item is in the list using a if in statement. Then you can reverse the list and remove the element.
if "ch" in X:
X.reverse()
X.remove("ch")
X.reverse()
Yet another answer...
def remove_last_occurrence(lst, element):
'''
Removes the last occurrence of a given element in a list (modifies list in-place).
:return bool:
True if the element was found and False otherwise.
'''
for i, s in enumerate(reversed(lst)):
if s == element:
del lst[len(lst) - 1 - i]
return True
return False
Yet another one ..
def remove_last_occurrence_one_liner(lst, element):
"""
Removes the last occurrence of a given element in a list (modifies list in-place).
Raises same exception than lst.index(element) if element can not be found.
"""
del lst[len(lst) - lst[::-1].index(element) - 1]
But it does not beat the for loop from abarnert
x = [random.choice(string.ascii_lowercase) for _ in range(10000)]
%timeit y = x[:]; f_rev_ex(y, 'a')
34.3 µs ± 219 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit y = x[:]; f_rev_if(y, 'a')
34.9 µs ± 195 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit y = x[:]; f_for(y, 'a')
26.9 µs ± 109 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit y = x[:]; f_enum(y, 'a')
699 µs ± 4.86 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit y = x[:]; remove_last_occurrence_one_liner(y, 'a')
49 µs ± 375 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)