Improve Pandas Merge performance - python

I specifically dont have performace issue with Pands Merge, as other posts suggest, but I've a class in which there are lot of methods, which does a lot of merge on datasets.
The class has around 10 group by and around 15 merge. While groupby is pretty fast, out of total execution time of 1.5 seconds for class, around 0.7 seconds goes in those 15 merge calls.
I want to speed up performace in those merge calls. As I will have around 4000 iterations, hence saving .5 seconds overall in single iteration will lead to overall performance reduction by around 30min, which will be great.
Any suggestions I should try? I tried:
Cython
Numba, and Numba was slower.
Thanks
Edit 1:
Adding sample code snippets:
My merge statements:
tmpDf = pd.merge(self.data, t1, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t2, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t3, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t4, on='APPT_NBR', how='left')
tmp = tmpDf
tmpDf = pd.merge(tmp, t5, on='APPT_NBR', how='left')
And, by implementing Joins, I incorporate the following satatements:
dat = self.data.set_index('APPT_NBR')
t1.set_index('APPT_NBR', inplace=True)
t2.set_index('APPT_NBR', inplace=True)
t3.set_index('APPT_NBR', inplace=True)
t4.set_index('APPT_NBR', inplace=True)
t5.set_index('APPT_NBR', inplace=True)
tmpDf = dat.join(t1, how='left')
tmpDf = tmpDf.join(t2, how='left')
tmpDf = tmpDf.join(t3, how='left')
tmpDf = tmpDf.join(t4, how='left')
tmpDf = tmpDf.join(t5, how='left')
tmpDf.reset_index(inplace=True)
Note, all are part of a function named: def merge_earlier_created_values(self):
And, when I did timedcall from profilehooks by following:
#timedcall(immediate=True)
def merge_earlier_created_values(self):
I get following results:
The result of profiling of that method gives:
#profile(immediate=True)
def merge_earlier_created_values(self):
The profiling of function, by using Merge is as follows:
*** PROFILER RESULTS ***
merge_earlier_created_values (E:\Projects\Predictive Inbound Cartoon Estimation-MLO\Python\CodeToSubmit\helpers\get_prev_data_by_date.py:122)
function called 1 times
71665 function calls (70588 primitive calls) in 0.524 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 563 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.012 0.012 0.524 0.524 get_prev_data_by_date.py:122(merge_earlier_created_values)
14 0.000 0.000 0.285 0.020 generic.py:1901(_update_inplace)
14 0.000 0.000 0.285 0.020 generic.py:1402(_maybe_update_cacher)
19 0.000 0.000 0.284 0.015 generic.py:1492(_check_setitem_copy)
7 0.283 0.040 0.283 0.040 {built-in method gc.collect}
15 0.000 0.000 0.181 0.012 generic.py:1842(drop)
10 0.000 0.000 0.153 0.015 merge.py:26(merge)
10 0.000 0.000 0.140 0.014 merge.py:201(get_result)
8/4 0.000 0.000 0.126 0.031 decorators.py:65(wrapper)
4 0.000 0.000 0.126 0.031 frame.py:3028(drop_duplicates)
1 0.000 0.000 0.102 0.102 get_prev_data_by_date.py:264(recreate_previous_cartons)
1 0.000 0.000 0.101 0.101 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
1 0.000 0.000 0.098 0.098 get_prev_data_by_date.py:360(recreate_previous_freight_type)
10 0.000 0.000 0.092 0.009 internals.py:4455(concatenate_block_managers)
10 0.001 0.000 0.088 0.009 internals.py:4471(<listcomp>)
120 0.001 0.000 0.084 0.001 internals.py:4559(concatenate_join_units)
266 0.004 0.000 0.067 0.000 common.py:733(take_nd)
120 0.000 0.000 0.061 0.001 internals.py:4569(<listcomp>)
120 0.003 0.000 0.061 0.001 internals.py:4814(get_reindexed_values)
1 0.000 0.000 0.059 0.059 get_prev_data_by_date.py:295(recreate_previous_appt_status)
10 0.000 0.000 0.038 0.004 merge.py:322(_get_join_info)
10 0.001 0.000 0.036 0.004 merge.py:516(_get_join_indexers)
25 0.001 0.000 0.024 0.001 merge.py:687(_factorize_keys)
74 0.023 0.000 0.023 0.000 {pandas.algos.take_2d_axis1_object_object}
50 0.022 0.000 0.022 0.000 {method 'factorize' of 'pandas.hashtable.Int64Factorizer' objects}
120 0.003 0.000 0.022 0.000 internals.py:4479(get_empty_dtype_and_na)
88 0.000 0.000 0.021 0.000 frame.py:1969(__getitem__)
1 0.000 0.000 0.019 0.019 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
39 0.000 0.000 0.018 0.000 internals.py:3495(reindex_indexer)
537 0.017 0.000 0.017 0.000 {built-in method numpy.core.multiarray.empty}
15 0.000 0.000 0.017 0.001 ops.py:725(wrapper)
15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array)
24 0.000 0.000 0.014 0.001 internals.py:3625(take)
10 0.000 0.000 0.014 0.001 merge.py:157(__init__)
10 0.000 0.000 0.014 0.001 merge.py:382(_get_merge_keys)
15 0.008 0.001 0.013 0.001 ops.py:662(na_op)
234 0.000 0.000 0.013 0.000 common.py:158(isnull)
234 0.001 0.000 0.013 0.000 common.py:179(_isnull_new)
15 0.000 0.000 0.012 0.001 generic.py:1609(take)
20 0.000 0.000 0.012 0.001 generic.py:2191(reindex)
The profiling by using Joins is as follows:
65079 function calls (63990 primitive calls) in 0.550 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 592 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.016 0.016 0.550 0.550 get_prev_data_by_date.py:122(merge_earlier_created_values)
14 0.000 0.000 0.295 0.021 generic.py:1901(_update_inplace)
14 0.000 0.000 0.295 0.021 generic.py:1402(_maybe_update_cacher)
19 0.000 0.000 0.294 0.015 generic.py:1492(_check_setitem_copy)
7 0.293 0.042 0.293 0.042 {built-in method gc.collect}
10 0.000 0.000 0.173 0.017 generic.py:1842(drop)
10 0.000 0.000 0.139 0.014 merge.py:26(merge)
8/4 0.000 0.000 0.138 0.034 decorators.py:65(wrapper)
4 0.000 0.000 0.138 0.034 frame.py:3028(drop_duplicates)
10 0.000 0.000 0.132 0.013 merge.py:201(get_result)
5 0.000 0.000 0.122 0.024 frame.py:4324(join)
5 0.000 0.000 0.122 0.024 frame.py:4371(_join_compat)
1 0.000 0.000 0.111 0.111 get_prev_data_by_date.py:264(recreate_previous_cartons)
1 0.000 0.000 0.103 0.103 get_prev_data_by_date.py:231(recreate_previous_appt_scheduled_date)
1 0.000 0.000 0.099 0.099 get_prev_data_by_date.py:360(recreate_previous_freight_type)
10 0.000 0.000 0.093 0.009 internals.py:4455(concatenate_block_managers)
10 0.001 0.000 0.089 0.009 internals.py:4471(<listcomp>)
100 0.001 0.000 0.085 0.001 internals.py:4559(concatenate_join_units)
205 0.003 0.000 0.068 0.000 common.py:733(take_nd)
100 0.000 0.000 0.060 0.001 internals.py:4569(<listcomp>)
100 0.001 0.000 0.060 0.001 internals.py:4814(get_reindexed_values)
1 0.000 0.000 0.056 0.056 get_prev_data_by_date.py:295(recreate_previous_appt_status)
10 0.000 0.000 0.033 0.003 merge.py:322(_get_join_info)
52 0.031 0.001 0.031 0.001 {pandas.algos.take_2d_axis1_object_object}
5 0.000 0.000 0.030 0.006 base.py:2329(join)
37 0.001 0.000 0.027 0.001 internals.py:2754(apply)
6 0.000 0.000 0.024 0.004 frame.py:2763(set_index)
7 0.000 0.000 0.023 0.003 merge.py:516(_get_join_indexers)
2 0.000 0.000 0.022 0.011 base.py:2483(_join_non_unique)
7 0.000 0.000 0.021 0.003 generic.py:2950(copy)
7 0.000 0.000 0.021 0.003 internals.py:3046(copy)
84 0.000 0.000 0.020 0.000 frame.py:1969(__getitem__)
19 0.001 0.000 0.019 0.001 merge.py:687(_factorize_keys)
100 0.002 0.000 0.019 0.000 internals.py:4479(get_empty_dtype_and_na)
1 0.000 0.000 0.018 0.018 get_prev_data_by_date.py:328(recreate_previous_location_numbers)
15 0.000 0.000 0.017 0.001 ops.py:725(wrapper)
34 0.001 0.000 0.017 0.000 internals.py:3495(reindex_indexer)
83 0.004 0.000 0.016 0.000 internals.py:3211(_consolidate_inplace)
68 0.015 0.000 0.015 0.000 {method 'copy' of 'numpy.ndarray' objects}
15 0.000 0.000 0.015 0.001 frame.py:2011(_getitem_array)
As you can see, the merge is faster than joins, though it is small value, but over 4000 iterations, that small value becomes a huge number, in minutes.
Thanks

set_index on merging column does indeed speed this up. Below is a slightly more realistic version of julien-marrec's Answer.
import pandas as pd
import numpy as np
myids=np.random.choice(np.arange(10000000), size=1000000, replace=False)
df1 = pd.DataFrame(myids, columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.random.permutation(myids), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))
%%timeit
x = df1.merge(df2, how='left', left_on='A', right_on='A2')
#1 loop, best of 3: 664 ms per loop
%%timeit
x = df1.set_index('A').join(df2.set_index('A2'), how='left')
#1 loop, best of 3: 354 ms per loop
%%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)
#Wall time: 16 ms
%%timeit
x = df1.join(df2, how='left')
#10 loops, best of 3: 80.4 ms per loop
When the column to be joined has integers not in the same order on both tables you can still expect a great speed up of 8 times.

I suggest that you set your merge columns as index, and use df1.join(df2) instead of merge, it's much faster.
Here's some example including profiling:
In [1]:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.arange(1000000), columns=['A'])
df1['B'] = np.random.randint(0,1000,(1000000))
df2 = pd.DataFrame(np.arange(1000000), columns=['A2'])
df2['B2'] = np.random.randint(0,1000,(1000000))
Here's a regular left merge on A and A2:
In [2]: %%timeit
x = df1.merge(df2, how='left', left_on='A', right_on='A2')
1 loop, best of 3: 441 ms per loop
Here's the same, using join:
In [3]: %%timeit
x = df1.set_index('A').join(df2.set_index('A2'), how='left')
1 loop, best of 3: 184 ms per loop
Now obviously if you can set the index before looping, the gain in terms of time will be much greater:
# Do this before looping
In [4]: %%time
df1.set_index('A', inplace=True)
df2.set_index('A2', inplace=True)
CPU times: user 9.78 ms, sys: 9.31 ms, total: 19.1 ms
Wall time: 16.8 ms
Then in the loop, you'll get something that in this case is 30 times faster:
In [5]: %%timeit
x = df1.join(df2, how='left')
100 loops, best of 3: 14.3 ms per loop

I don't know if this deserved a new answer but personally, the following tricks helped me improve a bit more the joins I had to do on big DataFrames (millions of rows and hundreds of columns):
Beside using set_index(index, inplace=True), you may want to sort it using sort_index(inplace=True). This speeds up a lot the join if your index is not ordered.
For example, creating the DataFrames with
import random
import pandas as pd
import numpy as np
nbre_items = 100000
ids = np.arange(nbre_items)
random.shuffle(ids)
df1 = pd.DataFrame({"id": ids})
df1['value'] = 1
df1.set_index("id", inplace=True)
random.shuffle(ids)
df2 = pd.DataFrame({"id": ids})
df2['value2'] = 2
df2.set_index("id", inplace=True)
I got the following results:
%timeit df1.join(df2)
13.2 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And after sorting the index (which takes a limited amount of time):
df1.sort_index(inplace=True)
df2.sort_index(inplace=True)
%timeit df1.join(df2)
764 µs ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can split one of your DataFrames in multiple ones with fewer columns. This trick gave me mixed results so be cautious when using it.
For example:
for i in range(0, df2.shape[1], 100):
df1 = df1.join(df2.iloc[:, i:min(df2.shape[1], (i + 100))], how='outer')

Related

Speed up order_by and pagination in Django

Currently I have this result:
That's not bad (I guess), but I'm wondering if I can speed things up a little bit.
I've looked at penultimate query and don't really know how to speed it up, I guess I should get rid off join, but don't know how:
I'm already using prefetch_related in my viewset, my viewset is:
class GameViewSet(viewsets.ModelViewSet):
queryset = Game.objects.prefetch_related(
"timestamp",
"fighters",
"score",
"coefs",
"rounds",
"rounds_view",
"rounds_view_f",
"finishes",
"rounds_time",
"round_time",
"time_coef",
"totals",
).all()
serializer_class = GameSerializer
permission_classes = [AllowAny]
pagination_class = StandardResultsSetPagination
#silk_profile(name="Get Games")
def list(self, request):
qs = self.get_queryset().order_by("-timestamp__ts")
page = self.paginate_queryset(qs)
if page is not None:
serializer = GameSerializer(page, many=True)
return self.get_paginated_response(serializer.data)
serializer = self.get_serializer(qs, many=True)
return Response(serializer.data)
Join is happening because I'm ordering on a related field?
My models looks like:
class Game(models.Model):
id = models.AutoField(primary_key=True)
...
class Timestamp(models.Model):
id = models.AutoField(primary_key=True)
game = models.ForeignKey(Game, related_name="timestamp", on_delete=models.CASCADE)
ts = models.DateTimeField(db_index=True)
time_of_day = models.TimeField()
And my serializers:
class TimestampSerializer(serializers.Serializer):
ts = serializers.DateTimeField(read_only=True)
time_of_day = serializers.TimeField(read_only=True)
class GameSerializer(serializers.Serializer):
id = serializers.IntegerField(read_only=True)
timestamp = TimestampSerializer(many=True)
fighters = FighterSerializer(many=True)
score = ScoreSerializer(many=True)
coefs = CoefsSerializer(many=True)
rounds = RoundsSerializer(many=True)
rounds_view = RoundsViewSerializer(many=True)
rounds_view_f = RoundsViewFinishSerializer(many=True)
finishes = FinishesSerializer(many=True)
rounds_time = RoundTimesSerializer(many=True)
round_time = RoundTimeSerializer(many=True)
time_coef = TimeCoefsSerializer(many=True)
totals = TotalsSerializer(many=True)
Also results of profling:
166039 function calls (159016 primitive calls) in 3.226 seconds
Ordered by: internal time
List reduced from 677 to 100 due to restriction <100>
ncalls tottime percall cumtime percall filename:lineno(function)
20959/20958 0.206 0.000 0.283 0.000 {built-in method builtins.isinstance}
2700 0.123 0.000 0.359 0.000 fields.py:62(is_simple_callable)
390/30 0.113 0.000 1.211 0.040 serializers.py:493(to_representation)
8943/8473 0.098 0.000 0.307 0.000 {built-in method builtins.getattr}
2700 0.096 0.000 0.650 0.000 fields.py:85(get_attribute)
14 0.068 0.005 0.130 0.009 traceback.py:388(format)
7653 0.065 0.000 0.065 0.000 {method 'append' of 'list' objects}
28 0.064 0.002 0.072 0.003 {method 'execute' of 'psycopg2.extensions.cursor' objects}
390 0.062 0.000 0.153 0.000 base.py:406(__init__)
3090 0.060 0.000 0.257 0.000 serializers.py:359(_readable_fields)
3090 0.059 0.000 0.093 0.000 _collections_abc.py:760(__iter__)
6390 0.055 0.000 0.055 0.000 {built-in method builtins.hasattr}
1440 0.054 0.000 0.112 0.000 related.py:652(get_instance_value_for_fields)
2749 0.052 0.000 0.078 0.000 abc.py:96(__instancecheck__)
388 0.052 0.000 0.107 0.000 query.py:303(clone)
2700 0.049 0.000 0.072 0.000 inspect.py:158(isfunction)
2700 0.048 0.000 0.699 0.000 fields.py:451(get_attribute)
2702 0.048 0.000 0.071 0.000 inspect.py:80(ismethod)
2701 0.047 0.000 0.070 0.000 inspect.py:285(isbuiltin)
14 0.047 0.003 0.189 0.014 traceback.py:321(extract)
3786/3426 0.043 0.000 0.123 0.000 {built-in method builtins.setattr}
4445/247 0.038 0.000 1.936 0.008 {built-in method builtins.len}
360 0.035 0.000 0.374 0.001 related_descriptors.py:575(_apply_rel_filters)
2247 0.034 0.000 0.088 0.000 traceback.py:285(line)
3203 0.029 0.000 0.029 0.000 {method 'copy' of 'dict' objects}
12 0.028 0.002 1.836 0.153 query.py:1831(prefetch_one_level)
2749 0.026 0.000 0.026 0.000 {built-in method _abc._abc_instancecheck}
2700 0.024 0.000 0.024 0.000 serializer_helpers.py:154(__getitem__)
2780 0.024 0.000 0.024 0.000 {method 'get' of 'dict' objects}
360 0.023 0.000 0.069 0.000 related_lookups.py:26(get_normalized_value)
720 0.022 0.000 0.458 0.001 related_descriptors.py:615(get_queryset)
744 0.022 0.000 0.087 0.000 related_descriptors.py:523(__get__)
360 0.022 0.000 0.057 0.000 related_descriptors.py:203(__set__)
470 0.022 0.000 0.081 0.000 local.py:46(_get_context_id)
749 0.020 0.000 0.048 0.000 linecache.py:15(getline)
720 0.019 0.000 0.031 0.000 lookups.py:252(resolve_expression_parameter)
361/1 0.018 0.000 1.211 1.211 serializers.py:655(to_representation)
296/14 0.018 0.000 0.084 0.006 copy.py:128(deepcopy)
470 0.018 0.000 0.106 0.000 local.py:82(_get_storage)
732 0.017 0.000 0.043 0.000 related_descriptors.py:560(__init__)
720 0.017 0.000 0.040 0.000 related_descriptors.py:76(__set__)
1185/1151 0.017 0.000 0.032 0.000 {method 'join' of 'str' objects}
1501 0.017 0.000 0.017 0.000 {method 'format' of 'str' objects}
732 0.016 0.000 0.026 0.000 manager.py:26(__init__)
372 0.016 0.000 0.414 0.001 query.py:951(_filter_or_exclude)
14 0.016 0.001 0.028 0.002 traceback.py:369(from_list)
759 0.015 0.000 0.040 0.000 query.py:178(__init__)
387 0.015 0.000 0.143 0.000 query.py:1308(_clone)
732 0.015 0.000 0.022 0.000 manager.py:20(__new__)
1710 0.015 0.000 0.015 0.000 {built-in method __new__ of type object at 0x7fa87d9ad940}
720 0.015 0.000 0.034 0.000 __init__.py:1818(get_prep_value)
470 0.014 0.000 0.033 0.000 sync.py:469(get_current_task)
390 0.014 0.000 0.174 0.000 base.py:507(from_db)
1365 0.014 0.000 0.014 0.000 {method 'update' of 'dict' objects}
749 0.013 0.000 0.022 0.000 linecache.py:37(getlines)
744 0.013 0.000 0.044 0.000 lookups.py:266()
12 0.013 0.001 0.054 0.004 lookups.py:230(get_prep_lookup)
749 0.013 0.000 0.020 0.000 linecache.py:147(lazycache)
720 0.013 0.000 0.066 0.000 related.py:646(get_local_related_value)
720 0.013 0.000 0.071 0.000 related.py:649(get_foreign_related_value)
720 0.013 0.000 0.019 0.000 __init__.py:824(get_prep_value)
1506 0.013 0.000 0.013 0.000 {method 'strip' of 'str' objects}
372/12 0.013 0.000 0.026 0.002 query.py:1088(resolve_lookup_value)
638 0.013 0.000 0.018 0.000 threading.py:1306(current_thread)
732 0.013 0.000 0.021 0.000 reverse_related.py:200(get_cache_name)
720 0.013 0.000 0.018 0.000 base.py:573(_get_pk_val)
360 0.012 0.000 0.021 0.000 __init__.py:543(__hash__)
470 0.011 0.000 0.117 0.000 local.py:101(__getattr__)
28 0.011 0.000 0.089 0.003 compiler.py:199(get_select)
385 0.011 0.000 0.857 0.002 query.py:265(__iter__)
320 0.011 0.000 0.019 0.000 __init__.py:515(__eq__)
387 0.011 0.000 0.120 0.000 query.py:354(chain)
780 0.011 0.000 0.021 0.000 dispatcher.py:159(send)
360 0.010 0.000 0.031 0.000 related.py:976(get_prep_value)
387 0.010 0.000 0.157 0.000 query.py:1296(_chain)
372 0.010 0.000 0.014 0.000 query.py:151(__init__)
403 0.010 0.000 0.914 0.002 query.py:45(__iter__)
1204 0.010 0.000 0.010 0.000 {built-in method builtins.iter}
360 0.010 0.000 0.016 0.000 :1017(_handle_fromlist)
372 0.010 0.000 0.434 0.001 query.py:935(filter)
360 0.010 0.000 0.023 0.000 query.py:1124(check_query_object_type)
360 0.009 0.000 0.017 0.000 mixins.py:21(is_cached)
1152 0.009 0.000 0.009 0.000 query.py:194(query)
104 0.009 0.000 0.017 0.000 fields.py:323(__init__)
732 0.009 0.000 0.009 0.000 manager.py:120(_set_creation_counter)
470 0.009 0.000 0.013 0.000 :389(parent)
470 0.009 0.000 0.014 0.000 tasks.py:34(current_task)
459 0.009 0.000 0.013 0.000 deconstruct.py:14(__new__)
870 0.009 0.000 0.009 0.000 fields.py:810(to_representation)
322 0.009 0.000 0.019 0.000 linecache.py:53(checkcache)
318/266 0.009 0.000 0.145 0.001 compiler.py:434(compile)
763 0.008 0.000 0.008 0.000 traceback.py:292(walk_stack)
494 0.008 0.000 0.014 0.000 compiler.py:417(quote_name_unless_alias)
396 0.008 0.000 0.019 0.000 related.py:710(get_path_info)
374 0.008 0.000 0.011 0.000 utils.py:237(_route_db)
796 0.008 0.000 0.008 0.000 tree.py:21(__init__)
322 0.008 0.000 0.008 0.000 {built-in method posix.stat}
413/365 0.008 0.000 1.938 0.005 query.py:1322(_fetch_all)
12 0.008 0.001 1.244 0.104 related_descriptors.py:622(get_prefetch_queryset)
732 0.008 0.000 0.008 0.000 reverse_related.py:180(get_accessor_name)
And visual representation:
So, my questions is, how can I speed it up?
On the database side, you could set up a Materialized View for your query and trigger an update anytime a new timestamp is added (or whatever is happening in your application that requires a refresh). That way the results are pre-calculated for your lookup. However, I suppose there could be edge cases where when you update and lookup at the same time, you end up with a pre-updated result? It'd be a trade off and only you know whether that'd be worth it...
In any case, I did not come up with that, check out e.g. https://hashrocket.com/blog/posts/materialized-view-strategies-using-postgresql

Improve performance of MongoDB client (sockets)

I am using Python 2.7 (Anaconda distribution) on Windows 8.1 Pro.
I have a database of articles with their respective topics.
I am building an application which queries textual phrases in my database and associates article topics to each queried phrase. The topics are assigned based on the relevance of the phrase for the article.
The bottleneck seems to be Python socket communication with the localhost.
Here are my cProfile outputs:
topics_fit (PhraseVectorizer_1_1.py:668)
function called 1 times
1930698 function calls (1929630 primitive calls) in 148.209 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 286 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 1.224 1.224 148.209 148.209 PhraseVectorizer_1_1.py:668(topics_fit)
206272 0.193 0.000 146.780 0.001 cursor.py:1041(next)
601 0.189 0.000 146.455 0.244 cursor.py:944(_refresh)
534 0.030 0.000 146.263 0.274 cursor.py:796(__send_message)
534 0.009 0.000 141.532 0.265 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 141.484 0.265 mongo_client.py:768(_reset_on_error)
534 0.019 0.000 141.482 0.265 server.py:69(send_message_with_response)
534 0.002 0.000 141.364 0.265 pool.py:225(receive_message)
535 0.083 0.000 141.362 0.264 network.py:106(receive_message)
1070 1.202 0.001 141.278 0.132 network.py:127(_receive_data_on_socket)
3340 140.074 0.042 140.074 0.042 {method 'recv' of '_socket.socket' objects}
535 0.778 0.001 4.700 0.009 helpers.py:88(_unpack_response)
535 3.828 0.007 3.920 0.007 {bson._cbson.decode_all}
67 0.099 0.001 0.196 0.003 {method 'sort' of 'list' objects}
206187 0.096 0.000 0.096 0.000 PhraseVectorizer_1_1.py:705(<lambda>)
206187 0.096 0.000 0.096 0.000 database.py:339(_fix_outgoing)
206187 0.074 0.000 0.092 0.000 objectid.py:68(__init__)
1068 0.005 0.000 0.054 0.000 server.py:135(get_socket)
1068/534 0.010 0.000 0.041 0.000 contextlib.py:21(__exit__)
1068 0.004 0.000 0.041 0.000 pool.py:501(get_socket)
534 0.003 0.000 0.028 0.000 pool.py:208(send_message)
534 0.009 0.000 0.026 0.000 pool.py:573(return_socket)
567 0.001 0.000 0.026 0.000 socket.py:227(meth)
535 0.024 0.000 0.024 0.000 {method 'sendall' of '_socket.socket' objects}
534 0.003 0.000 0.023 0.000 topology.py:134(select_server)
206806 0.020 0.000 0.020 0.000 collection.py:249(database)
418997 0.019 0.000 0.019 0.000 {len}
449 0.001 0.000 0.018 0.000 topology.py:143(select_server_by_address)
534 0.005 0.000 0.018 0.000 topology.py:82(select_servers)
1068/534 0.001 0.000 0.018 0.000 contextlib.py:15(__enter__)
534 0.002 0.000 0.013 0.000 thread_util.py:83(release)
207307 0.010 0.000 0.011 0.000 {isinstance}
534 0.005 0.000 0.011 0.000 pool.py:538(_get_socket_no_auth)
534 0.004 0.000 0.011 0.000 thread_util.py:63(release)
534 0.001 0.000 0.011 0.000 mongo_client.py:673(_get_topology)
535 0.003 0.000 0.010 0.000 topology.py:57(open)
206187 0.008 0.000 0.008 0.000 {method 'popleft' of 'collections.deque' objects}
535 0.002 0.000 0.007 0.000 topology.py:327(_apply_selector)
536 0.003 0.000 0.007 0.000 topology.py:286(_ensure_opened)
1071 0.004 0.000 0.007 0.000 periodic_executor.py:50(open)
In particular: {method 'recv' of '_socket.socket' objects} seems to cause trouble.
According to suggestions found in What can I do to improve socket performance in Python 3?, I tried gevent.
I added this snippet at the beginning of my script (before importing anything):
from gevent import monkey
monkey.patch_all()
This resulted in even slower performance...
*** PROFILER RESULTS ***
topics_fit (PhraseVectorizer_1_1.py:671)
function called 1 times
1956879 function calls (1951292 primitive calls) in 158.260 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 427 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 158.170 158.170 hub.py:358(run)
1 0.000 0.000 158.170 158.170 {method 'run' of 'gevent.core.loop' objects}
2/1 1.286 0.643 158.166 158.166 PhraseVectorizer_1_1.py:671(topics_fit)
206272 0.198 0.000 156.670 0.001 cursor.py:1041(next)
601 0.192 0.000 156.203 0.260 cursor.py:944(_refresh)
534 0.029 0.000 156.008 0.292 cursor.py:796(__send_message)
534 0.012 0.000 150.514 0.282 mongo_client.py:725(_send_message_with_response)
534 0.002 0.000 150.439 0.282 mongo_client.py:768(_reset_on_error)
534 0.017 0.000 150.437 0.282 server.py:69(send_message_with_response)
551/535 0.002 0.000 150.316 0.281 pool.py:225(receive_message)
552/536 0.079 0.000 150.314 0.280 network.py:106(receive_message)
1104/1072 0.815 0.001 150.234 0.140 network.py:127(_receive_data_on_socket)
2427/2395 0.019 0.000 149.418 0.062 socket.py:381(recv)
608/592 0.003 0.000 48.541 0.082 socket.py:284(_wait)
552 0.885 0.002 5.464 0.010 helpers.py:88(_unpack_response)
552 4.475 0.008 4.577 0.008 {bson._cbson.decode_all}
3033 2.021 0.001 2.021 0.001 {method 'recv' of '_socket.socket' objects}
7/4 0.000 0.000 0.221 0.055 hub.py:189(_import)
4 0.127 0.032 0.221 0.055 {__import__}
67 0.104 0.002 0.202 0.003 {method 'sort' of 'list' objects}
536/535 0.003 0.000 0.142 0.000 topology.py:57(open)
537/536 0.002 0.000 0.139 0.000 topology.py:286(_ensure_opened)
1072/1071 0.003 0.000 0.138 0.000 periodic_executor.py:50(open)
537/536 0.001 0.000 0.136 0.000 server.py:33(open)
537/536 0.001 0.000 0.135 0.000 monitor.py:69(open)
20/19 0.000 0.000 0.132 0.007 topology.py:342(_update_servers)
4 0.000 0.000 0.131 0.033 hub.py:418(_get_resolver)
1 0.000 0.000 0.122 0.122 resolver_thread.py:13(__init__)
1 0.000 0.000 0.122 0.122 hub.py:433(_get_threadpool)
206187 0.081 0.000 0.101 0.000 objectid.py:68(__init__)
206187 0.100 0.000 0.100 0.000 database.py:339(_fix_outgoing)
206187 0.098 0.000 0.098 0.000 PhraseVectorizer_1_1.py:708(<lambda>)
1 0.073 0.073 0.093 0.093 threadpool.py:2(<module>)
2037 0.003 0.000 0.092 0.000 hub.py:159(get_hub)
2 0.000 0.000 0.090 0.045 thread.py:39(start_new_thread)
2 0.000 0.000 0.090 0.045 greenlet.py:195(spawn)
2 0.000 0.000 0.090 0.045 greenlet.py:74(__init__)
1 0.000 0.000 0.090 0.090 hub.py:259(__init__)
1102 0.004 0.000 0.078 0.000 pool.py:501(get_socket)
1068 0.005 0.000 0.074 0.000 server.py:135(get_socket)
This performance is somewhat unacceptable for my application - I would like it to be much faster (this is timed and profiled for a subset of ~20 documents, and I need to process few tens of thousands).
Any ideas on how to speed it up?
Much appreciated.
Edit:
Code snippet that I profiled:
# also tried monkey patching all here, see profiler
from pymongo import MongoClient
def topics_fit(self):
client = MongoClient()
# tried motor for multithreading - also slow
#client = motor.motor_tornado.MotorClient()
# initialize DB cursors
db_wiki = client.wiki
# initialize topic feature dictionary
self.topics = OrderedDict()
self.topic_mapping = OrderedDict()
vocabulary_keys = self.vocabulary.keys()
num_categories = 0
for phrase in vocabulary_keys:
phrase_tokens = phrase.split()
if len(phrase_tokens) > 1:
# query for current phrase
AND_phrase = "\"" + phrase + "\""
cursor = db_wiki.categories.find({ "$text" : { "$search": AND_phrase } },{ "score": { "$meta": "textScore" } })
cursor = list(cursor)
if cursor:
cursor.sort(key=lambda k: k["score"], reverse = True)
added_categories = cursor[0]["category_ids"]
for added_category in added_categories:
if not (added_category in self.topics):
self.topics[added_category] = num_categories
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [num_categories, ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(num_categories)
num_categories+=1
else:
if not (self.vocabulary[phrase] in self.topic_mapping):
self.topic_mapping[self.vocabulary[phrase]] = [self.topics[added_category], ]
else:
self.topic_mapping[self.vocabulary[phrase]].append(self.topics[added_category])
Edit 2: output of index_information():
{u'_id_':
{u'ns': u'wiki.categories', u'key': [(u'_id', 1)], u'v': 1},
u'article_title_text_article_body_text_category_names_text': {u'default_language': u'english', u'weights': SON([(u'article_body', 1), (u'article_title', 1), (u'category_names', 1)]), u'key': [(u'_fts', u'text'), (u'_ftsx', 1)], u'v': 1, u'language_override': u'language', u'ns': u'wiki.categories', u'textIndexVersion': 2}}

Tests are 4 times slower under PyPy

I am running tests for my project using nose2:
#!/bin/sh
nose2 --config=tests/nose2.cfg "$#"
Under CPython tests run 4 times faster than under PyPy:
Python 2.7.8 (default, Oct 20 2014, 15:05:19)
[GCC 4.9.1] on linux2
...
Ran 58 tests in 25.369s
Python 2.7.9 (2.5.1+dfsg-1~ppa1+ubuntu14.04, Mar 27 2015, 19:19:42)
[PyPy 2.5.1 with GCC 4.8.2] on linux2
...
Ran 58 tests in 100.854s
What could be the cause?
Is there a way to tweak PyPy configuration using environment variables or or a configuration file on some standard path? Because in my case I am running nose bootstrap script and I cannot control command line options for PyPy.
Here is one specific test:
1272695 function calls (1261234 primitive calls) in 1.165 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1224 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 1.171 1.171 test_progress.py:37(test_progress)
15 0.000 0.000 1.169 0.078 __init__.py:52(api_request)
15 0.000 0.000 1.160 0.077 __init__.py:46(request)
15 0.000 0.000 1.152 0.077 test.py:695(open)
15 0.000 0.000 1.150 0.077 test.py:655(run_wsgi_app)
15 0.000 0.000 1.144 0.076 test.py:828(run_wsgi_app)
15 0.000 0.000 1.144 0.076 application.py:101(__call__)
15 0.000 0.000 1.138 0.076 sessions.py:329(__call__)
15 0.000 0.000 1.071 0.071 course_object.py:14(__call__)
15 0.000 0.000 1.005 0.067 user_auth.py:7(__call__)
15 0.000 0.000 0.938 0.063 application.py:27(application)
15 0.000 0.000 0.938 0.063 application.py:81(wsgi_app)
15 0.000 0.000 0.876 0.058 ember_backend.py:188(__call__)
15 0.000 0.000 0.875 0.058 ember_backend.py:233(handle_request)
176 0.002 0.000 0.738 0.004 __init__.py:42(instrumented_method)
176 0.003 0.000 0.623 0.004 __init__.py:58(get_stack)
176 0.000 0.000 0.619 0.004 inspect.py:1053(stack)
176 0.010 0.000 0.619 0.004 inspect.py:1026(getouterframes)
294 0.001 0.000 0.614 0.002 cursor.py:1072(next)
248 0.001 0.000 0.612 0.002 cursor.py:998(_refresh)
144 0.002 0.000 0.608 0.004 cursor.py:912(__send_message)
7248 0.041 0.000 0.607 0.000 inspect.py:988(getframeinfo)
8 0.000 0.000 0.544 0.068 test_progress.py:31(get_progress_data)
4 0.000 0.000 0.529 0.132 test_progress.py:25(finish_item)
249 0.001 0.000 0.511 0.002 base.py:1131(next)
7 0.000 0.000 0.449 0.064 ember_backend.py:240(_handle_request)
8 0.000 0.000 0.420 0.053 ember_backend.py:307(_handle_request)
8 0.000 0.000 0.420 0.053 user_state.py:13(list)
8 0.001 0.000 0.407 0.051 user_progress.py:28(update_progress)
4 0.000 0.000 0.397 0.099 entity.py:253(post)
7248 0.051 0.000 0.362 0.000 inspect.py:518(findsource)
14532 0.083 0.000 0.332 0.000 inspect.py:440(getsourcefile)
92/63 0.000 0.000 0.308 0.005 objects.py:22(__get__)
61 0.001 0.000 0.304 0.005 base.py:168(get)
139 0.000 0.000 0.250 0.002 queryset.py:65(_iter_results)
51 0.001 0.000 0.249 0.005 queryset.py:83(_populate_cache)
29 0.000 0.000 0.220 0.008 __init__.py:81(save)
29 0.001 0.000 0.219 0.008 document.py:181(save)
21780 0.051 0.000 0.140 0.000 inspect.py:398(getfile)
32 0.002 0.000 0.139 0.004 {pymongo._cmessage._do_batched_write_command}
And the with PyPy:
6037905 function calls (6012014 primitive calls) in 7.475 seconds
Ordered by: cumulative time, internal time, call count
List reduced from 1354 to 40 due to restriction <40>
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 7.960 7.960 test_progress.py:37(test_progress)
15 0.000 0.000 7.948 0.530 __init__.py:52(api_request)
15 0.000 0.000 7.873 0.525 __init__.py:46(request)
15 0.000 0.000 7.860 0.524 test.py:695(open)
15 0.000 0.000 7.856 0.524 test.py:655(run_wsgi_app)
15 0.000 0.000 7.845 0.523 test.py:828(run_wsgi_app)
15 0.000 0.000 7.844 0.523 application.py:101(__call__)
15 0.000 0.000 7.827 0.522 sessions.py:329(__call__)
15 0.000 0.000 7.205 0.480 course_object.py:14(__call__)
176 0.004 0.000 6.605 0.038 __init__.py:42(instrumented_method)
15 0.000 0.000 6.591 0.439 user_auth.py:7(__call__)
176 0.008 0.000 6.314 0.036 __init__.py:58(get_stack)
176 0.001 0.000 6.305 0.036 inspect.py:1063(stack)
176 0.027 0.000 6.304 0.036 inspect.py:1036(getouterframes)
7839 0.081 0.000 6.274 0.001 inspect.py:998(getframeinfo)
15 0.000 0.000 5.983 0.399 application.py:27(application)
15 0.001 0.000 5.983 0.399 application.py:81(wsgi_app)
15 0.000 0.000 5.901 0.393 ember_backend.py:188(__call__)
15 0.000 0.000 5.899 0.393 ember_backend.py:233(handle_request)
15714/15713 0.189 0.000 5.828 0.000 inspect.py:441(getsourcefile)
294 0.002 0.000 5.473 0.019 cursor.py:1072(next)
248 0.002 0.000 5.470 0.022 cursor.py:998(_refresh)
144 0.004 0.000 5.445 0.038 cursor.py:912(__send_message)
8367 2.133 0.000 5.342 0.001 inspect.py:473(getmodule)
249 0.002 0.000 4.316 0.017 base.py:1131(next)
8 0.000 0.000 3.966 0.496 test_progress.py:31(get_progress_data)
4 0.000 0.000 3.209 0.802 test_progress.py:25(finish_item)
7839 0.098 0.000 3.185 0.000 inspect.py:519(findsource)
7 0.000 0.000 2.944 0.421 ember_backend.py:240(_handle_request)
8 0.000 0.000 2.898 0.362 ember_backend.py:307(_handle_request)
8 0.000 0.000 2.898 0.362 user_state.py:13(list)
8 0.001 0.000 2.820 0.352 user_progress.py:28(update_progress)
61 0.001 0.000 2.546 0.042 base.py:168(get)
4 0.000 0.000 2.534 0.633 entity.py:253(post)
850362/849305 2.315 0.000 2.344 0.000 {hasattr}
92/63 0.001 0.000 2.004 0.032 objects.py:22(__get__)
127 0.001 0.000 1.915 0.015 queryset.py:65(_iter_results)
51 0.001 0.000 1.914 0.038 queryset.py:83(_populate_cache)
29 0.000 0.000 1.607 0.055 __init__.py:81(save)
29 0.001 0.000 1.605 0.055 document.py:181(save)

How to speed up matrix code

I have the following simple code which estimates the probability that an h by n binary matrix has a certain property. It runs in exponential time (which is bad to start with) but I am surprised it is so slow even for n = 12 and h = 9.
#!/usr/bin/python
import numpy as np
import itertools
n = 12
h = 9
F = np.matrix(list(itertools.product([0,1],repeat = n))).transpose()
count = 0
iters = 100
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = set()
for column in product.T:
setofcols.add(repr(column))
if (len(setofcols)==2**n):
count = count + 1
print count*1.0/iters
I have profiled it using n = 10 and h = 7. The output is rather long but here are the lines that took more time.
23447867 function calls (23038179 primitive calls) in 35.785 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
2 0.002 0.001 0.019 0.010 __init__.py:1(<module>)
1 0.001 0.001 0.054 0.054 __init__.py:106(<module>)
1 0.001 0.001 0.022 0.022 __init__.py:15(<module>)
2 0.003 0.002 0.013 0.006 __init__.py:2(<module>)
1 0.001 0.001 0.003 0.003 __init__.py:38(<module>)
1 0.001 0.001 0.001 0.001 __init__.py:4(<module>)
1 0.001 0.001 0.004 0.004 __init__.py:45(<module>)
1 0.001 0.001 0.002 0.002 __init__.py:88(<module>)
307200 0.306 0.000 1.584 0.000 _methods.py:24(_any)
102400 0.026 0.000 0.026 0.000 arrayprint.py:22(product)
102400 1.345 0.000 32.795 0.000 arrayprint.py:225(_array2string)
307200/102400 1.166 0.000 33.350 0.000 arrayprint.py:335(array2string)
716800 0.820 0.000 1.162 0.000 arrayprint.py:448(_extendLine)
204800/102400 1.699 0.000 5.090 0.000 arrayprint.py:456(_formatArray)
307200 0.651 0.000 22.510 0.000 arrayprint.py:524(__init__)
307200 11.783 0.000 21.859 0.000 arrayprint.py:538(fillFormat)
1353748 1.920 0.000 2.537 0.000 arrayprint.py:627(_digits)
102400 0.576 0.000 2.523 0.000 arrayprint.py:636(__init__)
716800 2.159 0.000 2.159 0.000 arrayprint.py:649(__call__)
307200 0.099 0.000 0.099 0.000 arrayprint.py:658(__init__)
102400 0.163 0.000 0.225 0.000 arrayprint.py:686(__init__)
102400 0.307 0.000 13.784 0.000 arrayprint.py:697(__init__)
102400 0.110 0.000 0.110 0.000 arrayprint.py:713(__init__)
102400 0.043 0.000 0.043 0.000 arrayprint.py:741(__init__)
1 0.003 0.003 0.003 0.003 chebyshev.py:87(<module>)
2 0.001 0.000 0.001 0.000 collections.py:284(namedtuple)
1 0.277 0.277 35.786 35.786 counterfeit.py:3(<module>)
205002 0.222 0.000 0.247 0.000 defmatrix.py:279(__array_finalize__)
102500 0.747 0.000 1.077 0.000 defmatrix.py:301(__getitem__)
102400 0.322 0.000 34.236 0.000 defmatrix.py:352(__repr__)
102400 0.100 0.000 0.508 0.000 fromnumeric.py:1087(ravel)
307200 0.382 0.000 2.829 0.000 fromnumeric.py:1563(any)
271 0.004 0.000 0.005 0.000 function_base.py:3220(add_newdoc)
1 0.003 0.003 0.003 0.003 hermite.py:59(<module>)
1 0.003 0.003 0.003 0.003 hermite_e.py:59(<module>)
1 0.001 0.001 0.002 0.002 index_tricks.py:1(<module>)
1 0.003 0.003 0.003 0.003 laguerre.py:59(<module>)
1 0.003 0.003 0.003 0.003 legendre.py:83(<module>)
1 0.001 0.001 0.001 0.001 linalg.py:10(<module>)
1 0.001 0.001 0.001 0.001 numeric.py:1(<module>)
102400 0.247 0.000 33.598 0.000 numeric.py:1365(array_repr)
204800 0.321 0.000 1.143 0.000 numeric.py:1437(array_str)
614400 1.199 0.000 2.627 0.000 numeric.py:2178(seterr)
614400 0.837 0.000 0.918 0.000 numeric.py:2274(geterr)
102400 0.081 0.000 0.186 0.000 numeric.py:252(asarray)
307200 0.259 0.000 0.622 0.000 numeric.py:322(asanyarray)
1 0.003 0.003 0.004 0.004 polynomial.py:54(<module>)
513130 0.134 0.000 0.134 0.000 {isinstance}
307229 0.075 0.000 0.075 0.000 {issubclass}
5985327/5985305 0.595 0.000 0.595 0.000 {len}
306988 0.120 0.000 0.120 0.000 {max}
102400 0.061 0.000 0.061 0.000 {method '__array__' of 'numpy.ndarray' objects}
102406 0.027 0.000 0.027 0.000 {method 'add' of 'set' objects}
307200 0.241 0.000 1.824 0.000 {method 'any' of 'numpy.ndarray' objects}
307200 0.482 0.000 0.482 0.000 {method 'compress' of 'numpy.ndarray' objects}
204800 0.035 0.000 0.035 0.000 {method 'item' of 'numpy.ndarray' objects}
102451 0.014 0.000 0.014 0.000 {method 'join' of 'str' objects}
102400 0.222 0.000 0.222 0.000 {method 'ravel' of 'numpy.ndarray' objects}
921176 3.330 0.000 3.330 0.000 {method 'reduce' of 'numpy.ufunc' objects}
102405 0.057 0.000 0.057 0.000 {method 'replace' of 'str' objects}
2992167 0.660 0.000 0.660 0.000 {method 'rstrip' of 'str' objects}
102400 0.041 0.000 0.041 0.000 {method 'splitlines' of 'str' objects}
6 0.003 0.000 0.003 0.001 {method 'sub' of '_sre.SRE_Pattern' objects}
307276 0.090 0.000 0.090 0.000 {min}
100 0.013 0.000 0.013 0.000 {numpy.core._dotblas.dot}
409639 0.473 0.000 0.473 0.000 {numpy.core.multiarray.array}
1228800 0.239 0.000 0.239 0.000 {numpy.core.umath.geterrobj}
614401 0.352 0.000 0.352 0.000 {numpy.core.umath.seterrobj}
102475 0.031 0.000 0.031 0.000 {range}
102400 0.076 0.000 0.102 0.000 {reduce}
204845/102445 0.198 0.000 34.333 0.000 {repr}
The multiplication of the matrices seems to take a tiny fraction of the time. Is it possible to speed up the rest?
Results
There are now three answers but one seems to have a bug currently. I have tested the remaining two with n=18, h=11 and iters=10 .
bubble - 21 seconds, 185MB of RAM . 16 seconds on "sort".
hpaulj - 7.5 seconds, 130MB of RAM . 3 seconds on "tolist". 1.5 seconds on "numpy.core.multiarray.array", 1.5 seconds on "genexpr" (the 'set' line).
Interestingly, the time for multiplying the matrices is still a tiny fraction of the overall time taken.
To speed up the code above you should avoid loops.
import numpy as np
import itertools
def unique_rows(a):
a = np.ascontiguousarray(a)
unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
n = 12
h = 9
iters=100
F = np.matrix(list(itertools.product([0,1],repeat = n))).transpose()
M = np.random.randint(2, size=(h*iters,n))
product = np.dot(M,F)
counts = map(lambda x: len(unique_rows(x.T))==2**n, np.split(product,iters,axis=0))
prob=float(sum(counts))/iters
#All unique submatrices M (hxn) with the sophisticated property...
[np.split(M,iters,axis=0)[j] for j in range(len(counts)) if counts[j]==True]
Try replacing repr(col) with
setofcols.add(tuple(column.A1.tolist()))
set accepts a tuple. column.A1 is the matrix converted to a 1d array. The tuple is then something like (0, 1, 0), which set can easily compare.
Just replacing the expensive repr formatting lops off a lot of time (25x speedup).
EDIT
By creating and filling the set in one statement I get a further 10x speed up. In my tests it is 2x faster than bubble's vectorization.
count = 0
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = set(tuple(x) for x in product.T.tolist())
# or {tuple(x) for x in product.T.tolist()} if new enough Python
if (len(setofcols)==2**n):
count += 1
# print M # to see the unique M
print count*1.0/iters
EDIT
Here's something even faster - transform each column of 9 integers into 1, using dot([1,10,100,...],column). Then apply np.unique (or set) to the list of integers. It's a 2-3x further speedup.
count = 0
X = 10**np.arange(h)
for i in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = np.dot(M,F)
setofcols = np.unique(np.dot(X,product).A1)
if (setofcols.size==2**n):
count += 1
print count*1.0/iters
With this the top calls are
200 0.201 0.001 0.204 0.001 {numpy.core._dotblas.dot}
100 0.026 0.000 0.026 0.000 {method 'sort' of 'numpy.ndarray' objects}
100 0.007 0.000 0.035 0.000 arraysetops.py:93(unique)
As alko and seberg pointed out, you are loosing a lot of time converting your arrays to large strings to store them in your set of columns.
If I understood your code correctly, you are trying to find if the number of different columns in your product matrix is equal to the length of this matrix. You can do that easily by sorting it and looking at differences from one column to the next:
D = (np.diff(np.sort(product.T, axis=0), axis=0) == 0)
This will give you a matrix of booleans D. You can then see whether at least one element changes from one column to the next:
C = (1 - np.prod(D, axis=1)) # i.e. 'not all(D[i,:]) for all i'
You then simply have to take see whether all the values are different:
hasproperty = np.all(C)
Which gives you the complete code:
def f(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
counts = []
for _ in xrange(iters):
M = np.random.randint(2, size=(h,n))
product = M.dot(F)
D = (np.diff(np.sort(product.T, axis=1), axis=0) == 0)
C = (1 - np.prod(D, axis=1))
hasproperty = np.all(C)
counts.append(1. if hasproperty else 0.)
return np.mean(counts)
Which takes roughly 8s for f(12, 9, 100).
If you prefer comically compact expressions:
def g(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
return np.mean([np.all(1 - np.prod(np.diff(np.sort(np.random.randint(2,size=(h,n)).dot(F).T, axis=1), axis=0)==0, axis=1)) for _ in xrange(iters)])
Timing it gives:
>>> setup = """import numpy as np
def g(n, h, iters):
F = np.array(list(itertools.product([0,1], repeat=n))).T
return np.mean([np.all(1 - np.prod(np.diff(np.sort(np.random.randint(2,size=(h,n)).dot(F).T, axis=1), axis=0)==0, axis=1)) for _ in xrange(iters)])
"""
>>> timeit.timeit('g(10, 7, 100)', setup=setup, number=10)
17.358669997900734
>>> timeit.timeit('g(10, 7, 100)', setup=setup, number=50)
83.06966196163967
Or approximatively 1.7s per call to g(10,7,100).

why python hash_ring is so slow?

well,I found hash_ring.MemcacheRing's so slow that rate test cant beat directly db access
while,I replace it with memcache.Client,rate test's back to normal
rps is tested in multithreaded environment
memcache.Client rps2500
hash_ring.MemcacheRing rps 600
directly db access:rps960
I use hotshot to track hash_ring.MemcacheRing,I found there's definitely something wrong with hash_ring,u can find that hash_ring cost too much cpu as expected in the following
profile data collected
hotshot profile data's collected in single threaded environment
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.066 0.000 8.297 0.001 Memcached.py:16(__getitem__)
10000 0.262 0.000 6.544 0.001 build/bdist.linux-x86_64/egg/memcache.py:818(_unsafe_get)
1 0.000 0.000 0.000 0.000 /usr/local/Python-2.6.4/lib/python2.6/socket.py:180(__init__)
20000 0.064 0.000 0.064 0.000 build/bdist.linux-x86_64/egg/memcache.py:310(_statlog)
80000 0.118 0.000 0.118 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:148()
10000 0.148 0.000 8.168 0.001 build/bdist.linux-x86_64/egg/memcache.py:812(_get)
20000 0.086 0.000 0.157 0.000 build/bdist.linux-x86_64/egg/memcache.py:1086(_get_socket)
30000 10.539 0.000 10.539 0.000 build/bdist.linux-x86_64/egg/memcache.py:1118(readline)
10000 0.102 0.000 0.102 0.000 build/bdist.linux-x86_64/egg/memcache.py:1142(recv)
20000 0.243 0.000 0.243 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:156(_hash_digest)
20000 0.103 0.000 0.103 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:124(distinct_filter)
10000 0.167 0.000 0.269 0.000 build/bdist.linux-x86_64/egg/memcache.py:965(_recv_value)
1 0.148 0.148 15.701 15.701 Memcached.py:25(foo)
10000 0.236 0.000 5.409 0.001 build/bdist.linux-x86_64/egg/memcache.py:953(_expectvalue)
20002 0.336 0.000 0.336 0.000 :1(settimeout)
10000 0.061 0.000 7.207 0.001 build/bdist.linux-x86_64/egg/memcache.py:541(set)
20000 0.116 0.000 0.606 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:141(gen_key)
20000 0.279 0.000 1.680 0.000 build/bdist.linux-x86_64/egg/hash_ring/memcache_ring.py:20(_get_server)
40000 0.346 0.000 1.236 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:111(iterate_nodes)
10000 0.144 0.000 7.146 0.001 build/bdist.linux-x86_64/egg/memcache.py:771(_set)
20000 0.129 0.000 0.247 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:150(_hash_val)
20000 0.098 0.000 0.433 0.000 build/bdist.linux-x86_64/egg/memcache.py:1111(send_cmd)
10000 0.063 0.000 8.231 0.001 build/bdist.linux-x86_64/egg/memcache.py:857(get)
10000 0.049 0.000 7.256 0.001 Memcached.py:13(__setitem__)
20000 0.144 0.000 5.510 0.000 build/bdist.linux-x86_64/egg/memcache.py:1135(expect)
20000 0.181 0.000 0.787 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:92(get_node_pos)
20000 0.070 0.000 0.070 0.000 build/bdist.linux-x86_64/egg/memcache.py:1070(_check_dead)
10000 0.084 0.000 0.084 0.000 build/bdist.linux-x86_64/egg/memcache.py:725(_val_to_store_info)
20000 0.068 0.000 0.225 0.000 build/bdist.linux-x86_64/egg/memcache.py:1076(connect)
20000 1.036 0.000 1.036 0.000 build/bdist.linux-x86_64/egg/memcache.py:1000(check_key)
10000 0.216 0.000 5.702 0.001 build/bdist.linux-x86_64/egg/memcache.py:777(_unsafe_set)
0 0.000 0.000 profile:0(profiler)
and the following is memcache.Client's profile data
ncalls tottime percall cumtime percall filename:lineno(function)
10000 0.064 0.000 8.022 0.001 Memcached.py:16(__getitem__)
10000 0.263 0.000 6.885 0.001 build/bdist.linux-x86_64/egg/memcache.py:818(_unsafe_get)
1 0.000 0.000 0.000 0.000 /usr/local/Python-2.6.4/lib/python2.6/socket.py:180(__init__)
20000 0.069 0.000 0.069 0.000 build/bdist.linux-x86_64/egg/memcache.py:310(_statlog)
10000 0.127 0.000 7.897 0.001 build/bdist.linux-x86_64/egg/memcache.py:812(_get)
20000 0.091 0.000 0.164 0.000 build/bdist.linux-x86_64/egg/memcache.py:1086(_get_socket)
30000 11.074 0.000 11.074 0.000 build/bdist.linux-x86_64/egg/memcache.py:1118(readline)
10000 0.098 0.000 0.098 0.000 build/bdist.linux-x86_64/egg/memcache.py:1142(recv)
20000 0.259 0.000 0.568 0.000 build/bdist.linux-x86_64/egg/memcache.py:329(_get_server)
1 0.149 0.149 15.036 15.036 Memcached.py:25(foo)
10000 0.236 0.000 5.645 0.001 build/bdist.linux-x86_64/egg/memcache.py:953(_expectvalue)
20002 0.351 0.000 0.351 0.000 :1(settimeout)
10000 0.064 0.000 6.814 0.001 build/bdist.linux-x86_64/egg/memcache.py:541(set)
10000 0.124 0.000 6.751 0.001 build/bdist.linux-x86_64/egg/memcache.py:771(_set)
20000 1.039 0.000 1.039 0.000 build/bdist.linux-x86_64/egg/memcache.py:1000(check_key)
20000 0.092 0.000 0.442 0.000 build/bdist.linux-x86_64/egg/memcache.py:1111(send_cmd)
10000 0.061 0.000 7.958 0.001 build/bdist.linux-x86_64/egg/memcache.py:857(get)
10000 0.050 0.000 6.864 0.001 Memcached.py:13(__setitem__)
20000 0.148 0.000 5.812 0.000 build/bdist.linux-x86_64/egg/memcache.py:1135(expect)
20000 0.073 0.000 0.073 0.000 build/bdist.linux-x86_64/egg/memcache.py:1070(_check_dead)
10000 0.087 0.000 0.087 0.000 build/bdist.linux-x86_64/egg/memcache.py:725(_val_to_store_info)
20000 0.072 0.000 0.236 0.000 build/bdist.linux-x86_64/egg/memcache.py:1076(connect)
20000 0.072 0.000 0.072 0.000 build/bdist.linux-x86_64/egg/memcache.py:57(cmemcache_hash)
10000 0.218 0.000 5.905 0.001 build/bdist.linux-x86_64/egg/memcache.py:777(_unsafe_set)
0 0.000 0.000 profile:0(profiler)
10000 0.156 0.000 0.255 0.000 build/bdist.linux-x86_64/egg/memcache.py:965(_recv_value)
and profile data's following
ncalls tottime percall cumtime percall filename:lineno(function)
20000 2.278 0.000 4.260 0.000 build/bdist.linux-x86_64/egg/memcache.py:1000(check_key)
640000 1.624 0.000 1.624 0.000 :0(ord)
20000 0.658 0.000 1.445 0.000 build/bdist.linux-x86_64/egg/memcache.py:329(_get_server)
100010 0.477 0.000 0.477 0.000 :0(len)
30000 0.460 0.000 0.827 0.000 build/bdist.linux-x86_64/egg/memcache.py:1118(readline)
110000 0.414 0.000 0.414 0.000 :0(isinstance)
20000 0.412 0.000 0.412 0.000 :0(sendall)
10000 0.349 0.000 1.974 0.000 build/bdist.linux-x86_64/egg/memcache.py:818(_unsafe_get)
10000 0.344 0.000 0.633 0.000 build/bdist.linux-x86_64/egg/memcache.py:725(_val_to_store_info)
10000 0.344 0.000 2.006 0.000 build/bdist.linux-x86_64/egg/memcache.py:777(_unsafe_set)
10000 0.325 0.000 0.624 0.000 build/bdist.linux-x86_64/egg/memcache.py:965(_recv_value)
50000 0.233 0.000 0.233 0.000 :0(find)
20000 0.200 0.000 0.808 0.000 build/bdist.linux-x86_64/egg/memcache.py:1135(expect)
10000 0.194 0.000 5.292 0.001 build/bdist.linux-x86_64/egg/memcache.py:812(_get)
20000 0.190 0.000 0.357 0.000 build/bdist.linux-x86_64/egg/memcache.py:1076(connect)
10000 0.185 0.000 4.771 0.000 build/bdist.linux-x86_64/egg/memcache.py:771(_set)
10000 0.173 0.000 0.187 0.000 build/bdist.linux-x86_64/egg/memcache.py:1142(recv)
10000 0.163 0.000 0.400 0.000 build/bdist.linux-x86_64/egg/memcache.py:953(_expectvalue)
20000 0.152 0.000 0.564 0.000 :1(sendall)
20000 0.147 0.000 0.711 0.000 build/bdist.linux-x86_64/egg/memcache.py:1111(send_cmd)
20000 0.142 0.000 0.232 0.000 build/bdist.linux-x86_64/egg/memcache.py:57(cmemcache_hash)
ncalls tottime percall cumtime percall filename:lineno(function)
20000 2.679 0.000 4.921 0.000 build/bdist.linux-x86_64/egg/memcache.py:1000(check_key)
640000 2.125 0.000 2.125 0.000 :0(ord)
20000 0.636 0.000 2.887 0.000 build/bdist.linux-x86_64/egg/hash_ring/memcache_ring.py:20(_get_server)
20000 0.582 0.000 1.673 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:92(get_node_pos)
30000 0.565 0.000 0.979 0.000 build/bdist.linux-x86_64/egg/memcache.py:1118(readline)
20000 0.345 0.000 0.630 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:156(_hash_digest)
40000 0.328 0.000 2.065 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:111(iterate_nodes)
20003 0.294 0.000 0.294 0.000 :0(recv)
10000 0.287 0.000 1.119 0.000 build/bdist.linux-x86_64/egg/memcache.py:953(_expectvalue)
10000 0.176 0.000 4.437 0.000 build/bdist.linux-x86_64/egg/memcache.py:771(_set)
10000 0.175 0.000 1.697 0.000 build/bdist.linux-x86_64/egg/memcache.py:818(_unsafe_get)
10000 0.160 0.000 5.899 0.001 build/bdist.linux-x86_64/egg/memcache.py:812(_get)
20000 0.157 0.000 1.051 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:141(gen_key)
10000 0.148 0.000 0.251 0.000 build/bdist.linux-x86_64/egg/memcache.py:965(_recv_value)
20000 0.144 0.000 0.264 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:150(_hash_val)
90000 0.122 0.000 0.122 0.000 :0(isinstance)
50000 0.120 0.000 0.120 0.000 :0(find)
80000 0.120 0.000 0.120 0.000 build/bdist.linux-x86_64/egg/hash_ring/hash_ring.py:148()

Categories

Resources