Related
I want to compare two huge identical nested lists and by iterating over both of them. I'm looking for nested lists in where list_a[0] is equal to list_b[1]. In that case I want to merge those lists (the order is important). The non-matches lists I also want in the output.
rows_a = [['a', 'b', 'z'], ['b', 'e', 'f'], ['g', 'h', 'i']]
rows_b = [['a', 'b', 'z'], ['b', 'e', 'f'], ['g', 'h', 'i']]
data = []
for list_a in rows_a:
for list_b in rows_b:
if list_a[0] == list_b[1]:
list_b.extend(list_a)
data.append(list_b)
else:
data.append(list_b)
#print(data): [['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i'], ['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i'], ['a', 'b', 'z', 'b', 'e', 'f'], ['b', 'e', 'f'], ['g', 'h', 'i']]
Above is the output that I do NOT want, because it is way too much data. All this unnecessary data is caused by the double loop over both rows. A solution would be to slice an element off rows_b by every iteration of the for loop on rows_a. This would avoid many duplicate comparisons. Question: How do I skip first element of a list every time it has looped from start to end?
In order to show the desired outcome, I correct the outcome by deleting duplicates below:
res=[]
for i in data:
if tuple(i) not in res:
res.append(tuple(i))
print(res)
#Output: [('a', 'b', 'z', 'b', 'e', 'f'), ('b', 'e', 'f'), ('g', 'h', 'i')]
This is the output I want! But faster...And preferably without removing duplicates.
I managed to get what I want when I work with a small data set. However, I am using this for a very large data set and it gives me a 'MemoryError'. Even if it didn't give me the error, I realise it is a very inefficient script and it takes a lot of time to run.
Any help would be greatly appreciated.
tuple(i) not in res is not efficient since it iterate over the whole list over and over in linear time resulting in a quadratic execution time (O(n²)). You can speed this up using a set:
list({tuple(e) for e in data})
This does not preserve the order. If you want to do that, then you can use a dict (requires a quire recent version of Python):
list({tuple(e): None for e in data}.keys())
This should be significantly faster. An alternative solution is to convert the element to tuple, then sort them and compare close pairs of values so to remove duplicates. Note you can also merge two set or two dict with the update method.
As for the memory space, there is not much to do. The problem is CPython itself which is clearly not designed for computing large data with such data structure (only native data structures like Numpy arrays are efficient). Each character is encoded as a Python object taking 24-32 bytes. Lists contains references to objects taking 8 bytes each on a 64-bit architecture. This means 40 bytes per characters while 1 byte is actually needed (and this is what a native C/C++ program can actually use in practice). That being said CPython can cache 1-byte character so to use "only" 8 byte per character in this specific case (which is still 8 time more than required). If you use list of characters in your real-world application, please consider using string instead. Otherwise, please consider using another language.
I solved this by using a LEFT JOIN in SQL. You can do the same thing with Pandas Data Frames in Python.
I have a dataset of multiple local store rankings that I'm looking to aggregate / combine into one national ranking, programmatically. I know that the local rankings are by sales volume, but I am not given the sales volume so must use the relative rankings to create as accurate a national ranking as possible.
As a short example, let's say that we have 3 local ranking lists, from best ranking (1st) to worst ranking (last), that represent different geographic boundaries that can overlap with one another.
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
We know that J or Q is the highest ranked store, as both are highest in ranking_1 and ranking_3, respectively, and they appear above A, which is the highest in ranking_2. We know that O is next, as it's above A in ranking_3. A comes next, and so on...
If I did this correctly on paper, the output of this short example would be:
global_ranking = [('J',1.5),('Q',1.5),('O',3),('A',4),('H',6),('N',6),('Z',6),('K',8),('B',9),('C',10)]
Note that when we don't have enough data to determine which of two stores is ranked higher, we consider it a tie (i.e. we know that one of J or Q is the highest ranked store, but don't know which is higher, so we put them both at 1.5). In the actual dataset, there are 100+ lists of 1000+ items in each.
I've had fun trying to figure out this problem and am curious if anyone has any smart approaches to it.
Modified Merge Sort algorithm will help here. The modification should take into account incomparable stores and though build groups of incomparable elements which you are willing to consider as equal (like Q and J)
This method seeks to analyze all of the stores at the front of the rankings. If they are not located in a lower than first position in any other ranking list, then they belong at this front level and are added to a 'level' list. Next, they are removed from the front runners and all of the list are adjusted so that there are new front runners. Repeat the process until there are no stores left.
def rank_stores(rankings):
"""
Rank stores with rankings by volume sales with over lap between lists.
:param rankings: list of rankings of stores also in lists.
:return: Ordered list with sets of items at same rankings.
"""
rank_global = []
# Evaluate all stores in the number one postion, if they are not below
# number one somewhere else, then they belong at this level.
# Then remove them from the front of the list, and repeat.
while sum([len(x) for x in rankings]) > 0:
tops = []
# Find out which of the number one stores are not in a lower position
# somewhere else.
for rank in rankings:
if not rank:
continue
else:
top = rank[0]
add = True
for rank_test in rankings:
if not rank_test:
continue
elif not rank_test[1:]:
continue
elif top in rank_test[1:]:
add = False
break
else:
continue
if add:
tops.append(top)
# Now add tops to total rankings list,
# then go through the rankings and pop the top if in tops.
rank_global.append(set(tops))
# Remove the stores that just made it to the top.
for rank in rankings:
if not rank:
continue
elif rank[0] in tops:
rank.pop(0)
else:
continue
return rank_global
For the rankings provided:
ranking_1 = ['J','A','Z','B','C']
ranking_2 = ['A','H','K','B']
ranking_3 = ['Q','O','A','N','K']
rankings = [ranking_1, ranking_2, ranking_3]
Then calling the function:
rank_stores(rankings)
Results in:
[{'J', 'Q'}, {'O'}, {'A'}, {'H', 'N', 'Z'}, {'K'}, {'B'}, {'C'}]
In some circumstances there may not be enough information to determine definite rankings. Try this order.
['Z', 'A', 'B', 'J', 'K', 'F', 'L', 'E', 'W', 'X', 'Y', 'R', 'C']
We can derive the following rankings:
a = ['Z', 'A', 'B', 'F', 'E', 'Y']
b = ['Z', 'J', 'K', 'L', 'X', 'R']
c = ['F', 'E', 'W', 'Y', 'C']
d = ['J', 'K', 'E', 'W', 'X']
e = ['K', 'F', 'W', 'R', 'C']
f = ['X', 'Y', 'R', 'C']
g = ['Z', 'F', 'W', 'X', 'Y', 'R', 'C']
h = ['Z', 'A', 'E', 'W', 'C']
i = ['L', 'E', 'Y', 'R', 'C']
j = ['L', 'E', 'W', 'R']
k = ['Z', 'B', 'K', 'L', 'W', 'Y', 'R']
rankings = [a, b, c, d, e, f, g, h, i, j, k]
Calling the function:
rank_stores(rankings)
results in:
[{'Z'},
{'A', 'J'},
{'B'},
{'K'},
{'F', 'L'},
{'E'},
{'W'},
{'X'},
{'Y'},
{'R'},
{'C'}]
In this scenario there is not enough information to determine where 'J' should be relative to 'A' and 'B'. Only that it is in the range beetween 'Z' and 'K'.
When multiplied among hundreds of rankings and stores, some of the stores will not be properly ranked on an absolute volume basis.
I have a DataFrame like this one:
df=pd.DataFrame({'State' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
'County' : ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i'],
'Population': [10, 11, 12, 13, 17, 16, 15, 18, 14]})
Looking at the two most populous counties for each state, what are the two most populous states (in order of highest population to lowest population)?
I solved it by using a loop, and now I'm trying to get the same result grouping, summing, sorting and selecting.
The following code works, but I'm sure there are many differnt and more elegant way to do it.
df.groupby(['State'])['Population'].nlargest(2).groupby(['State']).sum()\
.sort_values(ascending=False)[:2].to_frame()\
.reset_index()['State'].tolist()
You can't shorten this alittle.
df.groupby(['State'])['Population'].nlargest(2)\
.sum(level=0).sort_values(ascending=False).index[:2].tolist()
No need to convert back to dataframe to retreive states, just get the states from the index directly. Using sum with level parameter is just short syntax that over groupby again.
(df.sort_values('Population', ascending=False) # order by highest population per country
.groupby('State').head(2) # get two most populous counties per state
.groupby('State').sum() # get population of two largest counties per state
.sort_values('Population', ascending = False)[:2] # get top 2 states by population
.index # get states names
.tolist() # convert to list
)
Here's an alternate way to do it with explanations of each operation
UPDATE: I believe I found the solution. I've put it at the end.
Let’s say we have this list:
a = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
I want to create another list to remove the duplicates from list a, but at the same time, keep the ratio approximately intact AND maintain order.
The output should be:
b = ['a', 'b', 'a', 'c']
EDIT: To explain better, the ratio doesn't need to be exactly intact. All that's required is the output of ONE single letter for all letters in the data. However, two letters might be the same but represent two different things. The counts are important to identify this as I say later. Letters representing ONE unique variable appear in counts between 3000-3400 so when I divide the total count by 3500 and round it, I know how many time it should appear in the end, but the problem is I don't know what order they should be in.
To illustrate this I'll include one more input and desired output:
Input: ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'a', 'a', 'd', 'd', 'a', 'a']
Desired Output: ['a', 'a', 'b', 'c', 'a', 'd', 'a']
Note that 'C' has been repeated three times. The ratio needs not be preserved exactly, all I need to represent is how many times that variable is represented and because it's represented 3 times only in this example, it isn't considered enough for it to count as two.
The only difference is that here I'm assuming all letters repeating exactly twice are unique, although in the data-set, again, uniqueness is dependent on the appearance of 3000-3400 times.
Note(1): This doesn't necessarily need to be considered but there's a possibility that not all letters will be grouped together nicely, for example, considering 4 letters for uniqueness to make it short: ['a','a',''b','a','a','b','b','b','b'] should still be represented as ['a','b']. This is a minor problem in this case, however.
EDIT:
Example of what I've tried and successfully done:
full_list = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c']
#full_list is a list containing around 10k items, just using this as example
rep = 2 # number of estimated repetitions for unique item,
# in the real list this was set to 3500
quant = {'a': 0, "b" : 0, "c" : 0, "d" : 0, "e" : 0, "f" : 0, "g": 0}
for x in set(full_list):
quant[x] = round(full_list.count(x)/rep)
final = []
for x in range(len(full_list)):
if full_list[x] in final:
lastindex = len(full_list) - 1 - full_list[::-1].index(full_list[x])
if lastindex == x and final.count(full_list[x]) < quant[full_list[x]]:
final.append(full_list[x])
else:
final.append(full_list[x])
print(final)
My problem with the above code is two-fold:
If there are more than 2 repetitions of the same data, it will not count them correctly. For example: ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a'] should become ['a','b','a','c','a'] but instead it becomes ['a','b,'c','a']
It takes a very log time to finish as I'm sure it's a very
inefficient way to do this.
Final remark: The code I've tried was more of a little hack to achieve the desired output on the most common input, however it doesn't do exactly what I intended it to. It's also important to note that the input changes over time. Repetitions of single letters aren't always the same, although I believe they're always grouped together, so I was thinking of making a flag that is True when it hits a letter and becomes false as soon as it changes to a different one, but this also has the problem of not being able to account for the fact that two letters that are the same might be put right next to each other. The count for each letter as an individual is always between 3000-3400, so I know that if the count is above that, there are more than 1.
UPDATE: Solution
Following hiro protagonist's suggestion with minor modifications, the following code seems to work:
full = ['a', 'a', 'b', 'b', 'a', 'a', 'c', 'c', 'a', 'a']
from itertools import groupby
letters_pre = [key for key, _group in groupby(full)]
letters_post = []
for x in range(len(letters_pre)):
if x>0 and letters_pre[x] != letters_pre[x-1]:
letters_post.append(letters_pre[x])
if x == 0:
letters_post.append(letters_pre [x])
print(letters_post)
The only problem is that it doesn't consider that sometimes letters can appear in between unique ones, as described in "Note(1)", but that's only a very minor issue. The bigger issue is that it doesn't consider when two separate occurances of the same letter are consecutive, for example (two for uniqueness as example): ['a','a','a','a','b','b'] gets turned to ['a','b'] when desired output should be ['a','a','b']
this is where itertools.groupby may come in handy:
from itertools import groupby
a = ["a", "a", "b", "b", "a", "a", "c", "c"]
res = [key for key, _group in groupby(a)]
print(res) # ['a', 'b', 'a', 'c']
this is a version where you could 'scale' down the unique keys (but are guaranteed to have at leas one in the result):
from itertools import groupby, repeat, chain
a = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'c', 'c', 'a', 'a',
'd', 'd', 'a', 'a']
scale = 0.4
key_count = tuple((key, sum(1 for _item in group)) for key, group in groupby(a))
# (('a', 4), ('b', 2), ('c', 5), ('a', 2), ('d', 2), ('a', 2))
res = tuple(
chain.from_iterable(
(repeat(key, round(scale * count) or 1)) for key, count in key_count
)
)
# ('a', 'a', 'b', 'c', 'c', 'a', 'd', 'a')
there may be smarter ways to determine the scale (probably based on the length of the input list a and the average group length).
Might be a strange one, but:
b = []
for i in a:
if next(iter(b[::-1]), None) != i:
b.append(i)
print(b)
Output:
['a', 'b', 'a', 'c']
I'm trying to concatenate a bunch of dataframes together, which all have the same information. But some column names are missing and some dataframes have extra columns. However, for the columns they do have, they all follow the same order. I'd like a function to fill in the missing names. The following almost works:
def fill_missing_colnames(colnames):
valid_colnames = ['Z', 'K', 'C', 'T', 'A', 'E', 'F', 'G']
missing = list(set(valid_colnames) - set(colnames))
if len(missing) > 0:
for i, col in enumerate(colnames):
if col not in valid_colnames and len(missing) > 0:
colnames[i] = missing.pop(0)
return colnames
But the problem is that set() orders the elements alphabetically, whereas I'd like to preserve the order from the column names (or rather from the valid column names).
colnames = ['K', 'C', 'T', 'E', 'XY', 'F', 'G']
list(set(valid_colnames) - set(colnames))
Out[9]: ['A', 'Z']
The concat looks like this:
concat_errors = {}
all_data = pd.DataFrame(list_of_dataframes[0])
for i, data in enumerate(list_of_dataframes[1:]):
try:
all_data = pd.concat([all_data, pd.DataFrame(data)], axis = 0, sort = False)
except Exception as e:
concat_errors.update({i+1:e})
You can use a list comprehension instead of a set operation.
missing = [col for col in valid_colnames if col not in colnames]
That will simply filter out the values that are not in colnames and preserve order.