How does this python nested for loops work? - python

Can anyone explain why the output of the following nested loop is {1:6, 2:6, 3:6}?
>>> {x:y for x in [1, 2, 3] for y in [4, 5, 6]}
{1:6, 2:6, 3:6}

my_dict = {x:y for x in [1,2,3] for y in [4,5,6]}
is the same is creating it as follows
my_dict = {}
for x in [1,2,3]:
for y in [4,5,6]:
my_dict[x] = y
Which would look like this if you unroll the loops:
my_dict = {}
my_dict[1] = 4
my_dict[1] = 5
my_dict[1] = 6
my_dict[2] = 4
my_dict[2] = 5
my_dict[2] = 6
my_dict[3] = 4
my_dict[3] = 5
my_dict[3] = 6
You are effectively inserting nine key value pairs into the dictionary. However, each time you insert a pair with a key that already exists it overwrites the previous value. Thus you only ended up with the last insert for each key where the value was six.

The difference is you are making a dictionary vs a list. In your own example, you are effectively constructing a dictionary and because you set a different value for the same key 3 times, the last value sticks.
You are effectively doing:
dict[1] = 4
dict[1] = 5
dict[1] = 6
...
dict[3] = 4
dict[3] = 5
dict[3] = 6
So the last value sticks.

If the expectation was to create {1:4, 2:5, 3:6}, try this:
{x[0]:x[1] for x in zip([1,2,3], [4,5,6])}

Related

Calculate number of items in one list are in another

Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.
As such this can be done by:
true_count = 0
false_count = 0
for i, x in enumerate(list1):
print(i)
if x in list2:
true_count += 1
else:
false_count += 1
print(true_count)
print(false_count)
This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?
Using Pandas
Here's how you will do it using Pandas dataframe.
import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]
df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})
print (df1)
print (df2)
print (all(df2.list2.isin(df1.list1).astype(int)))
I am just picking 10 rows and generating 10 random numbers:
List 1:
list1
0 3
1 5
2 4
3 1
4 5
5 2
6 1
7 4
8 2
9 5
List 2:
list2
0 2
1 3
2 2
3 4
4 3
5 5
6 5
7 1
8 4
9 1
The output of the if statement will be:
True
The random lists I checked against are:
list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]
Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds
Using Set
Alternate, you can also convert the list into a set and check if one set is a subset of the other.
set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))
Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds
You can convert the lists to sets and compute the length of the intersection between them.
len(set(list1) & set(list2))
You will have to use Numpy array to translate the lists into a np.array()
After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size
import numpy as np
lst = [1, 7, 0, 6, 2, 5, 6]
lst2 = [1, 8, 0, 6, 2, 4, 6]
a_list=np.array(lst)
b_list=np.array(lst2)
c = np.intersect1d(a_list, b_list)
print (c.size)

Check pandas column for successive row values

I have:
hi
0 1
1 2
2 4
3 8
4 3
5 3
6 2
7 8
8 3
9 5
10 4
I have a list of lists and single integers like this:
[[2,8,3], 2, [2,8]]
For each item in the main list, I want to find out the index of when it appears in the column for the first time.
So for the single integers (i.e 2) I want to know the first time this appears in the hi column (index 1, but I am not interested when it appears again i.e index 6)
For the lists within the list, I want to know the last index of when the list appears in order in that column.
So for [2,8,3] that appears in order at indexes 6, 7 and 8, so I want 8 to be returned. Note that it appears before this too, but is interjected by a 4, so I am not interested in it.
I have so far used:
for c in chunks:
# different method if single note chunk vs. multi
if type(c) is int:
# give first occurence of correct single notes
single_notes = df1[df1['user_entry_note'] == c]
single_notes_list.append(single_notes)
# for multi chunks
else:
multi_chunk = df1['user_entry_note'].isin(c)
multi_chunk_list.append(multi_chunk)
You can do it with np.logical_and.reduce + shift. But there are a lot of edge cases to deal with:
import numpy as np
def find_idx(seq, df, col):
if type(seq) != list: # if not list
s = df[col].eq(seq)
if s.sum() >= 1: # if something matched
idx = s.idxmax().item()
else:
idx = np.NaN
elif seq: # if a list that isn't empty
seq = seq[::-1] # to get last index
m = np.logical_and.reduce([df[col].shift(i).eq(seq[i]) for i in range(len(seq))])
s = df.loc[m]
if not s.empty: # if something matched
idx = s.index[0]
else:
idx = np.NaN
else: # empty list
idx = np.NaN
return idx
l = [[2,8,3], 2, [2,8]]
[find_idx(seq, df, col='hi') for seq in l]
#[8, 1, 7]
l = [[2,8,3], 2, [2,8], [], ['foo'], 'foo', [1,2,4,8,3,3]]
[find_idx(seq, df, col='hi') for seq in l]
#[8, 1, 7, nan, nan, nan, 5]

Find rows in pandas dataframe, where diffrent rows have common values in lists in columns storing lists

I can solve my task by writing a for loop, but I wonder, how to do this in a more pandorable way.
So I have this dataframe storing some lists and want to find all the rows that have any common values in these lists,
(This code just to obtaine a df with lists:
>>> df = pd.DataFrame( {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]})
>>> df
a b
0 A 1
1 A 2
2 B 5
3 B 1
4 B 4
5 C 6
>>> d = df.groupby('a')['b'].apply(list)
)
Here we start:
>>> d
A [1, 2]
B [5, 1, 4]
C [6]
Name: b, dtype: object
I want to select rows with index 'A' and 'B', because their lists overlap by the value 1.
I could write now a for loop or expand the dataframe at these lists (reversing the way I got it above) and have multiple rows copying other values.
What would you do here? Or is there some way, to use df.groupby(by=lambda x, y : return not set(x).isdisjoint(y)), that compares two rows?
But groupby and also boolean masking just look at one element at once...
I tried now to overload the equality operator for lists, and because lists are not hashable, then of tuples and sets (I set hash to 1 to avoid identity comparison). I then used groupby and merge on the frame with itself, but as it seems, that it checks off the indexes, that it has already matched.
import pandas as pd
import numpy as np
from operator import itemgetter
class IndexTuple(set):
def __hash__(self):
#print(hash(str(self)))
return hash(1)
def __eq__(self, other):
#print("eq ")
is_equal = not set(self).isdisjoint(other)
return is_equal
l = IndexTuple((1,7))
l1 = IndexTuple((4, 7))
print (l == l1)
df = pd.DataFrame(np.random.randint(low=0, high=4, size=(10, 2)), columns=['a','b']).reset_index()
d = df.groupby('a')['b'].apply(IndexTuple).to_frame().reset_index()
print (d)
print (d.groupby('b').b.apply(list))
print (d.merge (d, on = 'b', how = 'outer'))
outputs (it works fine for the first element, but at [{3}] there should be [{3},{0,3}] instead:
True
a b
0 0 {1}
1 1 {0, 2}
2 2 {3}
3 3 {0, 3}
b
{1} [{1}]
{0, 2} [{0, 2}, {0, 3}]
{3} [{3}]
Name: b, dtype: object
a_x b a_y
0 0 {1} 0
1 1 {0, 2} 1
2 1 {0, 2} 3
3 3 {0, 3} 1
4 3 {0, 3} 3
5 2 {3} 2
Using a merge on df:
v = df.merge(df, on='b')
common_cols = set(
np.sort(v.iloc[:, [0, -1]].query('a_x != a_y'), axis=1).ravel()
)
common_cols
{'A', 'B'}
Now, pre-filter and call groupby:
df[df.a.isin(common_cols)].groupby('a').b.apply(list)
a
A [1, 2]
B [5, 1, 4]
Name: b, dtype: object
I understand you are asking for a "pandorable" solution, but in my opinion this is not a task ideally suited to pandas.
Below is one solution using collections.Counter and itertools.combinations which provides your result without using a dataframe.
from collections import defaultdict
from itertools import combinations
data = {'a':['A','A','B','B','B','C'], 'b':[1,2,5,1,4,6]}
d = defaultdict(set)
for i, j in zip(data['a'], data['b']):
d[i].add(j)
res = {frozenset({i, j}) for i, j in combinations(d, 2) if not d[i].isdisjoint(d[j])}
# {frozenset({'A', 'B'})}
Explanation
Group to a set with collections.defaultdict. via an O(n) complexity solution.
Iterate using itertools.combinations to find set values which are not disjoint, using a set comprehension.
Use frozenset (or sorted tuple) for key type as lists are mutable and therefore cannot be used as dictionary keys.

Python: sort tab-separated key in dict by both columns

I have a dictionary with tab separated keys.
d = {}
d["1\t1"] = "abc"
d["10\t1"] = "def"
d["1\t10"] = "ghi"
d["2\t5"] = "xyz"
d["1\t4"] = 0
How can I sort these keys after first and second column?
I cannot use this
for s in sorted(d):
print s
because my keys are strings.
I want to return this:
1 1
1 4
1 10
2 5
10 1
How can this be achieved? I am not even sure if dictionaries are the right data structures.
i assume the second column means after \t
sorted(d.items(),key= lambda x: (int(x[0].split('\t')[0]),int(x[0].split('\t')[-1])))
output:
[('1\t1', 'abc'), ('1\t4', 0), ('1\t10', 'ghi'), ('2\t5', 'xyz'), ('10\t1', 'def')]
print out:
for k,_ in sorted(d.items(),key= lambda x: (int(x[0].split('\t')[0]),int(x[0].split('\t')[-1]))):
... print k.split('\t')[0], k.split('\t')[1]
1 1
1 4
1 10
2 5
10 1
Alternative (and probably much easier solution) for my own problem (thanks Kevin):
d = {}
d[1,1] = "abc"
d[10,1] = "def"
d[1,10] = "ghi"
d[2,5] = "xyz"
d[1,4] = 0
for k in sorted(d):
print(k)
Only included small revision of input data.

dict.values() doesn't provide all the values in python

The dict.values() doesn't provide all the values which are retrieved inside a for loop. I use a for loop to retrieve values from a text file.
test = {}
with open(input_file, "r") as test:
for line in test:
value = line.split()[5]
value = int(value)
test[value] = value
print (value)
test_list = test.values()
print (str(test_list))
The value and test_value doesn't contain equal number of data
The output is as followed:
From printing "value":
88
53
28
28
24
16
16
12
12
11
8
8
8
8
6
6
6
4
4
4
4
4
4
4
4
4
4
4
4
2
2
2
2
2
From printing test_list:
list values:dict_values([16, 24, 2, 4, 53, 8, 88, 12, 6, 11, 28])
Is there any way to include the duplicate values too, to the list?
This line:
test[value] = value
Doesn't add a new value to test if it's a duplicate, it simply overwrites the old value. So any duplicates get removed. The values() call is truly returning everything that remains in the dict.
Dictionary keys cannot contain duplicates. When you are doing test[value] = value the old value at the key value is overwritten. Thus you get a limited set of values only.
A sample test can be
>>> {1:10}
{1: 10}
>>> {1:10,1:20}
{1: 20}
Here you can see, the duplicate key is overwritten with the new value
POST COMMENT EDIT
As you said you want a list of values, you can have a statement l = [] at the start and have l.append(value) at the place where you have test[value] = value
This is because python dictionaries cannot have duplicate values. Everytime you run test[value] = value, it replaces an existing value or adds it if it's not in the dictionary yet.
For example:
>>> d = {}
>>> d['a'] = 'b'
>>> d
{'a': 'b'}
>>> d['a'] = 'c'
>>> d
{'a': 'c'}
I'd suggest making this into a list, like:
output = []
with open(input_file, "r") as test:
for line in test:
value = line.split()[5]
value = int(value)
output.append(value)
print (value)
print (str(output))

Categories

Resources