Matrix as a dictionary; is it safe? - python

I know that the order of the keys is not guaranteed and that's OK, but what exactly does it mean that the order of the values is not guaranteed as well*?
For example, I am representing a matrix as a dictionary, like this:
signatures_dict = {}
M = 3
for i in range(1, M):
row = []
for j in range(1, 5):
row.append(j)
signatures_dict[i] = row
print signatures_dict
Are the columns of my matrix correctly constructed? Let's say I have 3 rows and at this signatures_dict[i] = row line, row will always have 1, 2, 3, 4, 5. What will signatures_dict be?
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
or something like
1 2 3 4 5
1 4 3 2 5
5 1 3 4 2
? I am worried about cross-platform support.
In my application, the rows are words and the columns documents, so can I say that the first column is the first document?
*Are order of keys() and values() in python dictionary guaranteed to be the same?

You will guaranteed have 1 2 3 4 5 in each row. It will not reorder them. The lack of ordering of values() refers to the fact that if you call signatures_dict.values() the values could come out in any order. But the values are the rows, not the elements of each row. Each row is a list, and lists maintain their order.
If you want a dict which maintains order, Python has that too: https://docs.python.org/2/library/collections.html#collections.OrderedDict

Why not use a list of lists as your matrix? It would have whatever order you gave it;
In [1]: matrix = [[i for i in range(4)] for _ in range(4)]
In [2]: matrix
Out[2]: [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]
In [3]: matrix[0][0]
Out[3]: 0
In [4]: matrix[3][2]
Out[4]: 2

Related

Why does python replace every object of a column, when only referring to one, if all lines are identical? [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 3 months ago.
When trying to change one value in a matrix, python will change all items of that column with the desired value, despite the fact I am only trying to change one. But this only happens when all rows are identical.
Example:
def print_matrix(matrix: list[list], dlm: str) -> None:
for row in matrix:
for col in row:
print(col, end = dlm)
print()
one_row = list(range(4))
test_matrix = []
for i in range(5):
test_matrix.append(one_row)
test_matrix[0][0] = 5
sec_matrix =[
[0,1,2,3],
[0,1,2,3],
[0,1,2,3],
[0,1,2,4]
]
sec_matrix[0][0]=5
print_matrix(test_matrix, ' ')
print()
print_matrix(sec_matrix, ' ')
In the first matrix every 0 gets replaced with a 5, despite only referencing the first item of the first list.
In the second one it works the way I want it to, because the last list is slightly different.
Why is there a difference in the way test_matrix and sec_matrix are treated? Is this a bug, or intended?
Does python just think they are the same list because they look the same?
Or are they even the same to increase performance? Either way I don't think it should happen.
I tried to update a matrix item on certain coordinates.
I expected only the desired item to be altered, instead every single one of that column got changed. Problem is fixed by not having identical rows.
The reason is when you write test_matrix.append(one_row) you are appending actually [0,1,2,3] 5 times to test_matrix, essentially, i.e the list will look like [[0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3], [0, 1, 2, 3]]. Here each list element is a list with [0,1,2,3] references to the same [0,1,2,3]. When you then modify this single [0,1,2,3] it is visible via all references to it. For debugging purposes, you can check it,
print(id(test_matrix[0]))
print(id(test_matrix[1]))
So you will see all are the same id, if you want to do it then you can do it like below- where test_matrix = [ list(range(4)) for n in range(5) ] will re-generate value each time
def print_matrix(matrix, dlm):
for row in matrix:
for col in row:
print(col, end = dlm)
print()
test_matrix = []
test_matrix = [ list(range(4)) for n in range(5) ] # re-generate and appending
test_matrix[0][0] = 7
sec_matrix =[
[0,1,2,3],
[0,1,2,3],
[0,1,2,3],
[0,1,2,4]
]
sec_matrix[0][0]=5
print_matrix(test_matrix, ' ')
print()
print_matrix(sec_matrix, ' ')
Output:
7 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
0 1 2 3
5 1 2 3
0 1 2 3
0 1 2 3
0 1 2 4

Creating a pandas column of values with a calculation, but change the calculation every x times to a different one

I'm currently creating a new column in my pandas dataframe, which calculates a value based on a simple calculation using a value in another column, and a simple value subtracting from it. This is my current code, which almost gives me the output I desire (example shortened for reproduction):
subtraction_value = 3
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = data['test'][::-1] - subtraction_value
When run, this gives me the current output:
print(data['new_column'])
[9,1,2,1,-2,0,-1,3,7,6]
However, if I wanted to use a different value to subtract on the column, from position [0], then use the original subtraction value on positions [1:3] of the column, before using the second value on position [4] again, and repeat this pattern, how would I do this iteratively? I realize I could use a for loop to achieve this, but for performance reasons I'd like to do this another way. My new output would ideally look like this:
subtraction_value_2 = 6
print(data['new_column'])
[6,1,2,1,-5,0,-1,3,4,6]
You can use positional indexing:
subtraction_value_2 = 6
col = data.columns.get_loc('new_column')
data.iloc[0::4, col] = data['test'].iloc[0::4].sub(subtraction_value_2)
or with numpy.where:
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
output:
test new_column
0 12 6
1 4 1
2 5 2
3 4 1
4 1 -5
5 3 0
6 2 -1
7 5 2
8 10 4
9 9 6
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
data['new_column'] = data.test - subtraction_value
data['new_column'][::4] = data.test[::4] - subtraction_value_2
print(list(data.new_column))
Output:
[6, 1, 2, 1, -5, 0, -1, 2, 4, 6]

Calculate number of items in one list are in another

Let's say I have two very large lists (e.g. 10 million rows) with some values or strings. I would like to figure out how many items from list1 are in list2.
As such this can be done by:
true_count = 0
false_count = 0
for i, x in enumerate(list1):
print(i)
if x in list2:
true_count += 1
else:
false_count += 1
print(true_count)
print(false_count)
This will do the trick, however, if you have 10 million rows, this could take quite some time. Is there some sweet function I don't know about that can do this much faster, or something entirely different?
Using Pandas
Here's how you will do it using Pandas dataframe.
import pandas as pd
import random
list1 = [random.randint(1,10) for i in range(10)]
list2 = [random.randint(1,10) for i in range(10)]
df1 = pd.DataFrame({'list1':list1})
df2 = pd.DataFrame({'list2':list2})
print (df1)
print (df2)
print (all(df2.list2.isin(df1.list1).astype(int)))
I am just picking 10 rows and generating 10 random numbers:
List 1:
list1
0 3
1 5
2 4
3 1
4 5
5 2
6 1
7 4
8 2
9 5
List 2:
list2
0 2
1 3
2 2
3 4
4 3
5 5
6 5
7 1
8 4
9 1
The output of the if statement will be:
True
The random lists I checked against are:
list1 = [random.randint(1,100000) for i in range(10000000)]
list2 = [random.randint(1,100000) for i in range(5000000)]
Ran a test with 10 mil. random numbers in list1, 5 mil. random numbers in list2, result on my mac came back in 2.207757880999999 seconds
Using Set
Alternate, you can also convert the list into a set and check if one set is a subset of the other.
set1 = set(list1)
set2 = set(list2)
print (set2.issubset(set1))
Comparing the results of the run, set is also fast. It came back in 1.6564296570000003 seconds
You can convert the lists to sets and compute the length of the intersection between them.
len(set(list1) & set(list2))
You will have to use Numpy array to translate the lists into a np.array()
After that, both lists will be considered as np.array objects, and because they have only one dimension you can use np.intersect() and count the common items with .size
import numpy as np
lst = [1, 7, 0, 6, 2, 5, 6]
lst2 = [1, 8, 0, 6, 2, 4, 6]
a_list=np.array(lst)
b_list=np.array(lst2)
c = np.intersect1d(a_list, b_list)
print (c.size)

Organizing data to access information based on either variable depending on what is required

I am given a list of elements with their respective lists of nodes, and I want to switch how I look them up. I'd like to have a list of nodes with lists of their respective elements.
Example:
Have
E N
1 1 2 3
2 2 3 4
3 1 4
Desired
N E
1 1 3
2 1 2
3 1 2
4 2 3
I have a nested for loop solution.
# known data
elements = [
[], # No element 0
[1, 2, 3],
[2, 3, 4],
[1, 4],
]
max_element = 3
max_node = 4
# truth
truth_nodes = [
[],
[1, 3],
[1, 2],
[1, 2],
[2, 3],
]
# current algorithm
nodes = [[] for n in range(max_node+1)]
for node in range(1, max_node + 1):
for element in range(1, max_element + 1):
if node in elements[element]:
nodes[node].append(element)
Is there another tool in Python, either with NumPy, Pandas, or something else that might speed this up for over 300,000 elements? If this is a common algorithm, what is its name and/or how would I find it?
Edit: Is this a graph algorithm?
I can imagine vertices for my nodes above as well as my elements and using NetworkX to make an undirected graph. Then would I use a connected component algorithm?
I ended up using a pandas DataFrame with columns node and element and a row for every pair of a single node belonging to a single element.
Node Element
1 1
2 1
3 1
2 2
3 2
4 2
1 3
4 3
I believe this follows Tidy Data by Hadley Wickham. Then I could call a given node by a boolean index df[df.node==1], and likewise for an element, df[df.element==3].
I am still not sure if this is a graph algorithm, but organizing my data this way allowed me to move forward with the rest of the problem.

Collapsing identical adjacent rows in a Pandas Series

Basically if a column of my pandas dataframe looks like this:
[1 1 1 2 2 2 3 3 3 1 1]
I'd like it to be turned into the following:
[1 2 3 1]
You can write a simple function that loops through the elements of your series only storing the first element in a run.
As far as I know, there is no tool built in to pandas to do this. But it is not a lot of code to do it yourself.
import pandas
example_series = pandas.Series([1, 1, 1, 2, 2, 3])
def collapse(series):
last = ""
seen = []
for element in series:
if element != last:
last = element
seen.append(element)
return seen
collapse(example_series)
In the code above, you will iterate through each element of a series and check if it is the same as the last element seen. If it is not, store it. If it is, ignore the value.
If you need to handle the return value as a series you can change the last line of the function to:
return pandas.Series(seen)
You could write a function that does the following:
x = pandas.Series([1 1 1 2 2 2 3 3 3 1 1])
y = x-x.shift(1)
y[0] = 1
result = x[y!=0]
You can use DataFrame's diff and indexing:
>>> df = pd.DataFrame([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df[0].diff()!=0]
0
0 1
2 2
6 3
10 1
>>> df[df[0].diff()!=0].values.ravel() # If you need an array
array([1, 2, 3, 1])
Same works for Series:
>>> df = pd.Series([1,1,2,2,2,2,3,3,3,3,1])
>>> df[df.diff()!=0].values
array([1, 2, 3, 1])
You can use shift to create a boolean mask to compare the row against the previous row:
In [67]:
s = pd.Series([1,1,2,2,2,2,3,3,3,3,4,4,5])
s[s!=s.shift()]
Out[67]:
0 1
2 2
6 3
10 4
12 5
dtype: int64

Categories

Resources