Python - 2D list of unknown size - python

I'm looking to create a 2D list to store data .n, I want to store an auto-incrementing ID along side the various pieces of information. The number of rows that will be required is unknown but the number of columns will always be fixed at 6 data values.
I want something like the following:
[0, a, b, 1, 2, 3]
[1, c, d, 4, 5, 6]
[2, e, f, 7, 8, 9]
I then want to be able to return any of the columns as need, e.g.
[a, c, e]
At the moment I'm trying the following code:
student_array = []
student_count = 0
...
Student.student_array.append(Student.student_count)
Student.student_array.append(name)
Student.student_array.append(course_name)
Student.student_array.append(mark_one)
Student.student_array.append(mark_two)
Student.student_array.append(mark_three)
Student.student_count = Student.student_count + 1
def list_students():
print(Student.student_array[1])
The problem I'm having at the moment is that it's obviously appending the new row onto the end of the outer list, rather than appending a new row. i.e.:
[0, 'a', 'b', 1, 2, 3, 1, 'c', 'd', 4, 5, 6]
Additionally when it comes to pulling out the second column from each row would the code be along these lines:
column_array = []
for row in Student.student_array:
column_array.append(row[2])
print("Selected Data =", column_array)

The structure you have now, with all the data in a single list (list and array mean different things in Python by the way), actually makes it easier to get a column. If your record is of size r_len = 6, and you want col = 3 (the fourth column), you can do
>>> Student.student_array[col::r_len]
[1, 4, 7]
To store a 2D list, though, you need to place each student's information into a separate list in your loop:
current_student = [len(Student.student_array), name, course_name, mark1, mark2, mark3]
Student.student_array.append(current_student)
Notice that you do not need to maintain a separate count this way: the length of the outer list speaks for itself.
To get the data from col = 3 in a 2D array like this, use a comprehension:
>>> [s[col] for s in Student.student_array]
[1, 4, 7]
Keeping related information in an unlabeled format like that is generally a poor idea though. You can either add labels by using a library like pandas, which will maintain proper tables for you, or you can encapsulate each student's info into a small class. You can write your own class, or use something like a namedtuple:
Record = collections.namedtuple('Record', ['id', 'name', 'course', 'mark1', 'mark2', 'mark3'])
...
Student.student_array.append(Record(len(Student.student_array), name, course_name, mark1, mark2, mark3))
You can now extract mark1 for each student instead of a numerical index, which is liable to change and cause a maintenance problem later:
>>> [s.mark1 for s in Student.student_array]
[1, 4, 7]

Related

Python LIFO list/array - shifting data, to replace first input with newest value

My goal is to find the highest high in set of price data. However, im currently struggling to append data to a list in a LIFO order (last in first out) in my for loop looping through large set of data. So for example:
I have a list []
append to list item by item in for loop: list [1, 2, 3, 4, 5]
then I have reached desired list length (in this case 5), I want to shift everything down whilst deleting '1' for example it could go to [2, 3, 4, 5, 1] then replace '1' with '6' resulting in [2, 3, 4, 5, 6]
highs_list = np.array([])
list_length = 50
for current in range(1, len(df.index)):
previous = current - 1
if len(highs_list) < list_length:
np.append(highs_list, df.loc[current, 'high'])
else:
np.roll(highs_list, -1)
highs_list[49] = df.loc[current, 'high']
If you insert 1, 2, 3, 4, 5 and then want to remove 1 to insert 6 then this seems to be a FIFO movement, since the First IN (1) is the First OUT.
Anyhow, standard Python lists allow for this by using append() and pop(0), but the in-memory shift of the elements has time complexity O(n).
A much more efficient tool for this task is collections.deque, which also provides a rotate method to allow exactly the [1,2,3,4,5] => [2,3,4,5,1] transformation.

How sorting deals with key and lambda functions in two given lists

I'm new to programming and learning about the sort function. While I did search and looked at a number of SO articles regarding sort/lambda and some even very similar to my situation, I am still having a hard time grasping how exactly this code works. I also read the docs (https://docs.python.org/3/howto/sorting.html) and the examples there make sense to me but I can't seem to carry this knowledge onto this example here. Would someone be kind enough to help me with the code below? I understand this can be a duplicate but I am only asking because I don't have the knowledge base to get what I've read from other similar posts. Please help if you can, thanks.
a = [13, 15, 81, 4]
b = [0, 1, 2, 3]
b.sort(key = lambda x:a[x])
b = [3, 0, 1, 2]
How does the order of list 'b' change from [0, 1, 2, 3] to [3, 0, 1, 2]? How does the list 'a' come into play?
It sorts b as if each element in b had the value of the corresponding value in a. It might help to see what happens if you sort tuples consisting of each value.
>>> sorted(zip(a, b))
[(4, 3), (13, 0), (15, 1), (81, 2)]
Notice the second element of each tuple, where they are sorted in the order of their first elements.
Values of the list b matter
The values of b are of importance as they are used to index into a. The example you originally gave does not throw an error because every value of b can be used as an index to retrieve values from b.
In other words, if the values of b changed from [0, 1, 2, 3] to [1, 2, 3, 4], this would cause an IndexError as the last element, 4, would be pointing to a fifth element in a which does not exist.
Further explanation
b is sorted by the function below (a.k.a your original lambda function).
def anonymous(x):
return a[x]
What happens under the hood is that every value of b is used as an index of a to retrieve and compare the values of a.
a[0] = 13
a[1] = 15
a[2] = 81
a[3] = 4
Here, we sort the right hand side of each equation in ascending order: [4, 13, 15, 81]. Then, we grab the corresponding values of b in the same order: [3, 0, 1, 2] which results in the final sorted array b.

Restricting a Change to Only 1 List

I have nested lists (ie. [[list1],[list2]] and I want to make a change to only the first list.
My function is given as below:
function_name(data, list_number, change)
should return a change to only the list_number provided in the data
So my overall question is, how do I restrict this change to only the list_number given? If this is unclear please feel free to ask and clarify.
Ex:
Where the 'change' is already known which index it is going to replace, in this case it's going to replace the 2nd index of the first list (ie. 2)
data = [[1,2,3],[4,5,6]]
function_name(data, 1, 6)
data = [[1,6,3],[4,5,6]]
I have no idea where to even begin as the index is 'unknown' (ie. given by the user when the function is called)
List items are referenced via their index and nested lists can work the same way.
If your list were:
list = [['nest1_1', 'nest1_2']['nest2_1', 'nest2_2']]
You could change it in the following ways:
list[0] = ['nesta_1', 'nesta_2']
list[1][0] = 'second_1'
This would make your list now be:
[['nesta_1', 'nesta_2']['second_1', 'nest2_2']]
Try this code:
data = [[1,2,3],[4,5,6]]
def element_change(data, index_list, element_to_change, change):
a =''.join([ str(i) for i in data[index_list]])
data[index_list][a.find(str(element_to_change))] = change
return data
print(element_change(data, 0, 2, 6))
Input:
[[1, 2, 3], [4, 5, 6]]
Output:
[[1, 6, 3], [4, 5, 6]]
Simply what it does is casting list to string and merging them in order to be able to use find() method to find the index.

Most efficient way to add an item and at the same time remove one from a list with a fixed length

I'm parsing some data which can have duplicates. To get rid of them, I use a small list with the last five non-duplicate items and check if the current item is not in the list. I have a solution that works, but there should be a better way. Any ideas?
My current code to achieve this:
activities = []
index = 0
# Open file
# Loop lines (each line is an activity)
# Parse line to activity object
if activity not in activities:
# session is part of SQLAlchemy but this isn't that important
self.session.add(activity)
# The part from here on is the one I want changed
if len(activities) == 5:
activities.pop(index)
activities.insert(index, activity)
if index == 4:
index = 0
else:
index = index + 1
EDIT: The problem is not in removing the duplicates inside this list. This is just to check if the new activity is in one of the last added activities. I'm parsing A LOT of data and checking the new activity against all old ones would be a huge bottleneck. The data is sorted by date and can really have a duplicate just in the last few activities (so I'm checking the last 5). Getting the unique values is not the problem, I'm just asking for a solution that does the same thing as mine already does, but would be better.
collections.deque with limited maxlen will be effective
in the insert+delete operation,
from collections import deque
activities = deque(maxlen=5)
# if len(activities) == 5 then the leftmost item will be removed before the push
activities.push(activity)
but # some code in-between may require some changes as now data is
shifted on each step, changing the indices.
Or
you may prefill activities with Nones and then simply do
activities = [None] * 5
index = 0
# some code in-between
activities[index] = activity
if index == 4:
index = 0
else:
index = index + 1
assuming you have no none-activities)
The answer is to use a different data structure - one which is tailor made for this purpose. Your approach fails if the new item is not a duplicate of one of the most recent five elements.
Instead use a set.
Parse each activity into an object of a class with a __hash__ method, then simply add each new activity into the set as you parse them. This will leave you with a collection containing only the unique objects from your input.
Once you have finished parsing the input, you can convert the set into a list.
s = set()
while more_data_to_parse():
s.add(parse_next_object())
activities = list(s)
For example:
>>> s = set()
>>> for i in [1, 2, 3, 2, 3, 4, 5, 6, 1, 6]:
... s.add(i)
...
>>> activities=list(s)
>>> activities
[1, 2, 3, 4, 5, 6]
>>>
The resulting list won't be in the same order as the original input, but that can be resolved by simply sorting it.
You could use OrderedDict to do the filtering. It would preserve the original order so that result would be in order of first occurrence:
from collections import OrderedDict
items = [3, 5, 6, 2, 5, 6, 1, 7, 8, 2, 3, 6]
items = OrderedDict((x, True) for x in items).keys() # [3, 5, 6, 2, 1, 7, 8]

Selecting unique random values from the third column of a an array in python

I have a 41000x3 numpy array that I call "sortedlist" in the function below. The third column has a bunch of values, some of which are duplicates, others which are not. I'd like to take a sample of unique values (no duplicates) from the third column, which is sortedlist[:,2]. I think I can do this easily with numpy.random.sample(sortedlist[:,2], sample_size). The problem is I'd like to return, not only those values, but all three columns where, in the last column, there are the randomly chosen values that I get from numpy.random.sample.
EDIT: By unique values I mean I want to choose random values which appear only once. So If I had an array:
array = [[0, 6, 2]
[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]
[5, 2, 8]]
And I wanted to choose 4 values of the third column, I want to get something like new_array_1 out:
new_array_1 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[5, 2, 8]]
But I don't want something like new_array_2, where two values in the 3rd column are the same:
new_array_2 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]]
I have the code to choose random values but without the criterion that they shouldn't be duplicates in the third column.
samplesize = 100
rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]
I'm trying to enforce this criterion by doing something like this
array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )
But I'm not sure if I'm on the right track. Any help would be greatly appreciated!
I can't think of a clever numpythonic way to do this that doesn't involve multiple passes over the data. (Sometimes numpy is so much faster than pure Python that's still the fastest way to go, but it never feels right.)
In pure Python, I'd do something like
def draw_unique(vec, n):
# group indices by value
d = {}
for i, x in enumerate(vec):
d.setdefault(x, []).append(i)
drawn = [random.choice(d[k]) for k in random.sample(d, n)]
return drawn
which would give
>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
[8, 8, 1],
[5, 8, 3]])
I can think of some tricks with np.bincount and scipy.stats.rankdata but they hurt my head, and there always winds up being one step at the end I can't see how to vectorize.. and if I'm not vectorizing the whole thing I might as well use the above which at least is simple.
I believe this will do what you want. Note that the running time will almost certainly be dominated by whatever method you use to generate your random numbers. (An exception is if the dataset is gigantic but you only need a small number of rows, in which case very few random numbers need to be drawn.) So I'm not sure this will run much faster than a pure python method would.
# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)
# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]
# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)
# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1
# now actually slice the array
a3 = a2[keep_mask]
# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

Categories

Resources