Python: multidimensional pandas DataFrame - python

This is my first question.
I have many sets of data. Each of them should be presented in a DataFrame. I have tried to implement this by having a DataFrame as an item of a multidimensional tuple, e.g.:
data[0][1].Glucose.val
data[0][1].Glucose.time
I have predefined the tuple like this:
data = tuple([data_type for _ in range(3)] for _ in range(8))
Addressing this works fine, but if I try to fill the df with new values, all elements in the tuple are overwritten:
for condition in range(8):
for index in range(3):
loop_it = condition + row_mult * index
exp_setting = expIDs[loop_it]
tempval = pd.read_csv(f"raw_data/{exp_setting}_Glucose.csv", delimiter="\t")
rundata[condition][index].DOT.val = tempval.val.values
rundata[condition][index].DOT.time = tempval.t
What the hell am I doing wrong?
THANKS

Tuples are immutable, so you can't replace individual items without overwriting the whole tuple. You could use lists of DataFrames instead.
If your DataFrames all have the same shape, and all the values are numerical, you could also use just one multi-dimensional NumPy array for all the data, e.g.:
import numpy as np
data = np.array([[[1, 2], [3, 4]],
[[5, 6], [7, 8]]])
# replace the first item in the second row of the first frame with 9
data[0, 1, 0] = 9
print(data)
[[[1 2]
[9 4]]
[[5 6]
[7 8]]]
By the way, pandas did have special data structures for 3- and 4-dimensional DataFrames in earlier versions, but I guess they were found unnecessary. Maybe you can stack the data into one DataFrame with two dimensions. For that, you may want to look into pandas' MultiIndex functionality.

As mentioned here: Multidimensional list of classes - overwriting issue
The issue was, that I missed to inital the class correctly.
Wrong:
data = tuple([data_type for _ in range(3)] for _ in range(8))
Right:
data = tuple([data_type() for _ in range(3)] for _ in range(8))

Related

Copy xarray into larger xarray in python?

I have an xarray of shape [3,] that looks like this
data= [2,4,6]
and I'm trying to copy it into array so that it looks like this:
data= [[2,4,6],[2,4,6],[2,4,6]]
(ie the entire array is copied three times).
I've tried a few different methods but keep getting:
data= [2 2 2,4 4 4,6 6 6]
Anyone know how I should go about doing this? (Also, sorry if I wrote this not according to the stack overflow rules, this is my first question...)
The first two answers don't actually copy the original array/list. Rather, they both just reference the array three times inside a new list. If you change one of the values inside of the the original array or any of the "copies" inside the new list, all of the "copies" of the array will change because they're really all the same structure just referenced in multiple places.
If you want to create a list containing three unique copies of your original array (xarray or list), you can do this:
new_list = [data[:] for _ in range(3)]
or if you want a new xarray containing your original array:
new_array = xarray.DataArray([data[:] for _ in range(3)])
I think the safest approach would be this:
import itertools
data = [2,4,6]
res = list(itertools.chain.from_iterable(itertools.repeat(x, 3) for x in [data]))
print(res)
Output: [[2, 4, 6], [2, 4, 6], [2, 4, 6]]

What Is the Logic Behind Advanced Indexing in Numpy?

When the following lines of codes are run, same results are expected. Is the logic behind advanced indexing in Numpy literally zipping different iterables together? If so, I am also curious about what data structure is converted into after zipping. I am using a tuple in my example, but it seems like there are other possibilities. Thanks in advance for the help!
a = np.array([[1,2],[3,4],[5,6]])
print(a[[0,1],[1,1]])
>>> [2 4]
result = zip([0,1],[1,1])
print(a[tuple(result)])
>>> [2 4]
The list and tuple are basically the same - both hold items, but while list is mutable (i.e., you can change its elements) the latter is immutable.
But as far as the numpy indexing is concerned - you can use both, as long as they hold integer values.
The only advantage in using tuple for indexing is that it can not be changed in the middle, and mess up the data extraction (as shown in Example 1), if that is one of your requirements.
Example 1 (imutable):
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = tuple(np.random.randint(0, 2, 10).reshape((5, 2)))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 1:
TypeError: 'tuple' object does not support item assignment
On the other hand, if you desire a more flexible indexing (as in Example 2), i.e., changing the indices in the process of the run - tuples won't work for you.
Example 2 (Mutable):
import numpy as np
arr = np.random.randint(0, 10, 6).reshape((2, 3))
idx = np.random.randint(0, 2, 10).reshape((5, 2))
for i in range(3):
np.random.shuffle(idx)
print(arr[idx])
Output of Example 2:
[[[0 6 8]
[0 5 5]]
[[0 5 5]
[0 5 5]]
[[0 6 8]
[0 5 5]]...
So, whether to use one or another depends on the outcome you desire.
Cheers.

Python - Select elements from matrix within range

I have a question regarding python and selecting elements within a range.
If I have a n x m matrix with n row and m columns, I have a defined range for each column (so I have m min and max values).
Now I want to select those rows, where all values are within the range.
Looking at the following example:
input = matrix([[1, 2], [3, 4],[5,6],[1,8]])
boundaries = matrix([[2,1],[8,5]])
#Note:
#col1min = 2
#col1max = 8
#col2min = 1
#col2max = 5
print(input)
desired_result = matrix([[3, 4]])
print(desired_result)
Here, 3 rows where discarded, because they contained values beyond the boundaries.
While I was able to get values within one range for a given array, I did not manage to solve this problem efficiently.
Thank you for your help.
I believe that there is more elegant solution, but i came to this:
def foo(data, boundaries):
zipped_bounds = list(zip(*boundaries))
output = []
for item in data:
for index, bound in enumerate(zipped_bounds):
if not (bound[0] <= item[index] <= bound[1]):
break
else:
output.append(item)
return output
data = [[1, 2], [3, 4], [5, 6], [1, 8]]
boundaries = [[2, 1], [8, 5]]
foo(data, boundaries)
Output:
[[3, 4]]
And i know that there is not checking and raising exceptions if the sizes of arrays won't match each concrete size. I leave it OP to implement this.
Your example data syntax is not correct matrix([[],..]) so it needs to be restructured like this:
matrix = [[1, 2], [3, 4],[5,6],[1,8]]
bounds = [[2,1],[8,5]]
I'm not sure exactly what you mean by "efficient", but this solution is readable, computationally efficient, and modular:
# Test columns in row against column bounds or first bounds
def row_in_bounds(row, bounds):
for ci, colVal in enumerate(row):
bi = ci if len(bounds[0]) >= ci + 1 else 0
if not bounds[1][bi] >= colVal >= bounds[0][bi]:
return False
return True
# Use a list comprehension to apply test to n rows
print ([r for r in matrix if row_in_bounds(r,bounds)])
>>>[[3, 4]]
First we create a reusable test function for rows accepting a list of bounds lists, tuples are probably more appropriate, but I stuck with list as per your specification.
Then apply the test to your matrix of n rows with a list comprehension. If n exceeds the bounds column index or the bounds column index is falsey use the first set of bounds provided.
Keeping the row iterator out of the row parser function allows you to do things like get min/max from the filtered elements as required. This way you will not need to define a new function for every manipulation of the data required.

Selecting unique random values from the third column of a an array in python

I have a 41000x3 numpy array that I call "sortedlist" in the function below. The third column has a bunch of values, some of which are duplicates, others which are not. I'd like to take a sample of unique values (no duplicates) from the third column, which is sortedlist[:,2]. I think I can do this easily with numpy.random.sample(sortedlist[:,2], sample_size). The problem is I'd like to return, not only those values, but all three columns where, in the last column, there are the randomly chosen values that I get from numpy.random.sample.
EDIT: By unique values I mean I want to choose random values which appear only once. So If I had an array:
array = [[0, 6, 2]
[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]
[5, 2, 8]]
And I wanted to choose 4 values of the third column, I want to get something like new_array_1 out:
new_array_1 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[5, 2, 8]]
But I don't want something like new_array_2, where two values in the 3rd column are the same:
new_array_2 = [[5, 3, 9]
[3, 7, 1]
[5, 3, 2]
[3, 1, 1]]
I have the code to choose random values but without the criterion that they shouldn't be duplicates in the third column.
samplesize = 100
rand_sortedlist = sortedlist[np.random.randint(len(sortedlist), size = sample_size),:]]
I'm trying to enforce this criterion by doing something like this
array_index = where( array[:,2] == sample(SelectionWeight, sample_size) )
But I'm not sure if I'm on the right track. Any help would be greatly appreciated!
I can't think of a clever numpythonic way to do this that doesn't involve multiple passes over the data. (Sometimes numpy is so much faster than pure Python that's still the fastest way to go, but it never feels right.)
In pure Python, I'd do something like
def draw_unique(vec, n):
# group indices by value
d = {}
for i, x in enumerate(vec):
d.setdefault(x, []).append(i)
drawn = [random.choice(d[k]) for k in random.sample(d, n)]
return drawn
which would give
>>> a = np.random.randint(0, 10, (41000, 3))
>>> drawn = draw_unique(a[:,2], 3)
>>> drawn
[4219, 6745, 25670]
>>> a[drawn]
array([[5, 6, 0],
[8, 8, 1],
[5, 8, 3]])
I can think of some tricks with np.bincount and scipy.stats.rankdata but they hurt my head, and there always winds up being one step at the end I can't see how to vectorize.. and if I'm not vectorizing the whole thing I might as well use the above which at least is simple.
I believe this will do what you want. Note that the running time will almost certainly be dominated by whatever method you use to generate your random numbers. (An exception is if the dataset is gigantic but you only need a small number of rows, in which case very few random numbers need to be drawn.) So I'm not sure this will run much faster than a pure python method would.
# arrayify your list of lists
# please don't use `array` as a variable name!
a = np.asarray(arry)
# sort the list ... always the first step for efficiency
a2 = a[np.argsort(a[:, 2])]
# identify rows that are duplicates (3rd column is non-increasing)
# Note this has length one less than a2
duplicate_rows = np.diff(a2[:, 2]) == 0)
# if duplicate_rows[N], then we want to remove row N and N+1
keep_mask = np.ones(length(a2), dtype=np.bool) # all True
keep_mask[duplicate_rows] = 0 # remove row N
keep_mask[1:][duplicate_rows] = 0 # remove row N + 1
# now actually slice the array
a3 = a2[keep_mask]
# select rows from a3 using your preferred random number generator
# I actually prefer `random` over numpy.random for sampling w/o replacement
import random
result = a3[random.sample(xrange(len(a3)), DESIRED_NUMBER_OF_ROWS)]

Two dimensional array in python

I want to know how to declare a two dimensional array in Python.
arr = [[]]
arr[0].append("aa1")
arr[0].append("aa2")
arr[1].append("bb1")
arr[1].append("bb2")
arr[1].append("bb3")
The first two assignments work fine. But when I try to do, arr[1].append("bb1"), I get the following error:
IndexError: list index out of range.
Am I doing anything silly in trying to declare the 2-D array?
Edit:
but I do not know the number of elements in the array (both rows and columns).
You do not "declare" arrays or anything else in python. You simply assign to a (new) variable. If you want a multidimensional array, simply add a new array as an array element.
arr = []
arr.append([])
arr[0].append('aa1')
arr[0].append('aa2')
or
arr = []
arr.append(['aa1', 'aa2'])
There aren't multidimensional arrays as such in Python, what you have is a list containing other lists.
>>> arr = [[]]
>>> len(arr)
1
What you have done is declare a list containing a single list. So arr[0] contains a list but arr[1] is not defined.
You can define a list containing two lists as follows:
arr = [[],[]]
Or to define a longer list you could use:
>>> arr = [[] for _ in range(5)]
>>> arr
[[], [], [], [], []]
What you shouldn't do is this:
arr = [[]] * 3
As this puts the same list in all three places in the container list:
>>> arr[0].append('test')
>>> arr
[['test'], ['test'], ['test']]
What you're using here are not arrays, but lists (of lists).
If you want multidimensional arrays in Python, you can use Numpy arrays. You'd need to know the shape in advance.
For example:
import numpy as np
arr = np.empty((3, 2), dtype=object)
arr[0, 1] = 'abc'
You try to append to second element in array, but it does not exist.
Create it.
arr = [[]]
arr[0].append("aa1")
arr[0].append("aa2")
arr.append([])
arr[1].append("bb1")
arr[1].append("bb2")
arr[1].append("bb3")
We can create multidimensional array dynamically as follows,
Create 2 variables to read x and y from standard input:
print("Enter the value of x: ")
x=int(input())
print("Enter the value of y: ")
y=int(input())
Create an array of list with initial values filled with 0 or anything using the following code
z=[[0 for row in range(0,x)] for col in range(0,y)]
creates number of rows and columns for your array data.
Read data from standard input:
for i in range(x):
for j in range(y):
z[i][j]=input()
Display the Result:
for i in range(x):
for j in range(y):
print(z[i][j],end=' ')
print("\n")
or use another way to display above dynamically created array is,
for row in z:
print(row)
When constructing multi-dimensional lists in Python I usually use something similar to ThiefMaster's solution, but rather than appending items to index 0, then appending items to index 1, etc., I always use index -1 which is automatically the index of the last item in the array.
i.e.
arr = []
arr.append([])
arr[-1].append("aa1")
arr[-1].append("aa2")
arr.append([])
arr[-1].append("bb1")
arr[-1].append("bb2")
arr[-1].append("bb3")
will produce the 2D-array (actually a list of lists) you're after.
You can first append elements to the initialized array and then for convenience, you can convert it into a numpy array.
import numpy as np
a = [] # declare null array
a.append(['aa1']) # append elements
a.append(['aa2'])
a.append(['aa3'])
print(a)
a_np = np.asarray(a) # convert to numpy array
print(a_np)
a = [[] for index in range(1, n)]
For compititve programming
1) For input the value in an 2D-Array
row=input()
main_list=[]
for i in range(0,row):
temp_list=map(int,raw_input().split(" "))
main_list.append(temp_list)
2) For displaying 2D Array
for i in range(0,row):
for j in range(0,len(main_list[0]):
print main_list[i][j],
print
the above method did not work for me for a for loop, where I wanted to transfer data from a 2D array to a new array under an if the condition. This method would work
a_2d_list = [[1, 2], [3, 4]]
a_2d_list.append([5, 6])
print(a_2d_list)
OUTPUT - [[1, 2], [3, 4], [5, 6]]
x=3#rows
y=3#columns
a=[]#create an empty list first
for i in range(x):
a.append([0]*y)#And again append empty lists to original list
for j in range(y):
a[i][j]=input("Enter the value")
In my case I had to do this:
for index, user in enumerate(users):
table_body.append([])
table_body[index].append(user.user.id)
table_body[index].append(user.user.username)
Output:
[[1, 'john'], [2, 'bill']]

Categories

Resources