How can I remove the rows of my dataset using pandas?

How can I remove the rows of my dataset using pandas? - python

Here's the dataset I'm dealing with (called depo_dataset):
Some entries starting from the second column (0.0) might be 0.000..., the goal is for each column starting from the 2nd one, I will generate a separate array of Energy, with the 0.0... entry in the column and associated Energy removed. I'm trying to use a mask in pandas. Here's what I tried:
for column in depo_dataset.columns[1:]:
e = depo_dataset['Energy'].copy()
mask = depo_dataset[column] == 0
Then I don't know how can I drop the 0 entry (assume there is one), and the corresponding element is e?
For instance, suppose we have depo_dataset['0.0'] to be 0.4, 0.0, 0.4, 0.1, and depo_dataset['Energy'] is 0.82, 0.85, 0.87, 0.90, I hope to drop the 0.0 entry in depo_dataset['0.0'], and 0.85 in depo_dataset['Energy'] .
Thanks for the help!

You can just us .loc on the DataFrame to filter out some rows.
Here a little example:
df = pd.DataFrame({
'Energy': [0.82, 0.85, 0.87, 0.90],
0.0: [0.4, 0.0, 0.4, 0.1],
0.1: [0.0, 0.3, 0.4, 0.1]
})
energies = {}
for column in df.columns[1:]:
energies[column] = df.loc[df[column] != 0, ['Energy', column]]
energies[0.0]

you can use .loc:
depo_dataset = pd.DataFrame({'Energy':[0.82, 0.85, 0.87, 0.90],
'0.0':[0.4, 0.0, 0.4, 0.1],
'0.1':[1,2,3,4]})
dataset_no_zeroes = depo_dataset.loc[(depo_dataset.iloc[:,1:] !=0).all(axis=1),:]
Explanation:
(depo_dataset.iloc[:,1:] !=0)
makes a dataframe from all cols beginning with the second one with bool values indicating if the cell is zero.
.all(axis=1)
take the rows of the dataframe '(axis =1)' and only return true if all values of the row are true

Related

Selecting items on a matrix based on indexes given by an array

Consider this matrix:
[0.9, 0.45, 0.4, 0.35],
[0.4, 0.8, 0.3, 0.25],
[0.5, 0.45, 0.9, 0.35],
[0.2, 0.18, 0.8, 0.1],
[0.6, 0.45, 0.4, 0.9]
and this list:
[0,1,2,3,3]
I want to create a list that looks like the following:
[0.9, 0.8, 0.9, 0.1, 0.9]
To clarify, for each row, I want the element of the matrix whose column index is contained in the first array. How can I accomplish this?

Zip the two lists together as below
a=[[0.9, 0.45, 0.4, 0.35],[0.4, 0.8, 0.3, 0.25],[0.5, 0.45, 0.9, 0.35],[0.2, 0.18, 0.8, 0.1],[0.6, 0.45, 0.4, 0.9]]
b=[0,1,2,3,3]
[i[j] for i,j in zip(a,b)]
Result
[0.9, 0.8, 0.9, 0.1, 0.9]
This basically pairs up each sublist in the matrix with the element of your second list in order with zip(a,b)
Then for each pair you choose the bth element of a

If this is a numpy array, you can pass in two numpy arrays to access the desired indices:
import numpy as np
data = np.array([[0.9, 0.45, 0.4, 0.35],
[0.4, 0.8, 0.3, 0.25],
[0.5, 0.45, 0.9, 0.35],
[0.2, 0.18, 0.8, 0.1],
[0.6, 0.45, 0.4, 0.9]])
indices = np.array([0,1,2,3,3])
data[np.arange(data.shape[0]), indices]
This outputs:
[0.9 0.8 0.9 0.1 0.9]

In the first array [0, 1, 2, 3, 3], the row is determined by the index of the each element, and the value at that index is the column. This is a good case for enumerate:
matrix = [[ ... ], [ ... ], ...] # your matrix
selections = [0, 1, 2, 3, 3]
result = [matrix[i][j] for i, j in enumerate(selections)]
This will be much more efficient than looping through the entire matrix.

Loop through both arrays together using the zip function.
def create_array_from_matrix(matrix, indices):
if len(matrix) != len(indices):
return None
res = []
for row, index in zip(matrix, indices):
res.append(row[index])
return res

How to extract data point from two numpy arrays based on two conditions?

I have two numpy arrays; x, y. I want to be able to extract the value of x that is closest to 1 that also has a y value greater than 0.96 and the get the index of that value.
x = [0.5, 0.8, 0.99, 0.8, 0.85, 0.9, 0.91, 1.01, 10, 20]
y = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99, 0.99, 0.99, 0.85]
In this case the x value would be 1.01 because it is closest to 1 and has a y value of 0.99.
Ideal result would be:
idx = 7
I know how to find the point nearest to 1 and how to get the index of it but I don't know how to add the second condition.

This code also works when there are multiple indexes satisfying the condition.
import numpy as np
x = [0.5, 0.8, 0.99, 0.8, 0.85, 0.9, 0.91, 1.01, 10, 20]
y = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99, 0.99, 0.99, 0.85]
# differences
first_check = np.abs(np.array(x) - 1)
# extracting index of the value that x is closest to 1
# (indexes in case there are two or more values that are closest to 1)
indexes = np.where(first_check == np.min(first_check))[0]
indexes = [index for index in indexes if y[index] > 0.96]
print(indexes)
OUTPUT:
[7]

You can use np.argsort(abs(x - 1)) to sort the indices according to the closest value to 1. Then, grab the first y index that satisfies y > 0.96 using np.where.
import numpy as np
x = np.array([0.5, 0.8, 0.99, 0.8, 0.85, 0.9, 0.91, 1.01, 10, 20])
y = np.array([0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99, 0.99, 0.99, 0.85])
closest_inds = np.argsort(abs(x - 1))
idx = closest_inds[np.where(y[closest_inds] > 0.96)][0]
This would give:
idx = 7
For short arrays (shorter than, say 10k elements), the above solution would be slow because there is no findfirst in numpy till the moment. Look at this long awaited feature request.
So, in this case, the following loop would be much faster and will give same result:
for i in closest_inds:
if y[i] > 0.96:
idx = i
break

This will work on multiple conditions and lists.
x = [0.5, 0.8, 0.99, 0.8, 0.85, 0.9, 0.91, 1.01, 10, 20]
y = [0.7, 0.75, 0.8, 0.85, 0.9, 0.95, 0.99, 0.99, 0.99, 0.85]
condition1 = 1.0
condition2 = 0.96
def convert(*args):
"""
Returns a list of tuples generated from multiple lists and tuples
"""
for x in args:
if not isinstance(x, list) and not isinstance(x, tuple):
return []
size = float("inf")
for x in args:
size = min(size, len(x))
result = []
for i in range(size):
result.append(tuple([x[i] for x in args]))
print(result)
return result
result = convert(x, y)
closest = min([tpl for tpl in result if tpl[0] >= condition1 and tpl[1] > condition2], key=lambda x: x[1])
index = result.index(closest)
print(f'The index of the closest numbers of x-list to 1 and y-list to 0.96 is {index}')
Output
[(0.5, 0.7), (0.8, 0.75), (0.99, 0.8), (0.8, 0.85), (0.85, 0.9), (0.9, 0.95), (0.91, 0.99), (1.01, 0.99), (10, 0.99), (20, 0.85)]
The index of the closest numbers is 7

Return the biggest value less than one from a numpy vector

I have a numpy vector in python and I want to find the index of the max value of the vector with the condition that it is less than one. I have as an example the following:
temp_res = [0.9, 0.8, 0.7, 0.99, 1.2, 1.5, 0.1, 0.5, 0.1, 0.01, 0.12, 0.56, 0.89, 0.23, 0.56, 0.78]
temp_res = np.asarray(temp_res)
indices = np.where((temp_res == temp_res.max()) & (temp_res < 1))
However, what I tried always return an empty matrix since those two conditions cannot be met. HU want to return as final result the index = 3 which correspond to 0.99 the biggest value that it is less than 1. How can I do so?

You need to perform the max() function after filtering your array:
temp_res = np.asarray(temp_res)
temp_res[temp_res < 1].max()
Out[60]: 0.99
If you want to find all the indexes, here is a more genera approach:
mask = temp_res < 1
indices = np.where(mask)
maximum = temp_res[mask].max()
max_indices = np.where(temp_res == maximum)
Example:
...: temp_res = [0.9, 0.8, 0.7, 1, 0.99, 0.99, 1.2, 1.5, 0.1, 0.5, 0.1, 0.01, 0.12, 0.56, 0.89, 0.23, 0.56, 0.78]
...: temp_res = np.asarray(temp_res)
...: mask = temp_res < 1
...: indices = np.where(mask)
...: maximum = temp_res[mask].max()
...: max_indices = np.where(temp_res == maximum)
...:
In [72]: max_indices
Out[72]: (array([4, 5]),)

You can use:
np.where(temp_res == temp_res[temp_res < 1].max())[0]
Example:
In [49]: temp_res
Out[49]:
array([0.9 , 0.8 , 0.7 , 0.99, 1.2 , 1.5 , 0.1 , 0.5 , 0.1 , 0.01, 0.12,
0.56, 0.89, 0.23, 0.56, 0.78])
In [50]: np.where(temp_res == temp_res[temp_res < 1].max())[0]
...:
Out[50]: array([3])

find closest set in N-dimensional numpy array

Let's assume I have a numpy array composed of the combinations of all this ranges:
A = numpy.arange(0.05, 7.05, 0.5)
B = numpy.arange(0.0,0.75, 0.15 )
C = numpy.arange(8.0, 12.0, 0.15)
D = numpy.arange(-2, 2, 0.15)
E = [0.5, 0.55, 0.6, 0.65, 0.7]
F = numpy.arange(0.1,2.1, 0.1)
G = [0.5, 1.0, 1.5]
The resulting array is called A and is 7 lines and 15309000 columns.
now I have this groub:
group = [a ,b, c, d, e, f, g] = [1.22, 0.34, 9.45, -1.43, 0.52, 0.23, 0.64]
I am looking for a way to find the combination in A that is the closest to the former group in the sense that the selected set (closest combination) would be the group where all the value are the closest to the group to find: a would be the closest value in A, b the closest value in B, etc...
For this particular case, the closest group would be:
closest_group = [1.05, 0.3, 9.5, -1.4, 0.5, 0.2, 0.5]
I have found a different answers like with scipy.signal.KDTree for example:
A[spatial.KDTree(A).query(set)[1]]
but unfortunately this does not converge, raising this error:
RecursionError: maximum recursion depth exceeded

Python, pandas: how to extract values from a symmetric, multi-index dataframe

I have a symmetric, multi-index dataframe from which I want to systematically extract data:
import pandas as pd
df_index = pd.MultiIndex.from_arrays(
[["A", "A", "B", "B"], [1, 2, 3, 4]], names = ["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 0.3, -0.4],
[0.5, 1.0, 0.9, -0.8],
[0.3, 0.9, 1.0, 0.1],
[-0.4, -0.8, 0.1, 1.0]],
index=df_index, columns=df_index)
I want a function extract_vals that can return all values related to elements in the same group, EXCEPT for the diagonal AND elements must not be double-counted. Here are two examples of the desired behavior (order does not matter):
A_vals = extract_vals("A", df) # [0.5, 0.3, -0.4, 0.9, -0.8]
B_vals = extract_vals("B", df) # [0.3, 0.9, 0.1, -0.4, -0.8]
My question is similar to this question on SO, but my situation is different because I am using a multi-index dataframe.
Finally, to make things more fun, please consider efficiency because I'll be running this many times on much bigger dataframes. Thanks very much!
EDIT:
Happy001's solution is awesome. I came up with a method myself based on the logic of extracting the elements where target is NOT in BOTH the rows and columns, and then extracting the lower triangle of those elements where target IS in BOTH the rows and columns. However, Happy001's solution is much faster.
First, I created a more complex dataframe to make sure both methods are generalizable:
import pandas as pd
import numpy as np
df_index = pd.MultiIndex.from_arrays(
[["A", "B", "A", "B", "C", "C"], [1, 2, 3, 4, 5, 6]], names=["group", "id"])
df = pd.DataFrame(
[[1.0, 0.5, 1.0, -0.4, 1.1, -0.6],
[0.5, 1.0, 1.2, -0.8, -0.9, 0.4],
[1.0, 1.2, 1.0, 0.1, 0.3, 1.3],
[-0.4, -0.8, 0.1, 1.0, 0.5, -0.2],
[1.1, -0.9, 0.3, 0.5, 1.0, 0.7],
[-0.6, 0.4, 1.3, -0.2, 0.7, 1.0]],
index=df_index, columns=df_index)
Next, I defined both versions of extract_vals (the first is my own):
def extract_vals(target, multi_index_level_name, df):
# Extract entries where target is in the rows but NOT also in the columns
target_in_rows_but_not_in_cols_vals = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) != target]
# Extract entries where target is in the rows AND in the columns
target_in_rows_and_cols_df = df.loc[
df.index.get_level_values(multi_index_level_name) == target,
df.columns.get_level_values(multi_index_level_name) == target]
mask = np.triu(np.ones(target_in_rows_and_cols_df.shape), k = 1).astype(np.bool)
vals_with_nans = target_in_rows_and_cols_df.where(mask).values.flatten()
target_in_rows_and_cols_vals = vals_with_nans[~np.isnan(vals_with_nans)]
# Append both arrays of extracted values
vals = np.append(target_in_rows_but_not_in_cols_vals, target_in_rows_and_cols_vals)
return vals
def extract_vals2(target, multi_index_level_name, df):
# Get indices for what you want to extract and then extract all at once
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i < j and (
df.index.get_level_values(multi_index_level_name)[i] == target or (
df.columns.get_level_values(multi_index_level_name)[j] == target))]
return df.values[tuple(np.transpose(coord))]
I checked that both functions returned output as desired:
# Expected values
e_A_vals = np.sort([0.5, 1.0, -0.4, 1.1, -0.6, 1.2, 0.1, 0.3, 1.3])
e_B_vals = np.sort([0.5, 1.2, -0.8, -0.9, 0.4, -0.4, 0.1, 0.5, -0.2])
e_C_vals = np.sort([1.1, -0.9, 0.3, 0.5, 0.7, -0.6, 0.4, 1.3, -0.2])
# Sort because order doesn't matter
assert np.allclose(np.sort(extract_vals("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals("C", "group", df)), e_C_vals)
assert np.allclose(np.sort(extract_vals2("A", "group", df)), e_A_vals)
assert np.allclose(np.sort(extract_vals2("B", "group", df)), e_B_vals)
assert np.allclose(np.sort(extract_vals2("C", "group", df)), e_C_vals)
And finally, I checked speed:
## Test speed
import time
# Method 1
start1 = time.time()
for ii in range(10000):
out = extract_vals("C", "group", df)
elapsed1 = time.time() - start1
print elapsed1 # 28.5 sec
# Method 2
start2 = time.time()
for ii in range(10000):
out2 = extract_vals2("C", "group", df)
elapsed2 = time.time() - start2
print elapsed2 # 10.9 sec

I don't assume df has the same columns and index. (Of course they can be the same).
def extract_vals(group_label, df):
coord = [[i, j] for i in range(len(df)) for j in range(len(df)) if i<j and (df.index.get_level_values('group')[i] == group_label or df.columns.get_level_values('group')[j] == group_label) ]
return df.values[tuple(np.transpose(coord))]
print extract_vals('A', df)
print extract_vals('B', df)
result:
[ 0.5 0.3 -0.4 0.9 -0.8]
[ 0.3 -0.4 0.9 -0.8 0.1]

is that what you want?
all elements above the diagonal:
In [139]: df.values[np.triu_indices(len(df), 1)]
Out[139]: array([ 0.5, 0.3, -0.4, 0.9, -0.8, 0.1])
A_vals:
In [140]: df.values[np.triu_indices(len(df), 1)][:-1]
Out[140]: array([ 0.5, 0.3, -0.4, 0.9, -0.8])
B_vals:
In [141]: df.values[np.triu_indices(len(df), 1)][1:]
Out[141]: array([ 0.3, -0.4, 0.9, -0.8, 0.1])
Source matrix:
In [142]: df.values
Out[142]:
array([[ 1. , 0.5, 0.3, -0.4],
[ 0.5, 1. , 0.9, -0.8],
[ 0.3, 0.9, 1. , 0.1],
[-0.4, -0.8, 0.1, 1. ]])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I remove the rows of my dataset using pandas? - python

Related

Selecting items on a matrix based on indexes given by an array

How to extract data point from two numpy arrays based on two conditions?

Return the biggest value less than one from a numpy vector

find closest set in N-dimensional numpy array

Python, pandas: how to extract values from a symmetric, multi-index dataframe

Categories

Resources