Numpy: how to select full row based on some columns - python

a = np.array([[1.,2.,3.],
[3.,4.,2.],
[8.,1.,3.]])
b = [8.,1.]
c = a[np.isclose(a[:,0:2],b)]
print(c)
I want to select full rows in a based on only a few columns. My attempt is above.
It works if I include the last column too in that condition, but I don't care about the last column. How do I select rows with 3 columns, based on a condition on 2?

Compare with np.isclose using the sliced version of a and then look for all matches along each row, for which we can use np.all or np.logical_and.reduce. Finally, index into input array for the output.
Hence, two solutions -
a[np.isclose(a[:,:2],b).all(axis=1)]
a[np.logical_and.reduce( np.isclose(a[:,:2],b), axis=1)]

Related

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Numpy where matching two specific columns

I have a six column matrix. I want to find the row(s) where BOTH columns match the query.
I've been trying to use numpy.where, but I can't specify it to match just two columns.
#Example of the array
x = np.array([[860259, 860328, 861277, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 865534, 865716], [860259, 860328, 861301, 861393, 871151, 871173],])
print(x)
#Match first column of interest
A = np.where(x[:,2] == 861301)
#Match second column on interest
B = np.where(x[:,3] == 861393)
#rows in both A and B
np.intersect1d(A, B)
#This approach works, but is not column specific for the intersect, leaving me with extra rows I don't want.
#This is the only way I can get Numpy to match the two columns, but
#when I query I will not actually know the values of columns 0,1,4,5.
#So this approach will not work.
#Specify what row should exactly look like
np.where(all([860259, 860328, 861277, 861393, 865534, 865716]))
#I want something like this:
#Where * could be any number. But I think that this approach may be
#inefficient. It would be best to just match column 2 and 3.
np.where(all([*, *, 861277, 861393, *, *]))
I'm looking for an efficient answer, because I am looking through a 150GB HDF5 file.
Thanks for your help!
If I understand you correctly,
you can use a little more advanced slicing, like this:
np.where(np.all(x[:,2:4] == [861277, 861393], axis=1))
this will give you only where these 2 cols are equal to [861277, 861393]

based on condition=True in an column filling random values to a particular column

I need to work on a column, and based on a condition (if it is True ), need to fill some random numbers for the entry(not a constant string/number ). Tried with for loop and its working, but any other fastest way to proceed similar to np.select or np.where conditions ?
I have written for loop and its working:
The 'NUMBER' column have here few entries with greater than 1000, i need to replace them by any random float in between (120,123),not the same one b/w 120-123 . I have used np.random.uniform and its working too.
for i in range(0,len(data['NUMBER'])):
if data['NUMBER'][i] >=1000:
data['NUMBER'][i]=np.random.uniform(120,123)\
'''The o/p for this code fills each entries with different values
between (120,123) in random,after replacement the entries are'''
0 7.139093
1 12.592815
2 12.712103
3 **120.305773**
4 11.941386
5 **122.548703**
6 6.357255.............etc
''' but while using codes using np.select and np.where as shown below(as
it will run faster) --> the result was replaced by same number alone
for all the entries satisfying the condition. for example instead of
having different values for the indexes 3 and 5 as shown above it
have same value of any b/w(120,123 ) for all the entries. please
guide here.'''
data['NUMBER'] =np.where(data['NUMBER'] >= 1000,np.random.uniform(120,123), data['NUMBER'])
data['NUMBER'] = np.select([data['NUMBER'] >=1000],[np.random.uniform(120,123)], [data['NUMBER']])
np.random.uniform(120, 123) is a single random number:
In [1]: np.random.uniform(120, 123)
Out[1]: 120.51317994772921
Use the size parameter to make an array of random numbers:
In [2]: np.random.uniform(120, 123, size=5)
Out[2]:
array([122.22935075, 122.70963032, 121.97763459, 121.68375085,
121.13568039])
Passing this to np.where (as the second argument) allows np.where to select from this array when the condition is True:
data['NUMBER'] = np.where(data['NUMBER'] >= 1000,
np.random.uniform(120, 123, size=len(data)),
data['NUMBER'])
Use np.select when there is more than one condition. Since there is only one condition here, use np.where.

how to do row wise comparison of string using NumPy of a single matrix

how to do row wise comparison of string using NumPy of a single matrix. where 1 row is compared with all other rows and subsequently 2 row is compared with all other rows, for comparison the column values are used like first row first column is compared with second row first column. This technique i should apply for multiple columns and wherever match is found i should update to the variable score by 1 and if no match and if there is missing field(like nan) score should remain the same.
vector_col1 = np.array(data_list1)
for i in range(0,len(data_list1)-1):
skill_score=0
if ((data_list1[0] and data_list1[i+1])=='nan'):
skill_score=0
if (data_list1[0]==data_list1[i+1]):
skill_score=skill_score+1
vector_col1[i]=skill_score
print vector_col1
i expect the output to be 1 for matched score but actual output is 0

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources