merge columns in numpy matrix - python

I have a NumPy matrix like this one (it could have several columns, this is just an example:
array([[nan, nan],
[nan, nan],
['value 1', nan],
[nan, nan],
[nan, nan],
[nan, 'value 2']], dtype=object)
I need to merge all columns in this matrix, replacing nan values with the corresponding non-nan value (if exists). Example output:
array([[nan],
[nan],
['value 1'],
[nan],
[nan],
['value 2']], dtype=object)
Is there a way to achieve this with some built-in function in NumPy?
EDIT: if there is more than one non-nan in single row, I will take the first non-nan value.
Values could be string, float or int

Find the rows where the first column is nan. This works because nan!=nan:
rows = arr[:,0] != arr[:,0]
Update the first element of each chosen row with the second element:
arr[rows,0] = arr[rows,1]
Select the first column:
arr[:,[0]]

Related

Returning the index of the first NaN and number in every row of a matrix

I have a numpy array, for every row id like to know the position of the first NaN and if there are numbers in the row after that NaN.
Let's say i have this matrix
[[nan, nan, 0.0, 1.0, nan],
[0.0, nan, nan, nan, 0.0]]
Id like to return two vectors, with a length = to the matrix rows where the first contains the index of the first nan in the row and the second the position of the first number in the row.
nan_vector = [0, 1]
num_vector = [2, 0]
How do i achieve this?
Use np.isnan with np.argmax or np.argmin:
nan_vector = np.isnan(a).argmax(axis=1)
num_vector = np.isnan(a).argmin(axis=1)
Output:
>>> nan_vector
array([0, 1])
>>> num_vector
array([2, 0])

Efficient and elegant way to fetch unique values from all columns - Big data

I have a dataframe with more than 600 columns. I have given a sample dataframe with few columns here
df_new = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
'hero_id':[2,4,np.nan],'date2':['12/31/2017',np.nan,'10/06/2015'],
'heroine_id':[1,np.nan,5],'date3':['12/31/2027','11/25/2029',np.nan],
'bud_source_value':[1250000,250000,np.nan],
'prod__source_value':[10000,20000,np.nan]})
I would like to fetch the unique values from each column and output it in another dataframe
These are the two approaches that I tried
cols = df_new.columns.tolist()
unique_list = dict()
for c in cols: #appraoch 1
unique_list[c] = df_new[c].unique()
for c in cols: #approach 2
unique_list[c] = df_new[c].drop_duplicates()
Is there anyway to do this at one go without loop? Please note that I expect to have unique values from each column and not unique row in dataframe
As my data is over million records and columns are more than 600 any suggestions/solutions to improve would be helpful
You could try:
print({k:v.drop_duplicates().tolist() for k,v in df_new.items()})
Output:
{'bud_source_value': [1250000.0, 250000.0, nan], 'date2': ['12/31/2017', nan, '10/06/2015'], 'date3': ['12/31/2027', '11/25/2029', nan], 'hero_id': [2.0, 4.0, nan], 'heroine_id': [1.0, nan, 5.0], 'obs_date': ['12/31/2007', '11/25/2009', nan], 'person_id': [1, 2, 3], 'prod__source_value': [10000.0, 20000.0, nan]}
Convert dataframe into numpy array and do following
df_new = np.array(df_new)
unique_list = np.unique(df_new,axis=1)
Numpy is much faster!

pandas simple pairwise occurrence

There is a function corr in pandas to create a table with mutual correlation coefficients in presence of sparse data. But how to calculate the number of mutual occurrences in the data instead of correlation coefficient?
i.e.
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
F(A,B) = 1
A = [1, NaN, NaN]
B = [NaN, NaN, 8]
F(A,B) = 0
I need pandas.DataFrame([A,B]).<function>() -> matrix of occurrences
In pandas, you may want to use dropna: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
You can do something like
co_occur = df.dropna(how = "any")
the_count = co_occur.shape[0] # number of remaining rows
This will drop all rows where there is any NaN (thereby leaving you only with rows that contain values for every variable) and then count the number of remaining rows.
Alternatively, you could do it with lists (as in your code above) assuming the lists are the same length:
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
co_occur = len( [i for i in range(len(A)) if A[i] and B[i]] )
I am using numpy
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[335]: 1
For you second case
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[337]: 0
With pandas
(df.A.notnull() & df.B.notnull()).sum()
Or
df.notnull().all(axis=1).sum()

Numpy reading data from '.npy' file directly into arrays

This might be a silly question, but I can't seem to find an answer for it. I have a large array that I've previously saved using np.save, and now I'd like to load the data into a new file, creating a separate list from each column. The only issue is that some of the rows in my large array only have a single nan value, so the array looks something like this (as an extremely simplified example):
np.array([[5,12,3],
[nan],
[10,13,9],
[nan],
[nan]])
I can use a for loop to achieve what I want, but I was wondering if there was a better way than this:
import numpy as np
results = np.load('data.npy')
depth, upper, lower = [], [], []
for item in results:
if len(item) > 1:
depth.append(item[0])
upper.append(item[1])
lower.append(item[2])
else:
depth.append(np.nan)
upper.append(np.nan)
lower.append(np.nan)
My desired output would look like:
depth = [5,nan,10,nan,nan]
upper = [12,nan,13,nan,nan]
lower = [3,nan,9,nan,nan]
Thanks for your help! I realize I should have previously altered the code that creates the "data.npy" file, so that it has the same number of columns for each row, but that code already takes hours to run and I'd rather avoid that!
With varying length sub arrays, this is dtype=object array. For most purposes this is the same as a list of these subarrays. So most actions will require iteration.
A variant on your action would be a list comprehension
In [61]: dd=[[nan,nan,nan] if len(i)==1 else i for i in d]
In [62]: dd
Out[62]: [[5, 12, 3], [nan, nan, nan], [10, 13, 9], [nan, nan, nan], [nan, nan, nan]]
Your three target arrays are then columns of:
In [63]: np.array(dd)
Out[63]:
array([[ 5., 12., 3.],
[ nan, nan, nan],
[ 10., 13., 9.],
[ nan, nan, nan],
[ nan, nan, nan]])
Another approach is to make an array of that type filled with nan, and then copy over the non-nan values. But that too requires iteration to find the length of the subsarrays.
In [65]: [len(i)>1 for i in d]
Out[65]: [True, False, True, False, False]
np.nan is a float, so a 2d array with nan will be dtype float.
A shorter way using pandas:
import numpy as np
import pandas as pd
data = np.array([[5,12,3], [np.nan], [10,13,9], [np.nan], [np.nan]])
df = pd.DataFrame.from_records(data.tolist())
df.columns = ['depth','upper','lower']
Output:
>>> df
depth upper lower
0 5.0 12.0 3.0
1 NaN NaN NaN
2 10.0 13.0 9.0
3 NaN NaN NaN
4 NaN NaN NaN
You can now address each column to get your desired output
>>> df.depth
0 5.0
1 NaN
2 10.0
3 NaN
4 NaN
If you need lists:
>>> df.depth.tolist()
[5.0, nan, 10.0, nan, nan]

Removing nan elements from matrix

I have a bunch of matrices eq1, eq2, etc. defined like
from numpy import meshgrid, sqrt, arange
# from numpy import isnan, logical_not
xs = arange(-7.25, 7.25, 0.01)
ys = arange(-5, 5, 0.01)
x, y = meshgrid(xs, ys)
eq1 = ((x/7.0)**2.0*sqrt(abs(abs(x)-3.0)/(abs(x)-3.0))+(y/3.0)**2.0*sqrt(abs(y+3.0/7.0*sqrt(33.0))/(y+3.0/7.0*sqrt(33.0)))-1.0)
eq2 = (abs(x/2.0)-((3.0*sqrt(33.0)-7.0)/112.0)*x**2.0-3.0+sqrt(1-(abs(abs(x)-2.0)-1.0)**2.0)-y)
where eq1, eq2, eq3, etc. are large square matrices. As you can see, there are many nan elements surrounding a 'block' of plot-able values. I want to remove all the nan values whilst keeping the shape of the block of the valid values in the matrix. Note that these 'blocks' can be located anywhere in the eq1, eq2 matrix.
I've looked at answers given in Removing nan values from an array and Removing NaN elements from a matrix, but these don't seem to be completely relevant to my case.
IIUC, you can use boolean indexing with np.isnan to keep the slices. There are probably slicker ways to do this, but starting from something like:
>>> eq = np.zeros((5,6)) + np.nan
>>> eq[2:4, 1:3].flat = [1,np.nan,3,4]
>>> eq
array([[ nan, nan, nan, nan, nan, nan],
[ nan, nan, nan, nan, nan, nan],
[ nan, 1., nan, nan, nan, nan],
[ nan, 3., 4., nan, nan, nan],
[ nan, nan, nan, nan, nan, nan]])
You could select the rows and columns with data using something like
>>> eq = eq[:,~np.isnan(eq).all(0)]
>>> eq = eq[~np.isnan(eq).all(1)]
>>> eq
array([[ 1., nan],
[ 3., 4.]])
Short and sweet,
eq1_c = eq1[~np.isnan(eq1)]
np.isnan returns a bool array that can be used to index your original array. Take its negation and you will get back the non-nan values.
One option is to manually iterate through the grid and check for Nan values. A Nan value is easy to spot because comparing it to itself will result in False. You could use this to set all Nan values to 0.0 for example.
for x in xrange(len(eq1)):
for y in xrange(len(eq1[x])):
v = eq1[x][y]
if v!=v:
eq1[x][y] = 0.0

Categories

Resources