pandas simple pairwise occurrence - python

There is a function corr in pandas to create a table with mutual correlation coefficients in presence of sparse data. But how to calculate the number of mutual occurrences in the data instead of correlation coefficient?
i.e.
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
F(A,B) = 1
A = [1, NaN, NaN]
B = [NaN, NaN, 8]
F(A,B) = 0
I need pandas.DataFrame([A,B]).<function>() -> matrix of occurrences

In pandas, you may want to use dropna: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
You can do something like
co_occur = df.dropna(how = "any")
the_count = co_occur.shape[0] # number of remaining rows
This will drop all rows where there is any NaN (thereby leaving you only with rows that contain values for every variable) and then count the number of remaining rows.
Alternatively, you could do it with lists (as in your code above) assuming the lists are the same length:
A = [NaN, NaN, 3]
B = [NaN, NaN, 8]
co_occur = len( [i for i in range(len(A)) if A[i] and B[i]] )

I am using numpy
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[335]: 1
For you second case
sum(np.sum(~np.isnan(np.array([A,B])),0)==2)
Out[337]: 0

With pandas
(df.A.notnull() & df.B.notnull()).sum()
Or
df.notnull().all(axis=1).sum()

Related

Returning the index of the first NaN and number in every row of a matrix

I have a numpy array, for every row id like to know the position of the first NaN and if there are numbers in the row after that NaN.
Let's say i have this matrix
[[nan, nan, 0.0, 1.0, nan],
[0.0, nan, nan, nan, 0.0]]
Id like to return two vectors, with a length = to the matrix rows where the first contains the index of the first nan in the row and the second the position of the first number in the row.
nan_vector = [0, 1]
num_vector = [2, 0]
How do i achieve this?
Use np.isnan with np.argmax or np.argmin:
nan_vector = np.isnan(a).argmax(axis=1)
num_vector = np.isnan(a).argmin(axis=1)
Output:
>>> nan_vector
array([0, 1])
>>> num_vector
array([2, 0])

merge columns in numpy matrix

I have a NumPy matrix like this one (it could have several columns, this is just an example:
array([[nan, nan],
[nan, nan],
['value 1', nan],
[nan, nan],
[nan, nan],
[nan, 'value 2']], dtype=object)
I need to merge all columns in this matrix, replacing nan values with the corresponding non-nan value (if exists). Example output:
array([[nan],
[nan],
['value 1'],
[nan],
[nan],
['value 2']], dtype=object)
Is there a way to achieve this with some built-in function in NumPy?
EDIT: if there is more than one non-nan in single row, I will take the first non-nan value.
Values could be string, float or int
Find the rows where the first column is nan. This works because nan!=nan:
rows = arr[:,0] != arr[:,0]
Update the first element of each chosen row with the second element:
arr[rows,0] = arr[rows,1]
Select the first column:
arr[:,[0]]

Efficient and elegant way to fetch unique values from all columns - Big data

I have a dataframe with more than 600 columns. I have given a sample dataframe with few columns here
df_new = pd.DataFrame({'person_id' :[1,2,3],'obs_date':['12/31/2007','11/25/2009',np.nan],
'hero_id':[2,4,np.nan],'date2':['12/31/2017',np.nan,'10/06/2015'],
'heroine_id':[1,np.nan,5],'date3':['12/31/2027','11/25/2029',np.nan],
'bud_source_value':[1250000,250000,np.nan],
'prod__source_value':[10000,20000,np.nan]})
I would like to fetch the unique values from each column and output it in another dataframe
These are the two approaches that I tried
cols = df_new.columns.tolist()
unique_list = dict()
for c in cols: #appraoch 1
unique_list[c] = df_new[c].unique()
for c in cols: #approach 2
unique_list[c] = df_new[c].drop_duplicates()
Is there anyway to do this at one go without loop? Please note that I expect to have unique values from each column and not unique row in dataframe
As my data is over million records and columns are more than 600 any suggestions/solutions to improve would be helpful
You could try:
print({k:v.drop_duplicates().tolist() for k,v in df_new.items()})
Output:
{'bud_source_value': [1250000.0, 250000.0, nan], 'date2': ['12/31/2017', nan, '10/06/2015'], 'date3': ['12/31/2027', '11/25/2029', nan], 'hero_id': [2.0, 4.0, nan], 'heroine_id': [1.0, nan, 5.0], 'obs_date': ['12/31/2007', '11/25/2009', nan], 'person_id': [1, 2, 3], 'prod__source_value': [10000.0, 20000.0, nan]}
Convert dataframe into numpy array and do following
df_new = np.array(df_new)
unique_list = np.unique(df_new,axis=1)
Numpy is much faster!

Numpy reading data from '.npy' file directly into arrays

This might be a silly question, but I can't seem to find an answer for it. I have a large array that I've previously saved using np.save, and now I'd like to load the data into a new file, creating a separate list from each column. The only issue is that some of the rows in my large array only have a single nan value, so the array looks something like this (as an extremely simplified example):
np.array([[5,12,3],
[nan],
[10,13,9],
[nan],
[nan]])
I can use a for loop to achieve what I want, but I was wondering if there was a better way than this:
import numpy as np
results = np.load('data.npy')
depth, upper, lower = [], [], []
for item in results:
if len(item) > 1:
depth.append(item[0])
upper.append(item[1])
lower.append(item[2])
else:
depth.append(np.nan)
upper.append(np.nan)
lower.append(np.nan)
My desired output would look like:
depth = [5,nan,10,nan,nan]
upper = [12,nan,13,nan,nan]
lower = [3,nan,9,nan,nan]
Thanks for your help! I realize I should have previously altered the code that creates the "data.npy" file, so that it has the same number of columns for each row, but that code already takes hours to run and I'd rather avoid that!
With varying length sub arrays, this is dtype=object array. For most purposes this is the same as a list of these subarrays. So most actions will require iteration.
A variant on your action would be a list comprehension
In [61]: dd=[[nan,nan,nan] if len(i)==1 else i for i in d]
In [62]: dd
Out[62]: [[5, 12, 3], [nan, nan, nan], [10, 13, 9], [nan, nan, nan], [nan, nan, nan]]
Your three target arrays are then columns of:
In [63]: np.array(dd)
Out[63]:
array([[ 5., 12., 3.],
[ nan, nan, nan],
[ 10., 13., 9.],
[ nan, nan, nan],
[ nan, nan, nan]])
Another approach is to make an array of that type filled with nan, and then copy over the non-nan values. But that too requires iteration to find the length of the subsarrays.
In [65]: [len(i)>1 for i in d]
Out[65]: [True, False, True, False, False]
np.nan is a float, so a 2d array with nan will be dtype float.
A shorter way using pandas:
import numpy as np
import pandas as pd
data = np.array([[5,12,3], [np.nan], [10,13,9], [np.nan], [np.nan]])
df = pd.DataFrame.from_records(data.tolist())
df.columns = ['depth','upper','lower']
Output:
>>> df
depth upper lower
0 5.0 12.0 3.0
1 NaN NaN NaN
2 10.0 13.0 9.0
3 NaN NaN NaN
4 NaN NaN NaN
You can now address each column to get your desired output
>>> df.depth
0 5.0
1 NaN
2 10.0
3 NaN
4 NaN
If you need lists:
>>> df.depth.tolist()
[5.0, nan, 10.0, nan, nan]

Adding two 2D NumPy arrays ignoring NaNs in them

What is the right way to add 2 numpy arrays a and b (both 2D) with numpy.nan as missing value?
a + b
or
numpy.ma.sum(a,b)
Since the inputs are 2D arrays, you can stack them along the third axis with np.dstack and then use np.nansum which would ensure NaNs are ignored, unless there are NaNs in both input arrays, in which case output would also have NaN. Thus, the implementation would look something like this -
np.nansum(np.dstack((A,B)),2)
Sample run -
In [157]: A
Out[157]:
array([[ 0.77552455, 0.89241629, nan, 0.61187474],
[ 0.62777982, 0.80245533, nan, 0.66320306],
[ 0.41578442, 0.26144272, 0.90260667, nan],
[ 0.65122428, 0.3211213 , 0.81634856, nan],
[ 0.52957704, 0.73460363, 0.16484994, 0.20701344]])
In [158]: B
Out[158]:
array([[ 0.55809925, 0.1339353 , nan, 0.35154039],
[ 0.94484722, 0.23814073, 0.36048809, 0.20412318],
[ 0.25191484, nan, 0.43721322, 0.95810905],
[ 0.69115038, 0.51490958, nan, 0.44613473],
[ 0.01709308, 0.81771896, 0.3229837 , 0.64013882]])
In [159]: np.nansum(np.dstack((A,B)),2)
Out[159]:
array([[ 1.3336238 , 1.02635159, nan, 0.96341512],
[ 1.57262704, 1.04059606, 0.36048809, 0.86732624],
[ 0.66769925, 0.26144272, 1.33981989, 0.95810905],
[ 1.34237466, 0.83603089, 0.81634856, 0.44613473],
[ 0.54667013, 1.55232259, 0.48783363, 0.84715226]])
Just replace the NaNs with zeros in both arrays:
a[np.isnan(a)] = 0 # replace all nan in a with 0
b[np.isnan(b)] = 0 # replace all nan in b with 0
And then perform the addition:
a + b
This relies on the fact that 0 is the "identity element" for addition.

Categories

Resources