Get row, column pairs that are not nan from a DataFrame - python

I have a DataFrame which contains a lot of NAN values, and it has around 100 columns and 100 rows which makes it difficult to have an overview of the rows, column pairs that are not NAN.
What is the best way to get a list of the rows, column pairs that are not NAN?
Let's assume for simplicity the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=
[[np.nan, 300, np.nan, 200,],
[400, 300, 100, 200],
[np.nan, np.nan, 100, np.nan],
[400, np.nan, 100, 200],
[400, 300, np.nan, 200]],
columns=["col1","col2","col3","col4"])
df
I would like to get a list with all the row, column pairs that are not NAN.

You can stack the frame which drops NaNs, then you can take the index which is the non-NaN row-column pairs:
>>> df.stack().index.tolist()
[(0, "col2"),
(0, "col4"),
(1, "col1"),
(1, "col2"),
(1, "col3"),
(1, "col4"),
(2, "col3"),
(3, "col1"),
(3, "col3"),
(3, "col4"),
(4, "col1"),
(4, "col2"),
(4, "col4")]

Related

Numpy Where and Pandas: How to aggregate groupby values?

How can I get an array that aggregates the grouped column into a single entity (list/array) while also returning NaNs for results that do not match the where clause condition?
# example
df1 = pd.DataFrame({'flag': [1, 1, 0, 0],
'part': ['a', 'b', np.nan, np.nan],
'id': [1, 1, 2, 3]})
# my try
np.where(df1['flag'] == 1, df1.groupby(['id'])['part'].agg(np.array), df1.groupby(['id'])['part'].agg(np.array))
# operands could not be broadcast together with shapes (4,) (3,) (3,)
# expected
np.array((np.array(('a', 'b')), np.array(('a', 'b')), np.nan, np.nan), dtype=object)
Drop the rows having NaN values in the part column, then group the remaining rows by id and aggregate part using list, finally map the aggregated dataframe onto flag column to get the result
s = df1.dropna(subset=['part']).groupby('id')['part'].agg(list)
df1['id'].map(s).to_numpy()
array([list(['a', 'b']), list(['a', 'b']), nan, nan], dtype=object)

Creating a dictionary from a using a list and dataframe

I have a list that is created using two columns of a dataframe. I need to create a dictionary where keys will be the elements of the list and the values will be the elements of a column in the dataframe. Below is an Example that I just created. The dataframe I am using is large and so is the list.
data={'init':[1,2,1], 'term':[2,3,3], 'cost':[10,20,30]}
df=pd.DataFrame.from_dict(data)
link=[(1,2),(1,3),(2,3) ]
I need to create the following dictionary using the dataframe and list.
link_cost={(1,2): 10,(1,3):30,(2,3):20,}
Could anyone help me with this? Any comments or instruction would be appreciated.
Let's try set_index + reindex then Series.to_dict:
d = df.set_index(['init', 'term'])['cost'].reindex(index=link).to_dict()
d:
{(1, 2): 10, (1, 3): 30, (2, 3): 20}
set_index with multiple columns will create a MultiIndex which can be indexed with tuples. Selecting a specific column and then reindexing will allow the list link to reorder/select specific values from the Series. Series.to_dict will create the dictionary output.
Setup used:
import pandas as pd
df = pd.DataFrame({
'init': [1, 2, 1],
'term': [2, 3, 3],
'cost': [10, 20, 30]
})
link = [(1, 2), (1, 3), (2, 3)]
Why are you even using pandas for this? You have the dict right there:
link_cost = dict(zip(link, data['cost']}}
# or if you must use the dataframe it's the same
link_cost = dict(zip(link, df['cost']}}
{(1,2): 10, (2,3):20, (1,3):30}
One approach is to use DataFrame.set_index, Index.isin and DataFrame.itertuples:
import pandas as pd
data = {'init': [1, 2, 1], 'term': [2, 3, 3], 'cost': [10, 20, 30]}
df = pd.DataFrame.from_dict(data)
link = [(1, 2), (2, 3), (1, 3)]
cols = ["init", "term"]
new = df.set_index(cols)
res = dict(new[new.index.isin(link)].itertuples(name=None))
print(res)
Output
{(1, 2): 10, (2, 3): 20, (1, 3): 30}

Select a pandas DataFrame column by indices given in another column

Given a dataframe such as
df = pd.DataFrame({1: [10,20,30,40], 2: [50,60,70,80], 3: [90,100,110,120], "select": [2, 3, 1, 1])
I can get a series of values selected from each row want to select values in each row corresponding to the column index given in the select column, like this:
df.apply(lambda r: r[r.select], axis=1) # 50, 100, 30, 40
Is there a better way to do this that doesn't rely on apply?
Use lookup:
df.lookup(df.index, df['select'])
Output:
array([ 50, 100, 30, 40])

Iterate over numpy array in a specific order based on values

I want to iterate over a numpy array starting at the index of the highest value working through to the lowest value
import numpy as np #imports numpy package
elevation_array = np.random.rand(5,5) #creates a random array 5 by 5
print elevation_array # prints the array out
ravel_array = np.ravel(elevation_array)
sorted_array_x = np.argsort(ravel_array)
sorted_array_y = np.argsort(sorted_array_x)
sorted_array = sorted_array_y.reshape(elevation_array.shape)
for index, rank in np.ndenumerate(sorted_array):
print index, rank
I want it to print out:
index of the highest value
index of the next highest value
index of the next highest value etc
If you want numpy doing the heavy lifting, you can do something like this:
>>> a = np.random.rand(100, 100)
>>> sort_idx = np.argsort(a, axis=None)
>>> np.column_stack(np.unravel_index(sort_idx[::-1], a.shape))
array([[13, 62],
[26, 77],
[81, 4],
...,
[83, 40],
[17, 34],
[54, 91]], dtype=int64)
You first get an index that sorts the whole array, and then convert that flat index into pairs of indices with np.unravel_index. The call to np.column_stack simply joins the two arrays of coordinates into a single one, and could be replaced by the Python zip(*np.unravel_index(sort_idx[::-1], a.shape)) to get a list of tuples instead of an array.
Try this:
from operator import itemgetter
>>> a = np.array([[2, 7], [1, 4]])
array([[2, 7],
[1, 4]])
>>> sorted(np.ndenumerate(a), key=itemgetter(1), reverse=True)
[((0, 1), 7),
((1, 1), 4),
((0, 0), 2),
((1, 0), 1)]
you can iterate this list if you so wish. Essentially I am telling the function sorted to order the elements of np.ndenumerate(a) according to the key itemgetter(1). This function itemgetter gets the second (index 1) element from the tuples ((0, 1), 7), ((1, 1), 4), ... (i.e the values) generated by np.ndenumerate(a).

merging indexed array in Python

Suppose that I have two numpy arrays of the form
x = [[1,2]
[2,4]
[3,6]
[4,NaN]
[5,10]]
y = [[0,-5]
[1,0]
[2,5]
[5,20]
[6,25]]
is there an efficient way to merge them such that I have
xmy = [[0, NaN, -5 ]
[1, 2, 0 ]
[2, 4, 5 ]
[3, 6, NaN]
[4, NaN, NaN]
[5, 10, 20 ]
[6, NaN, 25 ]
I can implement a simple function using search to find the index but this is not elegant and potentially inefficient for a lot of arrays and large dimensions. Any pointer is appreciated.
See numpy.lib.recfunctions.join_by
It only works on structured arrays or recarrays, so there are a couple of kinks.
First you need to be at least somewhat familiar with structured arrays. See here if you're not.
import numpy as np
import numpy.lib.recfunctions
# Define the starting arrays as structured arrays with two fields ('key' and 'field')
dtype = [('key', np.int), ('field', np.float)]
x = np.array([(1, 2),
(2, 4),
(3, 6),
(4, np.NaN),
(5, 10)],
dtype=dtype)
y = np.array([(0, -5),
(1, 0),
(2, 5),
(5, 20),
(6, 25)],
dtype=dtype)
# You want an outer join, rather than the default inner join
# (all values are returned, not just ones with a common key)
join = np.lib.recfunctions.join_by('key', x, y, jointype='outer')
# Now we have a structured array with three fields: 'key', 'field1', and 'field2'
# (since 'field' was in both arrays, it renamed x['field'] to 'field1', and
# y['field'] to 'field2')
# This returns a masked array, if you want it filled with
# NaN's, do the following...
join.fill_value = np.NaN
join = join.filled()
# Just displaying it... Keep in mind that as a structured array,
# it has one dimension, where each row contains the 3 fields
for row in join:
print row
This outputs:
(0, nan, -5.0)
(1, 2.0, 0.0)
(2, 4.0, 5.0)
(3, 6.0, nan)
(4, nan, nan)
(5, 10.0, 20.0)
(6, nan, 25.0)
Hope that helps!
Edit1: Added example
Edit2: Really shouldn't join with floats... Changed 'key' field to an int.

Categories

Resources