Largest (n) numbers with Index and Column name in Pandas DataFrame - python

I wish to find out the largest 5 numbers in a DataFrame and store the Index name and Column name for these 5 values.
I am trying to use nlargest() and idxmax methods but failing to achieve what i want. My code is as below:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
df = DataFrame({'a': [1, 10, 8, 11, -1],'b': [1.0, 2.0, 6, 3.0, 4.0],'c': [1.0, 2.0, 6, 3.0, 4.0]})
Can you kindly let me know How can i achieve this. Thank you

Use stack and nlargest:
max_vals = df.stack().nlargest(5)
This will give you a Series with a multiindex, where the first level is the original DataFrame's index, and the second level is the column name for the given value. Here's what max_vals looks like:
3 a 11.0
1 a 10.0
2 a 8.0
b 6.0
c 6.0
To explicitly get the index and column names, use get_level_values on the index of max_vals:
max_idx = max_vals.index.get_level_values(0)
max_cols = max_vals.index.get_level_values(1)
The result of max_idx:
Int64Index([3, 1, 2, 2, 2], dtype='int64')
The result of max_cols:
Index(['a', 'a', 'a', 'b', 'c'], dtype='object')

Related

Python pandas: elegant division within dataframe

I'm new on stackoverflow and have switched from R to python. I'm trying to do something probably not too difficult, and while I can do this by butchering, I am wondering what the most pythonic way to do it is. I am trying to divide certain values (E where F=a) in a column by values further down in the column (E where F=b) using column D as a lookup:
import pandas as pd
df = pd.DataFrame({'D':[1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1], 'E':[10,20,30,40,50,100, 250, 250, 360, 567, 400],'F':['a', 'a', 'a', 'a', 'a', 'b','b', 'b', 'b', 'b', 'c']})
print(df)
out = pd.DataFrame({'D': [1, 2, 3, 4, 5], 'a/b': [0.1, 0.08, 0.12 , 0.1111, 0.0881]}
print(out)
Can anyone help write this nicely?
I'm not entirely sure what you mean by "using D column as lookup" since there is no need for such lookup in the example you provided.
However the quick and dirty way to achieve the output you did provide is
output = pd.DataFrame({'a/b': df[df['F'] == 'a']['E'].values / df[df['F'] == 'b']['E'].values})
output['D'] = df['D']
which makes output to be
a/b D
0 0.100000 1
1 0.080000 2
2 0.120000 3
3 0.111111 4
4 0.088183 5
Lookup with .loc in pandas dataframe as df.loc[rows, columns] where the conditions for rows and columns are True
import numpy as np
# get indices from column D. I convert it to a list structure to make sure that the order is maintained.
idx = list(set(df['D']))
# A is an array of values with 'F'=a
A = np.array([df.loc[(df['F']=='a') & (df['D']==i),'E'].values[0] for i in idx])
# B is an array of values with 'F'=b
B = np.array([df.loc[(df['F']=='b') & (df['D']==i),'E'].values[0] for i in idx])
# Now devide towards your new dataframe of divisions
out = pd.DataFrame(np.vstack([A/B,idx]).T, columns = ['a/b','D'])
Instead of using numpy.vstack, you can use:
out = pd.DataFrame(A/B,idx).T
out.columns = ['a/b','D']
with the same result. I tried to do it in a single line (for no reason whatsoever)
Got it:
df = df.set_index('D')
out = df.loc[(df['F'] == 'a'), 'E'] / df.loc[(df['F'] == 'b'), 'E']
out = out.reset_index()
Thanks for your thoughts - I got inspired.

Index a Pandas DataFrame/Series by Series of indices

I am trying to find a way to create a pandas Series which is based on values within another DataFrame. A simplified example would be:
df_idx = pd.DataFrame([0, 2, 2, 3, 1, 3])
df_lookup = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
where I wish to generate a new pandas series of values drawn from df_lookup based on the indices in df_idx, i.e.:
df_target = pd.DataFrame([10.0, 30.0, 30.0, 40.0, 20.0, 40.0])
Clearly, it is desirable to do this without looping for speed.
Any help greatly appreciated.
This is what reindex is for:
df_idx = pd.DataFrame([0, 2, 2, 3, 1, 3])
df_lookup = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
df_lookup.reindex(df_idx[0])
Output:
0
0
0 10.0
2 30.0
2 30.0
3 40.0
1 20.0
3 40.0
This is precisely the use case for iloc:
import pandas as pd
df = pd.DataFrame([10.0, 20.0, 30.0, 40.0])
idx_lst = pd.Series([0, 2, 2, 3, 1, 3])
res = df.iloc[idx_lst]
See here for more on indexing by position.

How to Convert Pandas Dataframe to Single List

Suppose I have a dataframe:
col1 col2 col3
0 1 5 2
1 7 13
2 9 1
3 7
How do I convert to a single list such as:
[1, 7, 9, 5, 13, 1, 7]
I have tried:
df.values.tolist()
However this returns a list of lists rather than a single list:
[[1.0, 5.0, 2.0], [7.0, 13.0, nan], [9.0, 1.0, nan], [nan, 7.0, nan]]
Note the dataframe will contain an unknown number of columns. The order of the values is not important so long as the list contains all values in the dataframe.
I imagine I could write a function to unpack the values, however I'm wondering if there is a simple built-in way of converting a dataframe to a series/list?
Following your current approach, you can flatten your array before converting it to a list. If you need to drop nan values, you can do that after flattening as well:
arr = df.to_numpy().flatten()
list(arr[~np.isnan(arr)])
Also, future versions of Pandas seem to prefer to_numpy over values
An alternate, perhaps cleaner, approach is to 'stack' your dataframe:
df.stack().tolist()
you can use dataframe stack
In [12]: df = pd.DataFrame({"col1":[np.nan,3,4,np.nan], "col2":['test',np.nan,45,3]})
In [13]: df.stack().tolist()
Out[13]: ['test', 3.0, 4.0, 45, 3]
For Ordered list (As per problem statement):
Only if your data contains integer values:
Firstly get all items in data frame and then remove the nan from the list.
items = [item for sublist in [df[cols].tolist() for cols in df.columns] for item in sublist]
items = [int(x) for x in items if str(x) != 'nan']
For Un-Ordered list:
Only if your data contains integer values:
items = [int(x) for x in sum(df.values.tolist(),[]) if str(x) != 'nan']

Is there a python function to fill missing data with consecutive value

I want to Fill in these missing numbers in column b with the consecutive values 1 and 2.
This is what I have done:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, 4, 7,8,4],
'b': [1, np.nan, 3, np.nan, 5]})
df['b'].fillna({'b':[1,2]}, inplace=True)
but nothing is done.
One way is to use loc with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc, you can use cumsum:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
You can't feed fillna a list of values, as stated here and in the documentation. Also, if you're selecting the column, no need to tell fillna which column to use. You could do:
df.fillna({'b':1}, inplace=True)
Or
df['b'].fillna(1, inplace=True)
By the way, inplace is on the way to deprecation in Pandas, the preferred way to do this is, for example
df = df.fillna({'b':1})
You can interpolate. Example:
s = pd.Series([0, 1, np.nan, 3])
s.interpolate()
0 0
1 1
2 2
3 3
If I understand wording " consecutive values 1 and 2" correctly, the solution may be:
from itertools import isclice, cycle
filler = [1, 2]
nans = df.b.isna()
df.loc[nans, 'b'] = list(islice(cycle(filler), sum(nans)))

Pandas div using index

I am sometimes struggling a bit to understand pandas datastructures and it seems to be the case again. Basically, I've got:
1 pivot table, major axis being a serial number
a Serie using the same index
I would like to divide each column of my pivot table by the value in the Serie using index to match the lines. I've tried plenty of combinations... without being successful so far :/
import pandas as pd
df = pd.DataFrame([['123', 1, 1, 3], ['456', 2, 3, 4], ['123', 4, 5, 6]], columns=['A', 'B', 'C', 'D'])
pt = pd.pivot_table(df, rows=['A', 'B'], cols='C', values='D', fill_value=0)
serie = pd.Series([5, 5, 5], index=['123', '678', '345'])
pt.div(serie, axis='index')
But I am only getting NaN. I guess it's because columns names are not matching but that's why I was using index as the axis. Any ideas on what I am doing wrong?
Thanks
You say "using the same index", but they're not the same: pt has a multiindex, and serie only an index:
>>> pt.index
MultiIndex(levels=[[u'123', u'456'], [1, 2, 4]],
labels=[[0, 0, 1], [0, 2, 1]],
names=[u'A', u'B'])
And you haven't told the division that you want to align on the A part of the index. You can pass that information using level:
>>> pt.div(serie, level='A', axis='index')
C 1 3 5
A B
123 1 0.6 0 0.0
4 0.0 0 1.2
456 2 NaN NaN NaN
[3 rows x 3 columns]

Categories

Resources