Find value smaller but closest to current value - python

I have a very large pandas dataframe that contains two columns, column A and column B. For each value in column A, I would like to find the largest value in column B that is less than the corresponding value in column A. Note that each value in column B can be mapped to many values in column A.
Here's an example with a smaller dataset. Let's say I have the following dataframe:
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
I would like to find some third column -- say c -- so that
c = [nil, 2, 5, 1, 2, 2].
Note that each entry in c is strictly less than the corresponding value in c.
Upon researching, I think that I want something similar to pandas.merge_asof, but I can't quite get the query correct. In particular, I'm struggling because I only have one dataframe and not two. Perhaps I can form a second dataframe from my current one to get what I need, but I can't quite get it right. Any help is appreciated.

Yes, it is doable using pandas.merge_asof. Explanation as comments in the code -
import pandas as pd
df = pd.DataFrame({'a' : [1, 5, 7, 2, 3, 4], 'b' : [5, 2, 7, 5, 1, 9]})
# merge_asof requires the keys to be sorted
adf = df[['a']].sort_values(by='a')
bdf = df[['b']].sort_values(by='b')
# your example wants 'strictly less' so we also add 'allow_exact_matches=False'
cdf_ordered = pd.merge_asof(adf, bdf, left_on='a', right_on='b', allow_exact_matches=False, direction='backward')
# rename the dataframe |a|b| -> |a|c|
cdf_ordered = cdf_ordered.rename(columns={'b': 'c'})
# since c is based on sorted a, we merge with original dataframe column a
new_df = pd.merge(df, cdf_ordered, on='a')
print(new_df)
"""
a b c
0 1 5 NaN
1 5 2 2.0
2 7 7 5.0
3 2 5 1.0
4 3 1 2.0
5 4 9 2.0
"""

Related

Pandas: Get the index of the first value greater than all subsequent values

Let's say I have the following pandas DataFrame:
index
A
B
C
0
2
1
4
1
1
2
3
2
4
3
2
3
3
4
1
I want to get the index of the row in each column where the value of the column at that row is greater than all subsequent rows. So in this example, my desired output would be
A
B
C
2
3
0
What is the most efficient way to do this?
In that case, I guess I would use:
df.idxmax()
Or to get it formatted to your desired output:
pd.DataFrame(df.idxmax()).T
df[::-1].idxmax(axis=0)
Explanation: indices of last maximum values, by first reversing the row order such that index of first (i.e. lowest) occurrence is used (documentation for DataFrame.idxmax says index of first occurrence of maximum). The following code produces the desired result (as a pd.DataFrame):
df = pd.DataFrame(
[[2, 1, 4], [1, 2, 3], [4, 3, 2], [3, 4, 1]],
index=[0, 1, 2, 3], columns=['A', 'B', 'C']
)
pd.DataFrame(df[::-1].idxmax(axis=0)).T
"index of the first value greater than all subsequent rows" <-> "index of last occurrence of maximum value"

How to render a column name as a single cell in midst of multilevel columns in pandas?

I'm working on multilevel indexes in columns. I've to send these tables. For sending tables, I'm using df.to_html(). The picture below is where i am now. foo is the index which i've converted to column.
While converting to column, I want it to occupy both cells so it can look nice.This is what i want to achieve.
The code looks like this.
df = pd.DataFrame([[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]],index=['M1','M2','M3'])
df.columns = pd.MultiIndex.from_product([['x', 'y'], ['a', 'b']])
ind = df.index
df.reset_index(drop=True,inplace=True)
df.insert(0,'foo',ind)
With the code you provide, foois not set as the index of the dataframe.
Anyway, you could add this after your current code in order to correct the header of your dataframe before converting it to html:
df = df.rename(axis=1, level=0, mapper={"foo": ""}).rename(
axis=1, level=1, mapper={"": "foo"}
)
df.to_html(index=False)
This way, the html version of your dataframe renders the desired way:
x y
foo a b a b
M1 1 2 3 4
M2 1 2 3 4
M3 1 2 3 4

How to get rows from only some columns based on columns value in Pandas?

I use Pandas to get datas from Excel. From those tables, I often need to find one or some values in only one row, based on value in a column.
I've read a lot about Pandas (doc and SO), and almost everytime, the question is like « how to SELECT * FROM df WHERE value = smthing ».
But what I'd like to do is more like :
SELECT Col1, Col2
FROM df
WHERE Col3.value = smthing
And I can't find any answer.
For example :
>>> dataFrame
foo bar sm_else
0 0 3 6
1 1 4 7
2 2 5 8
I want to get foo value and sm_else value when bar == 4.
So :
foo sm_else
1 7
Result can be DataFrame or can be list or dict, I don't really care.
Thanks !
How can I achieve this ?
df.loc can help you out
import pandas as pd
df = pd.DataFrame(data={'col1': [1, 2, 3],
'col2': [4, 5, 6],
'col3': [7, 8, 9]})
print(df.loc[df['col2'] == 4][['col1', 'col2']])
df.loc[df.bar == 4, ['foo', 'sm_else']]

A straightforword method to select columns by position

I am trying to clarify how can I manage pandas methods to call columns and rows in a Dataframe. An example will clarify my issue
dic = {'a': [1, 5, 2, 7], 'b': [6, 8, 4, 2], 'c': [5, 3, 2, 7]}
df = pd.DataFrame(dic, index = ['e', 'f', 'g', 'h'] )
than
df =
a b c
e 1 6 5
f 5 8 3
g 2 4 2
h 7 2 7
Now if I want to select column 'a' I just have to type
df['a']
while if I want to select row 'e' I have to use the ".loc" method
df.loc['e']
If I don't know the name of the row, but just it's position ( 0 in this case) than I can use the "iloc" method
df.iloc[0]
What looks like it is missing is a method for calling columns by position and not by name, something that is the "equivalent for columns of the 'iloc' method for rows". The only way I can find to do this is
df[df.keys()[0]]
is there something like
df.ilocColumn[0]
?
You can add : because first argument is position of selected indexes and second position of columns in function iloc:
And : means all indexes in DataFrame:
print (df.iloc[:,0])
e 1
f 5
g 2
h 7
Name: a, dtype: int64
If need select first index and first column value:
print (df.iloc[0,0])
1
Solution with ix work nice if need select index by name and column by position:
print (df.ix['e',0])
1

how to add columns label on a Pandas DataFrame

I can't understand how can I add column names on a pandas dataframe, an easy example will clarify my issue:
dic = {'a': [4, 1, 3, 1], 'b': [4, 2, 1, 4], 'c': [5, 7, 9, 1]}
df = pd.DataFrame(dic)
now if I type df than I get
a b c
0 4 4 5
1 1 2 7
2 3 1 9
3 1 4 1
say now that I generate another dataframe just by summing up the columns on the previous one
a = df.sum()
if I type 'a' than I get
a 9
b 11
c 22
That looks like a dataframe without with index and without names on the only column. So I wrote
a.columns = ['column']
or
a.columns = ['index', 'column']
and in both cases Python was happy because he didn't provide me any message of errors. But still if I type 'a' I can't see the columns name anywhere. What's wrong here?
The method DataFrame.sum() does an aggregation and therefore returns a Series, not a DataFrame. And a Series has no columns, only an index. If you want to create a DataFrame out of your sum you can change a = df.sum() by:
a = pandas.DataFrame(df.sum(), columns = ['whatever_name_you_want'])

Categories

Resources