Find index of first row closest to value in pandas DataFrame

Find index of first row closest to value in pandas DataFrame - python

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?

Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64

Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Related

How to locate and replace values in dataframe based on some criteria

I would like to locate all places when in Col2 there is a change in value (for ex. change from A to C) and then modify value from Col1 (corresponding to row when the change happens, so when A -> C then it will be value in the same row as C) by dividing subtraction current value and previous value by two (in this example will be 1 + (1.5-1)/2 = 1.25.
Output table is result of replacing all that occurrences in whole table
How I can achieve that ?
Col1
Col2
1
A
1.5
C
2.0
A
2.5
A
3.0
D
3.5
D
OUTPUT:
Col1
Col2
1
A
1.25
C
1.75
A
2.5
A
2.75
D
3.5
D

Use np.where and series holding values of your formula
solution = df.Col1.shift() + ((df.Col1 - df.Col1.shift()) / 2)
df['Col1'] = np.where(~df.Col2.eq(df.Col2.shift()), solution.fillna(df.Col1), df.Col1)

sum() on specific columns of dataframe

I cannot work out how to add a new row at the end. The last row needs to do sum() on specific columns and dividing 2 other columns. While the DF has applied a filter to sum only specific rows.
df:
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.79
1 Cat2 2 -81.91 -15.30 -16.00 10.06
2 Cat3 3 -57.70 -18.62 0.00 0.00
I would like the output to be like so:
3 Total -123.60 -119.02 -26.91 100*(-119.02/-26.91)
col3,col4,col5 would have sum(), and col6 would be the above formula.
If [CategID]==2, then don't include in the TOTAL
I was able to get it almost as I wanted by using .query(), like so:
#tg is a list
df.loc['Total'] = df.query("categID in #tg").sum()
But with the above I cannot have the 'col6' like this 100*(col4.sum() / col5.sum()), because they are all sum().
Then I tried with Series like so, but I don't understand how to apply filter .where()
s = pd.Series( [df['col3'].sum()\
,df['col4'].sum()\
,df['col5'].sum()\
,100*(df['col4'].sum()/df['col5'].sum())\
,index = ['col3','col4','col5','col6'])
df.loc['Total'] = s.where('tag1' in tg)
using the above Series() works, until I add .where()
this gives the error:
ValueError: Array conditional must be same shape as self
So, can I accomplish this with the first method, using .query(), just somehow modify one of the column in TOTAL ?
Otherwise what am I doing wrong in the second method .where()
Thanks

IIUC, you can try:
s = df.mask(df['CategID'].eq(2)).drop("CategID",1).sum()
s.loc['col6'] = 100*(s['col4'] / s['col5'])
df.loc[len(df)] = s
df = df.fillna({'Categ':'Total',"CategID":''})
print(df)
Categ CategID col3 col4 col5 col6
0 Cat1 1 -65.90 -100.40 -26.91 23.790000
1 Cat2 2 -81.91 -15.30 -16.00 10.060000
2 Cat3 3 -57.70 -18.62 0.00 0.000000
3 Total -123.60 -119.02 -26.91 442.289112

Edit Distance between all the columns of a pandas dataframe

I am interested in calculating the edit distances across all the columns of a given pandas DataFrame. Let's say we have a 3*5 DataFrame - I want to output something like this with the distance scores - (column*column matrix)
col1 col2 col3 col4 col5
col1
col2
col3
col4
col5
I want each element of a column to match with every element of the other columns. Therefore, for every col1*col2 cell = summation of all the scores of the nested loop of col1 and col2.
I would highly appreciate any help in this regards. Thanks in advance.
INSPECTION_ID STRUCTURE_ID RELOCATE_FID HECO_ID HECO_ID_TAG_NOT_FOUND \
0 100 95308 NaN 18/29 0.0
1 101 95346 NaN Nov-29 0.0
2 102 50008606 NaN 25/29 0.0
3 103 95310 NaN Dec-29 0.0
4 104 95286 NaN 17/29 0.0
OSMOSE_POLE_ID ALTERNATE_ID STREET_NBR STREET_DIRECTIONAL STREET_NAME \
0 NaN NaN 1888 NaN KAIKUNANE
1 NaN NaN 1731 NaN MAKUAHINE
2 NaN NaN 1862 NaN MAKUAHINE
3 NaN NaN 1825 NaN KAIKUNANE
4 NaN NaN 1816 NaN KAIKUNANE
Likewise, I got a (191795, 58) dataset. My objective is to find the edit distance between each column of the dataset so as to understand the patterns between them if any.
For instance, I desire INSPECTION_ID 100 to be checked with all the values of column STRUCTURE_ID ans so on. I understand the need of an optimized iterator in this case. Kindly help me throwing some direction to solve this problem. Thanks in advance.

Very naive solution (assuming you already have an edit distance function) but might just work for small datasets
df = # your dataset
def edit_distance(s1, s2):
# some code
# return edit distance of s1, s2
df_distances = []
for i, row in df.iterrows():
row_distances = []
for item in row:
for item2 in row:
row_distances.append(edit_distance(item, item2))
df_distances.append(some_array)
I haven't tested this solution so there might be bugs but the general principle should work. If you don't have an edit distance function, you can use this implementation
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python or one of the many others freely available

Panda Python - dividing a column by 100 (then rounding by 2.dp)

I have been manipulating some data frames, but unfortunately I have two percentage columns, one in the format '61.72' and the other '0.62'.
I want to just divide the column with the percentages in the '61.72' format by 100 then round it to 2.dp so it is consistent with the data frame.
Is there an easy way of doing this?
My data frame has two columns, one called 'A' and the other 'B', I want to format 'B'.
Many thanks!

You can use div with round:
df = pd.DataFrame({'A':[61.75, 10.25], 'B':[0.62, 0.45]})
print (df)
A B
0 61.75 0.62
1 10.25 0.45
df['A'] = df['A'].div(100).round(2)
#same as
#df['A'] = (df['A'] / 100).round(2)
print (df)
A B
0 0.62 0.62
1 0.10 0.45

This question have already got answered but here is another solution which is significantly faster and standard one.
df = pd.DataFrame({'x':[10, 3.50], 'y':[30.1, 50.8]})
print (df)
>> x y
0 10.0 30.1
1 3.5 50.8
df = df.loc[:].div(100).round(2)
print (df)
>> x y
0 0.10 0.30
1 0.03 0.50
why prefer this solution??
well, this warning is enough answer - "A value is trying to be set on a copy of a slice from a DataFrame if you use df['A'] so, Try using .loc[row_indexer,col_indexer] = value instead."
Moreover, check this for more understanding https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Python dataframe groupby multiple columns with conditional sum

I have a df which looks like that:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
I am grouping the df by col1 and col2, and for each member of each group, I want to sum the target values, only of other group members, that their now date value, is smaller(before) than the current member's previous date value.
For example for:
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
I want to sum the target values of:
col1 col2 now previous target
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
to eventually have:
col1 col2 now previous target sum
A 1 1-1-2015 4-1-2014 0.2 1.8

Interesting problem, I've got something that I think may work.
Although, slow time complexity of Worst case: O(n**3) and Best case: O(n**2).
Setup data
import pandas as pd
import numpy as np
import io
datastring = io.StringIO(
"""
col1 col2 now previous target
A 1 1-1-2015 4-1-2014 0.2
B 0 2-1-2015 2-5-2014 0.33
A 0 3-1-2013 3-9-2011 0.1
A 1 1-1-2014 4-9-2011 1.7
A 1 31-12-2014 4-9-2014 1.9
C 1 31-12-2014 4-9-2014 1.9
""")
# arguments for pandas.read_csv
kwargs = {
"sep": "\s+", # specifies that it's a space separated file
"parse_dates": [2,3], # parse "now" and "previous" as dates
}
# read the csv into a pandas dataframe
df = pd.read_csv(datastring, **kwargs)
Pseudo code for algorithm
For each row:
For each *other* row:
If "now" of *other* row comes before "previous" of row
Then add *other* rows "target" to "sum" of row
Run the algorithm
First start by setting up a function f(), that is to be applied over all the groups computed by df.groupby(["col1","col2"]). All that f() does is try to implement the pseudo code above.
def f(df):
_sum = np.zeros(len(df))
# represent the desired columns of the sub-dataframe as a numpy object
data = df[["now","previous","target"]].values
# loop through the rows in the sub-dataframe, df
for i, outer_row in enumerate(data):
# for each row, loop through all the rows again
for j, inner_row in enumerate(data):
# skip iteration if outer loop row is equal to the inner loop row
if i==j: continue
# get the dates from rows
outer_prev = outer_row[1]
inner_now = inner_row[0]
# if the "previous" datetime of the outer loop is greater than
# the "now" datetime of the inner loop, then add "target" to
# to the cumulative sum
if outer_prev > inner_now:
_sum[i] += inner_row[2]
# add a new column for this new "sum" that we calculated
df["sum"] = _sum
return df
Now just apply f() over the grouped data.
done = df.groupby(["col1","col2"]).apply(f)
Output
col1 col2 now previous target sum
0 A 1 2015-01-01 2014-04-01 0.20 1.7
1 B 0 2015-02-01 2014-02-05 0.33 0.0
2 A 0 2013-03-01 2011-03-09 0.10 0.0
3 A 1 2014-01-01 2011-04-09 1.70 0.0
4 A 1 2014-12-31 2014-04-09 1.90 1.7

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find index of first row closest to value in pandas DataFrame - python

Related

How to locate and replace values in dataframe based on some criteria

sum() on specific columns of dataframe

Edit Distance between all the columns of a pandas dataframe

Panda Python - dividing a column by 100 (then rounding by 2.dp)

Python dataframe groupby multiple columns with conditional sum

Categories

Resources