Edit Distance between all the columns of a pandas dataframe - python

I am interested in calculating the edit distances across all the columns of a given pandas DataFrame. Let's say we have a 3*5 DataFrame - I want to output something like this with the distance scores - (column*column matrix)
col1 col2 col3 col4 col5
col1
col2
col3
col4
col5
I want each element of a column to match with every element of the other columns. Therefore, for every col1*col2 cell = summation of all the scores of the nested loop of col1 and col2.
I would highly appreciate any help in this regards. Thanks in advance.
INSPECTION_ID STRUCTURE_ID RELOCATE_FID HECO_ID HECO_ID_TAG_NOT_FOUND \
0 100 95308 NaN 18/29 0.0
1 101 95346 NaN Nov-29 0.0
2 102 50008606 NaN 25/29 0.0
3 103 95310 NaN Dec-29 0.0
4 104 95286 NaN 17/29 0.0
OSMOSE_POLE_ID ALTERNATE_ID STREET_NBR STREET_DIRECTIONAL STREET_NAME \
0 NaN NaN 1888 NaN KAIKUNANE
1 NaN NaN 1731 NaN MAKUAHINE
2 NaN NaN 1862 NaN MAKUAHINE
3 NaN NaN 1825 NaN KAIKUNANE
4 NaN NaN 1816 NaN KAIKUNANE
Likewise, I got a (191795, 58) dataset. My objective is to find the edit distance between each column of the dataset so as to understand the patterns between them if any.
For instance, I desire INSPECTION_ID 100 to be checked with all the values of column STRUCTURE_ID ans so on. I understand the need of an optimized iterator in this case. Kindly help me throwing some direction to solve this problem. Thanks in advance.

Very naive solution (assuming you already have an edit distance function) but might just work for small datasets
df = # your dataset
def edit_distance(s1, s2):
# some code
# return edit distance of s1, s2
df_distances = []
for i, row in df.iterrows():
row_distances = []
for item in row:
for item2 in row:
row_distances.append(edit_distance(item, item2))
df_distances.append(some_array)
I haven't tested this solution so there might be bugs but the general principle should work. If you don't have an edit distance function, you can use this implementation
https://en.wikibooks.org/wiki/Algorithm_Implementation/Strings/Levenshtein_distance#Python or one of the many others freely available

Related

Pandas ffill on section of DataFrame

I am attempting to forward fill a filtered section of a DataFrame but it is not working the way I hoped.
I have df that look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 NaN string
I want it to look like this:
Col Col2
0 1 NaN
1 NaN NaN
2 3 string
3 3 string
This my current code:
filter = (df["col2"] == "string")
df.loc[filter, "col"].fillna(method="ffill", inplace=True)
But my code does not change the df at all. Any feedback is greatly appreciated
We can use boolean indexing to filter the section of Col where Col2 = 'string' then forward fill and update the values only in that section
m = df['Col2'].eq('string')
df.loc[m, 'Col'] = df.loc[m, 'Col'].ffill()
Col Col2
0 1.0 NaN
1 NaN NaN
2 3.0 string
3 3.0 string
I am not sure I understand your question but if you want to fill the NAN values or any values you should use the Simple imputer
from sklearn.impute import SimpleImputer
Then you can define an imputer that fills these missing values/NAN with a specific strategy. For example if you want to fill these values with the mean of all the column you can write it as follows:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'mean')
Or you can write it like this if you have the NaN as string
imputer=SimpleImputer(missing_values="NaN", strategy= 'mean')
and if you want to fill it with a specific values you can do this:
imputer=SimpleImputer(missing_values=np.nan, strategy= 'constant', fill_value = "YOUR VALUE")
Then you can use it like that
df[["Col"]]=imputer.fit_transform(df[["Col"]])

Pandas combine rows in groups to get rid of Nans

I want to do something similar to what pd.combine_first() does, but as a row-wise operation performed on a shared index. And to also add a new column in place of the old ones - while keeping the original_values of shared column names.
In this case the 'ts' column is one that I want to replace with time_now.
time_now = "2022-08-05"
row1 = {'unique_id':5,'ts': '2022-08-02','id2':2,'val':300, 'ffo1':55, 'debt':200}
row2 = {'unique_id':5,'ts': '2022-08-03' ,'id2':2, 'esg':True,'gov_rt':90}
row3 = {'unique_id':5,'ts': '2022-08-04','id2':2, 'rank':5,'buy_or_sell':'sell'}
df = pd.DataFrame([row1,row2,row3])
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-02 2 300.0 55.0 200.0 NaN NaN NaN
1 5 2022-08-03 2 NaN NaN NaN True 90.0 NaN
2 5 2022-08-04 2 NaN NaN NaN NaN NaN 5.0
buy_or_sell
0 NaN
1 NaN
2 sell
My desired output is below, using the new timestamp, but keeping the old ones based on their group index.
rows = [{'unique_id':5, 'ts':time_now ,'id2':2,'val':300, 'ffo1':55, 'debt':200,'esg':True,'gov_rt':90,'rank':5,'buy_or_sell':'sell', 'ts_1':'2022-08-02','ts_2':'2022-08-03', 'ts_3':'2022-08-04'}]
output = pd.DataFrame(rows)
unique_id ts id2 val ffo1 debt esg gov_rt rank \
0 5 2022-08-05 2 300 55 200 True 90 5
buy_or_sell ts_1 ts_2 ts_3
0 sell 2022-08-02 2022-08-03 2022-08-04
The part below seems to work when run by itself. But I cannot get it to work inside of a function because of differences between index lengths.
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df3['level_1'],df3[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows['ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time
df2
The problem was that I forgot to put the dictionary into a list to create a records oriented dataframe. Additionally when using a similar function, the index might need to be dropped to be reset, as duplicated columns might be created.
I still wonder if there's a better way to do what I want, since it's kind of slow.
def func(df):
df2 = df.set_index('ts').stack().reset_index()
rows = dict(zip(df2['level_1'],df2[0]))
ts = df2['ts'].unique().tolist()
for cnt,value in enumerate(ts):
rows[f'ts_{cnt}'] = value
# drop all rows
df2 = pd.DataFrame([rows])
df2['time'] = time_now
return df2
#run this
df.groupby('unique_id').apply(func)

Find index of first row closest to value in pandas DataFrame

So I have a dataframe containing multiple columns. For each column, I would like to get the index of the first row that is nearly equal to a user specified number (e.g. within 0.05 of desired number). The dataframe looks kinda like this:
ix col1 col2 col3
0 nan 0.2 1.04
1 0.98 nan 1.5
2 1.7 1.03 1.91
3 1.02 1.42 0.97
Say I want the first row that is nearly equal to 1.0, I would expect the result to be:
index 1 for col1 (not index 3 even though they are mathematically equally close to 1.0)
index 2 for col2
index 0 for col3 (not index 3 even though 0.97 is closer to 1 than 1.04)
I've tried an approach that makes use of argsort():
df.iloc[(df.col1-1.0).abs().argsort()[:1]]
This would, according to other topics, give me the index of the row in col1 with the value closest to 1.0. However, it returns only a dataframe full of nans. I would also imagine this method does not give the first value close to 1 it encounters per column, but rather the value that is closest to 1.
Can anyone help me with this?
Use DataFrame.sub for difference, convert to absolute values by abs, compare by lt (<) and last get index of first value by DataFrame.idxmax:
a = df.sub(1).abs().lt(0.05).idxmax()
print (a)
col1 1
col2 2
col3 0
dtype: int64
But for more general solution, working if failed boolean mask (no value is in tolerance) is appended new column filled by Trues with name NaN:
print (df)
col1 col2 col3
ix
0 NaN 0.20 1.07
1 0.98 NaN 1.50
2 1.70 1.03 1.91
3 1.02 1.42 0.87
s = pd.Series([True] * len(df.columns), index=df.columns, name=np.nan)
a = df.sub(1).abs().lt(0.05).append(s).idxmax()
print (a)
col1 1.0
col2 2.0
col3 NaN
dtype: float64
Suppose, you have some tolerance value tol for the nearly
match threshold. You can create a mask dataframe for
values below the threshold and use first_valid_index()
on each column to get the index of first match occurence.
tol = 0.05
mask = df[(df - 1).abs() < tol]
for col in df:
print(col, mask[col].first_valid_index())

Select rows of a pandas DataFrame with at most one null entry

If I need to choose from a dataframe where columns col1 and col2 must follow the condition that atleast one of these columns must be not null.
Right now, I am trying to perform below but it doesn't work
df=df.loc[(df['Cat1_L2'].isnull()) & (df['Cat2_L3'].isnull())==False]
Setup
(Modifying U8-Forward's data)
df = pd.DataFrame({'Cat1_L2':[1,np.nan,3, np.nan], 'Cat3_L3': [np.nan,3,4, np.nan]})
df
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
3 NaN NaN
Indexing with isna + sum
Fixing your code, ensure the number of True cases (corresponding to NaN in columns) is lesser than 2.
df[df[['Cat1_L2', 'Cat3_L3']].isna().sum(axis=1) < 2]
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
dropna with thresh
df.dropna(subset=['Cat1_L2', 'Cat3_L3'], thresh=1)
Cat1_L2 Cat3_L3
0 1.0 NaN
1 NaN 3.0
2 3.0 4.0
One way is to loop over every row using itertuples(). Beaware that this is computationally expensive.
1 - Create list that chceks your condition for each row using itertuples()
condition_list = []
for row in df.itertuples():
if (row.Cat1_L2 != None) or (row.Cat2_L3 != None):
condition_list.append(1)
else:
condition_list.append(0)
2. Convert list to pandas series
condition_series = pd.Series(condition_list)
3. Append series to original df
df['condition_column'] = condition_series.values
4. Filter df
df_new = df[df.condition_column == 1]
del df_new['condition_column']

Opposite of dropna() in pandas

I have a pandas DataFrame that I want to separate into observations for which there are no missing values and observations with missing values. I can use dropna() to get rows without missing values. Is there any analog to get rows with missing values?
#Example DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1,np.nan,3,4,5],'col2': [6,7,np.nan,9,10],})
#Get observations without missing values
df.dropna()
Check null by row and filter with boolean indexing:
df[df.isnull().any(1)]
# col1 col2
#1 NaN 7.0
#2 3.0 NaN
~ = Opposite :-)
df.loc[~df.index.isin(df.dropna().index)]
Out[234]:
col1 col2
1 NaN 7.0
2 3.0 NaN
Or
df.loc[df.index.difference(df.dropna().index)]
Out[235]:
col1 col2
1 NaN 7.0
2 3.0 NaN
I use the following expression as the opposite of dropna. In this case, it keeps rows based on the specified column that are null. Anything with a value is not kept.
csv_df = csv_df.loc[~csv_df['Column_name'].notna(), :]

Categories

Resources