Finding index of a closest value from another dataframe

Finding index of a closest value from another dataframe - python

I have two dataframes measuring two properties from an instrument, where the depths are offset for a certain dz. Note that the example below is extremely simplified.
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
How do I get the index of df2.depth_2 that gets closest the first element of df1.depth_1 ?

Using reindex with method nearest
df2.reset_index().set_index('depth_2').reindex(df1.depth_1,method = 'nearest')['index'].unique()
Out[265]: array([14], dtype=int64)

You can use pandas merge_asof function (you will need to order your data first if it isn't in real life)
df1 = df1.sort_values(by='depth_1')
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1, df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")
if you just wanted that for the first value in df1 you could do the join on the top row:
df2 = df2.sort_values(by='depth_2')
pd.merge_asof(df1.head(1), df2.reset_index(), left_on="depth_1", right_on="depth_2", direction="nearest")

Get the absolute difference between all elements of df2 and first element of df1 and then get it's index:
import pandas as pd
import numpy as np
def get_closest(df1, df2, idx):
abs_diff = np.array([abs(df1['depth_1'][idx]-item) for item in df2['depth_2']])
return abs_diff.argmin()
df1 = pd.DataFrame({'depth_1': [0.936250, 0.959990, 0.978864, 0.991288, 1.023876, 1.045801, 1.062768, 1.077090, 1.101248, 1.129754, 1.147458, 1.160193, 1.191206, 1.218595, 1.256964] })
df2 = pd.DataFrame({'depth_2': [0.620250, 0.643990, 0.662864, 0.675288, 0.707876, 0.729801, 0.746768, 0.761090, 0.785248, 0.813754, 0.831458, 0.844193, 0.875206, 0.902595, 0.940964 ] })
get_closest(df1,df2,0)
Output:
14

Related

Ungroup pandas dataframe after bfill

I'm trying to write a function that will backfill columns in a dataframe adhearing to a condition. The upfill should only be done within groups. I am however having a hard time getting the group object to ungroup. I have tried reset_index as in the example bellow but that gets an AttributeError.
Accessing the original df through result.obj doesn't lead to the updated value because there is no inplace for the groupby bfill.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column].bfill(axis="rows", inplace=True)
return df
Assigning the dataframe column in the function doesn't work because groupbyobject doesn't support item assingment.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
for column in df.obj.columns:
if column.startswith("x"):
df[column] = df[column].bfill()
return df
The test I'm trying to get to pass:
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
result.reset_index()
assert result["x_value"].equals(Series([4,4,None,5,5]))

You should use 'transform' method on the grouped DataFrame, like this:
import pandas as pd
def test_upfill():
df = pd.DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
result = df.groupby("group").transform(lambda x: x.bfill())
assert result["x_value"].equals(pd.Series([4,4,None,5,5]))
test_upfill()
Here you can find find more information about the transform method on Groupby objects

Based on the accepted answer this is the full solution I got to although I have read elsewhere there are issues using the obj attribute.
def upfill(df:DataFrameGroupBy)->DataFrameGroupBy:
columns = [column for column in df.obj.columns if column.startswith("x")]
df.obj[columns] = df[columns].transform(lambda x:x.bfill())
return df
def test_upfill():
df = DataFrame({
"id":[1,2,3,4,5],
"group":[1,2,2,3,3],
"x_value": [4,4,None,None,5],
})
grouped_df = df.groupby("group")
result = upfill(grouped_df)
assert df["x_value"].equals(Series([4,4,None,5,5]))

querying a multiindex pandas dataframe with slices

Assuming I have the following multiindex DF
import pandas as pd
import numpy as np
import pandas as pd
input_id = np.array(['12345'])
docType = np.array(['pre','pub','app','dw'])
docId = np.array(['34455667'])
sec_type = np.array(['bib','abs','cl','de'])
sec_ids = np.array(['x-y','z-k'])
index = pd.MultiIndex.from_product([input_id,docType,docId,sec_type,sec_ids])
content= [str(randint(1,10))+ '##' + str(randint(1,10)) for i in range(len(index))]
df = pd.DataFrame(content, index=index, columns=['content'])
df.rename_axis(index=['input_id','docType','docId','secType','sec_ids'], inplace=True)
df
I know that I can query a multiindex DF as follows:
# querying a multiindex DF
idx = pd.IndexSlice
df.loc[idx[:,['pub','pre'],:,'de',:]]
basically with the help of pd.IndexSlice I can pass the values I want for every of the indexes. In the above case I want the resulting DF where the second index is 'pub' OR 'pre' and the 4th one is 'de'.
I am looking for the way to pass a range of values to the query. something like multiindex 3 beeing between 34567 and 45657. Assume those are integers.
pseudocode: df.loc[idx[:,['pub','pre'],XXXXX,'de',:]]
XXXX = ?
EDIT 1:
docId column index is of text type, probably its necessary to change it first to int

Turns out query is very powerful:
df.query('docType in ["pub","pre"] and ("34455667" <= docId <= "3445568") and (secType=="de")')
Output:
content
input_id docType docId secType sec_ids
12345 pre 34455667 de x-y 2##9
z-k 6##1
pub 34455667 de x-y 6##5
z-k 9##8

Pandas: Join two dataframes if substring in df1 exists in string of df2 (if string contains substring)

I have two dataframes, and I would Like to join df1 to df2 where df1 contains a url and df2 contains a list of urls.
The shape of df1 and df2 are different
Example:
df1 = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16']})
df2 = pd.DataFrame({'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})
I would like the datafrmes to join on the condition that http://www.example.jp/pro/sanada16 in df1.url exists in df2.urls.
I thought about making the list in columns to columns, but the number of URLs is not unique in df2.urls.
I tried to add the df1.url substring that matched the df2.urls to a new column, so that I could join on the new column, but I couldn't get it to work:
df2['match'] = df2['urls'].apply(lambda x: x if x in df1['url'])
expected output:
new_df = pd.DataFrame({'url': ['http://www.example.jp/pro/sanada16'], 'urls': ['[https://www.example.jp/pro/minoya, http://www.example.jp/pro/tokyo_kankan, http://www.example.jp/pro/briansawazakiphotography, http://www.example.jp/pro/r_masuda, http://www.example.jp/pro/sanada16, ......]']})
With postgresql I could do:
SELECT
b.url
,a.urls
FROM df2 a
join df1 b
on position(b.url in a.urls)>0

Here is one way, if I understand correctly. You can iterate over the patterns you want to search for, and then store the matches using df.at.
import pandas as pd
data_1 = pd.DataFrame(
{
'url': ['http://www.ex.jp', 'http://www.ex.com']
}
)
data_2 = pd.DataFrame(
{
'url': ['http://www.ex.jp/pro', 'http://www.ex.jp/pro/test', 'http://www.ex.com/path', 'http://www.ex.com/home']
}
)
result = pd.DataFrame(columns = ['pattern', 'matches'])
for i in range(data_1.shape[0]):
result.loc[i, 'pattern'] = data_1.loc[i, 'url']
result.at[i, 'matches'] = [j for j in data_2['url'] if data_1.loc[i, 'url'] in j]
print(result)
Gives:
pattern matches
0 http://www.ex.jp [http://www.ex.jp/pro, http://www.ex.jp/pro/test]
1 http://www.ex.com [http://www.ex.com/path, http://www.ex.com/home]
Kudos for updating your question as requested.

Finding nearest time in a DataFrame

I have two different time format dataset like that
df1 = pd.DataFrame( {'A': [1499503900, 1512522054, 1412525061, 1502527681, 1512532303]})
df2 = pd.DataFrame( {'B' : ['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z'] })
I need to find the nearest date for each data in the first dataset. Doesn't matter how far is it. Just needed the nearest time. For example:
1499503900 for '2017-07-03T10:20:46.333Z'
1512522054 for '2017-12-15T12:26:01.347Z'
1412525061 for '2017-05-31T08:27:41.943Z'
1502527681 for '2017-08-10T08:48:01.347Z'
1512532303 for '2017-06-05T14:44:56.425Z'
here is a few help:
This is for converting to long format date :
def time1(date_text):
date = datetime.datetime.strptime(date_text, "%Y-%m-%dT%H:%M:%S.%fZ")
return calendar.timegm(date.utctimetuple())
x = '2017-12-15T12:26:01.347Z'
print(time1(x))
out: 1513340761
And this is for converting to ISO format:
def time_covert(time):
seconds_since_epoch = time
DT.datetime.utcfromtimestamp(seconds_since_epoch)
return DT.datetime.utcfromtimestamp(seconds_since_epoch).isoformat()
y = 1499503900
print(time_covert(y))
out = 2017-07-08T08:51:40
Any idea will be extremely useful.
Thank you all in advance!

Here a quick start:
def time_covert(time):
seconds_since_epoch = time
return datetime.utcfromtimestamp(seconds_since_epoch)
# real time series
df2['B'] = pd.to_datetime(df2['B'])
df2.index = df2['B']
del df2['B']
for a in df1['A']:
print( time_covert(a))
i = np.argmin(np.abs(df2.index.to_pydatetime() - time_covert(a)))
print(df2.iloc[i])

I would like to approach this as an algorithmic question rather than pandas specific. My approach is to sort the "df2" series and for each DateTime in df1, perform a binary search on the sorted df2, to get the indexes of insertion. Then check the indexes just below and above the found index to get the desired output.
Here is the code for above procedure.
Use standard pandas DateTime for easy comparison
df1 = pd.DataFrame( {'A': pd.to_datetime([1499503900, 1512522054, 1412525061, 1502527681, 1512532303], unit='s')})
df2 = pd.DataFrame( {'B' : pd.to_datetime(['2017-12-15T11:47:58.119Z', '2017-05-31T08:27:41.943Z', '2017-06-05T14:44:56.425Z', '2017-05-30T16:24:03.175Z' , '2017-07-03T10:20:46.333Z', '2017-06-16T10:13:31.535Z' , '2017-12-15T12:26:01.347Z', '2017-06-15T16:00:41.017Z', '2017-11-28T15:25:39.016Z', '2017-08-10T08:48:01.347Z']) })
sort df2 according to dates, and get the position of insertion using binary search
df2 = df2.sort_values('B').reset_index(drop=True)
ind = df2['B'].searchsorted(df1['A'])
Now check for the minimum difference between the index just above and just below the position of the insertion
for index, row in df1.iterrows():
i = ind[index]
if i not in df2.index:
print(df2.iloc[i-1]['B'])
elif i-1 not in df2.index:
print(df2.iloc[i]['B'])
else:
if abs(df2.iloc[i]['B'] - row['A']) > abs(df2.iloc[i-1]['B'] - row['A']):
print(df2.iloc[i-1]['B'])
else:
print(df2.iloc[i]['B'])
The test outputs are these, for each value in df1 respectively. (Note: Please recheck your outputs given in the question, they do not correspond to the minimum difference)
2017-07-03 10:20:46.333000
2017-11-28 15:25:39.016000
2017-05-30 16:24:03.175000
2017-08-10 08:48:01.347000
2017-11-28 15:25:39.016000
The above procedure has the time complexity of O(NlogN) for sorting and O(logN) (N = len(df2)) for finding each output. If the size of "df1" is large this will be a fairly fast approach.

How to add "order within group" column in pandas?

Take the following dataframe:
import pandas as pd
df = pd.DataFrame({'group_name': ['A','A','A','B','B','B'],
'timestamp': [4,6,1000,5,8,100],
'condition': [True,True,False,True,False,True]})
I want to add two columns:
The row's order within its group
rolling sum of the condition column within each group
I know I can do it with a custom apply, but I'm wondering if anyone has any fun ideas? (Also this is slow when there are many groups.) Here's one solution:
def range_within_group(input_df):
df_to_return = input_df.copy()
df_to_return = df_to_return.sort('timestamp')
df_to_return['order_within_group'] = range(len(df_to_return))
df_to_return['rolling_sum_of_condition'] = df_to_return.condition.cumsum()
return df_to_return
df.groupby('group_name').apply(range_within_group).reset_index(drop=True)

GroupBy.cumcount does:
Number each item in each group from 0 to the length of that group - 1.
so simply:
>>> gr = df.sort('timestamp').groupby('group_name')
>>> df['order_within_group'] = gr.cumcount()
>>> df['rolling_sum_of_condition'] = gr['condition'].cumsum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding index of a closest value from another dataframe - python

Using reindex with method nearest df2.reset_index().set_index('depth_2').reindex(df1.depth_1,method = 'nearest')['index'].unique() Out[265]: array([14], dtype=int64)

Related

Ungroup pandas dataframe after bfill

querying a multiindex pandas dataframe with slices

Pandas: Join two dataframes if substring in df1 exists in string of df2 (if string contains substring)

Finding nearest time in a DataFrame

How to add "order within group" column in pandas?

Categories

Resources