Match Dataframe rows on signature with tolerance interval

Match Dataframe rows on signature with tolerance interval - python

I have the following DataFrame:
df1 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df1["DATE"] = ["2000-05-01", "2000-05-03", "2001-01-15", "2001-02-20", "2001-02-22"]
df1["QTY1"] = [10, 11, 12,5,4]
df1["QTY2"] = [100, 101, 102,15,14]
df1["ID"] = [1,2,3,4,5]
df1["CODE"] = ["A", "B", "C", "D", "E"]
df2 = pd.DataFrame(columns=["DATE","QTY1", "QTY2", "ID", "CODE"])
df2["DATE"] = ["2000-05-02", "2000-05-04", "2001-01-12", "2001-03-28", "2001-08-21", "2005-07-01"]
df2["QTY1"] = [9, 101, 11,5.1,100, 10]
df2["QTY2"] = [99, 12, 1000,6,3, 1]
df2["ID"] = [1,2,3,8,5, 9]
df2["CODE"] = ["F", "G", "H", "I", "L", "M"]
df1:
DATE QTY1 QTY2 ID CODE
0 2000-05-01 10 100 1 A
1 2000-05-03 11 101 2 B
2 2001-01-15 12 102 3 C
3 2001-02-20 5 15 4 D
4 2001-02-22 4 14 5 E
df2
DATE QTY1 QTY2 ID CODE
0 2000-05-02 9.0 99 1 F
1 2000-05-04 101.0 12 2 G
2 2001-01-12 11.0 1000 3 H
3 2001-03-28 5.1 6 8 I
4 2001-08-21 100.0 3 5 L
5 2005-07-01 10 1 9 M
My goal is to create a signature with some tolerance for each row and match rows on both DF that fall in such interval.
The signature for each row is structure as follow:
DATE (with tolerance of +/- 5 days)
Qty1 (with tolerance of 10%)
Qty2 (with tolerance of 10%)
ID (perfect match).
So for example the matching result for the above DFs will return the following rows (the first line from each DF) grouped by signature:
Signature1 2000-05-01 10 100 1 A
2000-05-02 9.0 99 1 F
All the others rows do not respect one or more tolerances.
Currently I am doing that with a classical for loop using iterrows() and checking all fields but for large DFs performances are quite poor.
I was wondering if there is a more pandas-like approach that could help me to speed it up.
Thanks

Related

How to subset rows of df based on unique values?

I need to drop rows which do not match the criteria of equal values in nunique.
Every value in "lot" column is associated with two values in "shipment" column. For every value of "lot", number of unique "cargotype" values may/may not be different for each shipment. I want to drop the rows for "lot" which have unequal "cargotype" values for 2 shipments. col4-6 are irrelevant for subsetting and just need to be returned.
Code to recreate df
df = pd.DataFrame({"": [0,1,2,3,4,5,6,7,8,9,10],
"lot": ["dfg", "dfg", "dfg","dfg","ghj","ghj","ghj","abc","abc","abc","abc"],
"shipment": ["a", "b", "a","b","c","d","d","e","f","e","e"],
"cargotype": ["adam", "chris", "bob","tom","chris","hanna","chris","charlie","king","su","min"],
"col4": [ 777, 775, 767,715,772,712,712, 123, 122, 121,120],
"col5": [ 13, 12, 13,12,14,12,12, 15, 16, 17,18],
"col6": [4, 3, 4,3, 5, 8,8, 7,7,0,0]})
df
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
4 ghj c chris 772 14 5
5 ghj d hanna 712 12 8
6 ghj d chris 712 12 8
7 abc e charlie 123 15 7
8 abc f king 122 16 7
9 abc e su 121 17 0
10 abc e min 120 18 0
To check uniqueness in "cargotype" column, I use
pd.DataFrame((df.groupby(["lot","shipment"])["cargotype"].nunique()))
lot shipment cargotype
abc e 3
f 1
dfg a 2
b 2
ghj c 1
d 2
Answer df should be:
finaldf
lot shipment cargotype col4 col5 col6
0 dfg a adam 777 13 4
1 dfg b chris 775 12 3
2 dfg a bob 767 13 4
3 dfg b tom 715 12 3
Only "dfg" lot remains because unique "cargotype" values for 2 "shipments" are equal to each other.
Thank you!

Don't ask me how, but this creates your desired outcome
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df1 = df.groupby(["lot","shipment"])["cargotype"].nunique().unstack().apply(squeeze_nan, axis=1).dropna(how='all', axis=1)
lot = df1[df1['a'] == df1['b']].index
print(df[df['lot'].isin(lot)])
Caveat: not sure if this will work when a lot has more than 2 types of shipment values

Map a Pandas Series with duplicate keys to a DataFrame

Env: Python 3.9.6, Pandas 1.3.5
I have a DataFrame and a Series like below
df = pd.DataFrame({"C1" : ["A", "B", "C", "D"]})
sr = pd.Series(data = [1, 2, 3, 4, 5],
index = ["A", "A", "B", "C", "D"])
"""
[DataFrame]
C1
0 A
1 B
2 C
3 D
[Series]
A 1
A 2
B 3
C 4
D 5
"""
What I tried,
df["C2"] = df["C1"].map(sr)
But InvalidIndexError occurred because the series has duplicate keys ("A").
pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Is there any method to make DF like below?
C1 C2
0 A 1
1 A 2
2 B 3
3 C 4
4 D 5
or
C1 C2
0 A 1
1 B 3
2 C 4
3 D 5
4 A 2
Row indices do not matter.

The question was heavily edited and now has a very different meaning.
You want a simple merge:
df.merge(sr.rename('C2'),
left_on='C1', right_index=True)
Output:
C1 C2
0 A 1
0 A 2
1 B 3
2 C 4
3 D 5
old answer
First, I don't reproduce your issue (tested with 3M rows on pandas 1.3.5).
Then why do you use slicing and not map? This would have the advantage of systematically outputting the correct number of rows (NaN if the key is absent):
Example:
sr = pd.Series({10:"A", 13:"B", 16:"C", 18:"D"})
df = pd.DataFrame({"C1":np.random.randint(10, 20, size=3000000)})
df['C2'] = df['C1'].map(sr)
print(df.head())
output:
C1 C2
0 10 A
1 18 D
2 10 A
3 13 B
4 15 NaN

Unpacking cells containing list of lists in Pandas DataFrame into separate rows and columns of a new DataFrame

I have the DataFrame df:
a b c
0 7 5 [[-4, 7], [-5, 6]]
1 13 5 [[-9, 4], [-3, 7]]
I want to flatten the column with list of lists cells (column 'c') into a separate DataFrame such that:
The separate lists correspond to individual entries
The elements of the separate lists are split into new columns
I manage to obtain the desired result below (I understand there has been an int to float conversion, but this is not a bother for me):
a b d e
0 7 5 -4.0 7.0
1 7 5 -5.0 6.0
2 13 5 -9.0 4.0
3 13 5 -3.0 7.0
However, I believe the way I do it is not ideal since it firstly uses a lot of code, and secondly uses iterrows().
Below is my code:
old_cols = list(df)
old_cols.remove('c')
new_cols = ['d', 'e']
all_cols = old_cols + new_cols
df_flat = pd.DataFrame(columns=all_cols)
for idx, row in df.iterrows():
data = row['c']
for entry in data:
temp_series = pd.Series(index=new_cols)
temp_series['d'] = entry[0]
temp_series['e'] = entry[1]
new_row = pd.concat([row[old_cols], temp_series])
df_flat = df_flat.append(new_row, ignore_index=True)

Using groupby+apply with pd.DataFrame :
df = df.groupby(['a','b'])\
.apply(lambda x: pd.DataFrame(x['c'].tolist()[0], columns=['c','d']))\
.reset_index([0,1]).reset_index(drop=True)
print(df)
a b c d
0 7 5 -4 7
1 7 5 -5 6
2 13 5 -9 4
3 13 5 -3 7
Explanation :
Since for each value in column c are list of lists. To upack them and to make them different columns we taking x['c'].tolist() this contains 2 open and close brackets ([[[values],[values]]]) which not useful, so x['c'].tolist()[0] gives [[values],[values]] which is used as data to pd.DataFrame with columns ['c','d'] and finaly reset_index on levels = [0,1] which are columns ['a','b'].
print(pd.DataFrame([[-4, 7], [-5, 6]],columns=['c','d']))
c d
0 -4 7
1 -5 6
print(df.groupby(['a','b'])\
.apply(lambda x: pd.DataFrame(x['c'].tolist()[0], columns=['c','d'])))
c d
a b
7 5 0 -4 7
1 -5 6
13 5 0 -9 4
1 -3 7

How to shift values using pandas in python dataframe? [duplicate]

This question already has answers here:
How to shift a column in Pandas DataFrame
(9 answers)
Closed 4 years ago.
I have a dataframe like this,
d = {'ID': ["A", "A", "B", "B", "C", "C", "D", "D", "E", "E", "F", "F"],
'value': [23, 23, 52, 52, 36, 36, 46, 46, 9, 9, 110, 110]}
df = pd.DataFrame(data=d)
ID value
0 A 23
1 A 23
2 B 52
3 B 52
4 C 36
5 C 36
6 D 46
7 D 46
8 E 9
9 E 9
10 F 110
11 F 110
Basically, I replicate original data set(n rows). The dataframe I want to get seems like this,
ID value
0 A 23
1 B 23
2 B 52
3 C 52
4 C 36
5 D 36
6 D 46
7 E 46
8 E 9
9 F 9
Move column of value one unit down and remain all pairs of values. So, I lost first of A, last of F and last two value 110. Finally, I have 2n-2 rows.

I think need:
df = df.set_index('ID').shift().iloc[1:-1].reset_index()
print (df)
ID value
0 A 23.0
1 B 23.0
2 B 52.0
3 C 52.0
4 C 36.0
5 D 36.0
6 D 46.0
7 E 46.0
8 E 9.0
9 F 9.0

I think this will solve your problem
import pandas as pd
d = {'ID': ["A", "A", "B", "B", "C", "C", "D", "D", "E", "E","F","F"],
'value': [23, 23, 52, 52, 36, 36, 46, 46, 9, 9, 110, 110]}
df = pd.DataFrame(data=d)
df['value']=df['value'].shift(1)
df2=df[1:11] #here 11 is n-1,it depends on number of rows
print(df2)

So if you just want to shift ID by -1 and exclude last 2 rows:
df['ID'] = df['ID'].shift(-1)
result = df[:-2]

Python pandas - select rows based on groupby

I have a sample table like this:
Dataframe: df
Col1 Col2 Col3 Col4
A 1 10 i
A 1 11 k
A 1 12 a
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
B 2 21 w
B 2 25 e
B 2 36 q
C 1 23 a
C 1 24 b
I'm trying to get all records/rows of the groups (Col1, Col2) that has the smaller number of records AND skipping over those groups that have only 1 record (in this example Col1 = 'C'). So, the output would be as follows:
A 2 10 w
A 2 11 e
B 1 15 s
B 1 16 d
since group (A,2) has 2 records compared to group (A,1) which has 3 records.
I tried to approach this issue from different angles but just can't seem to get the result that I need. I am able to find the groups that I need using a combination of groupby, filter and agg but how do I now use this as a select filter on df? After spending a lot of time on this, I wasn't even sure that the approach was correct as it looked overly complicated. I am sure that there is an elegant solution but I just can't see it.
Any advise on how to approach this would be greatly appreciated.
I had this to get the groups for which I wanted the rows displayed:
groups = df.groupby(["Col1, Col2"])["Col2"].agg({'no':'count'})
filteredGroups = groups.groupby(level=0).filter(lambda group: group.size > 1)
print filteredGroups.groupby(level=0).agg('idxmin')
The second line was to account for groups that may have only one record as those I don't want to consider. Honestly, I tried so many variations and approaches that eventually did not give me the result that I wanted. I see that all answers are not one-liners so that at least I don't feel like I was over thinking the problem.

df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df['rnk'] = df.groupby('Col1')['sz'].rank(method='min')
df['rnk_rev'] = df.groupby('Col1')['sz'].rank(method='min',ascending=False)
df.loc[ (df['rnk'] == 1.0) & (df['rnk_rev'] != 1.0) ]
Col1 Col2 Col3 Col4 sz rnk rnk_rev
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
Edit: changed "count" to "size" (as in #Marco Spinaci's answer) which doesn't matter in this example but might if there were missing values.
And for clarity, here's what the df looks like before dropping the selected rows.
Col1 Col2 Col3 Col4 sz rnk rnk_rev
0 A 1 10 i 3 3.0 1.0
1 A 1 11 k 3 3.0 1.0
2 A 1 12 a 3 3.0 1.0
3 A 2 10 w 2 1.0 4.0
4 A 2 11 e 2 1.0 4.0
5 B 1 15 s 2 1.0 4.0
6 B 1 16 d 2 1.0 4.0
7 B 2 21 w 3 3.0 1.0
8 B 2 25 e 3 3.0 1.0
9 B 2 36 q 3 3.0 1.0
10 C 1 23 a 2 1.0 1.0
11 C 1 24 b 2 1.0 1.0

Definitely not a nice answer, but it should work:
tmp = df[['col1','col2']].groupby(['col1','col2'], as_index=False).size()
df['occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]][df.col2[i]])
df['min_occurrencies'] = pd.Series(df.index).apply(lambda i: tmp[df.col1[i]].min())
df[df.occurrencies == df.min_occurrencies]
But there must be a more clever way to use groupby than creating an auxiliary data frame...

The following is a solution that is based on the groupby.apply methodology. Other simpler methods are available by creating data Series as in JohnE's method which is superior I would say.
The solution works by grouping the dataframe at the Col1 level and then passing a function to apply that further groups the data by Col2. Each sub_group is then assessed to yield the smallest group. Note that ties in size will be determined by whichever is evaluated first. This may not be desirable.
#create data
import pandas as pd
df = pd.DataFrame({
"Col1" : ["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"],
"Col2" : [1, 1, 1, 2, 2, 1, 1, 2, 2, 2],
"Col3" : [10, 11, 12, 10, 11, 15, 16, 21, 25, 36],
"Col4" : ["i", "k", "a", "w", "e", "s", "d", "w", "e", "q"]
})
Grouped = df.groupby("Col1")
def transFunc(x):
smallest = [None, None]
sub_groups = x.groupby("Col2")
for group, data in sub_groups:
if not smallest[1] or len(data) < smallest[1]:
smallest[0] = group
smallest[1] = len(data)
return sub_groups.get_group(smallest[0])
Grouped.apply(transFunc).reset_index(drop = True)
Edit to assign the result
result = Grouped.apply(transFunc).reset_index(drop = True)
print(result)

I would like to add a shorter yet readable version of JohnE's solution
df['sz'] = df.groupby(['Col1','Col2'])['Col3'].transform("size")
df.groupby('Col1').filter(lambda x: x['sz'].rank(method='min') == 1 and x['sz'].rank(method='min', ascending=False) != 1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match Dataframe rows on signature with tolerance interval - python

Related

How to subset rows of df based on unique values?

Map a Pandas Series with duplicate keys to a DataFrame

Unpacking cells containing list of lists in Pandas DataFrame into separate rows and columns of a new DataFrame

How to shift values using pandas in python dataframe? [duplicate]

Python pandas - select rows based on groupby

Categories

Resources