Find row with closest value - python

I have a very large pandas dataframe with two columns, A and B. For each row containing values a and b in columns A and B respectively, I'd like to find another row with values a' and b' so that the absolute difference between a and a' is as small as possible. I would like to create two new columns: a column containing the "distance" between the two rows (i.e., abs(a - a')), and a column containing b'.
Here are a couple of exmaples. Let's say we have the following dataframe:
df = pd.DataFrame({'A' : [1, 5, 7, 2, 3, 4], 'B' : [5, 2, 7, 5, 1, 9]})
The first row has (a, b) = (1, 5). The two new columns for
this row would contain the values 1 and 5. Why? Because the closest value to a = 1 is a' = 2, which occurs in the fourth row. The value of b' in that row is 5.
The second row has (a, b) = (5, 2). The two new columns for this row would contain the values 1 and 9. The closest value to a = 5 is a' = 4, which occurs in the last row. The corresponding value of b' in that row is 9.
If the value of a' that minimizes (a - a') isn't unique, ties can be broken arbitrarily (or you can keep all entries).
I believe I need to use the pandas.merge_asof function, which allows for approximate joining. I also think that I need to set merge_asof function's direction keyword argument to nearest, which will allow selecting the closest (in absolute distance) to the left dataframe's key.
I've read the entire documentation (with examples) for pandas.merge_asof, but forming the correct query is a little bit tricky for me.

Use merge_asof with allow_exact_matches=False and direction='nearest' parameters, last for A1 subtract A column with absolute values:
df1 = df.sort_values('A')
df = pd.merge_asof(df1,
df1.rename(columns={'A':'A1', 'B':'B1'}),
left_on='A',
right_on='A1',
allow_exact_matches=False,
direction='nearest')
df['A1'] = df['A1'].sub(df['A']).abs()
print (df)
A B A1 B1
0 1 5 1 5
1 2 5 1 5
2 3 1 1 5
3 4 9 1 1
4 5 2 1 9
5 7 7 2 2

Related

Using series inside indexes of dataframe

I have dataframe which consists of five columns and five rows:
Pasquil_gifford_stability_table =pd.DataFrame( {"1":['A','B','B','C','C'],
"2":['A','B','C','D','D'],
"3":['B','C','C','D','D'],
"4":['D','E','D','D','D'],
"5":['D','F','E','D','D']
})
when I want to take element from second column and second row, I realise it:
Pasquil_gifford_stability_table.loc[2][2]
'C'
when I want to take element from second third and firs row, I also realise it:
Pasquil_gifford_stability_table.loc[1][3]
'E'
When I try to do it in arrays, I get an error:
Pasquil_gifford_stability_table.loc[[2,2]],[[1,3]]
( 1 2 3 4 5
2 B C C D E
2 B C C D E, [[1, 3]])
But As the result I should get
['C','E']
How should I solve that problem?
You want lookup:
df.lookup([2, 2], [1, 3])

Is there a way to loop through a python data frame, compare column value (nested list) and update another column conditionally?

I have a python data frame as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 3
My goal is to loop through the data frame and compare column B, if column B are the same, the update column C to the same number such as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 1
I tried with the code below:
for i, j in df.iterrows():
if len(df['B'][i] ==len(df['B'][j] & collections.Counter(df['B'][i]==collections.Counter(df['B'][j])
df['C'][j]==df['C'][i]
else:
df['C'][j]==df['C'][j]
I got error message unhashable type: 'list'
Anyone knows what cause this error and better way to do this? Thank you for your help!
Because lists are not hashable convert lists to sorted tuples and get first values by GroupBy.transform with GroupBy.first:
df['C'] = df.groupby(df.B.apply(lambda x: tuple(sorted(x)))).C.transform('first')
print (df)
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1
Detail:
print (df.B.apply(lambda x: tuple(sorted(x))))
0 (3, 4, 9)
1 (4, 8)
2 (3, 4, 9)
Name: B, dtype: object
Not quite sure about the efficiency of the code, but it gets the job done:
uniqueRows = {}
for index, row in df.iterrows():
duplicateFound = False
for c_value, uniqueRow in uniqueRows.items():
if duplicateFound:
continue
if len(row['B']) == len(uniqueRow):
if len(list(set(row['B']) - set(uniqueRow))) == 0:
print(c_value)
df.at[index, 'C'] = c_value
uniqueFound = True
if not duplicateFound:
uniqueRows[row['C']] = row['B']
print(df)
print(uniqueRows)
This code first loops over your dataframe. It has a duplicateFound boolean for each row that will be used later.
It will loop over the uniqueRows dict and first checks if a duplicate is found. In this case it will continue skip the calculations, because this is not needed anymore.
Afterwards it compares the length of the list to skip some comparisons and in case it's the same uses the following code: This returns a list with the differences and in case there are no differences returns an empty list.
So if the list is empty it sets the value from the C column at this position using pandas dataframe at function (this has to be used when iterating over a dataframe link). It sets the unqiueFound variable to True to prevent further comparisons. In case no duplicates were found the uniqueFound value will still be False and will trigger the addition to the uniqueRows dict at the end of the for loop before going to the next row.
In case you have any comments or improvements to my code feel free to discuss and hope this code helps you with your project!
Create a temporary column by applying sorted to each entry in the B column; group by the temporary column to get your matches and get rid of the temporary column.
df1['B_temp'] = df1.B.apply(lambda x: ''.join(sorted(x)))
df1['C'] = df1.groupby('B_temp').C.transform('min')
df1 = df1.drop('B_temp', axis = 1)
df1
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1

Ambiguity in Pandas Dataframe / Numpy Array "axis" definition

I've been very confused about how python axes are defined, and whether they refer to a DataFrame's rows or columns. Consider the code below:
>>> df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]], columns=["col1", "col2", "col3", "col4"])
>>> df
col1 col2 col3 col4
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
So if we call df.mean(axis=1), we'll get a mean across the rows:
>>> df.mean(axis=1)
0 1
1 2
2 3
However, if we call df.drop(name, axis=1), we actually drop a column, not a row:
>>> df.drop("col4", axis=1)
col1 col2 col3
0 1 1 1
1 2 2 2
2 3 3 3
Can someone help me understand what is meant by an "axis" in pandas/numpy/scipy?
A side note, DataFrame.mean just might be defined wrong. It says in the documentation for DataFrame.mean that axis=1 is supposed to mean a mean over the columns, not the rows...
It's perhaps simplest to remember it as 0=down and 1=across.
This means:
Use axis=0 to apply a method down each column, or to the row labels (the index).
Use axis=1 to apply a method across each row, or to the column labels.
Here's a picture to show the parts of a DataFrame that each axis refers to:
It's also useful to remember that Pandas follows NumPy's use of the word axis. The usage is explained in NumPy's glossary of terms:
Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). [my emphasis]
So, concerning the method in the question, df.mean(axis=1), seems to be correctly defined. It takes the mean of entries horizontally across columns, that is, along each individual row. On the other hand, df.mean(axis=0) would be an operation acting vertically downwards across rows.
Similarly, df.drop(name, axis=1) refers to an action on column labels, because they intuitively go across the horizontal axis. Specifying axis=0 would make the method act on rows instead.
There are already proper answers, but I give you another example with > 2 dimensions.
The parameter axis means axis to be changed.
For example, consider that there is a dataframe with dimension a x b x c.
df.mean(axis=1) returns a dataframe with dimenstion a x 1 x c.
df.drop("col4", axis=1) returns a dataframe with dimension a x (b-1) x c.
Here, axis=1 means the second axis which is b, so b value will be changed in these examples.
Another way to explain:
// Not realistic but ideal for understanding the axis parameter
df = pd.DataFrame([[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3]],
columns=["idx1", "idx2", "idx3", "idx4"],
index=["idx1", "idx2", "idx3"]
)
---------------------------------------1
| idx1 idx2 idx3 idx4
| idx1 1 1 1 1
| idx2 2 2 2 2
| idx3 3 3 3 3
0
About df.drop (axis means the position)
A: I wanna remove idx3.
B: **Which one**? // typing while waiting response: df.drop("idx3",
A: The one which is on axis 1
B: OK then it is >> df.drop("idx3", axis=1)
// Result
---------------------------------------1
| idx1 idx2 idx4
| idx1 1 1 1
| idx2 2 2 2
| idx3 3 3 3
0
About df.apply (axis means direction)
A: I wanna apply sum.
B: Which direction? // typing while waiting response: df.apply(lambda x: x.sum(),
A: The one which is on *parallel to axis 0*
B: OK then it is >> df.apply(lambda x: x.sum(), axis=0)
// Result
idx1 6
idx2 6
idx3 6
idx4 6
It should be more widely known that the string aliases 'index' and 'columns' can be used in place of the integers 0/1. The aliases are much more explicit and help me remember how the calculations take place. Another alias for 'index' is 'rows'.
When axis='index' is used, then the calculations happen down the columns, which is confusing. But, I remember it as getting a result that is the same size as another row.
Let's get some data on the screen to see what I am talking about:
df = pd.DataFrame(np.random.rand(10, 4), columns=list('abcd'))
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
When we want to take the mean of all the columns, we use axis='index' to get the following:
df.mean(axis='index')
a 0.562664
b 0.478956
c 0.410046
d 0.546366
dtype: float64
The same result would be gotten by:
df.mean() # default is axis=0
df.mean(axis=0)
df.mean(axis='rows')
To get use an operation left to right on the rows, use axis='columns'. I remember it by thinking that an additional column may be added to my DataFrame:
df.mean(axis='columns')
0 0.499784
1 0.506596
2 0.478461
3 0.448741
4 0.590839
5 0.595642
6 0.512294
7 0.427054
8 0.654669
9 0.281000
dtype: float64
The same result would be gotten by:
df.mean(axis=1)
Add a new row with axis=0/index/rows
Let's use these results to add additional rows or columns to complete the explanation. So, whenever using axis = 0/index/rows, its like getting a new row of the DataFrame. Let's add a row:
df.append(df.mean(axis='rows'), ignore_index=True)
a b c d
0 0.990730 0.567822 0.318174 0.122410
1 0.144962 0.718574 0.580569 0.582278
2 0.477151 0.907692 0.186276 0.342724
3 0.561043 0.122771 0.206819 0.904330
4 0.427413 0.186807 0.870504 0.878632
5 0.795392 0.658958 0.666026 0.262191
6 0.831404 0.011082 0.299811 0.906880
7 0.749729 0.564900 0.181627 0.211961
8 0.528308 0.394107 0.734904 0.961356
9 0.120508 0.656848 0.055749 0.290897
10 0.562664 0.478956 0.410046 0.546366
Add a new column with axis=1/columns
Similarly, when axis=1/columns it will create data that can be easily made into its own column:
df.assign(e=df.mean(axis='columns'))
a b c d e
0 0.990730 0.567822 0.318174 0.122410 0.499784
1 0.144962 0.718574 0.580569 0.582278 0.506596
2 0.477151 0.907692 0.186276 0.342724 0.478461
3 0.561043 0.122771 0.206819 0.904330 0.448741
4 0.427413 0.186807 0.870504 0.878632 0.590839
5 0.795392 0.658958 0.666026 0.262191 0.595642
6 0.831404 0.011082 0.299811 0.906880 0.512294
7 0.749729 0.564900 0.181627 0.211961 0.427054
8 0.528308 0.394107 0.734904 0.961356 0.654669
9 0.120508 0.656848 0.055749 0.290897 0.281000
It appears that you can see all the aliases with the following private variables:
df._AXIS_ALIASES
{'rows': 0}
df._AXIS_NUMBERS
{'columns': 1, 'index': 0}
df._AXIS_NAMES
{0: 'index', 1: 'columns'}
When axis='rows' or axis=0, it means access elements in the direction of the rows, up to down. If applying sum along axis=0, it will give us totals of each column.
When axis='columns' or axis=1, it means access elements in the direction of the columns, left to right. If applying sum along axis=1, we will get totals of each row.
Still confusing! But the above makes it a bit easier for me.
I remembered by the change of dimension, if axis=0, row changes, column unchanged, and if axis=1, column changes, row unchanged.

Keeping the N first occurrences of

The following code will (of course) keep only the first occurrence of 'Item1' in rows sorted by 'Date'. Any suggestions as to how I could get it to keep, say the first 5 occurrences?
## Sort the dataframe by Date and keep only the earliest appearance of 'Item1'
## drop_duplicates considers the column 'Date' and keeps only first occurence
coocdates = data.sort('Date').drop_duplicates(cols=['Item1'])
You want to use head, either on the dataframe itself or on the groupby:
In [11]: df = pd.DataFrame([[1, 2], [1, 4], [1, 6], [2, 8]], columns=['A', 'B'])
In [12]: df
Out[12]:
A B
0 1 2
1 1 4
2 1 6
3 2 8
In [13]: df.head(2) # the first two rows
Out[13]:
A B
0 1 2
1 1 4
In [14]: df.groupby('A').head(2) # the first two rows in each group
Out[14]:
A B
0 1 2
1 1 4
3 2 8
Note: the behaviour of groupby's head was changed in 0.14 (it didn't act like a filter - but modified the index), so you will have to reset index if using an earlier versions.
Use groupby() and nth():
According to Pandas docs, nth()
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
Therefore all you need is:
df.groupby('Date').nth([0,1,2,3,4]).reset_index(drop=False, inplace=True)

Only allow one to one mapping between two columns in pandas dataframe

I have a two column dataframe df, each row are distinct, one element in one column can map to one or more than one elements in another column. I want to filter OUT those elements. So in the final dataframe, one element in one column only map to a unique element in another column.
What I am doing is to groupby one column and count the duplicates, then remove rows with counts more than 1. and do it again for another column. I am wondering if there is a better, simpler way.
Thanks
edit1: I just realize my solution is INCORRECT, removing multi-mapping elements in column A reduces the number of mapping in column B, consider the following example:
A B
1 4
1 3
2 4
1 maps to 3,4 , so the first two rows should be removed, and 4 maps to 1,2. The final table should be empty. However, my solution will keep the last row.
Can anyone provide me a fast and simple solution ? thanks
Well, You could do something like the following:
>>> df
A B
0 1 4
1 1 3
2 2 4
3 3 5
You only want to keep a row if no other row has the value of 'A' and no other row as that value of 'B'. Only row three meets those conditions in this example:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Aone.merge(Bone,on=['A','B'],how='inner')
A B
0 3 5
Explanation:
>>> Aone = df.groupby('A').filter(lambda x: len(x) == 1)
>>> Aone
A B
2 2 4
3 3 5
The above grabs the rows that may be allowed based on looking at column 'A' alone.
>>> Bone = df.groupby('B').filter(lambda x: len(x) == 1)
>>> Bone
A B
1 1 3
3 3 5
The above grabs the rows that may be allowed based on looking at column 'B' alone. And then merging the intersection leaves you with rows that only meet both conditions:
>>> Aone.merge(Bone,on=['A','B'],how='inner')
Note, you could also do a similar thing using groupby/transform. But transform tends to be slowish so I didn't do it as an alternative.

Categories

Resources