Using series inside indexes of dataframe

Using series inside indexes of dataframe - python

I have dataframe which consists of five columns and five rows:
Pasquil_gifford_stability_table =pd.DataFrame( {"1":['A','B','B','C','C'],
"2":['A','B','C','D','D'],
"3":['B','C','C','D','D'],
"4":['D','E','D','D','D'],
"5":['D','F','E','D','D']
})
when I want to take element from second column and second row, I realise it:
Pasquil_gifford_stability_table.loc[2][2]
'C'
when I want to take element from second third and firs row, I also realise it:
Pasquil_gifford_stability_table.loc[1][3]
'E'
When I try to do it in arrays, I get an error:
Pasquil_gifford_stability_table.loc[[2,2]],[[1,3]]
( 1 2 3 4 5
2 B C C D E
2 B C C D E, [[1, 3]])
But As the result I should get
['C','E']
How should I solve that problem?

You want lookup:
df.lookup([2, 2], [1, 3])

Related

Find row with closest value

I have a very large pandas dataframe with two columns, A and B. For each row containing values a and b in columns A and B respectively, I'd like to find another row with values a' and b' so that the absolute difference between a and a' is as small as possible. I would like to create two new columns: a column containing the "distance" between the two rows (i.e., abs(a - a')), and a column containing b'.
Here are a couple of exmaples. Let's say we have the following dataframe:
df = pd.DataFrame({'A' : [1, 5, 7, 2, 3, 4], 'B' : [5, 2, 7, 5, 1, 9]})
The first row has (a, b) = (1, 5). The two new columns for
this row would contain the values 1 and 5. Why? Because the closest value to a = 1 is a' = 2, which occurs in the fourth row. The value of b' in that row is 5.
The second row has (a, b) = (5, 2). The two new columns for this row would contain the values 1 and 9. The closest value to a = 5 is a' = 4, which occurs in the last row. The corresponding value of b' in that row is 9.
If the value of a' that minimizes (a - a') isn't unique, ties can be broken arbitrarily (or you can keep all entries).
I believe I need to use the pandas.merge_asof function, which allows for approximate joining. I also think that I need to set merge_asof function's direction keyword argument to nearest, which will allow selecting the closest (in absolute distance) to the left dataframe's key.
I've read the entire documentation (with examples) for pandas.merge_asof, but forming the correct query is a little bit tricky for me.

Use merge_asof with allow_exact_matches=False and direction='nearest' parameters, last for A1 subtract A column with absolute values:
df1 = df.sort_values('A')
df = pd.merge_asof(df1,
df1.rename(columns={'A':'A1', 'B':'B1'}),
left_on='A',
right_on='A1',
allow_exact_matches=False,
direction='nearest')
df['A1'] = df['A1'].sub(df['A']).abs()
print (df)
A B A1 B1
0 1 5 1 5
1 2 5 1 5
2 3 1 1 5
3 4 9 1 1
4 5 2 1 9
5 7 7 2 2

Splitting dataframe with a specific rule at a specific row, on loop

Given any df with only 3 columns, and n rows. Im trying to split, horizontally, on loop, at the position where the value on a column is max.
Something close to what np.array_split() does, but not on equal sizes necessarily. It would have to be at the row with the value determined by the max rule, at that moment on the loop. I imagine the over or under cutting bit is not necessarily the harder part.
An example: (sorry, its my first time actually making a question. Formatting code here is unknown for me yet)
df = pd.DataFrame({'a': [3,1,5,5,4,4], 'b': [1,7,1,2,5,5], 'c': [2,4,1,3,2,2]})
This df, with the max value condition applied on column b (7), would be cutted on a 2 row df and other with 4 rows.

Perhaps this might help you. Assume our n by 3 dataframe is as follows:
df = pd.DataFrame({'a': [1,2,3,4], 'b': [4,3,2,1], 'c': [2,4,1,3]})
>>> df
a b c
0 1 4 2
1 2 3 4
2 3 2 4
3 4 1 3
We can create a list of rows where max values occur for each column.
rows = [df[df[i] == max(df[i])] for i in df.columns]
>>> rows[0]
a b c
3 4 1 3
>>> rows[2]
a b c
1 2 3 4
2 3 2 4
This can also be written as a list of indexes if preferred.
indexes = [i.index for i in rows]
>>> indexes
[Int64Index([3], dtype='int64'), Int64Index([0], dtype='int64'), Int64Index([1, 2], dtype='int64')]

Is there a way to loop through a python data frame, compare column value (nested list) and update another column conditionally?

I have a python data frame as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 3
My goal is to loop through the data frame and compare column B, if column B are the same, the update column C to the same number such as below:
A B C
2 [4,3,9] 1
6 [4,8] 2
3 [3,9,4] 1
I tried with the code below:
for i, j in df.iterrows():
if len(df['B'][i] ==len(df['B'][j] & collections.Counter(df['B'][i]==collections.Counter(df['B'][j])
df['C'][j]==df['C'][i]
else:
df['C'][j]==df['C'][j]
I got error message unhashable type: 'list'
Anyone knows what cause this error and better way to do this? Thank you for your help!

Because lists are not hashable convert lists to sorted tuples and get first values by GroupBy.transform with GroupBy.first:
df['C'] = df.groupby(df.B.apply(lambda x: tuple(sorted(x)))).C.transform('first')
print (df)
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1
Detail:
print (df.B.apply(lambda x: tuple(sorted(x))))
0 (3, 4, 9)
1 (4, 8)
2 (3, 4, 9)
Name: B, dtype: object

Not quite sure about the efficiency of the code, but it gets the job done:
uniqueRows = {}
for index, row in df.iterrows():
duplicateFound = False
for c_value, uniqueRow in uniqueRows.items():
if duplicateFound:
continue
if len(row['B']) == len(uniqueRow):
if len(list(set(row['B']) - set(uniqueRow))) == 0:
print(c_value)
df.at[index, 'C'] = c_value
uniqueFound = True
if not duplicateFound:
uniqueRows[row['C']] = row['B']
print(df)
print(uniqueRows)
This code first loops over your dataframe. It has a duplicateFound boolean for each row that will be used later.
It will loop over the uniqueRows dict and first checks if a duplicate is found. In this case it will continue skip the calculations, because this is not needed anymore.
Afterwards it compares the length of the list to skip some comparisons and in case it's the same uses the following code: This returns a list with the differences and in case there are no differences returns an empty list.
So if the list is empty it sets the value from the C column at this position using pandas dataframe at function (this has to be used when iterating over a dataframe link). It sets the unqiueFound variable to True to prevent further comparisons. In case no duplicates were found the uniqueFound value will still be False and will trigger the addition to the uniqueRows dict at the end of the for loop before going to the next row.
In case you have any comments or improvements to my code feel free to discuss and hope this code helps you with your project!

Create a temporary column by applying sorted to each entry in the B column; group by the temporary column to get your matches and get rid of the temporary column.
df1['B_temp'] = df1.B.apply(lambda x: ''.join(sorted(x)))
df1['C'] = df1.groupby('B_temp').C.transform('min')
df1 = df1.drop('B_temp', axis = 1)
df1
A B C
0 2 [4, 3, 9] 1
1 6 [4, 8] 2
2 3 [3, 9, 4] 1

pandas dataframe get the value with most occurence per row (Python2)

I have the dataframe
df =A B B A B
B B B B A
A A A B B
A A B A A
And I want to get a vector with the element the appeared the most, per row.
So here I will get [B,B,A,A]
What is the best way to do it? In Python2

Let us using mode
df.T.mode()
0 1 2 3
0 B B A A

You can get your vector v with most appearing values with
v = [_[1].value_counts().idxmax() for _ in df.iterrows()].
Be careful when you have multiple elements that occur the most.

Selecting multiple columns R vs python pandas

I am an R user who is currently learning Python and I am trying to replicate a method of selecting columns used in R into Python.
In R, I could select multiple columns like so:
df[,c(2,4:10)]
In Python, I know how iloc works, but I couldn't split between a single column number and a consecutive set of them.
This wouldn't work
df.iloc[:,[1,3:10]]
So, I'll have to drop the second column like so:
df.iloc[:,1:10].drop(df.iloc[:,1:10].columns[1] , axis=1)
Is there a more efficient way of replicating the method from R in Python?

You can use np.r_ that accepts mixed slice notation and scalar indices and concatenate them as 1-d array:
import numpy as np
df.iloc[:,np.r_[1, 3:10]]
df = pd.DataFrame([[1,2,3,4,5,6]])
df
# 0 1 2 3 4 5
#0 1 2 3 4 5 6
df.iloc[:, np.r_[1, 3:6]]
# 1 3 4 5
#0 2 4 5 6
As np.r_ produces:
np.r_[1, 3:6]
# array([1, 3, 4, 5])

Assuming one wants to select multiple columns of a DataFrame by their name, considering the Dataframe df
df = pandas.DataFrame({'A' : ['X', 'Y'],
'B' : 1,
'C' : [2, 3]})
Considering one wants the columns A and C, simply use
df[['A', 'C']]
>>> A C
0 X 2
1 Y 3
Note that if one wants to use it later on one should assign it to a variable.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using series inside indexes of dataframe - python

You want lookup: df.lookup([2, 2], [1, 3])

Related

Find row with closest value

Splitting dataframe with a specific rule at a specific row, on loop

Is there a way to loop through a python data frame, compare column value (nested list) and update another column conditionally?

pandas dataframe get the value with most occurence per row (Python2)

Selecting multiple columns R vs python pandas

Categories

Resources