More effective way to use pandas get_loc? - python

Task: Search a multi column dataframe for a value (all values are unique) and return the index of that row.
Currently: using get_loc, but it only seems allow a pass of a single column at a time, resulting in quite a ineffective set of try except statements. Although it works is anyone aware of a more effective way to do this?
df = pd.DataFrame(np.random.randint(0,100,size=(4, 4)), columns=list('ABCD'))
try:
unique_index = pd.Index(df['A'])
print(unique_index.get_loc(20))
except KeyError:
try:
unique_index = pd.Index(df['B'])
print(unique_index.get_loc(20))
except KeyError:
unique_index = pd.Index(df['C'])
print(unique_index.get_loc(20))
Loops don't seem to work because of the KeyError that is raised if a column doesn't contain a value. I've looked at functions such as .contains or .isin but it's the location index that i'm interested in.

You could use np.where, which returns a tuple of row and column indices where your value is present. You can then select just the row from this.
df = pd.DataFrame(np.random.randint(0,100,size=(4, 4)), columns=list('ABCD'))
indices = np.where(df.values == 20)
rows = indices[0]
if len(rows) != 0:
print(rows[0])

Consider this example instead using np.random.seed
np.random.seed([3, 1415])
df = pd.DataFrame(
np.random.randint(200 ,size=(4, 4)),
columns=list('ABCD'))
df
A B C D
0 11 98 123 90
1 143 126 55 141
2 139 141 154 115
3 63 104 128 120
We can find where values are what you're looking for using np.where and slicing. Notice that I used a value of 55 because that what I had in the data I got from the seed I chose. This will work just fine for 20 if it is in your data set. In fact, it'll work if you have more than one.
i, j = np.where(df.values == 55)
list(zip(df.index[i], df.columns[j]))
[(1, 'C')]

Use vectorized operations and boolean indexing:
df[(df==20).any(axis=1)].index

Another way
df[df.eq(20)].stack()
Out[1220]:
1 C 20.0
dtype: float64

Since other posters used np.where() I'll give another option using any().
df.loc[df.isin([20]).any(axis=1)].index
Since df.loc[*condition_here*] will return TRUE if the condition is met, you can use any to filter to the row where it may be true
so here is my example of my df:
A B C D
0 82 7 48 90
1 68 18 90 14 #< ---- notice the 18 here
2 18 34 72 24 #< ---- notice the 18 here
3 69 73 40 86
df.isin([18])
A B C D
0 False False False False
1 False True False False #<- ---- notice the TRUE value
2 True False False False #<- ---- notice the TRUE value
3 False False False False
print(df.loc[df.isin([18]).any(axis=1)].index.tolist())
#output is a list
[1, 2]

Related

Pandas: convert a b-tree of Series objects, (with name set), into a single DataFrame with multi-index

I've created a 4 level b-tree thing with each leaf being a Pandas Series and each level is a Series with 2 values based on True or False. Each level's Series is named according to the level. The result is a not very useful object, but convenient to create.
The code below shows how to create a similar (but simpler) object, that has the same essential properties.
What I really want is a MultiIndex dataframe, where each level of index inherits its name from the same name of that level Series.
import random
import pandas as pd
def sertree(names):
if len(names) <= 1:
ga = pd.Series([random.randint(0,100) for x in range(5)], name='last')
gb = pd.Series([random.randint(0,100) for x in range(5)], name='last')
return pd.Series([ga,gb], index=[True,False], name=names[0])
else:
xa = sertree(names[1:])
xb = sertree(names[1:])
return pd.Series([xa,xb], index=[True,False], name=names[0])
pp = sertree(['top', 'next', 'end'])
n=4
while True:
print(f"{'':>{n}s}{pp.name}")
n+=4
if len(pp) > 2 : break
pp = pp[True]
top
next
end
last
What I want is something like this...
top=[True,False]
nxt=[True,False]
end=[True,False]
last=range(5)
midx = pd.MultiIndex.from_product([top,nxt,end,last],names=['top','next','end','last']) ;
midf = pd.DataFrame([random.randint(0,100) for x in range(len(midx))], index=midx, columns=['name'])
In [593]: midf.head(12)
Out[593]:
name
top next end last
True True True 0 99
1 74
2 16
3 61
4 3
False 0 44
1 46
2 59
3 14
4 82
False True 0 98
1 93
Any ideas how to transform my 'pp' abomination into a nice DataFrame multi-index in a nice Pandas method that I'm lost to understand. Essential is to maintain the Series name as the Multi-index name at each level.

Comparing pandas DataFrames where column values are lists

I have some chemical data that I'm trying to process using Pandas. I have two dataframes:
C_atoms_all.head()
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]
1 218 2 C C [4.8792, 11.9845, 14.6312]
2 219 3 C C [4.8373, 10.7563, 13.9466]
3 220 4 C C [4.7366, 10.9327, 12.5408]
4 6573 5 C C [1.9482, -3.8747, 19.6319]
C_atoms_a.head()
id_a index_a label_a species_a position
0 55 1 C C [6.609, 6.6024, 19.3302]
1 56 2 C C [4.8792, 11.9844, 14.6313]
2 57 3 C C [4.8372, 10.7565, 13.9467]
3 58 4 C C [4.7367, 10.9326, 12.5409]
4 59 5 C C [5.1528, 15.5976, 14.1249]
What I want to do is get a mapping of all of the id_all values to the id_a values where their position matches. You can see that for C_atoms_all.iloc[0]['id_all'] (which returns 55) and the same query for C_atoms_a, the position values match (within a small fudge factor), which I should also include in the query.
The problem I'm having is that I can't merge or filter on the position columns because lists aren't hashable in Python.
I'd ideally like to return a dataframe that looks like so:
id_all id_a position
217 55 [6.609, 6.6024, 19.3301]
... ... ...
for every row where the position values match.
You can do it like below:
I named your C_atoms_all as df_all and C_atoms_a as df_a:
# First we try to extract different values in "position" columns for both dataframes.
df_all["val0"] = df_all["position"].str[0]
df_all["val1"] = df_all["position"].str[1]
df_all["val2"] = df_all["position"].str[2]
df_a["val0"] = df_a["position"].str[0]
df_a["val1"] = df_a["position"].str[1]
df_a["val2"] = df_a["position"].str[2]
# Then because the position values match (within a small fudge factor)
# we round them with three decimal
df_all.loc[:, ["val0", "val1", "val2"]] = df_all[["val0", "val1", "val2"]].round(3)
df_a.loc[:, ["val0", "val1", "val2"]]= df_a[["val0", "val1", "val2"]].round(3)
# We use loc to modify the original dataframe, instead of a copy of it.
# Then we use merge on three extracted values from position column
df = df_all.merge(df_a, on=["val0", "val1", "val2"], left_index=False, right_index=False,
suffixes=(None, "_y"))
# Finally we just keep the the desired columns
df = df[["id_all", "id_a", "position"]]
print(df)
id_all id_a position
0 217 55 [6.609, 6.6024, 19.3301]
1 218 56 [4.8792, 11.9845, 14.6312]
2 219 57 [4.8373, 10.7563, 13.9466]
3 220 58 [4.7366, 10.9327, 12.5408]
This isn't pretty, but it might work for you
def do(x, df_a):
try:
return next((df_a.iloc[i]['id_a'] for i in df_a.index if df_a.iloc[i]['position'] == x))
except StopIteration:
return np.NAN
match = pd.DataFrame(C_atoms_all[['id_all', 'position']])
match['id_a'] = C_atoms_all['position'].apply(do, args=(C_atoms_a,))
You can create a new column in both datasets that contains the hash of the position column and then merge both datasets by that new column.
# Custom hash function
def hash_position(position):
return hash(tuple(position))
# Create the hash column "hashed_position"
C_atoms_all['hashed_position'] = C_atoms_all['position'].apply(hash_position)
C_atoms_a['hashed_position'] = C_atoms_a['position'].apply(hash_position)
# merge datasets
C_atoms_a.merge(C_atoms_all, how='inner', on='hashed_position')
# ... keep the columns you need
Your question is not clear. It seems to me an interesting question though. For that reason I have reproduced your data in a more useful format just in case there is some one who can help more than I can.
Data
C_atoms_all = pd.DataFrame({
'id_all': [217,218,219,220,6573],
'index_all': [1,2,3,4,5],
'label_all': ['C','C','C','C','C'],
'species_all': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3301],[4.8792, 11.9845, 14.6312],[4.8373, 10.7563, 13.9466],[4.7366, 10.9327, 12.5408],[1.9482,-3.8747, 19.6319]]})
C_atoms_a = pd.DataFrame({
'id_a': [55,56,57,58,59],
'index_a': [1,2,3,4,5],
'label_a': ['C','C','C','C','C'],
'species_a': ['C','C','C','C','C'],
'position':[[6.609, 6.6024, 19.3302],[4.8792, 11.9844, 14.6313],[4.8372, 10.7565, 13.9467],[4.7367, 10.9326, 12.5409],[5.1528, 15.5976, 14.1249]]})
C_atoms_ab
Solution
#new dataframe bringing together columns position
df3=C_atoms_all.set_index('index_all').join(C_atoms_a.set_index('index_a').loc[:,'position'].to_frame(),rsuffix='_r').reset_index()
#Create temp column that gives you the comparison tolerances
df3['temp']=df3.filter(regex='^position').apply(lambda x: np.round(np.array(x[0])-np.array(x[1]), 4), axis=1)
#Assume tolerance is where only one of the values is over 0.0
C_atoms_all[df3.explode('temp').groupby(level=0)['temp'].apply(lambda x:x.eq(0).sum()).gt(1)]
id_all index_all label_all species_all position
0 217 1 C C [6.609, 6.6024, 19.3301]

executing an validation(expression) on a dataframe

I have a dataframe df with 2 columns
Sub_marks Total_marks
40 90
60 80
100 90
0 0
I need to find which all rows fails the criteria that sub_marks <= Total_marks.
Currently i am using sympy function as below
def fn_validate(formula,**args):
exp=sy.sympify(formula)
exp=exp.subs(args)
return exp
I am calling above function using apply method as below
df['val_check']=df.apply(lambda row:fn_validate('X<=Y',X=row['Sub_marks],Y=row['Total_marks']),axis=1)
I am expecting a column val_check with True/False expression validation result. But in case of 0 values, I am getting error.
Invalid Nan Comparison
I cant remove this values from the dataframe
Please let me know, is there anyother way to do this expression validation
You can try this:
df['val_check'] = df.Sub_marks <= df.Total_marks
df
Sub_marks Total_marks val_check
0 40 90 True
1 60 80 True
2 100 90 False
3 0 0 True
You can directly compare columns and store them in a list.
condition = df['sub_marks']>=df['total_marks']
print(condition)
Output:
[True,True,False,True]

Pandas groupby expanding optimization of syntax

I am using the data from the example shown here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html. Go to the subheading: New syntax to window and resample operations
At the command prompt, the new syntax works as shown in the pandas documentation. But I want to add a new column with the expanded data to the existing dataframe, as would be done in a saved program.
Before a syntax upgrade to the groupby expanding code, I was able to use the following single line code:
df = pd.DataFrame({'A': [1] * 10 + [5] * 10, 'B': np.arange(20)})
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
This gives the expected results, but also gives an 'expanding_sum is deprecated' message. Expected results are:
A B Sum of B
0 1 0 0
1 1 1 1
2 1 2 3
3 1 3 6
4 1 4 10
5 1 5 15
6 1 6 21
7 1 7 28
8 1 8 36
9 1 9 45
10 5 10 10
11 5 11 21
12 5 12 33
13 5 13 46
14 5 14 60
15 5 15 75
16 5 16 91
17 5 17 108
18 5 18 126
19 5 19 145
I want to use the new syntax to replace the deprecated syntax. If I try the new syntax, I get the error message:
df['Sum of B'] = df.groupby('A').expanding().B.sum()
TypeError: incompatible index of inserted column with frame index
I did some searching on here, and saw something that might have helped, but it gave me a different message:
df['Sum of B'] = df.groupby('A').expanding().B.sum().reset_index(level = 0)
ValueError: Wrong number of items passed 2, placement implies 1
The only way I can get it to work is to assign the result to a temporary df, then merge the temporary df into the original df:
temp_df = df.groupby('A').expanding().B.sum().reset_index(level = 0).rename(columns = {'B' : 'Sum of B'})
new_df = pd.merge(df, temp_df, on = 'A', left_index = True, right_index = True)
print (new_df)
This code gives the expected results as shown above.
I've tried different variations using transform as well, but have not been able to come up with coding this in one line as I did before the deprecation. Is there a single line syntax that will work? Thanks.
It seems you need a cumsum:
df.groupby('A')['B'].cumsum()
TL;DR
df['Sum of B'] = df.groupby('A')['B'].transform(lambda x: x.expanding().sum())
Explanation
We start from the offending line:
df.groupby('A')['B'].transform(lambda x: pd.expanding_sum(x))
Let's read carefully the warning you mentioned:
FutureWarning: pd.expanding_sum is deprecated for Series and will be
removed in a future version, replace with
Series.expanding(min_periods=1).sum()
After reading Pandas 0.17.0: pandas.expanding_sum it becomes clear that the Series the warning is talking about is the first parameter of the pd.expanding_sum. I.e. in our case it is x.
Now we apply the code transformation suggested in the warning. So pd.expanding_sum(x) becomes x.expanding(min_periods=1).sum().
According to Pandas 0.22.0: pandas.Series.expanding min_periods has a default value of 1 so in your case it can be omitted altogether, hence the final result.

pandas dataframe groupby like mysql, yet into new column

df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
df
A mask values
0 11 0 10
1 11 0 15
2 22 0 20
3 22 1 25
Now how can I group by A, and keep the column names in tact, and yet put a custom function into Z:
def calculate_df_stats(dfs):
mask_ = list(dfs['B'])
mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
return mean
df['Z'] = df.groupby('A').agg(calculate_df_stats) # does not work
and generate:
A mask values Z
0 11 0 10 12.5
1 22 0 20 25
Whatever I do it only replaces values column with the masked mean.
and can your solution be applied for a function on two columns and return in a new column?
Thanks!
Edit:
To clarify more: let's say I have such a table in Mysql:
SELECT * FROM `Reader_datapoint` WHERE `wavelength` = '560'
LIMIT 200;
which gives me such result:
http://pastebin.com/qXiaWcJq
If I run now this:
SELECT *, avg(action_value) FROM `Reader_datapoint` WHERE `wavelength` = '560'
group by `reader_plate_ID`;
I get:
datapoint_ID plate_ID coordinate_x coordinate_y res_value wavelength ignore avg(action_value)
193 1 0 0 2.1783 560 NULL 2.090027083333334
481 2 0 0 1.7544 560 NULL 1.4695583333333333
769 3 0 0 2.0161 560 NULL 1.6637885416666673
How can I replicate this behaviour in Pandas? note that all the column names stay the same, the first value is taken, and the new column is added.
If you want the original columns in your result, you can first calculate the grouped and aggregated dataframe (but you will have to aggregate in some way your original columns. I took the first occuring as an example):
>>> df = pd.DataFrame({'A':[11,11,22,22],'mask':[0,0,0,1],'values':np.arange(10,30,5)})
>>>
>>> grouped = df.groupby("A")
>>>
>>> result = grouped.agg('first')
>>> result
mask values
A
11 0 10
22 0 20
and then add a column 'Z' to that result by applying your function on the groupby result 'grouped':
>>> def calculate_df_stats(dfs):
... mask_ = list(dfs['mask'])
... mean = np.ma.array(list(dfs['values']), mask=mask_).mean()
... return mean
...
>>> result['Z'] = grouped.apply(calculate_df_stats)
>>>
>>> result
mask values Z
A
11 0 10 12.5
22 0 20 20.0
In your function definition you can always use more columns (just by their name) to return the result.

Categories

Resources