Compound inequality in if statement

Compound inequality in if statement - python

This is a generalized function I want to use to check if each row of a dataframe follows a specific trend in column values.
def follows_trend(row):
trend = None
if row[("col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3")]:
trend = True
else:
trend = False
return trend
I'll apply it like this
df_trend = df.apply(follows_trend, axis=1)
When I do, it returns all True when there are clearly some rows that should return False. I'm not sure if there is something wrong with the inequality I used or the function itself.

The compound comparisons don't "expand out of" the dict lookup. "col_5" < "col_6" < "col_4" < "col_1" < "col_2" < "col_3" will be evaluated first, producing False because the strings aren't sorted - so your if statement is actually if row[(False)]:. You need to do this:
if row["col_5"] < row["col_6"] < row["col_4"] < row["col_1"] < row["col_2"] < row["col_3"]:
If you have a lot of these expressions, you should probably extract this to a method that takes row and a list of the column names, and uses a loop for the comparisons. If you only have one, but want a somewhat more nice-looking version, try this:
a, b, c, d, e, f = (row[c] for c in ("col_5", "col_6", "col_4", "col_1", "col_2", "col_3"))
if a < b < c < d < e < f:

Also you can reorder the column names, use the diff function to check the difference along the rows and compare the result with 0:
(df[["col_5", "col_6", "col_4", "col_1", "col_2", "col_3"]]
.diff(axis=1).drop('col_5', 1).gt(0).all(1))
Example:
import pandas as pd
df = pd.DataFrame({"A": [1,2], "B": [3,1], "C": [4,2]})
df
# A B C
#0 1 3 4
#1 2 1 2
df.diff(axis=1).drop('A', 1).gt(0).all(1)
#0 True
#1 False
#dtype: bool

you could use query for this. See example below
df = pd.DataFrame(np.random.randn(5, 3), columns=['col1','col2','col3'])
print df
print df.query('col2>col3>col1') # query can accept a string with multiple comparisons.
results in
col1 col2 col3
0 -0.788909 1.591521 1.709402
1 -1.563310 1.188993 2.295683
2 -1.572323 -0.600015 -1.518411
3 1.786051 0.303291 -0.344720
4 0.756029 -0.393941 1.059874
col1 col2 col3
2 -1.572323 -0.600015 -1.518411

Related

Filter a dataframe by column index in a chain, without using the column name or table name

Generate an example dataframe
import random
import string
import numpy as np
df = pd.DataFrame(
columns=[random.choice(string.ascii_uppercase) for i in range(5)],
data=np.random.rand(10,5))
df
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
1 0.933778 0.393021 0.547383 0.469255 0.053089
2 0.994518 0.156547 0.917894 0.070152 0.201373
3 0.077694 0.685540 0.865004 0.830740 0.605135
4 0.760294 0.838441 0.905885 0.146982 0.157439
5 0.116676 0.340967 0.400340 0.293894 0.220995
6 0.632182 0.663218 0.479900 0.931314 0.003180
7 0.726736 0.276703 0.057806 0.624106 0.719631
8 0.677492 0.200079 0.374410 0.962232 0.915361
9 0.061653 0.984166 0.959516 0.261374 0.361677
Now I want to filter a dataframe using the values in the first column, but since I make heavy use of chaining (e.g. df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).pipe(func)) I need a much more compact notation for the operation. Normally you'd do something like
df[df.iloc[:, 0] < 0.5]
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
but the awkwardly redundant syntax is horrible for chaining. I want to replace it with a .query(), and normally you'd use the column name like df.query('V < 0.5'), but here I want to be able to query the table by column index number instead of by name. So in the example, I've deliberately randomized the column names. I also can not use the table name in the query like df.query('#df[0] < 0.5') since in a long chain, the intermediate result has no name.
I'm hoping there is some syntax such as df.query('_[0] < 0.05') where I can refer to the source table as some symbol _.

You can using f-string notation in df.query:
df.query(f'{df.columns[0]} < .5')
Output:
J M O R N
3 0.114554 0.131948 0.650307 0.672486 0.688872
4 0.272368 0.745900 0.544068 0.504299 0.434122
6 0.418988 0.023691 0.450398 0.488476 0.787383
7 0.040440 0.220282 0.263902 0.660016 0.955950
Update using "walrus" operator in python 3.8+
Let's try this:
((dfout := df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).to_frame(name='values'))
.query(f'{dfout.columns[0]} > -2'))
output:
values
N -1.356779
O -1.202353
M -1.591623
T -1.557801

You can use lambda functions in loc, which passes in the dataframe. You can then use iloc for your positional indexing. So you could do:
df.loc[lambda x: x.iloc[:, 0] > 0.5]
This should work in a method chain.

For a single column with index:
df.query(f"{df.columns[0]}<0.5")
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
For multiple columns with index:
idx = [0,1]
col = df.columns[np.r_[idx]]
val = 0.5
query = ' and '.join([f"{i} < {val}" for i in col])
# V < 0.5 and O < 0.5
print(df.query(query))
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
5 0.116676 0.340967 0.400340 0.293894 0.220995

Apply custom function to entire dataframe

I have a function which call another one.
The objective is, by calling function get_substr to extract a substring based on a position of the nth occurence of a character
def find_nth(string, char, n):
start = string.find(char)
while start >= 0 and n > 1:
start = string.find(char, start+len(char))
n -= 1
return start
def get_substr(string,char,n):
if n == 1:
return string[0:find_nth(string,char,n)]
else:
return string[find_nth(string,char,n-1)+len(char):find_nth(string,char,n)]
The function works.
Now I want to apply it on a dataframe by doing this.
df_g['F'] = df_g.apply(lambda x: get_substr(x['EQ'],'-',1))
I get on error:
KeyError: 'EQ'
I don't understand it as df_g['EQ'] exists.
Can you help me?
Thanks

You forgot about axis=1, without that function is applied to each column rather than each row. Consider simple example
import pandas as pd
df = pd.DataFrame({'A':[1,2],'B':[3,4]})
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)
print(df)
output
A B Z
0 1 3 100
1 2 4 200
As side note if you are working with value from single column you might use pandas.Series.apply rather than pandas.DataFrame.apply, in above example it would mean
df['Z'] = df['A'].apply(lambda x:x*100)
in place of
df['Z'] = df.apply(lambda x:x['A']*100,axis=1)

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

I have a dataframe like below
df
A B C
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
And I want to change it to below:
Resulting DF
A B C D
0 1 TRANSIT_1
TRANSIT_3
0 TRANSIT_5
So i tried to use str.contains and once I receive the series with True or False, i put it in eval function to somehow get me the table I want.
Code I tried:
series_index = pd.DataFrame()
series_index = df.columns.str.contains("^TRANSIT_", case=True, regex=True)
print(type(series_index))
series_index.index[series_index].tolist()
I thought to use eval function to write it to separate column,like
df = eval(df[result]=the index) # I dont know, But eval function does evaluation and puts it in a separate column

I couldn't find a simple one-liner, but this works:
idx = list(df1[df1.where(df1.applymap(lambda x: 'TRA' in x if isinstance(x, str) else False)).notnull()].stack().index)
a, b = [], []
for sublist in idx:
a.append(sublist[0])
b.append(sublist[1])
df1['ans'] = df1.lookup(a,b)
Output
A B C ans
0 0 1 TRANSIT_1 TRANSIT_1
1 TRANSIT_3 None None TRANSIT_3
2 0 TRANSIT_5 None TRANSIT_5

How to shorten my code with lambda statement in python?

I have trouble with shortening my code with lambda if possible. bp is my data name.
My data looks like this:
user label
1 b
2 b
3 c
I expect to have
user label Y
1 b 1
2 b 1
3 c 0
Here is my code:
counts = bp['Label'].value_counts()
def score_to_numeric(x):
if counts['b'] > counts['s']:
if x == 'b':
return 1
else:
return 0
else:
if x =='b':
return 0
else:
return 1
bp['Y'] = bp['Label'].apply(score_to_numeric) # apply above function to convert data
It is a function converting a categorical data 'b' or 's' in column named 'Label' into numeric data: 0 or 1. The line counts = bp['Label'].value_counts() counts the number of 'b' or 's' in column 'Label'. Then, in score_to_numeric, if the count of 'b' is more than 's', then give value 1 to b in a new column called 'Y', and vice versa.
I would like to shorten my code into 3-4 lines at most. I think perhaps using a lambda statement will do this, but I'm not familiar enough with lambdas.

Since True and False evaluate to 1 and 0, respectively, you can simply return the Boolean expression, converted to integer.
def score_to_numeric(x):
return int((counts['b'] > counts['s']) == \
(x == 'b'))
It returns 1 iff both expressions have the same Boolean value.

I don't think you need to use the apply method. Something simple like this should work:
value_counts = bp.Label.value_counts()
bp.Label[bp.Label == 'b'] = 1 if value_counts['b'] > value_counts['s'] else 0
bp.Label[bp.Label == 's'] = 1 if value_counts['s'] > value_counts['b'] else 0

You could do the following
counts = bp['Label'].value_counts()
t = 1 if counts['b'] > counts['s'] else 0
bp['Y'] = bp['Label'].apply(lambda x: t if x == 'b' else 1 - t)

How to add a specific number of characters to the end of string in Pandas?

I am using the Pandas library within Python and I am trying to increase the length of a column with text in it to all be the same length. I am trying to do this by adding a specific character (this will be white space normally, in this example I will use "_") a number of times until it reaches the maximum length of that column.
For example:
Col1_Before
A
B
A1R
B2
AABB4
Col1_After
A____
B____
A1R__
B2___
AABB4
So far I have got this far (using the above table as the example). It is the next part (and the part that does it that I am stuck on).
df['Col1_Max'] = df.Col1.map(lambda x: len(x)).max()
df['Col1_Len'] = df.Col1.map(lambda x: len(x))
df['Difference_Len'] = df ['Col1_Max'] - df ['Col1_Len']
I may have not explained myself well as I am still learning. If this is confusing let me know and I will clarify.

consider the pd.Series s
s = pd.Series(['A', 'B', 'A1R', 'B2', 'AABB4'])
solution
use str.ljust
m = s.str.len().max()
s.str.ljust(m, '_')
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4
dtype: object
for your case
m = df.Col1.str.len().max()
df.Col1 = df.Col1.ljust(m '_')

It isn't the most pandas-like solution, but you can try the following:
col = np.array(["A", "B", "A1R", "B2", "AABB4"])
data = pd.DataFrame(col, columns=["Before"])
Now compute the maximum length, the list of individual lengths, and the differences:
max_ = data.Before.map(lambda x: len(x)).max()
lengths_ = data.Before.map(lambda x: len(x))
diffs_ = max_ - lengths_
Create a new column called After adding the underscores, or any other character:
data["After"] = data["Before"] + ["_"*i for i in diffs_]
All this gives:
Before After
0 A A____
1 B B____
2 A1R A1R__
3 AABB4 AABB4

Without creating extra columns:
In [63]: data
Out[63]:
Col1
0 A
1 B
2 A1R
3 B2
4 AABB4
In [64]: max_length = data.Col1.map(len).max()
In [65]: data.Col1 = data.Col1.apply(lambda x: x + '_'*(max_length - len(x)))
In [66]: data
Out[66]:
Col1
0 A____
1 B____
2 A1R__
3 B2___
4 AABB4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compound inequality in if statement - python

Related

Filter a dataframe by column index in a chain, without using the column name or table name

Apply custom function to entire dataframe

Retrieve certain value located in dataframe in any row or column and keep it in separate column without forloop

How to shorten my code with lambda statement in python?

How to add a specific number of characters to the end of string in Pandas?

Categories

Resources