Creating a function to standardize categorical variables (python)

Creating a function to standardize categorical variables (python) - python

I don't know if it is right to say "standardize" categorical variable string, but basically I want to create a function to set all observations F or f in the column below to 0 and M or m to 1:
> df['gender']
gender
f
F
f
M
M
m
I tried this:
def padroniza_genero(x):
if(x == 'f' or x == 'F'):
replace(['f', 'F'], 0)
else:
replace(1)
df1['gender'] = df1['gender'].apply(padroniza_genero)
But I got an error:
NameError: name 'replace' is not defined
Any ideas? Thanks!

There is no replace function defined in your code.
Back to your goal, use a vector function.
Convert to lower and map f->0, m->1:
df['gender_num'] = df['gender'].str.lower().map({'f': 0, 'm': 1})
Or use a comparison (not equal to f) and conversion from boolean to integer:
df['gender_num'] = df['gender'].str.lower().ne('f').astype(int)
output:
gender gender_num
0 f 0
1 F 0
2 f 0
3 M 1
4 M 1
5 m 1
generalization
you can generalize to ant number of categories using pandas.factorize. Advantage: you will get a real Categorical type.
NB. the number values is set depending on whatever values comes first, or lexicographic order if sort=True:
s, key = pd.factorize(df['gender'].str.lower(), sort=True)
df['gender_num'] = s
key = dict(enumerate(key))
# {0: 'f', 1: 'm'}

Related

Filter a dataframe by column index in a chain, without using the column name or table name

Generate an example dataframe
import random
import string
import numpy as np
df = pd.DataFrame(
columns=[random.choice(string.ascii_uppercase) for i in range(5)],
data=np.random.rand(10,5))
df
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
1 0.933778 0.393021 0.547383 0.469255 0.053089
2 0.994518 0.156547 0.917894 0.070152 0.201373
3 0.077694 0.685540 0.865004 0.830740 0.605135
4 0.760294 0.838441 0.905885 0.146982 0.157439
5 0.116676 0.340967 0.400340 0.293894 0.220995
6 0.632182 0.663218 0.479900 0.931314 0.003180
7 0.726736 0.276703 0.057806 0.624106 0.719631
8 0.677492 0.200079 0.374410 0.962232 0.915361
9 0.061653 0.984166 0.959516 0.261374 0.361677
Now I want to filter a dataframe using the values in the first column, but since I make heavy use of chaining (e.g. df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).pipe(func)) I need a much more compact notation for the operation. Normally you'd do something like
df[df.iloc[:, 0] < 0.5]
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
but the awkwardly redundant syntax is horrible for chaining. I want to replace it with a .query(), and normally you'd use the column name like df.query('V < 0.5'), but here I want to be able to query the table by column index number instead of by name. So in the example, I've deliberately randomized the column names. I also can not use the table name in the query like df.query('#df[0] < 0.5') since in a long chain, the intermediate result has no name.
I'm hoping there is some syntax such as df.query('_[0] < 0.05') where I can refer to the source table as some symbol _.

You can using f-string notation in df.query:
df.query(f'{df.columns[0]} < .5')
Output:
J M O R N
3 0.114554 0.131948 0.650307 0.672486 0.688872
4 0.272368 0.745900 0.544068 0.504299 0.434122
6 0.418988 0.023691 0.450398 0.488476 0.787383
7 0.040440 0.220282 0.263902 0.660016 0.955950
Update using "walrus" operator in python 3.8+
Let's try this:
((dfout := df.T.replace(0, np.nan).pipe(np.log2).mean(axis=1).fillna(0).to_frame(name='values'))
.query(f'{dfout.columns[0]} > -2'))
output:
values
N -1.356779
O -1.202353
M -1.591623
T -1.557801

You can use lambda functions in loc, which passes in the dataframe. You can then use iloc for your positional indexing. So you could do:
df.loc[lambda x: x.iloc[:, 0] > 0.5]
This should work in a method chain.

For a single column with index:
df.query(f"{df.columns[0]}<0.5")
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
3 0.077694 0.685540 0.865004 0.830740 0.605135
5 0.116676 0.340967 0.400340 0.293894 0.220995
9 0.061653 0.984166 0.959516 0.261374 0.361677
For multiple columns with index:
idx = [0,1]
col = df.columns[np.r_[idx]]
val = 0.5
query = ' and '.join([f"{i} < {val}" for i in col])
# V < 0.5 and O < 0.5
print(df.query(query))
V O C X E
0 0.060255 0.341051 0.288854 0.740567 0.236282
5 0.116676 0.340967 0.400340 0.293894 0.220995

Assignments in for loops

This is something I wish to ask on top of the question that is already discussed:
Assigning values to variables in a list using a loop
So the link above says it is not recommended to perform assignments in a for loop because doing so changes only the reference of those values and doesn't change the original variable.
For example, if I wish to add 1 to each variable:
p = 1
q = 2
r = 3
a = [p,q,r]
for i in a:
i += 1
print(p, q, r)
# returns 1 2 3
So this doesn't change p, q, and r, since the values added by one are assigned to 'i'. I inquired some people about a possible solution to this issue, and one suggested that I try the 'enumerate' function. So I did:
p = 1
q = 2
r = 3
a = [p,q,r]
for idx, val in enumerate(a):
a[idx] = val + 1
print(p, q, r)
# returns 1 2 3
So it still doesn't work. Printing the list 'a' does return the list of values added by 1, but it still doesn't change the values assigned to p, q, and r, which is what I want.
So the only solution I currently have is to do the assignment for each variable manually:
p = 1
q = 2
r = 3
p += 1
q += 1
r += 1
print(p, q, r)
# returns 2 3 4
However, in a hypothetical setting where there are more variables involved than 3, such as 50 variables where I wish to add 1 to each value assigned, doing it manually is going to be very demanding.
So my question is, is there a solution to this without doing it manually? How do we accomplish this using the for loop? Or, if the for loop doesn't work for this case, is there another way?

You can maintain a dictionary, where the keys are strings (the names variables that you originally had) and the values are the integers that they're assigned to.
Then, you can do the following:
data = {
"p": 1,
"q": 2,
"r": 3
}
for item in data:
data[item] += 1
print(data)
This outputs:
{'p': 2, 'q': 3, 'r': 4}

Ignoring an invalid filter among multiple filters on a DataFrame

Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]

Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))

The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k

Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas

How to shorten my code with lambda statement in python?

I have trouble with shortening my code with lambda if possible. bp is my data name.
My data looks like this:
user label
1 b
2 b
3 c
I expect to have
user label Y
1 b 1
2 b 1
3 c 0
Here is my code:
counts = bp['Label'].value_counts()
def score_to_numeric(x):
if counts['b'] > counts['s']:
if x == 'b':
return 1
else:
return 0
else:
if x =='b':
return 0
else:
return 1
bp['Y'] = bp['Label'].apply(score_to_numeric) # apply above function to convert data
It is a function converting a categorical data 'b' or 's' in column named 'Label' into numeric data: 0 or 1. The line counts = bp['Label'].value_counts() counts the number of 'b' or 's' in column 'Label'. Then, in score_to_numeric, if the count of 'b' is more than 's', then give value 1 to b in a new column called 'Y', and vice versa.
I would like to shorten my code into 3-4 lines at most. I think perhaps using a lambda statement will do this, but I'm not familiar enough with lambdas.

Since True and False evaluate to 1 and 0, respectively, you can simply return the Boolean expression, converted to integer.
def score_to_numeric(x):
return int((counts['b'] > counts['s']) == \
(x == 'b'))
It returns 1 iff both expressions have the same Boolean value.

I don't think you need to use the apply method. Something simple like this should work:
value_counts = bp.Label.value_counts()
bp.Label[bp.Label == 'b'] = 1 if value_counts['b'] > value_counts['s'] else 0
bp.Label[bp.Label == 's'] = 1 if value_counts['s'] > value_counts['b'] else 0

You could do the following
counts = bp['Label'].value_counts()
t = 1 if counts['b'] > counts['s'] else 0
bp['Y'] = bp['Label'].apply(lambda x: t if x == 'b' else 1 - t)

Two-dimensional array of 0s & 1s, sliced and keys returned as set in Python

I have a number of records in a text file that represent days of a 'month' 1-30 and whether a shop is open or closed. The letters represent the shop.
A 00000000001000000000000000000
B 11000000000000000000000000000
C 00000000000000000000000000000
D 00000000000000000000000000000
E 00000000000000000000000000000
F 00000000000000000000000000000
G 00000000000000000000000000000
H 00000000000000000000000000000
I 11101111110111111011111101111
J 11111111111111111111111111111
K 00110000011000001100000110000
L 00010000001000000100000010000
M 00100000010000001000000100000
N 00000000000000000000000000000
O 11011111101111110111111011111
I want to store the 1's and 0's as is in an array (I'm thinking numpy but there is a another way (string, bitstring) I'd be happy with that). Then I want to be able to slice one day , i.e a column and get the record keys back in a set.
e.g.
A 1
B 0
C 0
D 0
E 0
F 0
G 0
H 0
I 0
J 1
K 1
L 1
M 0
N 0
O 1
day10 = {A,J,K,L,O}
I also need this to be as performant as absolutely possible.

Simplest solution I've come up with:
shops = {}
with open('input.txt', 'r') as f:
for line in f:
name, month = line.strip().split()
shops[name] = [d == '1' for d in month]
dayIndex = 14
result = [s for s,v in shops.iteritems() if v[dayIndex]]
print "Shops opened at",dayIndex,":",result

A numpy solution:
stores, isopen = np.genfromtxt('input.txt', dtype="S30", unpack=True)
isopen = np.array(map(list, isopen)).astype(bool)
Then,
>>> stores[isopen[:,10]]
array(['A', 'J', 'K', 'L', 'O'],
dtype='|S30')

with open("datafile") as fin:
D = {i[0]:int(i[:1:-1], 2) for i in fin}
days = [{k for k in D if D[k] & 1<<i} for i in range(31)]
Just keep the days variable between queries

First, I would hesitate to write the amount of code to make things work for example for bitarray.
Second, I already upvoted BartoszKP's answer as it looks like a reasonable approach.
Last, I would use pandas instead of numpy for such a task, as for most tasks it will use underlying numpy functions and will reasonable fast.
If data contains your array as string, converting to DataFrame can be done with
>>> df = pd.DataFrame([[x] + map(int, y)
... for x, y in [l.split() for l in data.splitlines()]])
>>> df.columns = ['Shop'] + map(str, range(1, 30))
and lookups are done with
>>> df[df['3']==1]['Shop']
8 I
9 J
10 K
12 M
Name: Shop, dtype: object

Use a multilayered dictionary:
all_shops = { 'shopA': { 1: True, 2: False, 3: True ...},
.......}
Then your query is translated to
def query(shop_name, day):
return all_shops[shop_name][day]

with open("datafile") as f:
for line in f:
shop, _days = line.split()
for i,d in enumerate(_days):
if d == '1':
days[i].add(shop)
Simpler, faster and answers the question

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a function to standardize categorical variables (python) - python

Related

Filter a dataframe by column index in a chain, without using the column name or table name

Assignments in for loops

Ignoring an invalid filter among multiple filters on a DataFrame

How to shorten my code with lambda statement in python?

Two-dimensional array of 0s & 1s, sliced and keys returned as set in Python

Categories

Resources