Match to re-code letters and numbers in python (pandas)

Match to re-code letters and numbers in python (pandas) - python

I have a variable that is mixed with letters and numbers. The letters range from A:Z and the numbers range from 2:8. I want to re-code this variable so that it is all numeric with the letters A:Z now becoming numbers 1:26 and the numbers 2:8 becoming numbers 27:33.
For example, I would like this variable:
Var1 = c('A',2,3,8,'C','W',6,'T')
To become this:
Var1 = c(1,27,28,33,3,23,31,20)
In R I can do this using 'match' like this:
Var1 = as.numeric(match(Var1, c(LETTERS, 2:8)))
How can I do this using python? Pandas?
Thank you

Make a dictionary and map the values:
import string
import numpy as np
dct = dict(zip(list(string.ascii_uppercase) + list(np.arange(2, 9)), np.arange(1, 34)))
# If they are strings of numbers, not integers use:
#dct = dict(zip(list(string.ascii_uppercase) + ['2', '3', '4', '5', '6', '7', '8'], np.arange(1, 34)))
df.col_name = df.col_name.map(dct)
An example:
import pandas as pd
df = pd.DataFrame({'col': [2, 4, 6, 3, 5, 'A', 'B', 'D', 'F', 'Z', 'X']})
df.col.map(dct)
Outputs:
0 27
1 29
2 31
3 28
4 30
5 1
6 2
7 4
8 6
9 26
10 24
Name: col, dtype: int64

i think that could help you
Replacing letters with numbers with its position in alphabet
then you just need to apply on you df column
dt.Var1.apply(alphabet_position)
you can also try this
for i in range(len(var1)):
if type(var1[i]) == int:
var1[i] = var1[i] + 25
else:
var1[i] = ord(var1[i].lower()) - 96

Related

Is there a way to populate a column based on if the values of another column falls withing a range of numbers in python?

I'm working on a data table and I need to create a new column based on which of the classes the value of another column falls in.
This is the original table:
ID sequence
AJ8 2
FT7 3
JU4 5
ER2 3
LI5 2
FR2 7
WS1 8
UG4 9
The ranges are 2, 3, 4, 6: first; 1,5,0: second; and 7,8, 9: third.
I created the variables
first = ['2', '3', '4', '6']
second = ['1', '5', '0']
third = ['7', '8', '9']
I want to get the following table
ID sequence code
AJ8 2 FIRST
FT7 3 FIRST
JU4 5 SECOND
ER2 3 FIRST
LI5 2 FIRST
FR2 7 THIRD
WS1 8 THIRD
UG4 9 THIRD
How do I do this?

I would create a function that conditionally returns the value you want.
import pandas as pd
keys = ['AJ8', 'FT7', 'JU4', 'ER2', 'LI5', 'FR2', 'WS1', 'UG4']
values = [2, 3, 5, 3, 2, 7, 8, 9]
df = pd.DataFrame(list(zip(keys, values)), columns =['key', 'value'])
def get_new_column(df):
if df['value'] in [2, 3, 4, 6]:
return 'first'
elif df['value'] in [1, 5, 0]:
return 'second'
elif df['value'] in [7, 8, 9]:
return 'third'
else:
return ''
df['new'] = df.apply(get_new_column, axis=1)
print(df)
Output:
key value new
0 AJ8 2 first
1 FT7 3 first
2 JU4 5 second
3 ER2 3 first
4 LI5 2 first
5 FR2 7 third
6 WS1 8 third
7 UG4 9 third
Here are more examples.

Add labels to Categorical Data in Dataframe

I am trying to convert survey data on the marital status which look as follows:
df['d11104'].value_counts()
[1] Married 1 250507
[2] Single 2 99131
[4] Divorced 4 32817
[3] Widowed 3 24839
[5] Separated 5 8098
[-1] keine Angabe 2571
Name: d11104, dtype: int64
So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding
df['marstat'].value_counts()
1 250507
2 99131
4 32817
3 24839
5 8098
0 2571
Name: marstat, dtype: int64
Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?
EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:
df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})

You can convert your result to a dataframe and include both the category code and name in the output.
A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.
import pandas as pd
df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
'S', 'S', 'M', 'W']}, dtype='category')
print(df.A.cat.categories)
# Index(['D', 'M', 'S', 'W'], dtype='object')
res = df.A.cat.codes.value_counts().to_frame('count')
cat_map = dict(enumerate(df.A.cat.categories))
res['A'] = res.index.map(cat_map.get)
print(res)
# count A
# 1 5 M
# 2 4 S
# 3 2 W
# 0 1 D
For example, you can access "M" by either df['A'] == 'M' or df.index == 1.
A more straightforward solution is just to use apply value_counts and then add an extra column for codes:
res = df.A.value_counts().to_frame('count').reset_index()
res['code'] = res['index'].cat.codes
index count code
0 M 5 1
1 S 4 2
2 W 2 3
3 D 1 0

Iterate in a dataframe with strings

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.

It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

Comparing two numbers lists with each other in Python

I have a data frame (possibly a list):
A = ['01', '20', '02', '25', '26']
B = ['10', '13', '14', '64', '32']
I would like to compare list 'a' with list 'b' in the following way:
As you can see, strings of numbers in the left column with strings in the right column are compared. Combined are strings that have the same boundary digit, one of which is removed during merging (or after). Why was the string '010' removed? Because each digit can occur only once.

You can perform a couple of string slicing operations and then merge on the common digit.
a
A
0 01
1 20
2 02
3 25
4 26
b
B
0 10
1 13
2 14
3 64
4 32
a['x'] = a.A.str[-1]
b['x'] = b.B.str[0]
b['B'] = b.B.str[1:]
m = a.merge(b)
You could also do this in a single line with assign, without disrupting the original dataframes:
m = a.assign(x=a.A.str[-1]).merge(b.assign(x=b.B.str[0], B=b.B.str[1:]))
For uniques, you'll need to convert to set and check its length.
v = (m['A'] + m['B'])
v.str.len() == v.apply(set).str.len()
0 False
1 True
2 True
3 True
dtype: bool
v[v.str.len() == v.apply(set).str.len()].tolist()
['013', '014', '264']

Something you should be aware of is that you're actually passing integers, not strings. That means that A = [01, 20, 02, 25, 26] is the same as A = [1, 20, 2, 25, 26]. If you always know that you're going to be working with integers <= 99, however, this won't be an issue. Otherwise, you should use strings instead of integers, like A = ['01', '20', '02', '25', '26']. So the first thing you should do is convert the lists to lists of strings. If you know all of the integers will be <= 99, you can do so like this:
A = ['%02d' % i for i in A]
B = ['%02d' % i for i in B]
(you could also name these something different if you want to preserve the integer lists). Then here would be the solution:
final = []
for i in A:
for j in B:
if i[-1] == j[0]:
final.append(i + j[1:])

Extract non- empty values from the regex array output in python

I have a column of type numpy.ndarray which looks like:
col
['','','5','']
['','8']
['6','','']
['7']
[]
['5']
I want the ouput like this :
col
5
8
6
7
0
5
How can I do this in python.Any help is highly appreciated.

To convert the data to numeric values you could use:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = pd.to_numeric(df['col'].str.join('')).fillna(0).astype(int)
print(df)
yields
col
0 5
1 8
2 6
3 7
4 0
5 5
To convert the data to strings use:
df['col'] = df['col'].str.join('').replace('', '0')
The result looks the same, but the dtype of the column is object since the values are strings.
If there is more than one number in some rows and you wish to pick the largest,
then you'll have to loop through each item in each row, convert each string to
a numeric value and take the max:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5','6'], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = [max([int(xi) if xi else 0 for xi in x] or [0]) for x in df['col']]
print(df)
yields
col
0 6 # <-- note ['','','5','6'] was converted to 6
1 8
2 6
3 7
4 0
5 5
For versions of pandas prior to 0.17, you could use df.convert_objects instead:
import numpy as np
import pandas as pd
data = list(map(np.array, [ ['','','5',''], ['','8'], ['6','',''], ['7'], [], ['5']]))
df = pd.DataFrame({'col': data})
df['col'] = df['col'].str.join('').replace('', '0')
df = df.convert_objects(convert_numeric=True)

xn = array([['', '', '5', ''], ['', '8'], ['6', '', ''], ['7'], [], ['5']],
dtype=object)
In [20]: for a in x:
....: if len(a)==0:
....: print 0
....: else:
....: for b in a:
....: if b:
....: print b
....:
5
8
6
7
0
5

I'll leave you with this :
>>> l=['', '5', '', '']
>>> l = [x for x in l if not len(x) == 0]
>>> l
>>> ['5']
You can do the same thing using lambda and filter
>>> l
['', '1', '']
>>> l = filter(lambda x: not len(x)==0, l)
>>> l
['1']
The next step would be iterating through the rows of the array and implementing one of these two ideas.
Someone shows how this is done here: Iterating over Numpy matrix rows to apply a function each?
edit: maybe this is down-voted, but I made it on purpose to not give the final code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Match to re-code letters and numbers in python (pandas) - python

Related

Is there a way to populate a column based on if the values of another column falls withing a range of numbers in python?

Add labels to Categorical Data in Dataframe

Iterate in a dataframe with strings

Comparing two numbers lists with each other in Python

Extract non- empty values from the regex array output in python

Categories

Resources