I have a dataframe that looks like below:
|userid|rank2017|rank2018|
|212 |'H' |'H' |
|322 |'L' |'H |
|311 |'H' |'L' |
I want to create a new column called progress in the the dataframe above that will output 1 if rank2017 is equal to rank2018, 2 if rank2017 is 'H' and rank2018 is 'L' else 3. can anybody help me execute this in python
Here is one way. You do not need to use nested if statements.
df = pd.DataFrame({'user': [212, 322, 311],
'rank2017': ['H', 'L', 'H'],
'rank2018': ['H', 'H', 'L']})
df['progress'] = 3
df.loc[(df['rank2017'] == 'L') & (df['rank2018'] == 'H'), 'progress'] = 2
df.loc[df['rank2017'] == df['rank2018'], 'progress'] = 1
# rank2017 rank2018 user progress
# 0 H H 212 1
# 1 L H 322 2
# 2 H L 311 3
Here is a way using np.select:
# Set your conditions:
conds = [(df['rank2017'] == df['rank2018']),
(df['rank2017'] == 'H') & (df['rank2018'] == 'L')]
# Set the values for each conditions
choices = [1, 2]
# Use np.select with a default of 3 (your "else" value)
df['progress'] = np.select(conds, choices, default = 3)
Returns:
>>> df
userid rank2017 rank2018 progress
0 212 H H 1
1 322 L H 3
2 311 H L 2
Related
If the value in column 'y' is K, multiply the column 'x' values to 1e3. If column 'y' is M, multiply the column 'x' values to 1e6. Below code multiplies all the values to with 1e3
value_list = []
for i in list(result['x'].values):
if np.where(result['y'] == 'K'):
value_list.append(float(i)*1e3)
elif np.where(result['y'] == 'M'):
value_list.append(float(i)*1e6)
else:
value_list.append(np.nan)
df['Value_numeric'] = value_list
df.head().Value_numeric
Dataframe:
Output right now:
This case is simple enough that it's not necessary to use a loop or a custom function; one can use a simple assignment:
import pandas as pd
import numpy as np
d = {'x': [750, 5, 4, 240, 220], 'y': ['K', 'M', 'M', 'K', 'K']}
df = pd.DataFrame(data=d)
# here is the main operation:
df['value_numeric'] = np.where(df['y']=='K', df['x'] * 1e3, df['x'] * 1e6)
print(df)
output
x y value_numeric
0 750 K 750000.0
1 5 M 5000000.0
2 4 M 4000000.0
3 240 K 240000.0
4 220 K 220000.0
You can do something like this:
df = pd.DataFrame([[1,"a"],[2,'b'],[3,'c']], columns=['A', 'B'])
def calc(x):
if x['B'] == 'a':
return x['A'] * 10
if x['B'] == 'b':
return x['A'] * 20
if x['B'] == 'c':
return x['A'] * 30
df['calculate'] = df.apply(lambda x: calc(x),axis=1)
print(df)
# A B calculate
#0 1 a 10
#1 2 b 40
#2 3 c 90
You can adjust your calculations as needed based on the condition.
I am trying to convert survey data on the marital status which look as follows:
df['d11104'].value_counts()
[1] Married 1 250507
[2] Single 2 99131
[4] Divorced 4 32817
[3] Widowed 3 24839
[5] Separated 5 8098
[-1] keine Angabe 2571
Name: d11104, dtype: int64
So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding
df['marstat'].value_counts()
1 250507
2 99131
4 32817
3 24839
5 8098
0 2571
Name: marstat, dtype: int64
Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?
EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:
df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})
You can convert your result to a dataframe and include both the category code and name in the output.
A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.
import pandas as pd
df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
'S', 'S', 'M', 'W']}, dtype='category')
print(df.A.cat.categories)
# Index(['D', 'M', 'S', 'W'], dtype='object')
res = df.A.cat.codes.value_counts().to_frame('count')
cat_map = dict(enumerate(df.A.cat.categories))
res['A'] = res.index.map(cat_map.get)
print(res)
# count A
# 1 5 M
# 2 4 S
# 3 2 W
# 0 1 D
For example, you can access "M" by either df['A'] == 'M' or df.index == 1.
A more straightforward solution is just to use apply value_counts and then add an extra column for codes:
res = df.A.value_counts().to_frame('count').reset_index()
res['code'] = res['index'].cat.codes
index count code
0 M 5 1
1 S 4 2
2 W 2 3
3 D 1 0
I have a pandas dataframe like this which I try to sort by column 'dist'. The sorted dataframe should start with E or F as per below. I use sort_values which it is not working for me. The function is computing distances from 'Start' location to a list of locations ['C', 'B', 'D', 'E', 'A', 'F'] and then is supposed to sort the dataframe in ascending order using 'dist' column.
Could someone advice me why sorting is not working?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
loc_list
Out[194]: ['C', 'B', 'D', 'E', 'A', 'F']
def closest_locations(from_loc_point, to_loc_list):
lresults=list()
for list_index in range(len(to_loc_list)):
dist= hypot(locations[from_loc_point[0]][0] -locations[to_loc_list[list_index]][0],locations[from_loc_point[0]][1] -locations[to_loc_list[list_index]][1]) # cumsum distante
lista_dist = [from_loc_point[0],to_loc_list[list_index],dist]
lresults.append(lista_dist[:])
RESULTS = pd.DataFrame(np.array(lresults))
RESULTS.columns = ['from','to','dist']
RESULTS.sort_values(['dist'],ascending=[True],inplace=True)
RESULTS.index = range(len(RESULTS))
return RESULTS
closest_locations(['Start'], loc_list)
Out[189]:
from to dist
0 Start D 10.19803902718557
1 Start A 10.19803902718557
2 Start C 15.132745950421555
3 Start B 15.132745950421555
4 Start E 6.08276253029822
5 Start F 6.08276253029822
closest_two_loc.dtypes
Out[247]:
from object
to object
dist object
dtype: object
Is this what you want?
locations = {'Start':(20,5),'A':(10,3), 'B':(5,3), 'C':(5, 7), 'D':(10,7),'E':(14,4),'F':(14,6)}
df= pd.DataFrame.from_dict(locations, orient='index').rename(columns={0:'x', 1:'y'})
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc['Start', 'x'])**2 + (row['y'] - df.loc['Start', 'y'])**2), axis=1)
df.drop(['Start']).sort_values(by='dist')
x y dist
E 14 4 6.082763
F 14 6 6.082763
A 10 3 10.198039
D 10 7 10.198039
C 5 7 15.132746
B 5 3 15.132746
or if you want to wrap it in a function
def dist_from(df, col):
df['dist'] = df.apply(lambda row: pd.np.sqrt((row['x'] - df.loc[col,'x'])**2 + (row['y'] - df.loc[col, 'y'])**2), axis=1)
df['form'] = col
df.drop([col]).sort_values(by='dist')
df.index.name = 'to'
return df.reset_index().loc[:, ['from', 'to', 'dist']]
You need to convert values in "dist" column to float:
df = closest_locations(['Start'], loc_list)
df.dist = list(map(lambda x: float(x), df.dist)) # convert each value to float
print(df.sort_values('dist')) # now it will sort properly
Output:
from to dist
4 Start E 6.082763
5 Start F 6.082763
0 Start D 10.198039
1 Start A 10.198039
2 Start C 15.132746
3 Start B 15.132746
Edit: As mentioned by #jezrael in comments, following is a more direct method:
df.dist = df.dist.astype(float)
I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12
I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long.
I need two dataframes with classes 0 & 1 for the first data set and 3 & 5 for the second.
I can get 0 & 1 together easily enough:
mnist_01 = mnist.loc[mnist['class']<= 1]
However, I am not sure how to get classes 3 & 5... so what I would like to be able to do is:
mnist_35 = mnist.loc[mnist['class'] == (3 or 5)]
...rather than doing:
mnist_3 = mnist.loc[mnist['class'] == 3]
mnist_5 = mnist.loc[mnist['class'] == 5]
mnist_35 = pd.concat([mnist_3,mnist_5],axis=0)
You can use isin, probably using set membership to make each check an O(1) time complexity operation:
mnist = pd.DataFrame({'class': [0, 1, 2, 3, 4, 5],
'val': ['a', 'b', 'c', 'd', 'e', 'f']})
>>> mnist.loc[mnist['class'].isin({3, 5})]
class val
3 3 d
5 5 f
>>> mnist.loc[mnist['class'].isin({0, 1})]
class val
0 0 a
1 1 b