Add labels to Categorical Data in Dataframe - python

I am trying to convert survey data on the marital status which look as follows:
df['d11104'].value_counts()
[1] Married 1 250507
[2] Single 2 99131
[4] Divorced 4 32817
[3] Widowed 3 24839
[5] Separated 5 8098
[-1] keine Angabe 2571
Name: d11104, dtype: int64
So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding
df['marstat'].value_counts()
1 250507
2 99131
4 32817
3 24839
5 8098
0 2571
Name: marstat, dtype: int64
Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?
EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:
df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})

You can convert your result to a dataframe and include both the category code and name in the output.
A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.
import pandas as pd
df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
'S', 'S', 'M', 'W']}, dtype='category')
print(df.A.cat.categories)
# Index(['D', 'M', 'S', 'W'], dtype='object')
res = df.A.cat.codes.value_counts().to_frame('count')
cat_map = dict(enumerate(df.A.cat.categories))
res['A'] = res.index.map(cat_map.get)
print(res)
# count A
# 1 5 M
# 2 4 S
# 3 2 W
# 0 1 D
For example, you can access "M" by either df['A'] == 'M' or df.index == 1.
A more straightforward solution is just to use apply value_counts and then add an extra column for codes:
res = df.A.value_counts().to_frame('count').reset_index()
res['code'] = res['index'].cat.codes
index count code
0 M 5 1
1 S 4 2
2 W 2 3
3 D 1 0

Related

Efficient way to relabel values in Pandas DataFrame column based on shared value in adjacent column

I have a dataframe like this:
r_id
c_id
0
x
1
1
y
1
2
z
2
3
u
3
4
v
3
5
w
4
6
x
4
which you can reproduce like this:
import pandas as pd
r1 = ['x', 'y', 'z', 'u', 'v', 'w', 'x']
r2 = ['1', '1', '2', '3', '3', '4', '4']
df = pd.DataFrame([r1,r2]).T
df.columns = ['r_id', 'c_id']
Where a row has a duplicate r_id, I want to relabel all cases of that c_id with the first c_id value that was given for the duplicate r_id.
(Edit: maybe this is somewhat subtle, but I therefore want to relabel 'w's c_id as '1', as well as that belonging to the second case of 'x'. The duplication of 'x' shows me that all instances where c_id == '1' and c_id == '2' should have the same label.)
For a small dataframe, this works:
from collections import defaultdict
import networkx as nx
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: sorted(list(g.nodes))[0] for g in subgraphs for n in g.nodes if n in df.c_id.values}
df['simplified'] = df.c_id.apply(lambda x: translator[x])
so that I get this:
r_id
c_id
simplified
0
x
1
1
1
y
1
1
2
z
2
2
3
u
3
3
4
v
3
3
5
w
4
1
6
x
4
1
But I'm trying to do this for a table with 2.5 million rows and my computer is struggling... There must be a more efficient way to do something like this.
Okay, if I optimize my initial answer by just using the memory id() as a unique label for a connected set (or rather, a subgraph, since I'm using networkx to find these), and don't check any condition while I'm generating the dictionary but just use a .get() so that it passes gracefully past values that have no key, then this seems to work:
def simplify(original_df):
df = original_df.copy()
g = nx.from_pandas_edgelist(df, 'r_id', 'c_id')
subgraphs = [g.subgraph(c) for c in nx.connected_components(g)]
translator = {n: id(g) for g in subgraphs for n in g.nodes}
df['simplified'] = df.c_id.apply(lambda x: translator.get(x,x))
return df
Manages to do what I want for 2,840,759 rows in 14.49 seconds on my laptop, which will do fine.

Is replace row-wise and will overwrite the value within the dict twice?

Assuming I have following data set
lst = ['u', 'v', 'w', 'x', 'y']
lst_rev = list(reversed(lst))
dct = dict(zip(lst, lst_rev))
df = pd.DataFrame({'A':['a', 'b', 'a', 'c', 'a'],
'B':lst},
dtype='category')
Now I want to replace the value of column B in df by dct
I know I can do
df.B.map(dct).fillna(df.B)
to get the expected out put , but when I test with replace (which is more straightforward base on my thinking ), I failed
The out put show as below
df.B.replace(dct)
Out[132]:
0 u
1 v
2 w
3 v
4 u
Name: B, dtype: object
Which is different from the
df.B.map(dct).fillna(df.B)
Out[133]:
0 y
1 x
2 w
3 v
4 u
Name: B, dtype: object
I can think that the reason why this happen, But why ?
0 u --> change to y then change to u
1 v --> change to x then change to v
2 w
3 v
4 u
Appreciate your help.
It's because replace keeps applying the dictionary
df.B.replace({'u': 'v', 'v': 'w', 'w': 'x', 'x': 'y', 'y': 'Hello'})
0 Hello
1 Hello
2 Hello
3 Hello
4 Hello
Name: B, dtype: object
With the given dct 'u' -> 'y' then 'y' -> 'u'.
This behavior is not intended, and was recognized as a bug.
This is the Github issue that first identified the behavior, and it was added as a milestone for pandas 0.24.0. I can confirm the replacement works as expected in the current version on Github.
Here is the PR containing the fix.

Nested if conditions to create a new column in pandas dataframe

I have a dataframe that looks like below:
|userid|rank2017|rank2018|
|212 |'H' |'H' |
|322 |'L' |'H |
|311 |'H' |'L' |
I want to create a new column called progress in the the dataframe above that will output 1 if rank2017 is equal to rank2018, 2 if rank2017 is 'H' and rank2018 is 'L' else 3. can anybody help me execute this in python
Here is one way. You do not need to use nested if statements.
df = pd.DataFrame({'user': [212, 322, 311],
'rank2017': ['H', 'L', 'H'],
'rank2018': ['H', 'H', 'L']})
df['progress'] = 3
df.loc[(df['rank2017'] == 'L') & (df['rank2018'] == 'H'), 'progress'] = 2
df.loc[df['rank2017'] == df['rank2018'], 'progress'] = 1
# rank2017 rank2018 user progress
# 0 H H 212 1
# 1 L H 322 2
# 2 H L 311 3
Here is a way using np.select:
# Set your conditions:
conds = [(df['rank2017'] == df['rank2018']),
(df['rank2017'] == 'H') & (df['rank2018'] == 'L')]
# Set the values for each conditions
choices = [1, 2]
# Use np.select with a default of 3 (your "else" value)
df['progress'] = np.select(conds, choices, default = 3)
Returns:
>>> df
userid rank2017 rank2018 progress
0 212 H H 1
1 322 L H 3
2 311 H L 2

Iterate in a dataframe with strings

I'm trying to create a cognitive task named 2-backed test.
I created a semi-random list with certain conditions and now I wanted to know what should be the good answer for the participant.
I want a column in my dataframe saying if yes or no, 2 letters before it was the same letter.
Here is my code :
from random import choice, shuffle
import pandas as pd
num = 60
letters = ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L']
# letters_1 = [1, 2, 3, 4, 5, 6]
my_list = [choice(letters), choice(letters)]
probab = list(range(num - 2))
shuffle(probab)
# We want 20% of the letters to repeat the letter 2 letters back
pourc = 20
repeatnum = num * pourc // 100
for i in probab:
ch = prev = my_list[-2]
if i >= repeatnum:
while ch == prev:
ch = choice(letters)
my_list.append(ch)
df = pd.DataFrame(my_list, columns=["letters"])
df.head(10)
letters
0 F
1 I
2 D
3 I
4 H
5 C
6 L
7 G
8 D
9 L
# Create a list to store the data
response = []
# For each row in the column,
for i in df['letters']:
# if more than a value,
if i == [i - 2]:
response.append('yes')
else:
response.append('no')
# Create a column from the list
df['response'] = response
First error :
if i == [i - 2]:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I use numbers instead of letters, I can get over this error, but I would prefer keeping letters..
But after that if I run it with number, I get no errors, but my new column response only have 'no'. But I know that 12 times it should be 'yes'.
It seems like you want to perform a comparison on the column and the same column shifted by two elements. Use shift + np.where -
df['response'] = np.where(df.letters.eq(df.letters.shift(2)), 'yes', 'no')
df.head(10)
letters response
0 F no
1 I no
2 D no
3 I yes
4 H no
5 C no
6 L no
7 G no
8 D no
9 L no
But I know that 12 times it should be 'yes'.
df.response.eq('yes').sum()
12

Automatically rename columns to ensure they are unique

I fetch a spreadsheet into a Python DataFrame named df.
Let's give a sample:
df=pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
df.columns=['a','a']
a a
0 0.973858 0.036459
1 0.835112 0.947461
2 0.520322 0.593110
3 0.480624 0.047711
4 0.643448 0.104433
5 0.961639 0.840359
6 0.848124 0.437380
7 0.579651 0.257770
8 0.919173 0.785614
9 0.505613 0.362737
When I run df.columns.is_unique I get False
I would like to automatically rename column 'a' to 'a_2' (or things like that)
I don't expect a solution like df.columns=['a','a_2']
I looking for a solution that could be usable for several columns!
You can uniquify the columns manually:
df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']
def uniquify(df_columns):
seen = set()
for item in df_columns:
fudge = 1
newitem = item
while newitem in seen:
fudge += 1
newitem = "{}_{}".format(item, fudge)
yield newitem
seen.add(newitem)
list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
I fetch a spreadsheet into a Python DataFrame named df... I would like
to automatically rename [duplicate] column [names].
Pandas does that automatically for you without you having to do anything...
test.xls:
import pandas as pd
import numpy as np
df = pd.io.excel.read_excel(
"./test.xls",
"Sheet1",
header=0,
index_col=0,
)
print df
--output:--
a b c b.1 a.1 a.2
index
0 10 100 -10 -100 10 21
1 20 200 -20 -200 11 22
2 30 300 -30 -300 12 23
3 40 400 -40 -400 13 24
4 50 500 -50 -500 14 25
5 60 600 -60 -600 15 26
print df.columns.is_unique
--output:--
True
If for some reason you are being given a DataFrame with duplicate columns, you can do this:
import pandas as pd
import numpy as np
from collections import defaultdict
df = pd.DataFrame(
{
'k': np.random.rand(10),
'l': np.random.rand(10),
'm': np.random.rand(10),
'n': np.random.rand(10),
'o': np.random.rand(10),
'p': np.random.rand(10),
}
)
print df
--output:--
k l m n o p
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.16604
df.columns = ['a', 'b', 'c', 'b', 'a', 'a']
print df
--output:--
a b c b a a
0 0.566150 0.025225 0.744377 0.222350 0.800402 0.449897
1 0.701286 0.182459 0.661226 0.991143 0.793382 0.980042
2 0.383213 0.977222 0.404271 0.050061 0.839817 0.779233
3 0.428601 0.303425 0.144961 0.313716 0.244979 0.487191
4 0.187289 0.537962 0.669240 0.096126 0.242258 0.645199
5 0.508956 0.904390 0.838986 0.315681 0.359415 0.830092
6 0.007256 0.136114 0.775670 0.665000 0.840027 0.991058
7 0.719344 0.072410 0.378754 0.527760 0.205777 0.870234
8 0.255007 0.098893 0.079230 0.225225 0.490689 0.554835
9 0.481340 0.300319 0.649762 0.460897 0.488406 0.166047
print df.columns.is_unique
--output:--
False
name_counts = defaultdict(int)
new_col_names = []
for name in df.columns:
new_count = name_counts[name] + 1
new_col_names.append("{}{}".format(name, new_count))
name_counts[name] = new_count
print new_col_names
--output:--
['a1', 'b1', 'c1', 'b2', 'a2', 'a3']
df.columns = new_col_names
print df
--output:--
a1 b1 c1 b2 a2 a3
0 0.264598 0.321378 0.466370 0.986725 0.580326 0.671168
1 0.938810 0.179999 0.403530 0.675112 0.279931 0.011046
2 0.935888 0.167405 0.733762 0.806580 0.392198 0.180401
3 0.218825 0.295763 0.174213 0.457533 0.234081 0.555525
4 0.891890 0.196245 0.425918 0.786676 0.791679 0.119826
5 0.721305 0.496182 0.236912 0.562977 0.249758 0.352434
6 0.433437 0.501975 0.088516 0.303067 0.916619 0.717283
7 0.026491 0.412164 0.787552 0.142190 0.665488 0.488059
8 0.729960 0.037055 0.546328 0.683137 0.134247 0.444709
9 0.391209 0.765251 0.507668 0.299963 0.348190 0.731980
print df.columns.is_unique
--output:--
True
In case anyone needs this in Scala->
def renameDup (Header : String) : String = {
val trimmedList: List[String] = Header.split(",").toList
var fudge =0
var newitem =""
var seen = List[String]()
for (item <- trimmedList){
fudge = 1
newitem = item
for (newitem2 <- seen){
if (newitem2 == newitem ){
fudge += 1
newitem = item + "_" + fudge
}
}
seen= seen :+ newitem
}
return seen.mkString(",")
}
>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']
Here's a solution that uses pandas all the way through.
import pandas as pd
# create data frame with duplicate column names
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.rename({'a': 'col', 'b': 'col'}, axis=1, inplace=True)
df
---output---
col col
0 1 4
1 2 5
2 3 6
# make a new data frame of column headers and number sequentially
dfcolumns = pd.DataFrame({'name': df.columns})
dfcolumns['counter'] = dfcolumns.groupby('name').cumcount().apply(str)
# remove counter for first case (optional) and combine suffixes
dfcolumns.loc[dfcolumns.counter=='0', 'counter'] = ''
df.columns = dfcolumns['name'] + dfcolumns['counter']
df
---output---
col col1
0 1 4
1 2 5
2 3 6
I ran into this problem when loading DataFrames from oracle tables. 7stud is right that pd.read_excel() automatically designates duplicated columns with a *.1, but not all of the read functions do this. One work around is to save the DataFrame to a csv (or excel) file and then reload it to re-designate duplicated columns.
data = pd.read_SQL(SQL,connection)
data.to_csv(r'C:\temp\temp.csv')
data=read_csv(r'C:\temp\temp.csv')

Categories

Resources