Given this data frame:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df
ID value
0 a None
1 b NaN
2 c 6D
3 d 7
4 e 10D
5 f NONE
6 g x
7 h 10D aaa
8 i 1 D
9 j 10 D aa
10 k i7D
I'd like to extract numbers where present, else return 0, for any mess of situations as shown above.
The desired result is:
ID value
0 a 0
1 b 0
2 c 6
3 d 7
4 e 10
5 f 0
6 g 0
7 h 10
8 i 1
9 j 10
10 k 7
Thanks in advance!
Here is my approach of using re.findall and apply
df['value'].apply(lambda x: 0 if not re.findall('\d+', str(x)) else re.findall('\d+', str(x))[0])
Try the following using Series.str.replace and fillna:
import pandas as pd
df = pd.DataFrame({'ID':['a','b','c','d','e','f','g','h','i','j','k'],
'value':['None',np.nan,'6D','7','10D','NONE','x','10D aaa','1 D','10 D aa',7]
})
df = df.fillna(0)
df = df.str.replace(r'\D+', '').astype(int)
Alternatively, you can apply a function to the dataframe via applymap() following the EAFP principle catching multiple exceptions while extracting the digits:
def get_number(item):
try:
return int(re.search(r"\d+", str(item)).group(0))
except (AttributeError, ValueError, IndexError):
return 0
print(df.applymap(get_number))
Prints:
ID value
0 0 0
1 0 0
2 0 6
3 0 7
4 0 10
5 0 0
6 0 0
7 0 10
8 0 1
9 0 10
10 0 7
Related
It would be more easy to explain start from a simple example df:
df1:
A B C D
0 a 6 1 b/5/4
1 a 6 1 a/1/6
2 c 9 3 9/c/3
There were four columns in the df1(ABCD).The task is to find out columns D's strings appeared how many times in columnsABC(3coulumns)?Here is expect output and more explanation:
df2(expect output):
A B C D E (New column)
0 a 6 1 b/5/4 0 <--Found 0 ColumnD's Strings from ColumnABC
1 a 6 1 a/1/6 3 <--Found a、1 & 6 so it should return 3
2 c 9 3 9/c/3 3 <--Found all strings (3 totally)
Anyone has good idea for this? Thanks!
You can use a list comprehension with set operations:
df['E'] = [len(set(l).intersection(s.split('/'))) for l, s in
zip(df.drop(columns='D').astype(str).to_numpy().tolist(),
df['D'])]
Output:
A B C D E
0 a 6 1 b/5/4 0
1 a 6 1 a/1/6 3
2 c 9 3 9/c/3 3
import pandas as pd
from pandas import DataFrame as df
dt = {'A':['a','a','c'], 'B': [6,6,9], 'C': [1,1,3], 'D':['b/5/4', 'a/1/6', 'c/9/3']}
E = []
nu_data =pd.DataFrame(data=dt)
for itxid, itx in enumerate(nu_data['D']):
match = 0
str_list = itx.split('/')
for keyid, keys in enumerate(dt):
if keyid < len(dt)-1:
for seg_str in str_list:
if str(dt[keys][itxid]) == seg_str:
match += 1
E.append(match)
nu_data['E'] = E
print(nu_data)
Consider the following code to create a dummy dataset
import numpy as np
from scipy.stats import norm
import pandas as pd
np.random.seed(10)
n=3
space= norm(20, 5).rvs(n)
time= norm(10,2).rvs(n)
values = np.kron(space, time).reshape(n,n) + norm(1,1).rvs([n,n])
### Output
array([[267.39784458, 300.81493866, 229.19163206],
[236.1940266 , 266.49469945, 204.01294305],
[122.55912977, 140.00957047, 106.28339745]])
I can put these data in a pandas dataframe using
space_names = ['A','B','C']
time_names = [2000,2001,2002]
df = pd.DataFrame(values, index=space_names, columns=time_names)
df
### Output
2000 2001 2002
A 267.397845 300.814939 229.191632
B 236.194027 266.494699 204.012943
C 122.559130 140.009570 106.283397
This is considered a wide dataset, where each observation lies in a table with 2 variable that acts as coordinates to identify it.
To make it a long-tidy dataset we can suse the .stack method of pandas dataframe
df.columns.name = 'time'
df.index.name = 'space'
df.stack().rename('value').reset_index()
### Output
space time value
0 A 2000 267.397845
1 A 2001 300.814939
2 A 2002 229.191632
3 B 2000 236.194027
4 B 2001 266.494699
5 B 2002 204.012943
6 C 2000 122.559130
7 C 2001 140.009570
8 C 2002 106.283397
My question is: how do I do exactly this thing but for a 3-dimensional dataset?
Let's imagine I have 2 observation for each space-time couple
s = 3
t = 4
r = 2
space_mus = norm(20, 5).rvs(s)
time_mus = norm(10,2).rvs(t)
values = np.kron(space_mus, time_mus)
values = values.repeat(r).reshape(s,t,r) + norm(0,1).rvs([s,t,r])
values
### Output
array([[[286.50322099, 288.51266345],
[176.64303485, 175.38175877],
[136.01675917, 134.44328617]],
[[187.07608546, 185.4068411 ],
[112.86398438, 111.983463 ],
[ 85.99035255, 86.67236986]],
[[267.66833894, 269.45295404],
[162.30044715, 162.50564386],
[124.6374401 , 126.2315447 ]]])
How can I obtain the same structure for the dataframe as above?
Ugly solution
Personally i don't like this solution, and i think one might do it in a more elegant and pythonic way, but still might be useful for someone else so I will post my solution.
labels = ['{}{}{}'.format(i,j,k) for i in range(s) for j in range(t) for k in range(r)] #space, time, repetition
def flatten3d(k):
return [i for l in k for s in l for i in s]
value_series = pd.Series(flatten3d(values)).rename('y')
split_labels= [[i for i in l] for l in labels]
df = pd.DataFrame(split_labels, columns=['s','t','r'])
pd.concat([df, value_series], axis=1)
### Output
s t r y
0 0 0 0 266.2408815208753
1 0 0 1 266.13662442609433
2 0 1 0 299.53178992512954
3 0 1 1 300.13941632567605
4 0 2 0 229.39037800681405
5 0 2 1 227.22227496248507
6 0 3 0 281.76357915411995
7 0 3 1 280.9639352062619
8 1 0 0 235.8137644198259
9 1 0 1 234.23202459516452
10 1 1 0 265.19681013560034
11 1 1 1 266.5462102589883
12 1 2 0 200.730100791878
13 1 2 1 199.83217739700535
14 1 3 0 246.54018839875374
15 1 3 1 248.5496308586532
16 2 0 0 124.90916276929234
17 2 0 1 123.64788669199066
18 2 1 0 139.65391860786775
19 2 1 1 138.08044561039517
20 2 2 0 106.45276370157518
21 2 2 1 104.78351933651582
22 2 3 0 129.86043618610572
23 2 3 1 128.97991481257253
This does not use stack, but maybe it is acceptable for your problem:
import numpy as np
import pandas as pd
values = np.arange(18).reshape(3, 3, 2) # Your values here
index = pd.MultiIndex.from_product([space_names, space_names, time_names], names=["space1", "space2", "time"])
df = pd.DataFrame({"value": values.ravel()}, index=index).reset_index()
# df:
# space1 space2 time value
# 0 A A 2000 0
# 1 A A 2001 1
# 2 A B 2000 2
# 3 A B 2001 3
# 4 A C 2000 4
# 5 A C 2001 5
# 6 B A 2000 6
# 7 B A 2001 7
# 8 B B 2000 8
# 9 B B 2001 9
# 10 B C 2000 10
# 11 B C 2001 11
# 12 C A 2000 12
# 13 C A 2001 13
# 14 C B 2000 14
# 15 C B 2001 15
# 16 C C 2000 16
# 17 C C 2001 17
pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as #Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to #piRSquared's comment:
This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).
Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here
This is quite a late answer, but since you ask for a quick way to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as #Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using #Jeff's setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1 per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1 per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1 per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don't assume one 1 per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Another option is using the function from_dummies from pandas version 1.5.0. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies:
pd.from_dummies(df)
0 a
1 b
2 a
3 c
Converting dat["classification"] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you're categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don't form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b has False for all three categories (since the mark is NaN and pass is True).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN was kept where none of the categories applied.
This approach is as performant as using np.where but allows for the flexibility of possible NaNs.
pd.get_dummies allows to convert a categorical variable into dummy variables. Besides the fact that it's trivial to reconstruct the categorical variable, is there a preferred/quick way to do it?
It's been a few years, so this may well not have been in the pandas toolkit back when this question was originally asked, but this approach seems a little easier to me. idxmax will return the index corresponding to the largest element (i.e. the one with a 1). We do axis=1 because we want the column name where the 1 occurs.
EDIT: I didn't bother making it categorical instead of just a string, but you can do that the same way as #Jeff did by wrapping it with pd.Categorical (and pd.Series, if desired).
In [1]: import pandas as pd
In [2]: s = pd.Series(['a', 'b', 'a', 'c'])
In [3]: s
Out[3]:
0 a
1 b
2 a
3 c
dtype: object
In [4]: dummies = pd.get_dummies(s)
In [5]: dummies
Out[5]:
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
In [6]: s2 = dummies.idxmax(axis=1)
In [7]: s2
Out[7]:
0 a
1 b
2 a
3 c
dtype: object
In [8]: (s2 == s).all()
Out[8]: True
EDIT in response to #piRSquared's comment:
This solution does indeed assume there's one 1 per row. I think this is usually the format one has. pd.get_dummies can return rows that are all 0 if you have drop_first=True or if there are NaN values and dummy_na=False (default) (any cases I'm missing?). A row of all zeros will be treated as if it was an instance of the variable named in the first column (e.g. a in the example above).
If drop_first=True, you have no way to know from the dummies dataframe alone what the name of the "first" variable was, so that operation isn't invertible unless you keep extra information around; I'd recommend leaving drop_first=False (default).
Since dummy_na=False is the default, this could certainly cause problems. Please set dummy_na=True when you call pd.get_dummies if you want to use this solution to invert the "dummification" and your data contains any NaNs. Setting dummy_na=True will always add a "nan" column, even if that column is all 0s, so you probably don't want to set this unless you actually have NaNs. A nice approach might be to set dummies = pd.get_dummies(series, dummy_na=series.isnull().any()). What's also nice is that idxmax solution will correctly regenerate your NaNs (not just a string that says "nan").
It's also worth mentioning that setting drop_first=True and dummy_na=False means that NaNs become indistinguishable from an instance of the first variable, so this should be strongly discouraged if your dataset may contain any NaN values.
In [46]: s = Series(list('aaabbbccddefgh')).astype('category')
In [47]: s
Out[47]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
In [48]: df = pd.get_dummies(s)
In [49]: df
Out[49]:
a b c d e f g h
0 1 0 0 0 0 0 0 0
1 1 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0
3 0 1 0 0 0 0 0 0
4 0 1 0 0 0 0 0 0
5 0 1 0 0 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 1 0 0 0 0 0
8 0 0 0 1 0 0 0 0
9 0 0 0 1 0 0 0 0
10 0 0 0 0 1 0 0 0
11 0 0 0 0 0 1 0 0
12 0 0 0 0 0 0 1 0
13 0 0 0 0 0 0 0 1
In [50]: x = df.stack()
# I don't think you actually need to specify ALL of the categories here, as by definition
# they are in the dummy matrix to start (and hence the column index)
In [51]: Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
Out[51]:
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
Name: level_1, dtype: category
Categories (8, object): [a < b < c < d < e < f < g < h]
So I think we need a function to 'do' this as it seems to be a natural operations. Maybe get_categories(), see here
This is quite a late answer, but since you ask for a quick way to do it, I assume you're looking for the most performant strategy. On a large dataframe (for instance 10000 rows), you can get a very significant speed boost by using np.where instead of idxmax or get_level_values, and obtain get the same result. The idea is to index the column names where the dummy dataframe is not 0:
Method:
Using the same sample data as #Nathan:
>>> dummies
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
s2 = pd.Series(dummies.columns[np.where(dummies!=0)[1]])
>>> s2
0 a
1 b
2 a
3 c
dtype: object
Benchmark:
On a small dummy dataframe, you won't see much difference in performance. However, testing different strategies to solving this problem on a large series:
s = pd.Series(np.random.choice(['a','b','c'], 10000))
dummies = pd.get_dummies(s)
def np_method(dummies=dummies):
return pd.Series(dummies.columns[np.where(dummies!=0)[1]])
def idx_max_method(dummies=dummies):
return dummies.idxmax(axis=1)
def get_level_values_method(dummies=dummies):
x = dummies.stack()
return pd.Series(pd.Categorical(x[x!=0].index.get_level_values(1)))
def dot_method(dummies=dummies):
return dummies.dot(dummies.columns)
import timeit
# Time each method, 1000 iterations each:
>>> timeit.timeit(np_method, number=1000)
1.0491090340074152
>>> timeit.timeit(idx_max_method, number=1000)
12.119140846014488
>>> timeit.timeit(get_level_values_method, number=1000)
4.109266621991992
>>> timeit.timeit(dot_method, number=1000)
1.6741622970002936
The np.where method is about 4 times faster than the get_level_values method 11.5 times faster than the idxmax method! It also beats (but only by a little) the .dot() method outlined in this answer to a similar question
They all return the same result:
>>> (get_level_values_method() == np_method()).all()
True
>>> (idx_max_method() == np_method()).all()
True
Setup
Using #Jeff's setup
s = Series(list('aaabbbccddefgh')).astype('category')
df = pd.get_dummies(s)
If columns are strings
and there is only one 1 per row
df.dot(df.columns)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
numpy.where
Again! Assuming only one 1 per row
i, j = np.where(df)
pd.Series(df.columns[j], i)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: category
Categories (8, object): [a, b, c, d, e, f, g, h]
numpy.where
Not assuming one 1 per row
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j])))
0 0 a
1 0 a
2 0 a
3 1 b
4 1 b
5 1 b
6 2 c
7 2 c
8 3 d
9 3 d
10 4 e
11 5 f
12 6 g
13 7 h
dtype: object
numpy.where
Where we don't assume one 1 per row and we drop the index
i, j = np.where(df)
pd.Series(dict(zip(zip(i, j), df.columns[j]))).reset_index(-1, drop=True)
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 c
8 d
9 d
10 e
11 f
12 g
13 h
dtype: object
Another option is using the function from_dummies from pandas version 1.5.0. Here is a reproducible example:
import pandas as pd
s = pd.Series(['a', 'b', 'a', 'c'])
df = pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
Using from_dummies:
pd.from_dummies(df)
0 a
1 b
2 a
3 c
Converting dat["classification"] to one hot encodes and back!!
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
dat["labels"]= le.fit_transform(dat["classification"])
Y= pd.get_dummies(dat["labels"])
tru=[]
for i in range(0, len(Y)):
tru.append(np.argmax(Y.iloc[i]))
tru= le.inverse_transform(tru)
##Identical check!
(tru==dat["classification"]).value_counts()
If you're categorizing the rows in your dataframe based on some row-wise mutually exclusive boolean conditions (these are the "dummy" variables) which don't form a partition (i.e. some rows are all 0 because of, for example, some missing data), it may be better to initialize a pd.Categorical full with np.nan and then explicitly set the category of each subset. An example follows.
0. Data setup:
np.random.seed(42)
student_names = list('abcdefghi')
marks = np.random.randint(0, 100, len(student_names)).astype(float)
passes = marks >= 50
marks[[1, 5]] = np.nan # artificially introduce NAs
students = pd.DataFrame({'mark': marks, 'pass': passes}, index=student_names)
>>> students
mark pass
a 51.0 True
b NaN True
c 14.0 False
d 71.0 True
e 60.0 True
f NaN False
g 82.0 True
h 86.0 True
i 74.0 True
1. Compute the value of the relevant boolean conditions:
failed = ~students['pass']
barely_passed = students['pass'] & (students['mark'] < 60)
well_passed = students['pass'] & (students['mark'] >= 60)
>>> pd.DataFrame({'f': failed, 'b': barely_passed, 'p': well_passed}).astype(int)
b f p
a 1 0 0
b 0 0 0
c 0 1 0
d 0 0 1
e 0 0 1
f 0 1 0
g 0 0 1
h 0 0 1
i 0 0 1
As you can see row b has False for all three categories (since the mark is NaN and pass is True).
2. Generate the categorical series:
cat = pd.Series(
pd.Categorical([np.nan] * len(students), categories=["failed", "barely passed", "well passed"]),
index=students.index
)
cat[failed] = "failed"
cat[barely_passed] = "barely passed"
cat[well_passed] = "well passed"
>>> cat
a barely passed
b NaN
c failed
d well passed
e well passed
f failed
g well passed
h well passed
i well passed
As you can see, a NaN was kept where none of the categories applied.
This approach is as performant as using np.where but allows for the flexibility of possible NaNs.
Currently, I'm using:
csvdata.update(data, overwrite=True)
How can I make it update and overwrite a specific column but not another, small but simple question, is there a simple answer?
Rather than update with the entire DataFrame, just update with the subDataFrame of columns which you are interested in. For example:
In [11]: df1
Out[11]:
A B
0 1 99
1 3 99
2 5 6
In [12]: df2
Out[12]:
A B
0 a 2
1 b 4
2 c 6
In [13]: df1.update(df2[['B']]) # subset of cols = ['B']
In [14]: df1
Out[14]:
A B
0 1 2
1 3 4
2 5 6
If you want to do it for a single column:
import pandas
import numpy
csvdata = pandas.DataFrame({"a":range(12), "b":range(12)})
other = pandas.Series(list("abcdefghijk")+[numpy.nan])
csvdata["a"].update(other)
print csvdata
a b
0 a 0
1 b 1
2 c 2
3 d 3
4 e 4
5 f 5
6 g 6
7 h 7
8 i 8
9 j 9
10 k 10
11 11 11
or, as long as the column names match, you can do this:
other = pandas.DataFrame({"a":list("abcdefghijk")+[numpy.nan], "b":list("abcdefghijk")+[numpy.nan]})
csvdata.update(other["a"])