Pandas - how to replace specific values in a Series?

Pandas - how to replace specific values in a Series? - python

I have a dataframe with a column called product_type such as:
df1.product_type.unique()
>> ["prod_1", "prod_2", "prod_3"]
df.prod_cost.dtype
>> dtype('O')
I am looking for the most efficient way to replace that by numerical values [1, 2, 3].
Thanks

Use factorize to encode a new column:
In [2]:
df = pd.DataFrame({'a':list('abcdbcbccc')})
df
Out[2]:
a
0 a
1 b
2 c
3 d
4 b
5 c
6 b
7 c
8 c
9 c
In [5]:
df['code'] = df['a'].factorize()[0] + 1
df
Out[5]:
a code
0 a 1
1 b 2
2 c 3
3 d 4
4 b 2
5 c 3
6 b 2
7 c 3
8 c 3
9 c 3
so in your case:
df1['product_type'] = df1['product_type'].factorize()[0] + 1
should work

Cast the column as a category, and then get the codes.
df1 = pd.DataFrame({'product_type': ['prod_1'] * 3 + ['prod_2'] * 3 + ['prod_3'] * 3})
df1['product_type_code'] = df1.product_type.astype('category').cat.codes
>>> df1
product_type product_type_code
0 prod_1 0
1 prod_1 0
2 prod_1 0
3 prod_2 1
4 prod_2 1
5 prod_2 1
6 prod_3 2
7 prod_3 2
8 prod_3 2

Related

Pandas split and append

I'm new to working with pandas, I don't know how to solve the following problem.
I have the following dataframe:
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
and I have to turn into the following:
a 1
b 2
c 3
d 4
e 5
f 6
g 7
h 8
i 9

Try this:
data = {
0: pd.concat(df[c] for c in df.columns[0::2]).reset_index(drop=True),
1: pd.concat(df[c] for c in df.columns[1::2]).reset_index(drop=True),
}
df = pd.DataFrame(data)
Output:
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Explanation
First, we select every even column and group them together:
>>> df
0 1 2 3 4 5
0 a 1 d 4 g 7
1 b 2 e 5 h 8
2 c 3 f 6 i 9
>>> df.columns
Index(['0', '1', '2', '3', '4', '5'], dtype='object')
>>> even_col_names = df.columns[0::2] # slice syntax: start:stop:step (start with the 0th item, end with the <unspecified> (last) item, select every 2 items)
Index(['0', '2', '4'], dtype='object')
>>> even_cols = df[even_col_names]
>>> even_cols
0 2 4
0 a d g
1 b e h
2 c f i
Then, we select every odd column and group them together:
>>> odd_col_names = df.columns[1::2] # start with the 1st item, select every 2 items
>>> odd_col_names
Index(['1', '3', '5'], dtype='object')
>>> odd_cols = df[odd_col_names]
>>> odd_cols
1 3 5
0 1 4 7
1 2 5 8
2 3 6 9
Then, we concatenate the even columns into a single column:
>>> even_cols_list = [df[c] for c in even_col_names]
>>> even_cols_list
[0 a
1 b
2 c
Name: 0, dtype: object,
0 d
1 e
2 f
Name: 2, dtype: object,
0 g
1 h
2 i
Name: 4, dtype: object]
>>> even_col = pd.concat(even_cols_list).reset_index(drop=True)
>>> even_col
0 a
1 b
2 c
3 d
4 e
5 f
6 g
7 h
8 i
dtype: object
Then we concatenate the odd columns into a single column:
>>> odd_cols_list = [df[c] for c in odd_col_names]
>>> odd_cols_list
[0 1
1 2
2 3
Name: 1, dtype: int64,
0 4
1 5
2 6
Name: 3, dtype: int64,
0 7
1 8
2 9
Name: 5, dtype: int64]
>>> odd_col = pd.concat(odd_cols_list).reset_index(drop=True)
>>> odd_col
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
dtype: int64
Finally, we create a new dataframe with these two columns:
>>> df = pd.DataFrame({0: even_col, 1: odd_col})
>>> df
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

Convert data to numpy, reshape within numpy (two columns), and create a new pandas dataframe (convert relevant column to integer):
df = df.to_numpy()
df = np.reshape(df, (-1, 2)) # have a look at the docs for np.reshape
df = pd.DataFrame(df).transform(pd.to_numeric, errors='ignore')
df.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
Another option would be to individually stack the numbers and strings, before recombining into a single dataframe:
numbers = df.select_dtypes('number').stack().array
strings = df.select_dtypes('object').stack().array
out = pd.concat([pd.Series(strings), pd.Series(numbers)], axis = 1)
out.sort_values(1, ignore_index = True)
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9
One more option, which takes advantage of patterns here is pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index=None,
names_to=['0','1'],
names_pattern= ['0|2|4', '1|3|5'])
0 1
0 a 1
1 b 2
2 c 3
3 d 4
4 e 5
5 f 6
6 g 7
7 h 8
8 i 9

While Loop Alternative in Python

I am working on a huge dataframe and trying to create a new column, based on a condition in another column. Right now, I have a big while-loop and this calculation takes too much time, is there an easier way to do it?
With lambda for example?:
def promo(dataframe, a):
i=0
while i < len(dataframe)-1:
i=i+1
if dataframe.iloc[i-1,5] >= a:
dataframe.iloc[i-1,6] = 1
else:
dataframe.iloc[i-1,6] = 0
return dataframe

Don't use loops in pandas, they are slow compared to a vectorized solution - convert boolean mask to integers by astype True, False are converted to 1, 0:
dataframe = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('aaabbb'),
'F':[5,3,6,9,2,4],
'G':[5,3,6,9,2,4]
})
a = 5
dataframe['new'] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
A B C D E F G new
0 a 4 7 1 a 5 5 1
1 b 5 8 3 a 3 3 0
2 c 4 9 5 a 6 6 1
3 d 5 4 7 b 9 9 1
4 e 5 2 1 b 2 2 0
5 f 4 3 0 b 4 4 0
If you want to overwrite the 7th column:
a = 5
dataframe.iloc[:,6] = (dataframe.iloc[:,5] >= a).astype(int)
print (dataframe)
A B C D E F G
0 a 4 7 1 a 5 1
1 b 5 8 3 a 3 0
2 c 4 9 5 a 6 1
3 d 5 4 7 b 9 1
4 e 5 2 1 b 2 0
5 f 4 3 0 b 4 0

Pandas DataFrame drop tuple or list of columns

When using the drop method for a pandas.DataFrame it accepts lists of column names, but not tuples, despite the documentation saying that "list-like" arguments are acceptable. Am I reading the documentation incorrectly, as I would expect my MWE to work.
MWE
import pandas as pd
df = pd.DataFrame({k: range(5) for k in list('abcd')})
df.drop(['a', 'c'], axis=1) # Works
df.drop(('a', 'c'), axis=1) # Errors
Versions - Using Python 2.7.12, Pandas 0.20.3.

There is problem with tuples select Multiindex:
np.random.seed(345)
mux = pd.MultiIndex.from_arrays([list('abcde'), list('cdefg')])
df = pd.DataFrame(np.random.randint(10, size=(4,5)), columns=mux)
print (df)
a b c d e
c d e f g
0 8 0 3 9 8
1 4 3 4 1 7
2 4 0 9 6 3
3 8 0 3 1 5
df = df.drop(('a', 'c'), axis=1)
print (df)
b c d e
d e f g
0 0 3 9 8
1 3 4 1 7
2 0 9 6 3
3 0 3 1 5
Same as:
df = df[('a', 'c')]
print (df)
0 8
1 4
2 4
3 8
Name: (a, c), dtype: int32

Pandas treats tuples as multi-index values, so try this instead:
In [330]: df.drop(list(('a', 'c')), axis=1)
Out[330]:
b d
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
here is an example for deleting rows (axis=0 - default) in the multi-index DF:
In [342]: x = df.set_index(np.arange(len(df), 0, -1), append=True)
In [343]: x
Out[343]:
a b c d
0 5 0 0 0 0
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [344]: x.drop((0,5))
Out[344]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
4 1 4 4 4 4
In [345]: x.drop([(0,5), (4,1)])
Out[345]:
a b c d
1 4 1 1 1 1
2 3 2 2 2 2
3 2 3 3 3 3
So when you specify tuple Pandas treats it as a multi-index label

I used this to delete column of tuples
del df3[('val1', 'val2')]
and it got deleted.

rename index using index and name column

I have the dataframe df
import pandas as pd
b=np.array([0,1,2,2,0,1,2,2,3,4,4,4,5,6,0,1,0,0]).reshape(-1,1)
c=np.array(['a','a','a','a','b','b','b','b','b','b','b','b','b','b','c','c','d','e']).reshape(-1,1)
df = pd.DataFrame(np.hstack([b,c]),columns=['Start','File'])
df
Out[22]:
Start File
0 0 a
1 1 a
2 2 a
3 2 a
4 0 b
5 1 b
6 2 b
7 2 b
8 3 b
9 4 b
10 4 b
11 4 b
12 5 b
13 6 b
14 0 c
15 1 c
16 0 d
17 0 e
I would like to rename the index using index_File
in order to have 0_a, 1_a, ...17_e as indeces

You use set_index with or without the inplace=True
df.set_index(df.File.radd(df.index.astype(str) + '_'))
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e
At the expense of a few more code characters, we can quicken this up and take care of the unnecessary index name
df.set_index(df.File.values.__radd__(df.index.astype(str) + '_'))
Start File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e

You can directly assign to the index, first by converting the default index to str using astype and then concatenate the str as usual:
In[41]:
df.index = df.index.astype(str) + '_' + df['File']
df
Out[41]:
Start File
File
0_a 0 a
1_a 1 a
2_a 2 a
3_a 2 a
4_b 0 b
5_b 1 b
6_b 2 b
7_b 2 b
8_b 3 b
9_b 4 b
10_b 4 b
11_b 4 b
12_b 5 b
13_b 6 b
14_c 0 c
15_c 1 c
16_d 0 d
17_e 0 e

Add multiple columns to DataFrame and set them equal to an existing column

I want to add multiple columns to a pandas DataFrame and set them equal to an existing column. Is there a simple way of doing this? In R I would do:
df <- data.frame(a=1:5)
df[c('b','c')] <- df$a
df
a b c
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
5 5 5 5
In pandas this results in KeyError: "['b' 'c'] not in index":
df = pd.DataFrame({'a': np.arange(1,6)})
df[['b','c']] = df.a

you can use .assign() method:
In [31]: df.assign(b=df['a'], c=df['a'])
Out[31]:
a b c
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 5 5 5
or a little bit more creative approach:
In [41]: cols = list('bcdefg')
In [42]: df.assign(**{col:df['a'] for col in cols})
Out[42]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
another solution:
In [60]: pd.DataFrame(np.repeat(df.values, len(cols)+1, axis=1), columns=['a']+cols)
Out[60]:
a b c d e f g
0 1 1 1 1 1 1 1
1 2 2 2 2 2 2 2
2 3 3 3 3 3 3 3
3 4 4 4 4 4 4 4
4 5 5 5 5 5 5 5
NOTE: as #Cpt_Jauchefuerst mentioned in the comment DataFrame.assign(z=1, a=1) will add columns in alphabetical order - i.e. first a will be added to existing columns and then z.

A pd.concat approach
df = pd.DataFrame(dict(a=range5))
pd.concat([df.a] * 5, axis=1, keys=list('abcde'))
a b c d e
0 0 0 0 0 0
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
4 4 4 4 4 4

Turns out you can use a loop to do this:
for i in ['b','c']: df[i] = df.a

You can set them individually if you're only dealing with a few columns:
df['b'] = df['a']
df['c'] = df['a']
or you can use a loop as you discovered.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - how to replace specific values in a Series? - python

I have a dataframe with a column called product_type such as: df1.product_type.unique() >> ["prod_1", "prod_2", "prod_3"] df.prod_cost.dtype >> dtype('O') I am looking for the most efficient way to replace that by numerical values [1, 2, 3]. Thanks

Related

Pandas split and append

While Loop Alternative in Python

Pandas DataFrame drop tuple or list of columns

rename index using index and name column

Add multiple columns to DataFrame and set them equal to an existing column

Categories

Resources