changing values in a column of a data frame efficiently [duplicate] - python

I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.

Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column

.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.

Related

Remap values in pandas column with a dict, None if KeyError

I'd like to modify the col1 of the following dataframe df:
col1 col2
0 Black 7
1 Death 2
2 Hardcore 6
3 Grindcore 1
4 Deathcore 4
...
I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'} to get the following dataframe:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
...
I know I can use df.map or df.replace, for example like this:
df.replace({"col1":cat_dic})
but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:
col1 col2
0 B 7
1 D 2
2 H 6
3 Grindcore 1
4 Deathcore 4
...
Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?
Use dict.get:
df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)
print (df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Performance comparison in 50k rows:
df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
You may use pd.apply first
df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
Then, you can safely use pd.replace
df.replace({"col1":cat_dic})
This can be done in One line:
df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
Output is:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Here is a one liner easy solution which gives us the expected output.
df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))
Output :
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Series.map already maps NaN to the mismatched key
$ print(df['col1'].map(cat_dic))
0 B
1 D
2 H
3 NaN
4 NaN
Name: col1, dtype: object
Anyway, you can update your cat_dic with missing keys from col1 column
cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)
print(cat_dic)
{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}
print(df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
In [6]: df.col1.map(cat_dic.get)
Out[6]:
0 B
1 D
2 H
3 None
4 None
dtype: object
You could also use apply, both work. When working on a Series, map is faster I think.
Explanation:
You can get a default value for missing keys by using dict.get instead using the [..]-operator. By default, this default value is None. So simply passing the dict.get method to apply/map just works.

What is the fastest way to perform a replace on a column of a Pandas DataFrame based on the index of a separate Series?

Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Select rows of a dataframe based on another dataframe in Python

I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df1)
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
I hope to select rows in df1 by the df2 as follows:
df2 = pd.DataFrame({'A': 'foo bar'.split(),
'B': 'one two'.split()
})
print(df2)
A B
0 foo one
1 bar two
Here is what I have tried in Python, but I just wonder if there is another method. Thanks.
df = df1.merge(df2, on=['A','B'])
print(df)
This is the output expected.
A B C D
0 foo one 0 0
1 bar two 5 10
2 foo one 6 12
Using pandas to select rows using two different columns from dataframe?
Select Columns of a DataFrame based on another DataFrame
Simpliest is use merge with inner join.
Another solution with filtering:
arr = [np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('r')]
df = df1[np.array(arr).any(axis=0)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Or create MultiIndex and filter with Index.isin:
df = df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Method #4. .apply + key function:
>>> key = lambda row: (row.A, row.B)
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Method #5. .join:
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
A B C D
0 foo one 0 0
6 foo one 6 12
5 bar two 5 10
Methods already mentioned:
.merge by #ahbon
Filtering with .to_dict('records') by #jezrael (fastest)
Index.isin by #jezrael
Performance comparison (fastest to slowest):
>>> %%timeit
>>> df1[np.array([np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('records')]).any(axis=0)]
1.62 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> key = lambda row: (row.A, row.B)
>>> %%timeit
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
2.96 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.merge(df2, on=['A','B'])
3.15 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
3.97 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
6.55 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The .merge method performs an inner join by default.
# The resulting dataframe will only have rows where the
# merge column value exists in both dataframes
x = df_only_english.merge(train_orders.assign(id=train_orders.id))
x
Unnamed: 0 language score id iso_language_name is_en cell_order
0 0 en 0.999998 00015c83e2717b English English 2e94bd7a 3e99dee9 b5e286ea da4f7550 c417225b 51e3cd89 2600b4eb 75b65993 cf195f8b 25699d02 72b3201a f2c750d3 de148b56...
1 1 en 0.999999 0001bdd4021779 English English 3fdc37be 073782ca 8ea7263c 80543cd8 38310c80 073e27e5 015d52a4 ad7679ef 7fde4f04 07c52510 0a1a7a39 0bcd3fef 58bf360b
2 2 en 0.999996 0001daf4c2c76d English English 97266564 a898e555 86605076 76cc2642 ef279279 df6c939f 2476da96 00f87d0a ae93e8e6 58aadb1d d20b0094 986fd4f1 b4ff1015...

Replacing values in a dataframe from another dataframe

So i am working with a dataset with two data frames.
The Data Frames look like this:
df1:
Item_ID Item_Name
0 A
1 B
2 C
df2:
Item_slot_1 Item_slot_2 Item_Slot_3
2 2 1
1 2 0
0 1 1
The values in df2 represent the Item_ID from df1. How can i replace the values in df2 from the item_id to the actual item name so that df2 can look like:
Item_slot_1 Item_slot_2 Item_Slot_3
C C B
B C A
A B B
The data set in reality is much larger and has way more id's and names than just a,b and c
Create dictionary by zip and pass it to applymap, or replace or apply with map:
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
#if value not exist in df1['Item_ID'] get None in df2
df2 = df2.applymap(s.get)
Or:
#if value not exist in df1['Item_ID'] get original value in df2
df2 = df2.replace(s)
Or:
#if value not exist in df1['Item_ID'] get NaN in df2
df2 = df2.apply(lambda x: x.map(s))
print (df2)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
EDIT:
You can specified columns by names for process:
cols = ['Item_slot_1','Item_slot_2','Item_Slot_3']
df2[cols] = df2[cols].applymap(s.get)
df2[cols] = df2[cols].replace(s)
df2[cols] = df2[cols].apply(lambda x: x.map(s))
You can improve the speed of dictionary mapping with numpy. If your items are numbered 0-N this is trivial, if they are not, it gets a bit more tricky, but is still easily doable.
If the items in df1 are numbered 0-N, use basic indexing:
a = df1['Item_Name'].values
b = df2.values
pd.DataFrame(a[b], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
If they are not numbered 0-N, here is a more general approach:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C B
1 B C A
2 A B B
To only replace a subset of columns from df2 is also simple, let's demonstrate only replacing the first two columns of df2:
x = df1['Item_ID'].values
y = df1['Item_Name'].values
cols = ['Item_slot_1', 'Item_slot_2']
z = df2[cols].values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
df2[cols] = m[z]
Item_slot_1 Item_slot_2 Item_Slot_3
0 C C 1
1 B C 0
2 A B 1
This type of indexing nets a hefty performance gain over apply and replace:
import string
df1 = pd.DataFrame({'Item_ID': np.arange(26), 'Item_Name': list(string.ascii_uppercase)})
df2 = pd.DataFrame(np.random.randint(1, 26, (10000, 100)))
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.applymap(s.get)
158 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.replace(s)
750 ms ± 34.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
s = dict(zip(df1['Item_ID'], df1['Item_Name']))
df2.apply(lambda x: x.map(s))
93.1 ms ± 4.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
x = df1['Item_ID'].values
y = df1['Item_Name'].values
z = df2.values
m = np.arange(x.max() + 1, dtype=object)
m[x] = y
pd.DataFrame(m[z], columns=df2.columns)
30.4 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas Dataframe: Replacing NaN with row average

I am trying to learn pandas but I have been puzzled with the following. I want to replace NaNs in a DataFrame with the row average. Hence something like df.fillna(df.mean(axis=1)) should work but for some reason it fails for me. Am I missing anything, is there something wrong with what I'm doing? Is it because its not implemented? see link here
import pandas as pd
import numpy as np
​
pd.__version__
Out[44]:
'0.15.2'
In [45]:
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
df
Out[45]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
In [46]:
df.fillna(df.mean(axis=1))
Out[46]:
c1 c2 c3
0 1 4 7
1 2 5 NaN
2 3 6 9
However something like this looks to work fine
df.fillna(df.mean(axis=0))
Out[47]:
c1 c2 c3
0 1 4 7
1 2 5 8
2 3 6 9
As commented the axis argument to fillna is NotImplemented.
df.fillna(df.mean(axis=1), axis=1)
Note: this would be critical here as you don't want to fill in your nth columns with the nth row average.
For now you'll need to iterate through:
m = df.mean(axis=1)
for i, col in enumerate(df):
# using i allows for duplicate columns
# inplace *may* not always work here, so IMO the next line is preferred
# df.iloc[:, i].fillna(m, inplace=True)
df.iloc[:, i] = df.iloc[:, i].fillna(m)
print(df)
c1 c2 c3
0 1 4 7.0
1 2 5 3.5
2 3 6 9.0
An alternative is to fillna the transpose and then transpose, which may be more efficient...
df.T.fillna(df.mean(axis=1)).T
As an alternative, you could also use an apply with a lambda expression like this:
df.apply(lambda row: row.fillna(row.mean()), axis=1)
yielding also
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
For an efficient solution, use DataFrame.where:
We could use where on axis=0:
df.where(df.notna(), df.mean(axis=1), axis=0)
or mask on axis=0:
df.mask(df.isna(), df.mean(axis=1), axis=0)
By using axis=0, we can fill in the missing values in each column with the row averages.
These methods perform very similarly (where does slightly better on large DataFrames (300_000, 20)) and is ~35-50% faster than the numpy methods posted here and is 110x faster than the double transpose method.
Some benchmarks:
df = creator()
>>> %timeit df.where(df.notna(), df.mean(axis=1), axis=0)
542 ms ± 3.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.mask(df.isna(), df.mean(axis=1), axis=0)
555 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.fillna(0) + df.isna().values * df.mean(axis=1).values.reshape(-1,1)
751 ms ± 22 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape), columns=df.columns, index=df.index); df.update(fill, overwrite=False)
848 ms ± 22.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit df.apply(lambda row: row.fillna(row.mean()), axis=1)
1min 4s ± 5.32 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit df.T.fillna(df.mean(axis=1)).T
1min 5s ± 2.4 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
def creator():
A = np.random.rand(300_000, 20)
A.ravel()[np.random.choice(A.size, 300_000, replace=False)] = np.nan
return pd.DataFrame(A)
I'll propose an alternative that involves casting into numpy arrays. Performance wise, I think this is more efficient and probably scales better than the other proposed solutions so far.
The idea being to use an indicator matrix (df.isna().values which is 1 if the element is N/A, 0 otherwise) and broadcast-multiplying that to the row averages.
Thus, we end up with a matrix (exactly the same shape as the original df), which contains the row-average value if the original element was N/A, and 0 otherwise.
We add this matrix to the original df, making sure to fillna with 0 so that, in effect, we have filled the N/A's with the respective row averages.
# setup code
df = pd.DataFrame()
df['c1'] = [1, 2, 3]
df['c2'] = [4, 5, 6]
df['c3'] = [7, np.nan, 9]
# fillna row-wise
row_avgs = df.mean(axis=1).values.reshape(-1,1)
df = df.fillna(0) + df.isna().values * row_avgs
df
giving
c1 c2 c3
0 1.0 4.0 7.0
1 2.0 5.0 3.5
2 3.0 6.0 9.0
You can broadcast the mean to a DataFrame with the same index as the original and then use update with overwrite=False to get the behavior of .fillna. Unlike .fillna, update allows for filling when the Indices have duplicated labels. Should be faster than the looping .fillna for smaller than 50,000 rows or so.
fill = pd.DataFrame(np.broadcast_to(df.mean(1).to_numpy()[:, None], df.shape),
columns=df.columns,
index=df.index)
df.update(fill, overwrite=False)
print(df)
1 1 1
0 1.0 4.0 7.0
0 2.0 5.0 3.5
0 3.0 6.0 9.0
Just had the same problem. I found this workaround to be working:
df.transpose().fillna(df.mean(axis=1)).transpose()
I'm not sure though about the efficiency of this solution.

Categories

Resources