I'd like to modify the col1 of the following dataframe df:
col1 col2
0 Black 7
1 Death 2
2 Hardcore 6
3 Grindcore 1
4 Deathcore 4
...
I want to use a dict named cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'} to get the following dataframe:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
...
I know I can use df.map or df.replace, for example like this:
df.replace({"col1":cat_dic})
but I want the KeyErrors of the dictionnary to return None, and with the previous line, I got this result instead:
col1 col2
0 B 7
1 D 2
2 H 6
3 Grindcore 1
4 Deathcore 4
...
Given that Grindcore and Deathcore are not the only 2 values in col1 that I want to be set to None, have you got any idea on how to do it ?
Use dict.get:
df['col1'] = df['col1'].map(lambda x: cat_dic.get(x, None))
#default value is None
df['col1'] = df['col1'].map(cat_dic.get)
print (df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Performance comparison in 50k rows:
df = pd.concat([df] * 10000, ignore_index=True)
cat_dic={'Black':'B', 'Death':'D', 'Hardcore':'H'}
In [93]: %timeit df['col1'].map(cat_dic.get)
3.22 ms ± 16.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [94]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
15 ms ± 293 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [95]: %timeit df['col1'].replace(dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic))
12.3 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [96]: %timeit df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
13.8 ms ± 837 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [97]: %timeit df['col1'].map(cat_dic).replace(dict({np.nan: None}))
9.97 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
You may use pd.apply first
df.col1 = df.col1.apply(lambda x: None if x not in cat_dic.keys() else x)
Then, you can safely use pd.replace
df.replace({"col1":cat_dic})
This can be done in One line:
df1['col1'] = df1.col1.apply(lambda x: None if x not in cat_dic.keys() else cat_dic[x])
Output is:
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Here is a one liner easy solution which gives us the expected output.
df['col1'] = df['col1'].map(cat_dic).replace(dict({np.nan: None}))
Output :
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
Series.map already maps NaN to the mismatched key
$ print(df['col1'].map(cat_dic))
0 B
1 D
2 H
3 NaN
4 NaN
Name: col1, dtype: object
Anyway, you can update your cat_dic with missing keys from col1 column
cat_dic = dict(dict.fromkeys(df['col1'].unique(), None), **cat_dic)
df['col1'] = df['col1'].replace(cat_dic)
print(cat_dic)
{'Black': 'B', 'Death': 'D', 'Hardcore': 'H', 'Grindcore': None, 'Deathcore': None}
print(df)
col1 col2
0 B 7
1 D 2
2 H 6
3 None 1
4 None 4
In [6]: df.col1.map(cat_dic.get)
Out[6]:
0 B
1 D
2 H
3 None
4 None
dtype: object
You could also use apply, both work. When working on a Series, map is faster I think.
Explanation:
You can get a default value for missing keys by using dict.get instead using the [..]-operator. By default, this default value is None. So simply passing the dict.get method to apply/map just works.
Related
I'm fighting with pandas and for now I'm loosing. I have source table similar to this:
import pandas as pd
a=pd.Series([123,22,32,453,45,453,56])
b=pd.Series([234,4353,355,453,345,453,56])
df=pd.concat([a, b], axis=1)
df.columns=['First', 'Second']
I would like to add new column to this data frame with first digit from values in column 'First':
a) change number to string from column 'First'
b) extracting first character from newly created string
c) Results from b save as new column in data frame
I don't know how to apply this to the pandas data frame object. I would be grateful for helping me with that.
Cast the dtype of the col to str and you can perform vectorised slicing calling str:
In [29]:
df['new_col'] = df['First'].astype(str).str[0]
df
Out[29]:
First Second new_col
0 123 234 1
1 22 4353 2
2 32 355 3
3 453 453 4
4 45 345 4
5 453 453 4
6 56 56 5
if you need to you can cast the dtype back again calling astype(int) on the column
.str.get
This is the simplest to specify string methods
# Setup
df = pd.DataFrame({'A': ['xyz', 'abc', 'foobar'], 'B': [123, 456, 789]})
df
A B
0 xyz 123
1 abc 456
2 foobar 789
df.dtypes
A object
B int64
dtype: object
For string (read:object) type columns, use
df['C'] = df['A'].str[0]
# Similar to,
df['C'] = df['A'].str.get(0)
.str handles NaNs by returning NaN as the output.
For non-numeric columns, an .astype conversion is required beforehand, as shown in #Ed Chum's answer.
# Note that this won't work well if the data has NaNs.
# It'll return lowercase "n"
df['D'] = df['B'].astype(str).str[0]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
List Comprehension and Indexing
There is enough evidence to suggest a simple list comprehension will work well here and probably be faster.
# For string columns
df['C'] = [x[0] for x in df['A']]
# For numeric columns
df['D'] = [str(x)[0] for x in df['B']]
df
A B C D
0 xyz 123 x 1
1 abc 456 a 4
2 foobar 789 f 7
If your data has NaNs, then you will need to handle this appropriately with an if/else in the list comprehension,
df2 = pd.DataFrame({'A': ['xyz', np.nan, 'foobar'], 'B': [123, 456, np.nan]})
df2
A B
0 xyz 123.0
1 NaN 456.0
2 foobar NaN
# For string columns
df2['C'] = [x[0] if isinstance(x, str) else np.nan for x in df2['A']]
# For numeric columns
df2['D'] = [str(x)[0] if pd.notna(x) else np.nan for x in df2['B']]
A B C D
0 xyz 123.0 x 1
1 NaN 456.0 NaN 4
2 foobar NaN f NaN
Let's do some timeit tests on some larger data.
df_ = df.copy()
df = pd.concat([df_] * 5000, ignore_index=True)
%timeit df.assign(C=df['A'].str[0])
%timeit df.assign(D=df['B'].astype(str).str[0])
%timeit df.assign(C=[x[0] for x in df['A']])
%timeit df.assign(D=[str(x)[0] for x in df['B']])
12 ms ± 253 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
27.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
3.77 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
7.84 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
List comprehensions are 4x faster.
Sorry if I've been googling the wrong keywords, but I haven't been able to find an efficient way to replace all instances of an integer in a DataFrame column with its corresponding indexed value from a secondary Series.
I'm working with the output of a third party program that strips the row and column labels from an input matrix and replaces them with their corresponding indices. I'd like to restore the true labels from the indices.
I have a dummy example of the dataframe and series in question:
In [6]: df
Out[6]:
idxA idxB var2
0 0 1 2.0
1 0 2 3.0
2 2 4 2.0
3 2 1 1.0
In [8]: labels
Out[8]:
0 A
1 B
2 C
3 D
4 E
Name: label, dtype: object
Currently, I'm converting the series to a dictionary and using replace:
label_dict = labels.to_dict()
df['idxA'] = df.idxA.replace(label_dict)
df['idxB'] = df.idxB.replace(label_dict)
which does give me the expected result:
In [12]: df
Out[12]:
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
However, this is very slow for my full dataset (approximately 3.8 million rows in the table, and 19,000 labels). Is there a more efficient way to approach this?
Thanks!
EDIT: I accepted #coldspeed's answer. Couldn't paste a code block in the comment reply to his answer, but his solution sped up the dummy code by about an order of magnitude:
In [10]: %timeit df.idxA.replace(label_dict)
4.41 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: %timeit df.idxA.map(labels)
435 µs ± 3.93 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
You can call map for each column using apply:
df.loc[:, 'idxA':'idxB'] = df.loc[:, 'idxA':'idxB'].apply(lambda x: x.map(labels))
df
idxA idxB var2
0 A B 2.0
1 A C 3.0
2 C E 2.0
3 C B 1.0
This is effectively iterating over every column (but the map operation for a single column is vectorized, so it is fast). It might just be faster to do
cols_of_interest = ['idxA', 'idxB', ...]
for c in cols_of_interest: df[c] = df[c].map(labels)
map is faster than replace, depending on the number of columns to replace. Your mileage may vary.
df_ = df.copy()
df = pd.concat([df_] * 10000, ignore_index=True)
%timeit df.loc[:, 'idxA':'idxB'].replace(labels)
%%timeit
for c in ['idxA', 'idxB']:
df[c].map(labels)
6.55 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 70 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a the following pandas table
df:
EVNT_ID col1 col2 col3 col4
123454 1 Nan 4 5
628392 Nan 3 Nan 7
293899 2 Nan Nan 6
127820 9 11 12 19
Now I am trying to concat all the columns except the first column and I want my data frame to look in the following way
new_df:
EVNT_ID col1 col2 col3 col4 new_col
123454 1 Nan 4 5 1|4|5
628392 Nan 3 Nan 7 3|7
293899 2 Nan Nan 6 2|6
127820 9 11 12 19 9|11|12|19
I am using the following code
df['new_column'] = df[~df.EVNT_ID].apply(lambda x: '|'.join(x.dropna().astype(str).values), axis=1)
but it is giving me the following error
ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I would really appreciate if any one can give me where I am wrong. I'd really appreciate that.
Try the following code:
df['new_col'] = df.iloc[:, 1:].apply(lambda x:
'|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
Initially I thought about x.dropna() instead of x if str(el) != 'nan',
but %timeit showed that dropna() works much slower.
You can do this with filter and agg:
df.filter(like='col').agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
Or,
df.drop('EVNT_ID', 1).agg(
lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
0 1|4|5
1 3|7
2 2|6
3 9|11|12|19
dtype: object
If performance is important, you can use a list comprehension:
joined = [
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
joined
# ['1|4|5', '3|7', '2|6', '9|11|12|19']
df.assign(new_col=joined)
EVNT_ID col1 col2 col3 col4 new_col
0 123454 1.0 NaN 4.0 5 1|4|5
1 628392 NaN 3.0 NaN 7 3|7
2 293899 2.0 NaN NaN 6 2|6
3 127820 9.0 11.0 12.0 19 9|11|12|19
If you can forgive the overhead of assignment to a DataFrame, here's timings for the two fastest solutions here.
df = pd.concat([df] * 1000, ignore_index=True)
# In this post.
%%timeit
[
'|'.join([str(int(x)) for x in r if pd.notna(x)])
for r in df.iloc[:,1:].values.tolist()
]
# RafaelC's answer.
%%timeit
[
'|'.join([k for k in a if k])
for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values.tolist())
]
31.9 ms ± 800 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
23.7 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Although note the answers aren't identical because #RafaelC's code produces floats: ['1.0|2.0|9.0', '3.0|11.0', ...]. If this is fine, then great. Otherwise you'll need to convert to int which adds more overhead.
Using list comprehension and zip
>>> [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
Timing seems alright
df = pd.concat([df]*1000)
%timeit [['|'.join([k for k in a if k])] for a in zip(*df.fillna('').astype(str).iloc[:, 1:].values)]
10.8 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.filter(like='col').agg(lambda x: x.dropna().astype(int).astype(str).str.cat(sep='|'), axis=1)
1.68 s ± 91.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.iloc[:, 1:].apply(lambda x: '|'.join(str(el) for el in x if str(el) != 'nan'), axis=1)
87.8 ms ± 5.01 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.assign(new_col=['|'.join([str(int(x)) for x in r if ~np.isnan(x)]) for r in df.iloc[:,1:].values])
45.1 ms ± 1.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import time
import timeit
from pandas import DataFrame
import numpy as np
import pandas as pd
from datetime import datetime
df = pd.DataFrame({
'date' : ['05/9/2023', '07/10/2023', '08/11/2023', '06/12/2023'],
'A' : [1, np.nan,4, 7],
'B' : [2, np.nan, 5, 8],
'C' : [3, 6, 9, np.nan]
}).set_index('date')
print(df)
print('.........')
start_time = datetime.now()
df['ColumnA'] = df[df.columns].agg(
lambda x: ','.join(x.dropna().astype(str)),
axis=1
)
print(df['ColumnA'])
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
"""
A B C
date
05/9/2023 1.0 2.0 3.0
07/10/2023 NaN NaN 6.0
08/11/2023 4.0 5.0 9.0
06/12/2023 7.0 8.0 NaN
...........................
OUTPUT:
date
05/9/2023 1.0,2.0,3.0
07/10/2023 6.0
08/11/2023 4.0,5.0,9.0
06/12/2023 7.0,8.0
Name: ColumnA, dtype: object
Duration: 0:00:00.002998
"""
I have the following dataframe:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8), 'D': np.arange(8) * 2})
print(df1)
A B C D
0 foo one 0 0
1 bar one 1 2
2 foo two 2 4
3 bar three 3 6
4 foo two 4 8
5 bar two 5 10
6 foo one 6 12
7 foo three 7 14
I hope to select rows in df1 by the df2 as follows:
df2 = pd.DataFrame({'A': 'foo bar'.split(),
'B': 'one two'.split()
})
print(df2)
A B
0 foo one
1 bar two
Here is what I have tried in Python, but I just wonder if there is another method. Thanks.
df = df1.merge(df2, on=['A','B'])
print(df)
This is the output expected.
A B C D
0 foo one 0 0
1 bar two 5 10
2 foo one 6 12
Using pandas to select rows using two different columns from dataframe?
Select Columns of a DataFrame based on another DataFrame
Simpliest is use merge with inner join.
Another solution with filtering:
arr = [np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('r')]
df = df1[np.array(arr).any(axis=0)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Or create MultiIndex and filter with Index.isin:
df = df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
print(df)
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Method #4. .apply + key function:
>>> key = lambda row: (row.A, row.B)
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
A B C D
0 foo one 0 0
5 bar two 5 10
6 foo one 6 12
Method #5. .join:
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
A B C D
0 foo one 0 0
6 foo one 6 12
5 bar two 5 10
Methods already mentioned:
.merge by #ahbon
Filtering with .to_dict('records') by #jezrael (fastest)
Index.isin by #jezrael
Performance comparison (fastest to slowest):
>>> %%timeit
>>> df1[np.array([np.array([df1[k] == v for k, v in x.items()]).all(axis=0) for x in df2.to_dict('records')]).any(axis=0)]
1.62 ms ± 15.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> key = lambda row: (row.A, row.B)
>>> %%timeit
>>> df1[df1.apply(key, axis=1).isin(df2.apply(key, axis=1))]
2.96 ms ± 408 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.merge(df2, on=['A','B'])
3.15 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1.join(df2.set_index(['A', 'B']), on=['A', 'B'], how='right')
3.97 ms ± 341 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
>>> %%timeit
>>> df1[df1.set_index(['A','B']).index.isin(df2.set_index(['A','B']).index)]
6.55 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# The .merge method performs an inner join by default.
# The resulting dataframe will only have rows where the
# merge column value exists in both dataframes
x = df_only_english.merge(train_orders.assign(id=train_orders.id))
x
Unnamed: 0 language score id iso_language_name is_en cell_order
0 0 en 0.999998 00015c83e2717b English English 2e94bd7a 3e99dee9 b5e286ea da4f7550 c417225b 51e3cd89 2600b4eb 75b65993 cf195f8b 25699d02 72b3201a f2c750d3 de148b56...
1 1 en 0.999999 0001bdd4021779 English English 3fdc37be 073782ca 8ea7263c 80543cd8 38310c80 073e27e5 015d52a4 ad7679ef 7fde4f04 07c52510 0a1a7a39 0bcd3fef 58bf360b
2 2 en 0.999996 0001daf4c2c76d English English 97266564 a898e555 86605076 76cc2642 ef279279 df6c939f 2476da96 00f87d0a ae93e8e6 58aadb1d d20b0094 986fd4f1 b4ff1015...
I have a dataframe like the following:
Timestamp Value
11/23/2017 7
11/24/2017 3
11/25/2017 5
11/26/2017 7
11/27/2017 7
11/28/2017 7
I want to write something which returns the first instance of the last value, 7, looking upward, and stops when the value changes to something else. So the answer to the sample dataframe should be 11/26/2017.
I tried to drop_duplicatesbut that returns the first row with timestamp 11/23/2017.
Thanks.
Create helper Series for get unique consecutive values of column Value, get index of max value byidxmax and last select value by loc:
print (df)
Timestamp Value
0 11/23/2017 7
1 11/24/2017 3
2 11/25/2017 5
3 11/26/2017 7
4 11/27/2017 7
5 11/28/2017 7
a = df['Value'].ne(df['Value'].shift()).cumsum()
b = df.loc[a.idxmax(), 'Timestamp']
print (b)
11/26/2017
Detail:
print (a)
0 1
1 2
2 3
3 4
4 4
5 4
Name: Value, dtype: int32
If first column is index solution is simplier, because need index value by max of Series:
print (df)
Value
Timestamp
11/23/2017 7
11/24/2017 3
11/25/2017 5
11/26/2017 7
11/27/2017 7
11/28/2017 7
b = df['Value'].ne(df['Value'].shift()).cumsum().idxmax()
print (b)
11/26/2017
In [173]: df.iat[df.loc[::-1, 'Value'].diff().fillna(0).ne(0).idxmax()+1,
df.columns.get_loc('Timestamp')]
Out[173]: '11/26/2017'
Timing for 600.000 rows DF:
In [201]: df = pd.concat([df] * 10**5, ignore_index=True)
In [202]: %%timeit
...: df['Value'].ne(df['Value'].shift()).cumsum().idxmax()
...:
15.3 ms ± 646 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [203]: %%timeit
...: df.iat[df.loc[::-1, 'Value'].diff().fillna(0).ne(0).idxmax()+1,
...: df.columns.get_loc('Timestamp')]
...:
11.6 ms ± 237 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)