pandas replace multiple values one column - python

In a column risklevels I want to replace Small with 1, Medium with 5 and High with 15.
I tried:
dfm.replace({'risk':{'Small': '1'}},
{'risk':{'Medium': '5'}},
{'risk':{'High': '15'}})
But only the medium were replaced.
What is wrong ?

Your replace format is off
In [21]: df = pd.DataFrame({'a':['Small', 'Medium', 'High']})
In [22]: df
Out[22]:
a
0 Small
1 Medium
2 High
[3 rows x 1 columns]
In [23]: df.replace({'a' : { 'Medium' : 2, 'Small' : 1, 'High' : 3 }})
Out[23]:
a
0 1
1 2
2 3
[3 rows x 1 columns]

In [123]: import pandas as pd
In [124]: state_df = pd.DataFrame({'state':['Small', 'Medium', 'High', 'Small', 'High']})
In [125]: state_df
Out[125]:
state
0 Small
1 Medium
2 High
3 Small
4 High
In [126]: replace_values = {'Small' : 1, 'Medium' : 2, 'High' : 3 }
In [127]: state_df = state_df.replace({"state": replace_values})
In [128]: state_df
Out[128]:
state
0 1
1 2
2 3
3 1
4 3

You could define a dict and call map
In [256]:
df = pd.DataFrame({'a':['Small', 'Medium', 'High']})
df
Out[256]:
a
0 Small
1 Medium
2 High
[3 rows x 1 columns]
In [258]:
vals_to_replace = {'Small':'1', 'Medium':'5', 'High':'15'}
df['a'] = df['a'].map(vals_to_replace)
df
Out[258]:
a
0 1
1 5
2 15
[3 rows x 1 columns]
In [279]:
val1 = [1,5,15]
df['risk'].update(pd.Series(val1))
df
Out[279]:
risk
0 1
1 5
2 15
[3 rows x 1 columns]

Looks like OP may have been looking for a one-liner to solve this through consecutive calls to .str.replace:
dfm.column = dfm.column.str.replace('Small', '1') \
.str.replace('Medium', '5') \
.str.replace('High', '15')
OP, you were close but just needed to replace your commas with .str.replace and the column call ('risk') in a dictionary format isn't necessary. Just pass the pattern-to-match and replacement-value as arguments to replace.

I had to turn on the "regex" flag to make it work:
df.replace({'a' : {'Medium':2, 'Small':1, 'High':3 }}, regex=True)

String replace each string (Small, Medium, High) for the new string (1,5,15)\
If dfm is the dataframe name, column is the column name.
dfm.column = dfm.column.str.replace('Small', '1')
dfm.column = dfm.column.str.replace('Medium', '5')
dfm.column = dfm.column.str.replace('High', '15')

Use series.replace with lists of before and after values for greater ease:
df.risklevels = df.risklevels.replace( ['Small','Medium','High'], [1,2,3] )
See here.

Related

Python Pandas make calculation in single cell

I have a TYPE column
and a VOLUME column
What I'm looking to do if first check if TYPE column == 'var1'
If so I would like to make a calculation in the VOLUME column.
So far I have something like this:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 4
This seems to set the entire column that meets the condition to the last variable. So I end up with just two values.
Out:
4
4
4
4
8
8
8
Another option:
data['VOLUME'] = data.loc[data['TYPE'] == 'var1', ['VOLUME']] * 2
This works for the first condition but show NaN for the second condition
Then when I run:
data['VOLUME'] = data.loc[data['TYPE'] == 'var2', ['VOLUME']] * 4
The whole column show as NaN.
Consider a simple example which demonstrates what is happening.
df = pd.DataFrame({'A': [1, 2, 3]})
df
A
0 1
1 2
2 3
Now, only values below 2 in column "A" are to be modified. So, try something like
df.loc[df.A < 2, 'A'] * 2
0 2
Name: A, dtype: int64
This series only has 1 row at index 0. If you try assigning this back, the implicit assumption is that the other index values are to be reset to NaN.
df.assign(A=df.loc[df.A < 2, 'A'] * 2)
A
0 2.0
1 NaN
2 NaN
What we want to do is to modify only the rows we're interested in. This is best done with the in-place modification arithmetic operator *=:
df.loc[df.A < 2, 'A'] *= 2
In your case, it is
data.loc[data['TYPE'] == 'var1', 'VOLUME'] *= 2
You are really close. The problem is in how you are storing the result. This should work:
data.loc[data['TYPE'] == 'var1', ['VOLUME']] = data['VOLUME'] * 2
You can use *= with loc:
In [11]: df = pd.DataFrame([[1], [2]], columns=["A"])
In [12]: df
Out[12]:
A
0 1
1 2
In [13]: df.loc[df.A == 1, "A"] *= 3
In [14]: df
Out[14]:
A
0 3
1 2

How to change column order (sort) in pandastic way?

I have a dataframe with header that is list of 'string-integers':
import pandas as pd
d = {'1': [1, 2], '7': [3, 4], '3': [3, 4], '5': [2, 7]}
df = pd.DataFrame(data=d)
1 3 5 7
0 1 3 2 3
1 2 4 7 4
This code change order of column (sort):
cols = df.columns.tolist()
cols = [int(x) for x in cols]
cols.sort()
cols = [str(x) for x in cols]
df = df[cols]
1 3 5 7
0 1 3 2 3
1 2 4 7 4
I'm not happy with this solution. Of course, I can hide it in the function, but probably more elegant approach exist.
There are several options depending on what you require.
Option 1
You can use sort_values to sort as strings:
df.columns = df.columns.sort_values()
Note this means that "10" will appear before "2".
Option 2
If you wish to convert to integers and then sort:
df.columns = df.columns.astype(int).sort_values()
Option 3
If you want to keep as string, but order by integer value:
df.columns = df.columns.astype(int).sort_values().astype(str)
A pure Python approach is also possible:
df.columns = sorted(df, key=int)

pandas - Apply mean to a specific row in grouped dataframe [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Get pandas groupby object to ignore missing dataframes

i'm using pandas to read an excel file and convert the spreadsheet to a dataframe. Then i apply groupby and store the individual groups in variables using get_group for later computation.
My issue is that the input file isn't always the same size, sometimes the groupby will result in 10 dfs, sometimes 25 etc. How to i get my program to ignore if a df is missing from the intial data?
df = pd.read_excel(filepath, 0, skiprows=3, parse_cols='A,B,C,E,F,G',
names=['Result', 'Trial', 'Well', 'Distance', 'Speed', 'Time'])
df = df.replace({'-': 0}, regex=True) #replaces '-' values with 0
df = df['Trial'].unique()
gb = df.groupby('Trial') #groups by column Trial
trial_1 = gb.get_group('Trial 1')
trial_2 = gb.get_group('Trial 2')
trial_3 = gb.get_group('Trial 3')
trial_4 = gb.get_group('Trial 4')
trial_5 = gb.get_group('Trial 5')
Say my initial data only has 3 trials, how would i get it to ignore trials 4, 5 later? My code runs when all trials are present but fails when some are missing :( It sounds very much like an if statement would be needed, but my tired brain has no idea where...
Thanks in advance!
After grouping you can get the groups using attribute .groups this returns a dict of the group names, you can then just iterate over the dict keys dynamically so you don't need to hard code the size:
In [22]:
df = pd.DataFrame({'grp':list('aabbbc'), 'val':np.arange(6)})
df
Out[22]:
grp val
0 a 0
1 a 1
2 b 2
3 b 3
4 b 4
5 c 5
In [23]:
gp = df.groupby('grp')
gp.groups
Out[23]:
{'a': Int64Index([0, 1], dtype='int64'),
'b': Int64Index([2, 3, 4], dtype='int64'),
'c': Int64Index([5], dtype='int64')}
In [25]:
for g in gp.groups.keys():
print(gp.get_group(g))
grp val
0 a 0
1 a 1
grp val
2 b 2
3 b 3
4 b 4
grp val
5 c 5

How to drop multiple columns on pandas data frame

Hello guys I have been trying to drop 2 columns of Excel data frame on pandas, using a drop command like this
energy = energy.drop(energy.columns[[0 , 1]], axis= 1 )
however, I could not make it to avoid the columns from view. and finally i sense the columns I am supposed to delete comes as a multi level index on my machine. finally I have tried to drop one of the level from it like this
energy.index = energy.index.droplevel(2)
But still i cant manage to how I should avoid these columns.
I have attached a screen copy of my work enter image description here
Instead of dropping the columns, you could subset your data frame like so:
In [3]: mydf = pd.DataFrame({"A":[1,2,3,4],"B":[4,3,2,1], "C":[3,4,5,3],"D":[6,4,3,2]})
In [4]: mydf
Out[4]:
A B C D
0 1 4 3 6
1 2 3 4 4
2 3 2 5 3
3 4 1 3 2
In [5]: mydf[mydf.columns[2:]]
Out[5]:
C D
0 3 6
1 4 4
2 5 3
3 3 2
This will work if you're trying to remove the first 2 columns for example. It works by creating a list with df.columns which you then subset and apply to your dataframe. You would then likely want to set the new dataframe to a variable.
If the columns that you want to drop are nonadjacent you can loop through a list of columns to drop:
In [7]: mydf1 = mydf.copy()
In [8]: for col in ["A","D"]:
...: mydf1 = mydf1.drop(col,axis=1)
In [9]: mydf1
Out[9]:
B C
0 4 3
1 3 4
2 2 5
3 1 3
Try simply renaming the columns
Say you have
In: df.columns
Out: MultiIndex(levels=[['BURGLARY', 'GRAND LARCENY', 'GRAND LARCENY OF MOTOR
VEHICLE', 'TMAX', 'TMIN'], ['count', 'mean']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 1, 1]])
Then
In: df.columns = ['Burglary', 'Grand Larceny', 'Grand Larceny on Motor Vehicle',
'TMAX', 'TMIN']
And voila
In: df.columns
Out: Index(['BURGLARY', 'GRAND LARCENY', 'GRAND LARCENY OF MOTOR VEHICLE',
'TMAX',
'TMIN'],
dtype='object')
If you really want to remove columns you can use del:
>>> df = pd.DataFrame({'A':range(3),'B':list('abc'), 'C':range(3,6), 'D':list('gde')})
>>> for x in ['A', 'B']:
... del df[x]
...
>>> df
C D
0 3 g
1 4 d
2 5 e
This might help
energy.drop(energy.columns[[0,1]] , axis=1, inplace=True)

Categories

Resources