I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64
Related
I'm trying to agg() a df at the same time I make a subsetting from one of the columns:
indi = pd.DataFrame({"PONDERA":[1,2,3,4], "ESTADO": [1,1,2,2]})
empleo = indi.agg(ocupados = (indi.PONDERA[indi["ESTADO"]==1], sum) )
but I'm getting 'Series' objects are mutable, thus they cannot be hashed
I want to sum the values of "PONDERA" only when "ESTADO" == 1.
Expected output:
ocupados
0 3
I'm trying to imitate R function summarise(), so I want to do it in one step and agg some other columns too.
In R would be something like:
empleo <- indi %>%
summarise(poblacion = sum(PONDERA),
ocupados = sum(PONDERA[ESTADO == 1]))
Is this even the correct approach?
Thank you all in advance.
Generally agg takes as an argument function, not Series itself. In your case though it's more beneficial to separate filtering and summation.
One of the options would be the following:
empleo = indi.query("ESTADO == 1")[["PONDERA"]].sum()
(Use single square brackets to output single number, instead of pd.Series)
Another option would be to use loc and filter the dataframe to when estado = 1, and sum the values of the column pondera:
indi.loc[indi.ESTADO==1, ['PONDERA']].sum()
Thanks to #Henry's input.
A bit fancy, but the output is exactly the format you want, and the syntax is similar to what you tried:
Use DataFrameGroupBy.agg() instead of DataFrame.agg():
empleo = (indi.loc[indi['ESTADO']==1]
.groupby('ESTADO')
.agg(ocupados=('PONDERA', sum))
.reset_index(drop=True)
)
Result:
print(empleo) gives:
ocupados
0 3
Here are two different ways you can get the scalar value 3.
option1 = indi.loc[indi['ESTADO'].eq(1),'PONDERA'].sum()
option2 = indi['PONDERA'].where(indi['ESTADO'].eq(1)).sum()
However, your expected output shows this value in a dataframe. To do this, you can create a new dataframe with the desired column name "ocupados".
outputdf = pd.DataFrame({'ocupados':[option1]})
Based on your comment you provided, is this what you are looking for?
(indi.agg(poblacion = ("PONDERA", 'sum'),
ocupados = ('PONDERA',lambda x: x.where(indi['ESTADO'].eq(1)).sum())))
Apologies if something similar has been asked before, I searched around but couldn't figure out a solution.
My dataset looks like such
data1 = {'Group':['Winner','Winner','Winner','Winner','Loser','Loser'],
'Study': ['Read','Read','Notes','Cheat','Read','Read'],
'Score': [1,.90,.80,.70,1,.90]}
df1 = pd.DataFrame(data=data1)
This dataframe spans for dozens of rows, and have a set of numeric columns, and a set of string columns.
I would like to condense this into 1 row, where each entry is just the mean or mode of the column. If the column is numeric, take the mean, otherwise, take the mode. In my actual use case, the order of numeric and object columns are random, so I hope to use an iterative loop that checks for each column which action to take.
I tried this but it didn't work, it seems to be taking the entire Series as the mode.
for i in df1:
if df1[i].dtype == 'float64':
df1[i] = df1[i].mean()
Any help is appreciated, thank you!
You can use describe with 'all' which calculates statistics depending upon the dtype. It determines the top (mode) for object and mean for numeric columns. Then combine.
s = df1.describe(include='all')
s = s.loc['top'].combine_first(s.loc['mean'])
#Group Winner
#Study Read
#Score 0.883333
#Name: top, dtype: object
np.number and select_dtypes
s = df1.select_dtypes(np.number).mean()
df1.drop(s.index, axis=1).mode().iloc[0].append(s)
Group Winner
Study Read
Score 0.883333
dtype: object
Variant
g = df1.dtypes.map(lambda x: np.issubdtype(x, np.number))
d = {k: d for k, d in df1.groupby(g, axis=1)}
pd.concat([d[False].mode().iloc[0], d[True].mean()])
Group Winner
Study Read
Score 0.883333
dtype: object
Here is a slight variation on your solution that gets the job done
res = {}
for col_name, col_type in zip(df1.columns, df1.dtypes):
if pd.api.types.is_numeric_dtype(col_type):
res[col_name] = df1[col_name].mean()
else:
res[col_name]= df1[col_name].mode()[0]
pd.DataFrame(res, index = [0])
returns
Group Study Score
0 Winner Read 0.883333
there could be multiple modes in a Series -- this solution picks the first one
I have a bunch of dataframes with one categorical column defining Sex (M/F). I want to assign integer 1 to Male and 2 to Female. I have the following code that cat codes them to 0 and 1 instead
df4["Sex"] = df4["Sex"].astype('category')
df4.dtypes
df4["Sex_cat"] = df4["Sex"].cat.codes
df4.head()
But I need specifically for M to be 1 and F to be 2. Is there a simple way to assign specific integers to categories?
IIUC:
df4['Sex'] = df4['Sex'].map({'M':1,'F':2})
And now:
print(df4)
Would be desired result.
If you need to impose a specific ordering, you can use pd.Categorical:
c = pd.Categorical(df["Sex"], categories=['M','F'], ordered=True)
This ensures "M" is given the smallest value, "F" the next, and so on. You can then just access codes and add 1.
df['Sex_cat'] = c.codes + 1
It is better to use pd.Categorical than astype('category') if you want finer control over what categories are assigned what codes.
You can also use lambda with apply:
df4['sex'] = df4['sex'].apply(lambda x : 1 if x=='M' else 2)
This is data collected from a survey where there was a radio button to select from 1 of 5 choices. What is stored in the column is a simple 1 as a flag to say it was selected.
I am wanting to end up with a single column with the column headers as the values. Someone suggested using the IDXMAX method on my dataframe, but when I looked at the docs I couldn't really figure out how to apply it. It does look like it would be useful for this though...
I have a dataframe:
old = pd.DataFrame({'a FINSEC_SA' : [1,'NaN','NaN','NaN','NaN',1,'NaN'],
'b FINSEC_A' : ['NaN',1,'NaN','NaN','NaN','NaN','NaN'],
'c FINSEC_NO' : ['NaN','NaN',1,'NaN','NaN','NaN','NaN'],
'd FINSEC_D' : ['NaN','NaN','NaN',1,'NaN','NaN',1],
'e FINSEC_SD' : ['NaN','NaN','NaN','NaN',1,'NaN','NaN']})
I would like to end up with a dataframe like this:
new = pd.DataFrame({'Financial Security':['a FINSEC_SA','b FINSEC_A',
'c FINSEC_NO','d FINSEC_D','e FINSEC_SD','a FINSEC_SA','d FINSEC_D']})
I only have about 65k rows of data so performance is not top of list for me. I am most interested in learning a good way to do this - that is hopefully fairly simple. It would be really nice if the idxmax does this fairly easily.
idxmax can only work with numerics. So first, we need to convert 'NaN' (a string) to np.NaN (a numeric value). Then we can convert each column into a numerical series:
old = old.replace('NaN', np.NaN)
old = old.apply(pd.to_numeric)
alternatively you can do this in one line with:
old = old.apply(pd.to_numeric, errors='coerce')
finally, we can run idxmax. All you have to do is specify the axis. axis=1 to get the position of 1 (highest value) in each row, axis=0 to get the position of 1 in each column
new = old.idxmax(axis=1)
You can run the code in one line (if you don't need a copy of old after this):
new = old.apply(pd.to_numeric, errors='coerce').idxmax(axis=1)
You can directly use idxmax followed by reset_index to achieve this.
df = old.idxmax(axis=1).reset_index().drop('index', axis=1).rename(columns={0:'Financial'})
print(df)
Financial
0 a FINSEC_SA
1 b FINSEC_A
2 c FINSEC_NO
3 d FINSEC_D
4 e FINSEC_SD
5 a FINSEC_SA
6 d FINSEC_D
Explanation:
1. idxmax select max. value row wise across columns.
2. drop drops the unwanted column followed by removing duplicate values.
3. Finally, we rename the column as required.
In the code below, I created a function to check NaN separately, as I think in real data you will have np.NaN and not 'NaN' (strings). You can modify the string accordingly
def isNaN(num):
return num == 'NaN'
def getval(x):
if not isNaN(x['a FINSEC_SA']) : return 'a FINSEC_SA'
if not isNaN(x['b FINSEC_A']) : return 'b FINSEC_A'
if not isNaN(x['c FINSEC_NO']) : return 'c FINSEC_NO'
if not isNaN(x['d FINSEC_D']) : return 'd FINSEC_D'
if not isNaN(x['e FINSEC_SD']) : return 'e FINSEC_SD'
old.apply(getval, axis=1)
This is readable but not efficient answer. Melt functionality can be used to get the same answer in much more efficient manner -
old['id'] = old.index
new = pd.melt(old, id_vars= 'id', var_name = 'Financial')
new = new[new['value'] != 'NaN'].drop('value', axis=1).sort_index(axis=0)
Apologies for the messy title: Problem as follows:
I have some data frame of the form:
df1 =
Entries
0 "A Level"
1 "GCSE"
2 "BSC"
I also have a data frame of the form:
df2 =
Secondary Undergrad
0 "A Level" "BSC"
1 "GCSE" "BA"
2 "AS Level" "MSc"
I have a function which searches each entry in df1, looking for the words in each column of df2. The words that match, are saved (Words_Present):
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
return Attribute
I apply this function over all entries in df1, and all columns in df2, using the following iteration:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
This yields an output df which looks something like:
df1 =
Entries Secondary undergrad
0 "A Level" 1 0
1 "GCSE" 1 0
2 "AS Level" 1 0
I want to amend the Word_Search function to output a the Words_Present list as well as the Attribute, and input these into a new column, so that my eventual df1 array looks like:
Desired dataframe:
Entries Secondary Words Found undergrad Words Found
0 "A Level" 1 "A Level" 0
1 "GCSE" 1 "GCSE" 0
2 "AS Level" 1 "AS Level" 0
If I do:
def word_search(df,group,words):
Yes, No = 0,0
Words_Present = []
for i in words:
match_object = re.search(i,df)
if match_object:
Words_Present.append(i)
Yes = 1
else:
No = 0
if Yes == 1:
Attribute = 1
if Yes == 0:
Attribute = 0
return Attribute,Words_Present
My function therefore now has multiple outputs. So applying the following:
for i in df2:
terms = df2[i].values.tolist()
df1[i] = df1['course'][0:1].apply(lambda x: word_search(x,i,terms))
My Output Looks like this:
Entries Secondary undergrad
0 "A Level" [1,"A Level"] 0
1 "GCSE" [1, "GCSE"] 0
2 "AS Level" [1, "AS Level"] 0
The output of pd.apply() is always a pandas series, so it just shoves everything into the single cell of df[i] where i = secondary.
Is it possible to split the output of .apply into two separate columns, as shown in the desired dataframe?
I have consulted many questions, but none seem to deal directly with yielding multiple columns when the function contained within the apply statement has multiple outputs:
Applying function with multiple arguments to create a new pandas column
Create multiple columns in Pandas Dataframe from one function
Apply pandas function to column to create multiple new columns?
For example, I have also tried:
for i in df2:
terms = df2[i].values.tolist()
[df1[i],df1[i]+"Present"] = pd.concat([df1['course'][0:1].apply(lambda x: word_search(x,i,terms))])
but this simply yields errors such as:
raise ValueError('Length of values does not match length of ' 'index')
Is there a way to use apply, but still extract the extra information directly into multiple columns?
Many thanks, apologies for the length.
The direct answer to your question is yes: use the apply method of the DataFrame object, so you'd be doing df1.apply().
However, for this problem, and anything in pandas in general, try to vectorise rather than iterate through rows -- it's faster and cleaner.
It looks like you are trying to classify Entries into Secondary or Undergrad, and saving the keyword used to make the match. If you assume that each element of Entries has no more than one keyword match (i.e. you won't run into 'GCSE A Level'), you can do the following:
df = df1.copy()
df['secondary_words_found'] = df.Entries.str.extract('(A Level|GCSE|AS Level)')
df['undergrad_words_found'] = df.Entries.str.extract('(BSC|BA|MSc)')
df['secondary'] = df.secondary_words_found.notnull() * 1
df['undergrad'] = df.undergrad_words_found.notnull() * 1
EDIT:
In response to your issue with having many more categories and keywords, you can continue in the spirit of this solution by using an appropriate for loop and doing '(' + '|'.join(df2['Undergrad'].values) + ')' inside the extract method.
However, if you have exact matches, you can do everything by a combination of pivots and joins:
keywords = df2.stack().to_frame('Entries').reset_index().drop('level_0', axis = 1).rename(columns={'level_1':'category'})
df = df1.merge(keywords, how = 'left')
for colname in df.category:
df[colname] = (df.Entries == colname) * 1 # Your indicator variable
df.loc[df.category == colname, colname + '_words_found'] = df.loc[df.category == colname, 'Entries']
The first line 'pivots' your table of keywords into a 2-column dataframe of keywords and categories. Your keyword column must be the same as the column in df1; in SQL, this would be called the foreign key that you are going to join these tables on.
Also, you generally want to avoid having duplicate indexes or columns, which in your case, was Words Found in the desired dataframe!
For the sake of completeness, if you insisted on using the apply method, you would iterate over each row of the DataFrame; your code would look something like this:
secondary_words = df2.Secondary.values
undergrad_words = df2.Undergrad.values
def(s):
if s.Entries.isin(secondary_words):
return pd.Series({'Entries':s.Entries, 'Secondary':1, 'secondary_words_found':s.Entries, 'Undergrad':0, 'undergrad_words_found':''})
elif s.Entries.isin(undergrad_words ):
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':1, 'undergrad_words_found':s.Entries})
else:
return pd.Series({'Entries':s.Entries, 'Secondary':0, 'secondary_words_found':'', 'Undergrad':0, 'undergrad_words_found':''})
This second version will only work in the cases you want it to if the element in Entries is exactly the same as its corresponding element in df2. You certainly don't want to do this, as it's messier, and will be noticeably slower if you have a lot of data to work with.