Pandas: filling placeholders in string column - python

I am working with a pandas DataFrame looking as follows:
df = pd.DataFrame(
[['There are # people', '3', np.nan], ['# out of # people are there', 'Five', 'eight'],
['Only # are here', '2', np.nan], ['The rest is at home', np.nan, np.nan]])
resulting in:
0 1 2
0 There are # people 3 NaN
1 # out of # people are there Five eight
2 Only # are here 2 NaN
3 The rest is at home NaN NaN
I would like to replace the # placeholders with the varying strings in columns 1 and 2, resulting in:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
How could I achieve this?

Using string format
df=df.replace({'#':'%s',np.nan:'NaN'},regex=True)
l=[]
for x , y in df.iterrows():
if y[2]=='NaN' and y[1]=='NaN':
l.append(y[0])
elif y[2]=='NaN':
l.append(y[0] % (y[1]))
else:
l.append(y[0] % (y[1], y[2]))
l
Out[339]:
['There are 3 people',
'Five out of eight people are there',
'Only 2 are here',
'The rest is at home']

A more concise way to do it.
cols = df.columns
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[1]]),1) if x[cols[1]]!=np.NaN else x,axis=1)
print(df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[2]]),1) if x[cols[2]]!=np.NaN else x,axis=1))
Out[12]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: 0, dtype: object
If you need to do this for even more columns
cols = df.columns
for i in range(1, len(cols)):
df[cols[0]] = df.apply(lambda x: x[cols[0]].replace('#',str(x[cols[i]]),1) if x[cols[i]]!=np.NaN else x,axis=1)
print(df[cols[0]])

A generic replace function in case you may have more values to add:
Replaces all instances if a given character in a string using a list of values (just two in your case but it can handle more)
def replace_hastag(text, values, replace_char='#'):
for v in values:
if v is np.NaN:
return text
else:
text = text.replace(replace_char, str(v), 1)
return text
df['text'] = df.apply(lambda r: replace_hastag(r[0], values=[r[1], r[2]]), axis=1)
Result
In [79]: df.text
Out[79]:
0 There are 3 people
1 Five out of eight people are there
2 Only 2 are here
3 The rest is at home
Name: text, dtype: object

Related

Creating a new dataframe column with the number of overlapping words between dataframe and list

I'm having some trouble fixing the following problem:
I have a dataframe with tokenised text on every row that looks (something) like the following
index feelings
1 [happy, happy, sad]
2 [neutral, sad, mad]
3 [neutral, neutral, happy]
and lists of words lst1=[happy, fantastic], lst2=[mad, sad], lst3=[neutral] and I want to check for every row in my dataframe how many occurrences of the words in the list there are. So the output would look something like this:
index feelings occlst1 occlst2 occlst3
1 [happy, happy, sad] 2 1 0
2 [neutral, sad, mad] 0 2 1
3 [neutral, neutral, happy] 1 0 2
So, I want to make a new column and compare the dataframe cells to the list.
Use collections.Counter
Setup:
import pandas as pd
from collections import Counter # Load 'Counter'
df = pd.DataFrame({'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']]})
lst1 = ['happy', 'fantastic']
lst2 = ['mad', 'sad']
lst3 = ['neutral']
# Create an intermediate dict
occ = {'occlst1': lst1, 'occlst2': lst2, 'occlst3': lst3}
Update: as suggested by #mozway
def count_occ(sr):
return {col: sum([v for k, v in Counter(sr).items() if k in lst])
for col, lst in occ.items()}
df = pd.concat([df, df['feelings'].apply(count_occ).apply(pd.Series)], axis=1)
Note: I didn't use any other columns except feelings for readability. However the concat function restore all columns from df.
Output:
>>> df
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2
You could built a reference Series, to match feelings with the list id. Then explode+merge+pivot_table:
ref = pd.Series({e: 'occlist_%s' % (i+1) for i,l in enumerate([lst1, lst2, lst3]) for e in l}, name='cols')
## ref:
# happy occlst1
# fantastic occlst1
# mad occlst2
# sad occlst2
# neutral occlst3
# Name: cols, dtype: object
df.merge((df.explode('feelings') # lists to single rows
# create a new column with list id
.merge(ref, left_on='feelings', right_index=True)
# reshape back to 1 row per original index
.pivot_table(index='index', columns='cols', values='feelings', aggfunc='count', fill_value=0)
),
left_on='index', right_index=True # merge with original df
)
NB. I considered here that index is a column, if is is an index, you need to add a df.reset_index() step
output:
index feelings occlist_1 occlist_2 occlist_3
0 1 [happy, happy, sad] 2 1 0
1 2 [neutral, sad, mad] 0 2 1
2 3 [neutral, neutral, happy] 1 0 2
input:
df = pd.DataFrame({'index': [1, 2, 3],
'feelings': [['happy', 'happy', 'sad'],
['neutral', 'sad', 'mad'],
['neutral', 'neutral', 'happy']
]})
lst1=['happy', 'fantastic']
lst2=['mad', 'sad']
lst3=['neutral']
You can also use:
my_lists = [lst1, lst2, st3]
occ = pd.DataFrame.from_records(df['feelings'].apply(lambda x: [pd.Series(x).isin(l).sum() for l in my_lists]).values, columns=['occlst1', 'occlst2', 'occlst3'])
df_occ = df.join(occ)
Output:
feelings occlst1 occlst2 occlst3
0 [happy, happy, sad] 2 1 0
1 [neutral, sad, mad] 0 2 1
2 [neutral, neutral, happy] 1 0 2

Create new dataframe column from the values of 2 other columns

I have 2 columns in my data frame. At any one instance (row), at least one of the columns has a string value in it, it is possible that the other column has NoneType in it or another string.
I want to create a 3rd column that, in the case where one of the columns is a NoneType, will take the value of the string. And in the case where both are strings, will take the concatenation of the two.
How can I do this?
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye
Series.str.cat
Use na_rep='' so joins with missing values do not result in NaN for the entire row. Then strip any excess separators that were joined due to missing data (assuming separator characters also don't start or end any of your words).
import pandas as pd
df = pd.DataFrame({'column1': ['hello', None, 'hello'],
'column2': [None, 'goodbye', 'goodbye']})
sep = ', '
df['column3'] = (df['column1'].str.cat(df['column2'], sep=sep, na_rep='')
.str.strip(sep))
print(df)
column1 column2 column3
0 hello None hello
1 None goodbye goodbye
2 hello goodbye hello, goodbye
With many columns, where there might be streaks of missing data in the middle, the above doesn't work to remove the excess separators. Instead you could use a slow lambda along the rows. We join all values after dropping the nulls:
df['column3'] = df.apply(lambda row: ', '.join(row.dropna()), axis=1)
Solution
You could replace all the NaNs with an empty string and then conact the columns (A and B) to create column C.
df2 = df.fillna('')
df['C'] = df2.A.str.strip() + df2.B.str.strip(); #del df2;
print(df)
Output:
A B C=A+B
0 1 3 13
1 2 None 2
2 dog dog dogdog
3 None None
4 snake 20 snake20
5 cat None cat
Dummy Data
d = {
'A': ['1', '2', 'dog', None, 'snake', 'cat'],
'B': ['3', None, 'dog', None, '20', None]
}
df = pd.DataFrame(d)
print(df)
Output:
A B
0 1 3
1 2 None
2 dog dog
3 None None
4 snake 20
5 cat None

Renaming multiple pandas columns with regular expressions

I am trying to tidy up a csv I was given where the columns are not very developer-friendly right now. I would like to use regular expressions to find multiple patterns in the column names to replace multiple conditions. For example, given df1 with leading/trailed spaces, white space throughout the header, parenthesis (), and <, then I would like remove the leading/trailing spaces and parenthesis, replace the white space with _, and replace the < with LESS_THAN
For example, turning df1 into df2:
df1 = pd.DataFrame({' APPLES AND LEMONS': [1,2], ' ORANGES ([POUNDS]) ': [2,1], ' BANANAS < 5 ': [8,9]})
APPLES AND LEMONS ORANGES (POUNDS) BANANAS < 5
0 1 2 8
1 2 1 9
df2 = pd.DataFrame({'APPLES_AND_LEMONS': [1,2], 'ORANGES_POUNDS': [2,1], 'BANANAS_LESS_THAN_5 ': [8,9]})
APPLES_AND_LEMONS ORANGES_POUNDS BANANAS_LESS_THAN_5
0 1 2 8
1 2 1 9
My current implementation is by just chaining a bunch of str.replaces. Is there a better way to do this? I was thinking that regular expressions could be especially useful because there are hundreds of columns and I'm sure that there will be a few more headaches that I have yet to find.
df1.columns = df1.columns.str.strip()
df1.columns = concatenated_df.columns.str.replace(' ','_').str.replace('<','LESS_THAN').str.replace('(', '').str.replace(')','')
Thanks to the link Alollz gave me I was able to get a solution that is much easier to maintain than continuously chaining str.replace
def clean_column_names(df):
df.columns = df.columns.str.strip()
replace_dict = {' ': '_', '<': 'LESS_THAN', '(': '', ')':''}
for i, j in replace_dict.items():
new_columns = [column.replace(i, j) for column in df.columns]
df.columns = new_columns
return df
clean_column_names(df1)
APPLES_AND_LEMONS ORANGES_POUNDS BANANAS_LESS_THAN_5
0 1 2 8
1 2 1 9
Not sure if this is better for you.
old_cols = list(df1.columns.values)
remove = re.compile(r'^\s+|\s+$|[\(\)\[\]]')
wspace = re.compile(r'\s+')
less = re.compile(r'<')
great = re.compile(r'>')
new_cols = []
for i in old_cols:
i = re.sub(remove, "", i)
i = re.sub(wspace, "_", i)
i = re.sub(less, "LESS_THAN", i)
i = re.sub(less, "GREATER_THAN", i)
new_cols.append(i)
df1.columns = new_cols

Panda: Create a column using first 2 letters from a text column

How to create a column using first 2 letters from other columns but not including NaN? E.g. I have 3 columns
a=pd.Series(['Eyes', 'Ear', 'Hair', 'Skin'])
b=pd.Series(['Hair', 'Liver', 'Eyes', 'NaN'])
c=pd.Series(['NaN', 'Skin', 'NaN', 'NaN'])
df=pd.concat([a, b, c], axis=1)
df.columns=['First', 'Second', 'Third']
Now I want to create a 4th column that would combine first 2 letters from 'First', 'Second' and 'Third' after sorting (so that Ear comes before Hair irrespective of the column). But it would skip NaN values.
The final output for the fourth column would would look something like:
Fourth = pd.Series(['EyHa', 'EaLiSk', 'EyHa', 'Sk'])
If NaN is np.nan - missing value:
a=pd.Series(['Eyes', 'Ear', 'Hair', 'Skin'])
b=pd.Series(['Hair', 'Liver', 'Eyes', np.nan])
c=pd.Series([np.nan, 'Skin', np.nan, np.nan])
df=pd.concat([a, b, c], axis=1)
df.columns=['First', 'Second', 'Third']
df['new'] = df.apply(lambda x: ''.join(sorted([y[:2] for y in x if pd.notnull(y)])), axis=1)
Another solution:
df['new'] = [''.join([y[:2] for y in x]) for x in np.sort(df.fillna('').values, axis=1)]
#alternative
#df['new'] = [''.join(sorted([y[:2] for y in x if pd.notnull(y)])) for x in df.values]
print (df)
First Second Third new
0 Eyes Hair NaN EyHa
1 Ear Liver Skin EaLiSk
2 Hair Eyes NaN EyHa
3 Skin NaN NaN Sk
If NaN is string:
df['new'] = df.apply(lambda x: ''.join(sorted([y[:2] for y in x if y != 'NaN'])), axis=1)
df['new'] = [''.join(sorted([y[:2] for y in x if y != 'NaN'])) for x in df.values]

pandas - Apply mean to a specific row in grouped dataframe [duplicate]

This should be straightforward, but the closest thing I've found is this post:
pandas: Filling missing values within a group, and I still can't solve my problem....
Suppose I have the following dataframe
df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3], 'name': ['A','A', 'B','B','B','B', 'C','C','C']})
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
and I'd like to fill in "NaN" with mean value in each "name" group, i.e.
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
I'm not sure where to go after:
grouped = df.groupby('name').mean()
Thanks a bunch.
One way would be to use transform:
>>> df
name value
0 A 1
1 A NaN
2 B NaN
3 B 2
4 B 3
5 B 1
6 C 3
7 C NaN
8 C 3
>>> df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
>>> df
name value
0 A 1
1 A 1
2 B 2
3 B 2
4 B 3
5 B 1
6 C 3
7 C 3
8 C 3
fillna + groupby + transform + mean
This seems intuitive:
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
The groupby + transform syntax maps the groupwise mean to the index of the original dataframe. This is roughly equivalent to #DSM's solution, but avoids the need to define an anonymous lambda function.
#DSM has IMO the right answer, but I'd like to share my generalization and optimization of the question: Multiple columns to group-by and having multiple value columns:
df = pd.DataFrame(
{
'category': ['X', 'X', 'X', 'X', 'X', 'X', 'Y', 'Y', 'Y'],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],
'other_value': [10, np.nan, np.nan, 20, 30, 10, 30, np.nan, 30],
'value': [1, np.nan, np.nan, 2, 3, 1, 3, np.nan, 3],
}
)
... gives ...
category name other_value value
0 X A 10.0 1.0
1 X A NaN NaN
2 X B NaN NaN
3 X B 20.0 2.0
4 X B 30.0 3.0
5 X B 10.0 1.0
6 Y C 30.0 3.0
7 Y C NaN NaN
8 Y C 30.0 3.0
In this generalized case we would like to group by category and name, and impute only on value.
This can be solved as follows:
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
Notice the column list in the group-by clause, and that we select the value column right after the group-by. This makes the transformation only be run on that particular column. You could add it to the end, but then you will run it for all columns only to throw out all but one measure column at the end. A standard SQL query planner might have been able to optimize this, but pandas (0.19.2) doesn't seem to do this.
Performance test by increasing the dataset by doing ...
big_df = None
for _ in range(10000):
if big_df is None:
big_df = df.copy()
else:
big_df = pd.concat([big_df, df])
df = big_df
... confirms that this increases the speed proportional to how many columns you don't have to impute:
import pandas as pd
from datetime import datetime
def generate_data():
...
t = datetime.now()
df = generate_data()
df['value'] = df.groupby(['category', 'name'])['value']\
.transform(lambda x: x.fillna(x.mean()))
print(datetime.now()-t)
# 0:00:00.016012
t = datetime.now()
df = generate_data()
df["value"] = df.groupby(['category', 'name'])\
.transform(lambda x: x.fillna(x.mean()))['value']
print(datetime.now()-t)
# 0:00:00.030022
On a final note you can generalize even further if you want to impute more than one column, but not all:
df[['value', 'other_value']] = df.groupby(['category', 'name'])['value', 'other_value']\
.transform(lambda x: x.fillna(x.mean()))
Shortcut:
Groupby + Apply + Lambda + Fillna + Mean
>>> df['value1']=df.groupby('name')['value'].apply(lambda x:x.fillna(x.mean()))
>>> df.isnull().sum().sum()
0
This solution still works if you want to group by multiple columns to replace missing values.
>>> df = pd.DataFrame({'value': [1, np.nan, np.nan, 2, 3, np.nan,np.nan, 4, 3],
'name': ['A','A', 'B','B','B','B', 'C','C','C'],'class':list('ppqqrrsss')})
>>> df['value']=df.groupby(['name','class'])['value'].apply(lambda x:x.fillna(x.mean()))
>>> df
value name class
0 1.0 A p
1 1.0 A p
2 2.0 B q
3 2.0 B q
4 3.0 B r
5 3.0 B r
6 3.5 C s
7 4.0 C s
8 3.0 C s
I'd do it this way
df.loc[df.value.isnull(), 'value'] = df.groupby('group').value.transform('mean')
The featured high ranked answer only works for a pandas Dataframe with only two columns. If you have a more columns case use instead:
df['Crude_Birth_rate'] = df.groupby("continent").Crude_Birth_rate.transform(
lambda x: x.fillna(x.mean()))
To summarize all above concerning the efficiency of the possible solution
I have a dataset with 97 906 rows and 48 columns.
I want to fill in 4 columns with the median of each group.
The column I want to group has 26 200 groups.
The first solution
start = time.time()
x = df_merged[continuous_variables].fillna(df_merged.groupby('domain_userid')[continuous_variables].transform('median'))
print(time.time() - start)
0.10429811477661133 seconds
The second solution
start = time.time()
for col in continuous_variables:
df_merged.loc[df_merged[col].isnull(), col] = df_merged.groupby('domain_userid')[col].transform('median')
print(time.time() - start)
0.5098445415496826 seconds
The next solution I only performed on a subset since it was running too long.
start = time.time()
for col in continuous_variables:
x = df_merged.head(10000).groupby('domain_userid')[col].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
11.685635566711426 seconds
The following solution follows the same logic as above.
start = time.time()
x = df_merged.head(10000).groupby('domain_userid')[continuous_variables].transform(lambda x: x.fillna(x.median()))
print(time.time() - start)
42.630549907684326 seconds
So it's quite important to choose the right method.
Bear in mind that I noticed once a column was not a numeric the times were going up exponentially (makes sense as I was computing the median).
def groupMeanValue(group):
group['value'] = group['value'].fillna(group['value'].mean())
return group
dft = df.groupby("name").transform(groupMeanValue)
I know that is an old question. But I am quite surprised by the unanimity of apply/lambda answers here.
Generally speaking, that is the second worst thing to do after iterating rows, from timing point of view.
What I would do here is
df.loc[df['value'].isna(), 'value'] = df.groupby('name')['value'].transform('mean')
Or using fillna
df['value'] = df['value'].fillna(df.groupby('name')['value'].transform('mean'))
I've checked with timeit (because, again, unanimity for apply/lambda based solution made me doubt my instinct). And that is indeed 2.5 faster than the most upvoted solutions.
To fill all the numeric null values with the mean grouped by "name"
num_cols = df.select_dtypes(exclude='object').columns
df[num_cols] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
df.fillna(df.groupby(['name'], as_index=False).mean(), inplace=True)
You can also use "dataframe or table_name".apply(lambda x: x.fillna(x.mean())).

Categories

Resources