Replace Multiple Values of columns

Replace Multiple Values of columns - python

I have a data frame with different columns name (asset.new, few, value.issue, etc). And I want to change some characters or symbols in the name of columns. I can do it in this form:
df.columns = df.columns.str.replace('.', '_')
df.columns = df.columns.str.replace('few', 'LOW')
df.columns = df.columns.str.replace('value', 'PRICE')
....
But I think it should have a better and shorter way.

You can create a dictionary with the actual character as a key and the replacement as a value and then you iterate through your dictionary:
df = pd.DataFrame({'asset.new':[1,2,3],
'few':[4,5,6],
'value.issue':[7,8,9]})
replaceDict = { '.':'_', 'few':'LOW', 'value':'PRICE'}
for k, v in replaceDict.items():
df.columns = [c.replace(k, v) for c in list(df.columns)]
print(df)
output:
asset_new LOW PRICE_issue
1 4 7
2 5 8
3 6 9
or:
df.columns = df.columns.to_series().replace(["\.","few","value"],['_','LOW','PRICE'],regex=True)
Produces the same output.

Use Series.replace with dictionary - also necessary escape . because special regex character:
d = { '\.':'_', 'few':'LOW', 'value':'PRICE'}
df.columns = df.columns.to_series().replace(d, regex=True)
More general solution with re.esape:
import re
d = { '.':'_', 'few':'LOW', 'value':'PRICE'}
d1 = {re.escepe(k): v for k, v in d.items()}
df.columns = df.columns.to_series().replace(d1, regex=True)

Related

pandas: replace column value with keys and values in a dictionary of list values

I have a dataframe and a dictionary as follows (but much bigger),
import pandas as pd
df = pd.DataFrame({'text': ['can you open the door?','shall you write the address?']})
dic = {'Should': ['can','could'], 'Could': ['shall'], 'Would': ['will']}
I would like to replace the words in the text column if they can be found in dic list of values, so i did the following and it works for the lists that have one value but not for the other list,
for key, val in dic.items():
if df['text'].str.lower().str.split().map(lambda x: x[0]).str.contains('|'.join(val)).any():
df['text'] = df['text'].str.replace('|'.join(val), key, regex=False)
print(df)
my desired output would be,
text
0 Should you open the door?
1 Could you write the address?

The best is to change the logic and try to minimize the pandas steps.
You can craft a dictionary that will directly contain your ideal output:
dic2 = {v:k for k,l in dic.items() for v in l}
# {'can': 'Should', 'could': 'Should', 'shall': 'Could', 'will': 'Would'}
# or if not yet formatted:
# dic2 = {v.lower():k.capitalize() for k,l in dic.items() for v in l}
import re
regex = '|'.join(map(re.escape, dic2))
df['text'] = df['text'].str.replace(f'\b({regex})\b',
lambda m: dic2.get(m.group()),
case=False, # only if case doesn't matter
regex=True)
output (as text2 column for clarity):
text text2
0 can you open the door? Should you open the door?
1 shall you write the address? Could you write the address?

You can use lowercase in flatten dictionary to d for keys and values, then replace values with words boundaries and last use Series.str.capitalize:
d = {x.lower(): k.lower() for k, v in dic.items() for x in v}
regex = '|'.join(r"\b{}\b".format(x) for x in d.keys())
df['text'] = (df['text'].str.lower()
.str.replace(regex, lambda x: d[x.group()], regex=True)
.str.capitalize())
print(df)
text
0 Should you open the door?
1 Could you write the address?

How to select column if string is in column name

so I have a dict of dataframes with many columns. I want to selected all the columns that have the string 'important' in them.
So some of the frames may have important_0 or important_9_0 as their column name. How can I select them and put them into their own new dictionary with all the values each columns contains.

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'important_c'])
selected_cols = [c for c in df.columns if c.startswith('important_')]
print(selected_cols)
# ['important_c']
dict_df = { x: pd.DataFrame(columns=['a', 'b', 'important_c']) for x in range(3) }
new_dict = { x: dict_df[x][[c for c in dict_df[x].columns if c.startswith('important_')]] for x in dict_df }

important_columns = [x for x in df.columns if 'important' in x]
#changing your dataframe by remaining columns that you need
df = df[important_columns]

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

One of the columns in the dataframe is in the following format
Row 1 :
Counter({'First': 3, 'record': 2})
Row 2 :
Counter({'Second': 2, 'record': 1}).
I want to create a new column which has the following value:
Row 1 :
First First First record record
Row 2 :
Second Second record

I was able to solve the question myself by the following code. It is very much related to regex.
def transform_word_count(text):
words = re.findall(r'\'(.+?)\'',text)
n = re.findall(r"[0-9]",text)
result = []
for i in range(len(words)):
for j in range(int(n[i])):
result.append(words[i])
return result
df['new'] = df.apply(lambda row: transform_word_count(row['old']), axis=1)

Use apply with iter values of counter and join with space - first repeated values and then together:
import ast
#convert values to dictionaries
df['col'] = df['col'].str.extract('\((.+)\)', expand=False).apply(ast.literal_eval)
df['new'] = df['col'].apply(lambda x: ' '.join(' '.join([k] * v) for k, v in x.items()))
print (df)
col new
0 {'First': 3, 'record': 2} First First First record record
1 {'Second': 2, 'record': 1} Second Second record
Or list comprehension:
df['new'] = [' '.join(' '.join([k] * v) for k, v in x.items()) for x in df['col']]

How to remove one dictionary from dataframe

I have the following dataframe:
And I made dictionaries from each unique appId as you see below:
with this command:
dfs = dict(tuple(timeseries.groupby('appId')))
After that I want to remove all dictionaries which have less than 30 rows from my dataframe. I removed those dictionaries from my dictionaries(dfs) and then I tried this code:
pd.concat([dfs]).drop_duplicates(keep=False)
but it doesn't work.

I believe you need transform size and then filter by boolean indexing:
df = pd.concat([dfs])
df = df[df.groupby('appId')['appId'].transform('size') >= 30]
#alternative 1
#df = df[df.groupby('appId')['appId'].transform('size').ge(30)]
#alternative 2 (slowier in large data)
#df = df.groupby('appId').filter(lambda x: len(x) >= 30)
Another approach is filter dictionary:
dfs = {k: v for k, v in dfs.items() if len(v) >= 30}
EDIT:
timeseries = timeseries[timeseries.groupby('appId')['appId'].transform('size') >= 30]
dfs = dict(tuple(timeseries.groupby('appId')))

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?

df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.

I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product

I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.

you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Replace Multiple Values of columns - python

Related

pandas: replace column value with keys and values in a dictionary of list values

How to select column if string is in column name

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

How to remove one dictionary from dataframe

How to replace elements of a DataFrame from other indicated columns

Categories

Resources