How to replace elements of a DataFrame from other indicated columns

How to replace elements of a DataFrame from other indicated columns - python

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?

df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.

I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product

I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.

you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Related

Replace Multiple Values of columns

I have a data frame with different columns name (asset.new, few, value.issue, etc). And I want to change some characters or symbols in the name of columns. I can do it in this form:
df.columns = df.columns.str.replace('.', '_')
df.columns = df.columns.str.replace('few', 'LOW')
df.columns = df.columns.str.replace('value', 'PRICE')
....
But I think it should have a better and shorter way.

You can create a dictionary with the actual character as a key and the replacement as a value and then you iterate through your dictionary:
df = pd.DataFrame({'asset.new':[1,2,3],
'few':[4,5,6],
'value.issue':[7,8,9]})
replaceDict = { '.':'_', 'few':'LOW', 'value':'PRICE'}
for k, v in replaceDict.items():
df.columns = [c.replace(k, v) for c in list(df.columns)]
print(df)
output:
asset_new LOW PRICE_issue
1 4 7
2 5 8
3 6 9
or:
df.columns = df.columns.to_series().replace(["\.","few","value"],['_','LOW','PRICE'],regex=True)
Produces the same output.

Use Series.replace with dictionary - also necessary escape . because special regex character:
d = { '\.':'_', 'few':'LOW', 'value':'PRICE'}
df.columns = df.columns.to_series().replace(d, regex=True)
More general solution with re.esape:
import re
d = { '.':'_', 'few':'LOW', 'value':'PRICE'}
d1 = {re.escepe(k): v for k, v in d.items()}
df.columns = df.columns.to_series().replace(d1, regex=True)

Creating a column, where the value of each row is a key of a specified dict, based on whether existing column contains that dict value as a substring?

Say I have the following dictionary
dict = {'a': ['tool', 'device'], 'b': ['food', 'beverage']},
and I have a dataframe with a column with the first 2 row values as
'tools',
'foods'
and I want to create a new column where the 1st value is a, and the second is b.
What would be the best way to do this?

First dont use varable name dict, because builtins (python code word). Then are swapped values of dict - values with keys for new dict, get values from column by Series.str.findall by keys of dict and Series.map by dictionary for new column:
d = {'a': ['tool', 'device'], 'b': ['food', 'beverage']}
df = pd.DataFrame({'col':['tools','foods']})
d1 = {k: oldk for oldk, oldv in d.items() for k in oldv}
print (d1)
{'tool': 'a', 'device': 'a', 'food': 'b', 'beverage': 'b'}
df['new'] = df['col'].str.findall('|'.join(d1.keys())).str[0].map(d1)
print (df)
col new
0 tools a
1 foods b
Or:
df['new'] = df['col'].str.extract('({})'.format('|'.join(d1.keys())), expand=False).map(d1)

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

One of the columns in the dataframe is in the following format
Row 1 :
Counter({'First': 3, 'record': 2})
Row 2 :
Counter({'Second': 2, 'record': 1}).
I want to create a new column which has the following value:
Row 1 :
First First First record record
Row 2 :
Second Second record

I was able to solve the question myself by the following code. It is very much related to regex.
def transform_word_count(text):
words = re.findall(r'\'(.+?)\'',text)
n = re.findall(r"[0-9]",text)
result = []
for i in range(len(words)):
for j in range(int(n[i])):
result.append(words[i])
return result
df['new'] = df.apply(lambda row: transform_word_count(row['old']), axis=1)

Use apply with iter values of counter and join with space - first repeated values and then together:
import ast
#convert values to dictionaries
df['col'] = df['col'].str.extract('\((.+)\)', expand=False).apply(ast.literal_eval)
df['new'] = df['col'].apply(lambda x: ' '.join(' '.join([k] * v) for k, v in x.items()))
print (df)
col new
0 {'First': 3, 'record': 2} First First First record record
1 {'Second': 2, 'record': 1} Second Second record
Or list comprehension:
df['new'] = [' '.join(' '.join([k] * v) for k, v in x.items()) for x in df['col']]

Pandas integrate over columns per each row

In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06

I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06

Pyspark dataframe get a list of columns where at least one row meets a condition

I have a PySpark DataFrame
Col1 Col2 Col3
0.1 0.2 0.3
I want to get the column names where at least one row meets a condition for example a row is bigger than 0.1
My expected result is should be in this case:
[Co2 , Co3]
I cannot provide any code because truly I don't know how to do this.

Just count items satisfying the predicate (internal select) and process the results:
from pyspark.sql.functions import col, count, when
[c for c, v in df.select([
count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
]).first().asDict().items() if v]
Step by step:
Aggregate (DataFrame -> DatFrame):
df = sc.parallelize([(0.1, 0.2, 0.3)]).toDF()
counts = df.select([
count(when(col(c) > 0.1, 1)).alias(c) for c in df.columns
])
DataFrame[_1: bigint, _2: bigint, _3: bigint]
collect the first Row:
a_row = counts.first()
Row(_1=0, _2=1, _3=1)
Convert to Python dict:
a_dict = a_row.asDict()
{'_1': 0, '_2': 1, '_3': 1}
And iterate over its items, keeping key, when value is truthy:
[c for c, v in a_dict.items() if v]
or explicitly checking count:
[c for c, v in a_dict.items() if v > 0]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to replace elements of a DataFrame from other indicated columns - python

Related

Replace Multiple Values of columns

Creating a column, where the value of each row is a key of a specified dict, based on whether existing column contains that dict value as a substring?

How to access elements of Collections Counter stored as column in dataframe to be used in CountVectorizer

Pandas integrate over columns per each row

Pyspark dataframe get a list of columns where at least one row meets a condition

Categories

Resources