pandas dataframe apply a function depending on index/column name

pandas dataframe apply a function depending on index/column name - python

multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
Is there Pandas non-loop way to apply f to each column, like df.apply(f, axis = 0, ?)? What I can achieve with loop is
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
(real f is more complex than the above example, please treat f as a black box function of two variables, arg1 is data, arg2 is column name)

Use lambda function and for pass column name use x.name:
np.random.seed(2022)
multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596
df2 = df.apply(lambda x: f(x, x.name))
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596

You can convert your dictionary to series and transform your function to vectorized operation. For example:
df * pd.Series(multipliers)
You can also use the method transform that accepts a dict of functions:
def func(var):
# return your function
return lambda x: x * var
df.transform({k: func(v) for k, v in multipliers.items()})

Related

Get dict keys using pandas apply

i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?

you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)

pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1

Zipping List of Pandas DataFrames Yields Unexpected Results

Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]

a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))

How to select column if string is in column name

so I have a dict of dataframes with many columns. I want to selected all the columns that have the string 'important' in them.
So some of the frames may have important_0 or important_9_0 as their column name. How can I select them and put them into their own new dictionary with all the values each columns contains.

import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'important_c'])
selected_cols = [c for c in df.columns if c.startswith('important_')]
print(selected_cols)
# ['important_c']
dict_df = { x: pd.DataFrame(columns=['a', 'b', 'important_c']) for x in range(3) }
new_dict = { x: dict_df[x][[c for c in dict_df[x].columns if c.startswith('important_')]] for x in dict_df }

important_columns = [x for x in df.columns if 'important' in x]
#changing your dataframe by remaining columns that you need
df = df[important_columns]

Optimising quartiling of columns of panda dataframe?

I have multiple columns in a data frame that have numerical data. I want to quartile each column, changing each value to either q1, q2, q3 or q4.
I currently loop through each column and change them using the pandas qcut function:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
This is very slow! Is there a faster way to do this?

Played around with the the following example a little. Looks like converting to float from a string is increasing the time. Though a working example was not provided, so the original type can't be known. df[column].astype(copy=) appears to be performant if copying or not. Not much else to go after.
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?

df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.

I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product

I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.

you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas dataframe apply a function depending on index/column name - python

Related

Get dict keys using pandas apply

Zipping List of Pandas DataFrames Yields Unexpected Results

How to select column if string is in column name

Optimising quartiling of columns of panda dataframe?

How to replace elements of a DataFrame from other indicated columns

Categories

Resources