pandas dataframe apply a function depending on index/column name - python

multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
Is there Pandas non-loop way to apply f to each column, like df.apply(f, axis = 0, ?)? What I can achieve with loop is
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
(real f is more complex than the above example, please treat f as a black box function of two variables, arg1 is data, arg2 is column name)

Use lambda function and for pass column name use x.name:
np.random.seed(2022)
multipliers = {'A' : 5, 'B' : 10, 'C' : 15, 'D' : 20}
df = pd.util.testing.makeDataFrame() # a random df with columns A,B,C,D
f = lambda x, col: multipliers[col] * x
df2 = df.copy()
for c in df.columns:
df2[c] = f(df[c], c)
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596
df2 = df.apply(lambda x: f(x, x.name))
print (df2.head())
A B C D
9CTWXXW3ys 2.308860 6.375789 5.362095 -23.354181
yq1PHBltEO 2.876024 1.950080 15.772909 -13.776645
lWtMioDq6A -11.206739 17.691500 -12.175996 25.957264
lEHcq1pxLr -6.510434 -6.004475 14.084401 13.999673
xvL04Y66tm -3.827731 -3.104207 -4.111277 1.440596

You can convert your dictionary to series and transform your function to vectorized operation. For example:
df * pd.Series(multipliers)
You can also use the method transform that accepts a dict of functions:
def func(var):
# return your function
return lambda x: x * var
df.transform({k: func(v) for k, v in multipliers.items()})

Related

Get dict keys using pandas apply

i want to get values from the dict that looks like
pair_devices_count =
{('tWAAAA.jg', 'ttNggB.jg'): 1,
('tWAAAM.jg', 'ttWVsM.jg'): 2,
('tWAAAN.CV', 'ttNggB.AS'): 1,
('tWAAAN.CV', 'ttNggB.CV'): 2,
('tWAAAN.CV', 'ttNggB.QG'): 1}
(Pairs of domain)
But when i use
train_data[['domain', 'target_domain']].apply(lambda x: pair_devices_count.get((x), 0))
it raises an error, because pandas series are not hashable
How can i get dict values to generate column
train['pair_devices_count']?
you cannot apply on multiple columns. You can try this :
train_data.apply(lambda x: pair_devices_count[(x.domain, x.target_domain)], axis=1)
pandas series are not hashable
Convert pd.Series to tuple before using .get consider following simple example
import pandas as pd
d = {('A','A'):1,('A','B'):2,('A','C'):3}
df = pd.DataFrame({'X':['A','A','A'],'Y':['C','B','A'],'Z':['X','Y','Z']})
df['d'] = df[['X','Y']].apply(lambda x:d.get(tuple(x)),axis=1)
print(df)
output
X Y Z d
0 A C X 3
1 A B Y 2
2 A A Z 1

Zipping List of Pandas DataFrames Yields Unexpected Results

Can somebody explain the following code?
import pandas as pd
a = pd.DataFrame({"col1": [1,2,3], "col2": [2,3,4]})
b = pd.DataFrame({"col3": [1,2,3], "col4": [2,3,4]})
list(zip(*[a,b]))
Output:
[('col1', 'col3'), ('col2', 'col4')]
a:
b:
zip function returns tuple:
a = ("John", "Charles", "Mike")
b = ("Jenny", "Christy", "Monica", "Vicky")
x = zip(a, b)
#use the tuple() function to display a readable version of the result:
print(tuple(x))
with [a,b] inside zip - U get the whole values from df.
There is also combine the all possible combination (16 permutations) :
eg:
d = list(zip(a['col1'],b['col4']))

How to select column if string is in column name

so I have a dict of dataframes with many columns. I want to selected all the columns that have the string 'important' in them.
So some of the frames may have important_0 or important_9_0 as their column name. How can I select them and put them into their own new dictionary with all the values each columns contains.
import pandas as pd
df = pd.DataFrame(columns=['a', 'b', 'important_c'])
selected_cols = [c for c in df.columns if c.startswith('important_')]
print(selected_cols)
# ['important_c']
dict_df = { x: pd.DataFrame(columns=['a', 'b', 'important_c']) for x in range(3) }
new_dict = { x: dict_df[x][[c for c in dict_df[x].columns if c.startswith('important_')]] for x in dict_df }
important_columns = [x for x in df.columns if 'important' in x]
#changing your dataframe by remaining columns that you need
df = df[important_columns]

Optimising quartiling of columns of panda dataframe?

I have multiple columns in a data frame that have numerical data. I want to quartile each column, changing each value to either q1, q2, q3 or q4.
I currently loop through each column and change them using the pandas qcut function:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
This is very slow! Is there a faster way to do this?
Played around with the the following example a little. Looks like converting to float from a string is increasing the time. Though a working example was not provided, so the original type can't be known. df[column].astype(copy=) appears to be performant if copying or not. Not much else to go after.
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)

How to replace elements of a DataFrame from other indicated columns

I have a DataFrame like:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
or
v1 v2 v3
0 a b 1
1 2 c d
When the contents of a column/row is '1', '2' or '3', I would like to replace its contents with the corresponding item from the column indicated. I.e., in the first row, column v3 has value "1" so I would like to replace it with the value of the first element in column v1. Doing this for both rows, I should get:
v1 v2 v3
0 a b a
1 c c d
I can do this with the following code:
for i in range(3):
for j in range(3):
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (i+1)]= \
df.loc[df['v%d' % (i+1)]==('%d' % (j+1)),'v%d' % (j+1)]
Is there a less cumbersome way to do this?
df.apply(lambda row: [row['v'+v] if 'v'+v in row else v for v in row], 1)
This iterates over each row and replaces any value v with the value in column named 'v'+v if that column exists, otherwise it does not change the value.
output:
v1 v2 v3
0 a b a
1 c c d
Note that this will not limit the replacements to digits only. For example, if you have a column named 'va', it will replace all cells that contain 'a' with the value in the 'va' column in a that row. To limit the rows that you can replace from, you can define a list of acceptable column names. For example, lets say you only wanted to make replacements from column v1:
acceptable_columns = ['v1']
df.apply(lambda row: [row['v'+v] if 'v'+v in acceptable_columns else v for v in row], 1)
output:
v1 v2 v3
0 a b a
1 2 c d
EDIT
It was pointed out that the answer above throws an error if you have non-string types in your dataframe. You can avoid this by explicitly converting each cell value to a string:
df.apply(lambda row: [row['v'+str(v)] if 'v'+str(v) in row else v for v in row], 1)
ORIGINAL (INCORRECT) ANSWER BELOW
note that the answer below only applies when the values to replace are on a diagonal (which is the case in the example but that was not the question asked ... my bad)
You can do this with pandas' replace method and numpy's diag method:
First select the values to replace, these will be the digits 1 to the length of your dataframe:
to_replace = [str(i) for i in range(1,len(df)+1)]
Then select values that each should be replaced with, these will be the diagonal of your data frame:
import numpy as np
replace_with = np.diag(df)
Now you can do the actual replacement:
df.replace(to_replace, replace_with)
which gives:
v1 v2 v3
0 a b a
1 c c d
And of course if you want the whole thing as a one liner:
df.replace([str(i) for i in range(1,len(df)+1)], np.diag(df))
Add the inplace=True keyword arg to replace if you want to do the replacement in place.
I see 2 options.
Loop over the columns and then over the mapping
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
df1 = df.copy()
for column_name, column in df1.iteritems():
for k, v in mapping.items():
df1.loc[column == k, column_name] = df1.loc[column == k, v]
df1
v1 v2 v3
0 a b a
1 c c d
Loop over the columns, then loop over all the 'hits'
df2 = df.copy()
for column_name, column in df2.iteritems():
hits = column.isin(mapping.keys())
for idx, item in column[hits].iteritems():
df2.loc[idx, column_name] = df2.loc[idx, mapping[item]]
df2
v1 v2 v3
0 a b a
1 c c d
If you've chosen a way, you could reduce the 2 nested for-loops to 1 for-loop with itertools.product
I made this:
df = pd.DataFrame([{'v1':'a', 'v2':'b', 'v3':'1'},
{'v1':'2', 'v2':'c', 'v3':'d'}])
def replace_col(row, columns, col_num_dict={1: 'v1', 2: 'v2', 3: 'v3'}):
for col in columns:
x = getattr(row, col)
try:
x = int(x)
if int(x) in col_num_dict.keys():
setattr(row, col, getattr(row, col_num_dict[int(x)]))
except ValueError:
pass
return row
df = df.apply(replace_col, axis=1, args=(df.columns,))
It applies the function replace_col on every row. The row object's attributes which correspond to its columns get replaced with the right value from the same row. It looks a bit complicated due to the multiple set/get attribute functions, but it does exactly what is needed without too much overhead.
you can modify the data before convert to df
data = [{'v1':'a', 'v2':'b', 'v3':'1'},{'v1':'2', 'v2':'c', 'v3':'d'}]
mapping = {'1': 'v1', '3': 'v3', '2': 'v2'}
for idx,line in enumerate(data):
... for item in line:
... try:
... int(line[item ])
... data[idx][item ] = data[idx][mapping[line[item ]]]
... except Exception:
... pass
[{'v1': 'a', 'v2': 'b', 'v3': 'a'}, {'v1': 'c', 'v2': 'c', 'v3': 'd'}]

Categories

Resources