I have several lists that are generated from a get_topic() function. That is,
list1 = get_topic(1)
list2 = get_topic(2)
and another dozens of lists.
# The list contains something like
[('A', 0.1),('B', 0.2),('C',0.3)]
I am trying to write a loop so that all different lists can be saved to different columns in a dataframe. The code I tried was:
for i in range(1,number) # number is the total number of lists + 1
df_02 = pd.DataFrame(get_topic(i)
This only returns with list1, but no other lists. The result that I would like to get is something like:
List 1
Number 1
List 2
Number 2
A
0.1
D
0.03
B
0.2
E
0.04
C
0.3
F
0.05
Could anyone help me to correct the loop? Thank you.
df = pd.DataFrame()
for i in range(1, number):
df[f'List {i}'], df[f'Number {i}'] = zip(*get_topic(i))
You are creating a new DataFrame at every iteration.
This will create a structure similar to what you want:
df = pd.DataFrame([get_topic(i) for i in range(1, number)])
df = df.apply(pd.Series.explode).reset_index(drop=True)
df = df.transpose()
Result:
0 1 2 3 4 5
0 A 0.1 D 0.1 G 0.1
1 B 0.2 E 0.2 H 0.2
2 C 0.3 F 0.3 I 0.3
One-liner version:
df = pd.DataFrame([get_topic(i) for i in range(1, number)]).apply(pd.Series.explode).reset_index(drop=True).transpose()
I reconstruct a hypothetical get_topic() function that simply fetches a list from a list of lists.
The idea is to use pd.concat() in order to concatenate dataframes at each iteration.
import pandas as pd
topics = [
[('A', 0.1), ('B', 0.2), ('C', 0.3)],
[('D', 0.3), ('E', 0.4), ('F', 0.5)]
]
number = len(topics)
def get_topic(index) -> []:
return topics[index]
if __name__ == '__main__':
df = pd.DataFrame()
for i in range(0, number): # number is the total number of lists
curr_topic = get_topic(i)
curr_columns = ['List ' + str(i+1), 'Number ' + str(i+1)]
df = pd.concat([df, pd.DataFrame(data=curr_topic, columns=curr_columns)], axis=1)
print(df)
Output will be:
List 1 Number 1 List 2 Number 2
0 A 0.1 D 0.3
1 B 0.2 E 0.4
2 C 0.3 F 0.5
Related
I have a dataframe with 3 columns: a_id, b, c (with a_id as a unique key) and I would like to assign a score for each row based on the number in b and c columns. I have created the following:
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
Normally, I would use something like this:
df['b_score'] = df(b_score, axis = 1)
df['c_score'] = df(c_score, axis = 1)
However, the above approach will be too long if I have multiple columns. I would like to know how can I create a loop for the selected columns? I have tried the following:
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(f'{col}_score_function', axis = 1)
but it returned with the following error:
'b_score_function' is not a valid function for 'DataFrame' object
Can anyone please point out what I did wrong?
Also if anyone can suggest how to create a reusable, that would be appreciated.
Thank you.
IIUC, this should work for you:
df = pd.DataFrame({'a_id': range(5), 'b': [0.0, 0.25, 0.5, 2.0, 2.5], 'c': [0.0, 0.25, 0.5, 1.0, 1.5]})
def b_score_function(df):
if df['b'] <= 0 :
return 0
elif df['b'] <= 2 :
return 0.25
else:
return 1
def c_score_function(df):
if df['c'] <= 0 :
return 0
elif df['c'] <= 1 :
return 0.5
else:
return 1
ds_cols = df.columns.difference(['a_id']).to_list()
for col in ds_cols:
df[f'{col}_score'] = df.apply(eval(f'{col}_score_function'), axis = 1)
print(df)
Result:
a_id b c b_score c_score
0 0 0.00 0.00 0.00 0.0
1 1 0.25 0.25 0.25 0.5
2 2 0.50 0.50 0.25 0.5
3 3 2.00 1.00 0.25 0.5
4 4 2.50 1.50 1.00 1.0
For a vectorial way in a single shot, you can use dictionaries to hold the threshold and replacement values, then numpy.select:
# example input
df = pd.DataFrame({'b': [-1, 2, 5],
'c': [5, -1, 1]})
# dictionaries (one key:value per column)
thresh = {'b': 2, 'c': 1}
repl = {'b': 0.25, 'c': 0.5}
out = pd.DataFrame(
np.select([df.le(0), df.le(thresh)],
[0, pd.Series(repl)],
1),
columns=list(thresh),
index=df.index
).add_suffix('_score')
output:
b_score c_score
0 0.00 1.0
1 0.25 0.0
2 1.00 0.5
The problem with your attempt is that pandas cannot access your functions from strings with the same name. For example, you need to pass df.apply(b_score_function, axis=1), and not df.apply("b_score_function", axis=1) (note the double quotes).
My first thought would be to link the column names to functions with a dictionary:
funcs = {'b' : b_score_function,
'c' : c_score_function}
for col in ds_cols:
foo = funcs[col]
df[f'{col}_score'] = df.apply(foo, axis = 1)
Typing out the dictionary funcs may be tedious or infeasible depending on how many columns/functions you have. If that is the case, you may have to find additional ways to automate the creation and access of your column-specific functions.
One somewhat automatic way is to use locals() or globals() - these will return dictionaries which have the functions you defined (as well as other things):
for col in ds_cols:
key = f"{col}_score_function"
foo = locals()[key]
df.apply(foo, axis=1)
This code is dependent on the fact that the function for column "X" is called X_score_function(), but that seems to be met in your example. It also requires that every column in ds_cols will have a corresponding entry in locals().
Somewhat confusingly there are some functions which you can access by passing a string to apply, but these are only the ones that are shortcuts for numpy functions, like df.apply('sum') or df.apply('mean'). Documentation for this appears to be absent. Generally you would want to do df.sum() rather than df.apply('sum'), but sometimes being able to access the method by the string is convenient.
I have a data frame like this:
mydf = {'p1':[0.1, 0.2, 0.3], 'p2':[0.2, 0.1,0.3], 'p3':[0.1,0.9, 0.01], 'p4':[0.11, 0.2, 0.4], 'p5':[0.3, 0.1,0.5],
'w1':['cancel','hello', 'hi'], 'w2':['good','bad','ugly'], 'w3':['thanks','CUSTOM_MASK','great'],
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible'],'w5':['CUSTOM_MASK','CUSTOM_MASK','job']}
df = pd.DataFrame(mydf)
So what I need to do is to sum up all values in column p1,p2,p3,p4,p5 if the correspondent values in w1,w2,w3,w4,w5 is not CUSTOM_MASK or CUSTOM_UNKNOWN.
So the result would be to add a column to the data frame like this: (0.1+0.2+0.1=0.4 is for the first row).
top_p
0.4
0.3
1.51
So my question is that is there any pandas way to do this?
What I have done so far is to loop through the rows and then columns and check the values (CUSTOM_MASK, CUSTOM_UNKNOWN) and then sum it up if those values was not exist in the columns.
You can use mask. The idea is to create a boolean mask with the w columns, and use it to filter the relevant w columns and sum:
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Before summing, the output of mask looks like:
p1 p2 p3 p4 p5
0 0.1 0.2 0.10 NaN NaN
1 0.2 0.1 NaN NaN NaN
2 0.3 0.3 0.01 0.4 0.5
Here's a way to do this using np.dot():
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
mydf['top_p'] = mydf.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
We first prepare the two sets of column names p1,...,p5 and w1,...,w5.
Then we use apply() to take the dot product of the values in the pN columns with the filtering criteria based on the wN columns (namely include only contributions from pN column values whose corresponding wN column value is not in the list of excluded strings).
Output:
p1 p2 p3 p4 p5 w1 w2 w3 w4 w5 top_p
0 0.1 0.2 0.10 0.11 0.3 cancel good thanks CUSTOM_MASK CUSTOM_MASK 0.40
1 0.2 0.1 0.90 0.20 0.1 hello bad CUSTOM_MASK CUSTOM_UNKNOWN CUSTOM_MASK 0.30
2 0.3 0.3 0.01 0.40 0.5 hi ugly great trible job 1.51
Alternatively, element-wise multiplication and sum across columns can be used like this:
pCols, wCols = [[c for c in mydf.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
mydf['top_p'] = (mydf[pCols] * ~mydf[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
Here, we needed to rename the columns of one of the 5-column DataFrames to ensure that * (DataFrame.multiply()) can do the element-wise multiplication.
UPDATE: Here are a few timing comparisons on various possible methods for solving this question:
#1. Pandas mask and sum (see answer by #enke):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
#2. Pandas apply with Numpy dot solution:
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
#3. Pandas element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
#4. Numpy element-wise multiply and sum:
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Timing results:
Timeit results for df with 30000 rows:
method_1 ran in 0.008165133331203833 seconds using 3 iterations
method_2 ran in 13.408894366662329 seconds using 3 iterations
method_3 ran in 0.007688766665523872 seconds using 3 iterations
method_4 ran in 0.006326200003968552 seconds using 3 iterations
Time performance results:
Method #4 (numpy multiply/sum) is about 20% faster than the runners-up.
Methods #1 and #3 (pandas mask/sum vs multiply/sum) are neck-and-neck in second place.
Method #2 (pandas apply/numpy dot) is frightfully slow.
Here's the timeit() test code in case it's of interest:
import pandas as pd
import numpy as np
nListReps = 10000
df = pd.DataFrame({'p1':[0.1, 0.2, 0.3]*nListReps, 'p2':[0.2, 0.1,0.3]*nListReps, 'p3':[0.1,0.9, 0.01]*nListReps, 'p4':[0.11, 0.2, 0.4]*nListReps, 'p5':[0.3, 0.1,0.5]*nListReps,
'w1':['cancel','hello', 'hi']*nListReps, 'w2':['good','bad','ugly']*nListReps, 'w3':['thanks','CUSTOM_MASK','great']*nListReps,
'w4':['CUSTOM_MASK','CUSTOM_UNKNOWN', 'trible']*nListReps,'w5':['CUSTOM_MASK','CUSTOM_MASK','job']*nListReps})
from timeit import timeit
def foo_1(df):
df['top_p'] = df.filter(like='p').mask(df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
def foo_2(df):
pCols, wCols = ['p'+str(i + 1) for i in range(5)], ['w'+str(i + 1)for i in range(5)]
df['top_p'] = df.apply(lambda x: np.dot(x[pCols], ~(x[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']))), axis=1)
return df
def foo_3(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
colMap = {wCols[i] : pCols[i] for i in range(len(pCols))}
df['top_p'] = (df[pCols] * ~df[wCols].rename(columns=colMap).isin(['CUSTOM_MASK','CUSTOM_UNKNOWN'])).sum(axis=1)
return df
def foo_4(df):
pCols, wCols = [[c for c in df.columns if c[0] == char] for char in 'pw']
df['top_p'] = (df[pCols].to_numpy() * ~df[wCols].isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
return df
n = 3
print(f'Timeit results for df with {len(df.index)} rows:')
for foo in ['foo_'+str(i + 1) for i in range(4)]:
t = timeit(f"{foo}(df.copy())", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ran in {t} seconds using {n} iterations')
Conclusion:
The absolute fastest of these four approaches seems to be Numpy element-wise multiply and sum. However, #enke's Pandas mask and sum is pretty close in performance and is arguably the most aesthetically pleasing of the four candidates.
Perhaps this hybrid of the two (which runs about as fast as #4 above) is worth considering:
df['top_p'] = (df.filter(like='p').to_numpy() * ~df.filter(like='w').isin(['CUSTOM_MASK','CUSTOM_UNKNOWN']).to_numpy()).sum(axis=1)
Have a dataframe with several columns from which I want to extract one row for each "family" of individuals that has the most frequent number ("No"). I have tested this with a for -loop that seems to work, but being a newbe I wanted to know if there is a shorter/smarter way of doing it.
Here is a short example code:
import pandas as pd
ind = [ ('A', 'a', 0.1 , 9) ,
('B', 'b', 0.6 , 10) ,
('C', 'b', 0.4 , 10) ,
('D', 'b', 0.2, 7) ,
('E', 'a', 0.9 , 6) ,
('F', 'b', 0.7 , 11)
]
df = pd.DataFrame(ind, columns = ['Name' , 'Family', 'Prob', 'No'])
res = pd.DataFrame(columns = df.columns)
for name,g in df.groupby('Family'):
v = g['No'].value_counts().idxmax()
idx = g['No'] == v
si = g[idx].iloc[0]
res = res.append(si)
print(res)
I have looked at several exampels that do some of it like this but with that I can only get the "Family" and "No" and not the whole row...
Here is an alternative using duplicated and mode+groupby with mode:
c = df['No'].eq(df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]))
c1 = df[['Family','No']].duplicated()
output = df[c & ~c1]
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6
Use GroupBy.transform with first mode, then filter and last remove duplicates by DataFrame.drop_duplicates:
df1 = (df[df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]).eq(df['No'])]
.drop_duplicates(['Family','No']))
print (df1)
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6
I want to create a loop that loads all the iterations of two variables into a dataframe in seperate columns. I want variable "a" to hold values between 0 and 1 in 0.1 increments, and the same for variable "b". In otherwords there should be 100 iterations when complete, starting with 0 & 0, and ending with 1 & 1.
I've tried the following code
data = [['Decile 1', 10], ['Decile_2', 15], ['Decile_3', 14]]
staging_table = pd.DataFrame(data, columns = ['Decile', 'Volume'])
profile_table = pd.DataFrame(columns = ['Decile', 'Volume'])
a = 0
b = 0
finished = False
while not finished:
if b != 1:
if a != 1:
a = a + 0.1
staging_table['CAM1_Modifier'] = a
staging_table['CAM2_Modifier'] = b
profile_table = profile_table.append(staging_table)
else:
b = b + 0.1
else:
finished = True
profile_table
You can use itertools.product to get all the combinations:
import itertools
import pandas as pd
x = [i / 10 for i in range(11)]
df = pd.DataFrame(
list(itertools.product(x, x)),
columns=["a", "b"]
)
# a b
# 0 0.0 0.0
# 1 0.0 0.1
# 2 0.0 0.2
# ... ... ...
# 118 1.0 0.8
# 119 1.0 0.9
# 120 1.0 1.0
#
# [121 rows x 2 columns]
itertools is your friend.
from itertools import product
for a, b in product(map(lambda x: x / 10, range(10)),
map(lambda x: x / 10, range(10))):
...
range(10) gives us the integers from 0 to 10 (regrettably, range fails on floats). Then we divide those values by 10 to get your range from 0 to 1. Then we take the Cartesian product of that iterable with itself to get every combination.
I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]
If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']
Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""
IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']