Have a dataframe with several columns from which I want to extract one row for each "family" of individuals that has the most frequent number ("No"). I have tested this with a for -loop that seems to work, but being a newbe I wanted to know if there is a shorter/smarter way of doing it.
Here is a short example code:
import pandas as pd
ind = [ ('A', 'a', 0.1 , 9) ,
('B', 'b', 0.6 , 10) ,
('C', 'b', 0.4 , 10) ,
('D', 'b', 0.2, 7) ,
('E', 'a', 0.9 , 6) ,
('F', 'b', 0.7 , 11)
]
df = pd.DataFrame(ind, columns = ['Name' , 'Family', 'Prob', 'No'])
res = pd.DataFrame(columns = df.columns)
for name,g in df.groupby('Family'):
v = g['No'].value_counts().idxmax()
idx = g['No'] == v
si = g[idx].iloc[0]
res = res.append(si)
print(res)
I have looked at several exampels that do some of it like this but with that I can only get the "Family" and "No" and not the whole row...
Here is an alternative using duplicated and mode+groupby with mode:
c = df['No'].eq(df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]))
c1 = df[['Family','No']].duplicated()
output = df[c & ~c1]
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6
Use GroupBy.transform with first mode, then filter and last remove duplicates by DataFrame.drop_duplicates:
df1 = (df[df.groupby('Family')['No'].transform(lambda x: x.mode().iat[0]).eq(df['No'])]
.drop_duplicates(['Family','No']))
print (df1)
Name Family Prob No
1 B b 0.6 10
4 E a 0.9 6
Related
I have several lists that are generated from a get_topic() function. That is,
list1 = get_topic(1)
list2 = get_topic(2)
and another dozens of lists.
# The list contains something like
[('A', 0.1),('B', 0.2),('C',0.3)]
I am trying to write a loop so that all different lists can be saved to different columns in a dataframe. The code I tried was:
for i in range(1,number) # number is the total number of lists + 1
df_02 = pd.DataFrame(get_topic(i)
This only returns with list1, but no other lists. The result that I would like to get is something like:
List 1
Number 1
List 2
Number 2
A
0.1
D
0.03
B
0.2
E
0.04
C
0.3
F
0.05
Could anyone help me to correct the loop? Thank you.
df = pd.DataFrame()
for i in range(1, number):
df[f'List {i}'], df[f'Number {i}'] = zip(*get_topic(i))
You are creating a new DataFrame at every iteration.
This will create a structure similar to what you want:
df = pd.DataFrame([get_topic(i) for i in range(1, number)])
df = df.apply(pd.Series.explode).reset_index(drop=True)
df = df.transpose()
Result:
0 1 2 3 4 5
0 A 0.1 D 0.1 G 0.1
1 B 0.2 E 0.2 H 0.2
2 C 0.3 F 0.3 I 0.3
One-liner version:
df = pd.DataFrame([get_topic(i) for i in range(1, number)]).apply(pd.Series.explode).reset_index(drop=True).transpose()
I reconstruct a hypothetical get_topic() function that simply fetches a list from a list of lists.
The idea is to use pd.concat() in order to concatenate dataframes at each iteration.
import pandas as pd
topics = [
[('A', 0.1), ('B', 0.2), ('C', 0.3)],
[('D', 0.3), ('E', 0.4), ('F', 0.5)]
]
number = len(topics)
def get_topic(index) -> []:
return topics[index]
if __name__ == '__main__':
df = pd.DataFrame()
for i in range(0, number): # number is the total number of lists
curr_topic = get_topic(i)
curr_columns = ['List ' + str(i+1), 'Number ' + str(i+1)]
df = pd.concat([df, pd.DataFrame(data=curr_topic, columns=curr_columns)], axis=1)
print(df)
Output will be:
List 1 Number 1 List 2 Number 2
0 A 0.1 D 0.3
1 B 0.2 E 0.4
2 C 0.3 F 0.5
Here is some example data:
data = {'Company': ['A', 'B', 'C', 'D', 'E', 'F'],
'Value': [18700, 26000, 44500, 32250, 15200, 36000],
'Change': [0.012, -0.025, -0.055, 0.06, 0.035, -0.034]
}
df = pd.DataFrame(data, columns = ['Company', 'Value', 'Change'])
df
Company Value Change
0 A 18700 0.012
1 B 26000 -0.025
2 C 44500 -0.055
3 D 32250 0.060
4 E 15200 0.035
5 F 36000 -0.034
I would like to create a new column called 'New Value'. The logic for this column is something along the lines of the following for each row:
if Change > 0, then Value + (Value * Change)
if Change < 0, then Value - (Value * (abs(Change)) )
I attempted to create a list with the following loop and add it to df as a new column but many more values than expected were returned when I expected only 5 (corresponding with the number of rows in df).
lst = []
for x in df['Change']:
for y in df['Value']:
if x > 0:
lst.append(y + (y*x))
elif x < 0:
lst.append(y - (y*(abs(x))))
print(lst)
It would be great if someone could point out where I've gone wrong, or suggest an alternate method :)
Your two conditions are actually identical, so this is all you need to do:
df['New Value'] = df['Value'] + df['Value'] * df['Change']
Output:
>>> df
Company Value Change New Value
0 A 18700 0.012 18924.4
1 B 26000 -0.025 25350.0
2 C 44500 -0.055 42052.5
3 D 32250 0.060 34185.0
4 E 15200 0.035 15732.0
5 F 36000 -0.034 34776.0
Or, slightly more consisely:
df['New Value'] = df['Value'] * df['Change'].add(1)
Or
df['New Value'] = df['Value'].mul(df['Change'].add(1))
I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]
If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']
Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""
IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']
This code :
import numpy as np
import pandas as pd
df = pd.DataFrame([['stop' , '1'], ['a1' , '2'], ['a1' , '3'], ['stop' , '4'], ['a2' , '5'], ['wildcard' , '6']] , columns=['a' , 'b'])
print(df)
prints :
a b
0 stop 1
1 a1 2
2 a1 3
3 stop 4
4 a2 5
5 wildcard 6
I'm attempting to create a new dataframe where if stop is encountered then a new row is created that contains a tuple where the value of column 'a' is first element of tuple and 'b' is subsequent element of tuple. So for df above post transforming the new df df_post structure is :
df_post = pd.DataFrame([['stop' , [('a1' , '2') , ('a1' , '3')]] , ['stop' , [('a2' , 5)]]] , columns=['a' , 'b'])
print(df_post)
a b
0 stop [(a1, 2), (a1, 3)]
1 stop [(a2, 5)]
wildcard is also a stopping condition where if encountered a new row is inserted into df_post as before.
Here is what I have so far :
df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df_new = df[(df['a'] != 'stop') & (df['stop_loc'] != df['stop_loc'].max())].groupby('stop_loc').apply(lambda x: list(zip(x.a, x.b)))
df_new
which renders :
stop_loc
1 [(a1, 2), (a1, 3)]
2 [(a2, 5)]
dtype: object
The 'stop' value is not inserted as row. How to modify so that the dataframe produced is
a b
0 stop [(a1, 2), (a1, 3)]
1 stop [(a2, 5)]
instead of :
stop_loc
1 [(a1, 2), (a1, 3)]
2 [(a2, 5)]
dtype: object
You are filtering out the stop rows with df['a'] != 'stop'. Here is an alternative code:
# df['stop_loc'] = ( (df['a'] == 'stop') | (df['a'] == 'wildcard') ).cumsum()
df['stop_loc'] = df['a'].isin(['stop', 'wildcard']).cumsum()
def zip_entries(x):
return list(x.a)[0], list(zip(x.a[1:], x.b[1:]))
df_new = (df[(df['stop_loc'] != df['stop_loc'].max())]
.groupby('stop_loc')
.apply(zip_entries)
.apply(pd.Series))
print(df_new)
# 0 1
# stop_loc
# 1 stop [(a1, 2), (a1, 3)]
# 2 stop [(a2, 5)]
I have a Pandas data frame with some categorical variables. Something like this -
>>df
'a', 'x'
'a', 'y'
Now, I want to return a matrix with the conditional probabilities of each level appearing with every other level. For the data frame above, it would look like -
[1, 0.5, 0.5],
[1, 1, 0],
[1, 0, 1]
The three entries correspond to the levels 'a', 'x' and 'y'.
This is because conditional on the first column being 'a', the probabilities of 'x' and 'y' appearing are 0.5 each and so on.
I have some code that does this (below). However, the problem is that it is excruciatingly slow. So slow that the application I want to use it in times out. Does anyone have any tips to make it faster?
df = pd.read_csv('pathToData.csv')
df = df.fillna("null")
cols = 0
col_levels = []
columns = {}
num = 0
for i in df.columns:
cols += len(set(df[i]))
col_levels.append(np.sort(list(set(df[i]))))
for j in np.sort(list(set(df[i]))):
columns[i + '_' + str(j)] = num
num += 1
res = np.eye(cols)
for i in range(len(df.columns)):
for j in range(len(df.columns)):
if i != j:
row_feature = df.columns[i]
col_feature = df.columns[j]
rowLevels = col_levels[i]
colLevels = col_levels[j]
for ii in rowLevels:
for jj in colLevels:
frst = (df[row_feature] == ii) * 1
scnd = (df[col_feature] == jj) * 1
prob = sum(frst*scnd)/(sum(frst) + 1e-9)
frst_ind = columns[row_feature + '_' + ii]
scnd_ind = columns[col_feature + '_' + jj]
res[frst_ind, scnd_ind] = prob
EDIT: Here is a bigger example:
>>df
'a', 'x', 'l'
'a', 'y', 'l'
'b', 'x', 'l'
The number of distinct categories here are 'a', 'b', 'x', 'y' and 'l'. Since these are 5 categories, the output matrix should be 5x5. The first row and first column would be how often does 'a' appear conditional on 'a'. This is of course, 1 (as are all the diagonals). The first row and second column is conditional on 'a', what is the probability of 'b'. Since 'a' and 'b' are parts of the same column, this is zero. The first row and third column is the probability of 'x' conditional on 'a'. We see that 'a' appears twice but only once with 'x'. So, this probability is 0.5. And so on.
The way I approach the problem is to first calculate all unique levels in the dataset. Then loop through a cartesian product of those levels. At each step, filter the dataset to create a subset where condition is True. Then, count the number of rows in the subset where the event has happened. Below is my code.
import pandas as pd
from itertools import product
from collections import defaultdict
df = pd.DataFrame({
'col1': ['a', 'a', 'b'],
'col2': ['x', 'y', 'x'],
'col3': ['l', 'l', 'l']
})
levels = df.stack().unique()
res = defaultdict(dict)
for event, cond in product(levels, levels):
# create a subset of rows with at least one element equal to cond
conditional_set = df[(df == cond).any(axis=1)]
conditional_set_size = len(conditional_set)
# count the number of rows in the subset where at least one element is equal to event
conditional_event_count = (conditional_set == event).any(axis=1).sum()
res[event][cond] = conditional_event_count / conditional_set_size
result_df = pd.DataFrame(res)
print(result_df)
# OUTPUT
# a b l x y
# a 1.000000 0.000000 1.0 0.500000 0.500000
# b 0.000000 1.000000 1.0 1.000000 0.000000
# l 0.666667 0.333333 1.0 0.666667 0.333333
# x 0.500000 0.500000 1.0 1.000000 0.000000
# y 1.000000 0.000000 1.0 0.000000 1.000000
I am sure there are other faster methods, but it is the first thing that comes to my mind.