Adding two numeric pandas columns with different lengths based on condition - python

I am writing a piece of simulation software in python using pandas, here is my problem:
Imagine you have two pandas dataframes dfA and dfB with numeric columns A and B respectively.
Both dataframes have a different number of rows denoted by n and m.
Let's assume that n > m.
Moreover, dfA includes a binary column C, which has m times 1, and the rest 0.
Assume both dfA and dfB are sorted.
My question is, in order, I want to add the values in B to the values in column A if column C == 0.
In the example n = 6, m = 3.
Example data:
dataA = {'A': [7,7,7,7,7,7],
'C': [1,0,1,0,0,1]}
dfA = pd.Dataframe(dataA)
dfB = pd.Dataframe([3,5,4], columns = ['B'])
Example pseudocode:
DOES NOT WORK
if dfA['C'] == 1:
dfD['D'] = dfA['A']
else:
dfD['D'] = dfA['A'] + dfB['B']
Expected result:
dfD['D']
[7,10,7,12,11,7]
I can only think of obscure for loops with index counters for each of the three vectors, but I am sure that there is a faster way by writing a function and using apply. But maybe there is something completely different that I am missing.
*NOTE: In the real problem the rows are not single values, but row vectors of equal length. Moreover, in the real problem it is not just simple addition but a weighted average over the two row vectors

You can use:
m = dfA['C'].eq(1)
dfA['C'] = dfA['A'].where(m, dfA['A']+dfB['B'].set_axis(dfA.index[~m]))
Or:
dfA.loc[m, 'C'] = dfA.loc[m, 'A']
dfA.loc[~m, 'C'] = dfB['B'].values
Output:
A C
0 7 7
1 7 10
2 7 7
3 7 12
4 7 11
5 7 7

The alternative answer is pretty clever. I am just showing a different way if you would like to do it using loops:
# Create an empty df
dfD = pd.DataFrame()
# Create Loop
k = 0
for i in range(len(dfA)):
if dfA.loc[i, "C"] == 1:
dfD.loc[i, "D"] = dfA.loc[i, "A"]
else:
dfD.loc[i, "D"] = dfA.loc[i, "A"] + dfB.loc[k, "B"]
k = k+1
# Show results
dfD

Related

Pandas dataframe, merge by intersection of spans?

I would like to merge two dataframes based on overlap of spans (indicated by pairs (s,e), s- start of span, e - end of span), and while I have a pretty bad code for doing it, I would like to know if there is a good way to implement it. Here is example:
df1 = pd.DataFrame({'s':[0,10,20,33,424,5345],
'e':[3,17,30,39,1000,10987],
'data1':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'s':[1,45,0],
'e':[50,46,90],
'data2':[1,2,3]})
def overlap(a1,a2,b1,b2):
if type(b1) == list or type(b1)==np.ndarray:
assert(len(b1)==len(b2))
return np.asarray([overlap(a1,a2,b1[k],b2[k]) for k in range(len(b1))])
else:
return max((a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1,0)
overlaps = [overlap(df1['s'].iloc[i],df1['e'].iloc[i],df2['s'].values,df2['e'].values)>0
for i in range(len(df1))]
df1['data2']=[df2['data2'][o].tolist() for o in overlaps]
Output is:
s e data1 data2
0 0 3 1 [1, 3]
1 10 17 2 [1, 3]
2 20 30 3 [1, 3]
3 33 39 4 [1, 3]
4 424 1000 5 []
5 5345 10987 6 []
Edit: also, in my particular case I am guaranteed that for df1 spans are non-overlapping and sequential (ie s[i]>s[i-1], e[i]>s[i], e[i] < s[i+1] )
Edit2: you can generate arbitrary amount of almost valid fake data (here we don't have guarantees on non-overlapping of spans in first df):
N=int(1e3)
sdf1=np.random.randint(0, high=10*N, size=(N,))
sdf1.sort()
edf1=sdf1+np.random.randint(1, high=10, size=(N,))
data1=range(N)
sdf2=np.random.randint(0, high=10*N, size=(N,))
edf2=sdf2+np.random.randint(1, high=10, size=(N,))
data2=range(N)
df1 = pd.DataFrame({'s':sdf1,
'e':edf1,
'data1':data1})
df2 = pd.DataFrame({'s':sdf2,
'e':edf2,
'data2':data2})
When it comes to pandas dataframe, you should always avoid for loops to process rows/columns and use apply, transform or other pandas functions. For example to get the overlaps you can do:
def has_overlap(a1, a2, b1, b2):
''' return True if spans overlap, otherwise return False '''
return (a2-a1)+(b2-b1)+min(a1,b1)-max(b2,a2)+1 > 0
def find_overlap(row1):
'''return indices of df2 which overlap with the given row of df1 as a list'''
df2['has_overlap'] = df2.apply(lambda row2: has_overlap(row1.s, row1.e, row2.s, row2.e), axis=1)
return list(df2['data2'].loc[df2['has_overlap']])
df1['data2'] = df1.apply(lambda row: find_overlap(row), axis=1)
print('df1: {}'.format(df1))

fill in entire dataframe cell by cell based on index AND column names?

I have a dataframe where the row indices and column headings should determine the content of each cell. I'm working with a much larger version of the following df:
df = pd.DataFrame(index = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'],
columns = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
Specifically, I want to apply the custom function edit_distance() or equivalent (see here for function code) which calculates a difference score between two strings. The two inputs are the row and column names. The following works but is extremely slow:
for seq in df.index:
for seq2 in df.columns:
df.loc[seq, seq2] = edit_distance(seq, seq2)
This produces the result I want:
ae azde afgle arlde afghijklbcmde
afghijklde 8 7 5 6 3
afghijklmde 9 8 6 7 2
ade 1 1 3 2 10
afghilmde 7 6 4 5 4
amde 2 1 3 2 9
What is a better way to do this, perhaps using applymap() ?. Everything I've tried with applymap() or apply or df.iterrows() has returned errors of the kind AttributeError: "'float' object has no attribute 'index'" . Thanks.
Turns out there's an even better way to do this. onepan's dictionary comprehension answer above is good but returns the df index and columns in random order. Using a nested .apply() accomplishes the same thing at about the same speed and doesn't change the row/column order. The key is to not get hung up on naming the df's rows and columns first and filling in the values second. Instead, do it the other way around, initially treating the future index and columns as standalone pandas Series.
series_rows = pd.Series(['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde'])
series_cols = pd.Series(['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde'])
df = pd.DataFrame(series_rows.apply(lambda x: series_cols.apply(lambda y: edit_distance(x, y))))
df.index = series_rows
df.columns = series_cols
you could use comprehensions, which speeds it up ~4.5x on my pc
first = ['afghijklde', 'afghijklmde', 'ade', 'afghilmde', 'amde']
second = ['ae', 'azde', 'afgle', 'arlde', 'afghijklbcmde']
pd.DataFrame.from_dict({f:{s:edit_distance(f, s) for s in second} for f in first}, orient='index')
# output
# ae azde afgle arlde afghijklbcmde
# ade 1 2 2 2 2
# afghijklde 1 3 4 4 9
# afghijklmde 1 3 4 4 10
# afghilmde 1 3 4 4 8
# amde 1 3 3 3 3
# this matches to edit_distance('ae', 'afghijklde') == 8, e.g.
note I used this code for edit_distance (first response in your link):
def edit_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]

Pandas assign label based on index value

I have a dataframe with index and multiple columns. Secondly, I have few lists containing index values sampled on certain criterias. Now I want to create columns with labes based on fact whether or not the index of certain row is present in a specified list.
Now there are two situations where I am using it:
1) To create a column and give labels based on one list:
df['1_name'] = df.index.map(lambda ix: 'A' if ix in idx_1_model else 'B')
2) To create a column and give labels based on multiple lists:
def assignLabelsToSplit(ix_, random_m, random_y, model_m, model_y):
if (ix_ in random_m) or (ix_ in model_m):
return 'A'
if (ix_ in random_y) or (ix_ in model_y):
return 'B'
else:
return 'not_assigned'
df['2_name'] = df.index.map(lambda ix: assignLabelsToSplit(ix, idx_2_random_m, idx_2_random_y, idx_2_model_m, idx_2_model_y))
This is working, but it is quite slow. Each call takes about 3 minutes and considering I have to execute the funtions multiple times, it needs to be faster.
Thank you for any suggestions.
I think you need double numpy.where with Index.isin :
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
Sample:
np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(10,1)), columns=['A'])
#print (df)
random_m = [0,1]
random_y = [2,3]
model_m = [7,4]
model_y = [5,6]
print (type(random_m))
<class 'list'>
print (random_m + model_m)
[0, 1, 7, 4]
print (random_y + model_y)
[2, 3, 5, 6]
df['2_name'] = np.where(df.index.isin(random_m + model_m), 'A',
np.where(df.index.isin(random_y + model_y), 'B', 'not_assigned'))
print (df)
A 2_name
0 8 A
1 8 A
2 3 B
3 7 B
4 7 A
5 0 B
6 4 B
7 2 A
8 5 not_assigned
9 2 not_assigned

How to copy unique keys and values from another dictionary in Python

I have a dataframe df with transactions where the values in the column Col can be repeated. I use Counter dictionary1 to count the frequency for each Col value, then I would like to run a for loop on a subset of the data and obtain a value pit. I want to create a new dictionary dict1 where the key is the key from dictionary1 and the value is the value of pit. This is the code I have so far:
dictionary1 = Counter(df['Col'])
dict1 = defaultdict(int)
for i in range(len(dictionary1)):
temp = df[df['Col'] == dictionary1.keys()[i]]
b = temp['IsBuy'].sum()
n = temp['IsBuy'].count()
pit = b/n
dict1[dictionary1.keys()[i]] = pit
My question is, how can i assign the key and value for dict1 based on the key of dictionary1 and the value obtained from the calculation of pit. In other words, what is the correct way to write the last line of code in the above script.
Thank you.
Since you're using pandas, I should point out that the problem you're facing is common enough that there's a built-in way to do it. We call collecting "similar" data into groups and then performing operations on them a groupby operation. It's probably wortwhile reading the tutorial section on the groupby split-apply-combine idiom -- there are lots of neat things you can do!
The pandorable way to compute the pit values would be something like
df.groupby("Col")["IsBuy"].mean()
For example:
>>> # make dummy data
>>> N = 10**4
>>> df = pd.DataFrame({"Col": np.random.randint(1, 10, N), "IsBuy": np.random.choice([True, False], N)})
>>> df.head()
Col IsBuy
0 3 False
1 6 True
2 6 True
3 1 True
4 5 True
>>> df.groupby("Col")["IsBuy"].mean()
Col
1 0.511709
2 0.495697
3 0.489796
4 0.510658
5 0.507491
6 0.513183
7 0.522936
8 0.488688
9 0.490498
Name: IsBuy, dtype: float64
which you could turn into a dictionary from a Series if you insisted:
>>> df.groupby("Col")["IsBuy"].mean().to_dict()
{1: 0.51170858629661753, 2: 0.49569707401032703, 3: 0.48979591836734693, 4: 0.51065801668211308, 5: 0.50749063670411987, 6: 0.51318267419962338, 7: 0.52293577981651373, 8: 0.48868778280542985, 9: 0.49049773755656106}

Pandas dataframe: return row AND column of maximum value(s)

I have a dataframe in which all values are of the same variety (e.g. a correlation matrix -- but where we expect a unique maximum). I'd like to return the row and the column of the maximum of this matrix.
I can get the max across rows or columns by changing the first argument of
df.idxmax()
however I haven't found a suitable way to return the row/column index of the max of the whole dataframe.
For example, I can do this in numpy:
>>>npa = np.array([[1,2,3],[4,9,5],[6,7,8]])
>>>np.where(npa == np.amax(npa))
(array([1]), array([1]))
But when I try something similar in pandas:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>df.where(df == df.max().max())
a b c
d NaN NaN NaN
e NaN 9 NaN
f NaN NaN NaN
At a second level, what I acutally want to do is to return the rows and columns of the top n values, e.g. as a Series.
E.g. for the above I'd like a function which does:
>>>topn(df,3)
b e
c f
b f
dtype: object
>>>type(topn(df,3))
pandas.core.series.Series
or even just
>>>topn(df,3)
(['b','c','b'],['e','f','f'])
a la numpy.where()
I figured out the first part:
npa = df.as_matrix()
cols,indx = np.where(npa == np.amax(npa))
([df.columns[c] for c in cols],[df.index[c] for c in indx])
Now I need a way to get the top n. One naive idea is to copy the array, and iteratively replace the top values with NaN grabbing index as you go. Seems inefficient. Is there a better way to get the top n values of a numpy array? Fortunately, as shown here there is, through argpartition, but we have to use flattened indexing.
def topn(df,n):
npa = df.as_matrix()
topn_ind = np.argpartition(npa,-n,None)[-n:] #flatend ind, unsorted
topn_ind = topn_ind[np.argsort(npa.flat[topn_ind])][::-1] #arg sort in descending order
cols,indx = np.unravel_index(topn_ind,npa.shape,'F') #unflatten, using column-major ordering
return ([df.columns[c] for c in cols],[df.index[i] for i in indx])
Trying this on the example:
>>>df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
>>>topn(df,3)
(['b', 'c', 'b'], ['e', 'f', 'f'])
As desired. Mind you the sorting was not originally asked for, but provides little overhead if n is not large.
what you want to use is stack
df = pd.DataFrame([[1,2,3],[4,9,5],[6,7,8]],columns=list('abc'),index=list('def'))
df = df.stack()
df.sort(ascending=False)
df.head(4)
e b 9
f c 8
b 7
a 6
dtype: int64
I guess for what you are trying to do a DataFrame might not be the best choice, since the idea of the columns in the DataFrame is to hold independent data.
>>> def topn(df,n):
# pull the data ouit of the DataFrame
# and flatten it to an array
vals = df.values.flatten(order='F')
# next we sort the array and store the sort mask
p = np.argsort(vals)
# create two arrays with the column names and indexes
# in the same order as vals
cols = np.array([[col]*len(df.index) for col in df.columns]).flatten()
idxs = np.array([list(df.index) for idx in df.index]).flatten()
# sort and return cols, and idxs
return cols[p][:-(n+1):-1],idxs[p][:-(n+1):-1]
>>> topn(df,3)
(array(['b', 'c', 'b'],
dtype='|S1'),
array(['e', 'f', 'f'],
dtype='|S1'))
>>> %timeit(topn(df,3))
10000 loops, best of 3: 29.9 µs per loop
watsonics solution takes slightly less
%timeit(topn(df,3))
10000 loops, best of 3: 24.6 µs per loop
but way faster than stack
def topStack(df,n):
df = df.stack()
df.sort(ascending=False)
return df.head(n)
%timeit(topStack(df,3))
1000 loops, best of 3: 1.91 ms per loop

Categories

Resources