I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)
I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0
The title of this question is a little confusing to write out succinctly.
I have pandas df that contains integers and a relevant key Column. When a value is in the key Column is present I want to return the most recent increase in integers from the other Columns.
For the df below, the key Column is [Area]. When X is in [Area], I want to find the most recent increase is integers from Columns ['ST_A','PG_A','ST_B','PG_B'].
import pandas as pd
d = ({
'ST_A' : [0,0,0,0,0,1,1,1,1],
'PG_A' : [0,0,0,1,1,1,2,2,2],
'ST_B' : [0,1,1,1,1,1,1,1,1],
'PG_B' : [0,0,0,0,0,0,0,1,1],
'Area' : ['','','X','','X','','','','X'],
})
df = pd.DataFrame(data = d)
Output:
ST_A PG_A ST_B PG_B Area
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0 X
3 0 1 1 0
4 0 1 1 0 X
5 1 1 1 0
6 1 2 1 0
7 1 2 1 1
8 1 2 1 1 X
I tried to use df = df.loc[(df['Area'] == 'X')] but this returns the rows where X is situated. I need something that uses X to return the most recent row where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B'].
I have also tried:
cols = ['ST_A','PG_A','ST_B','PG_B']
df[cols] = df[cols].diff()
df = df.fillna(0.)
df = df.loc[(df[cols] == 1).any(axis=1)]
This returns all rows where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B']. Not the most recent increase before X in ['Area'].
Intended Output:
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Does this question make sense or do I need to simplify it?
I believe you can use NumPy here via np.searchsorted:
import numpy as np
increases = np.where(df.iloc[:, :-1].diff().gt(0).max(1))[0]
marks = np.where(df['Area'].eq('X'))[0]
idx = increases[np.searchsorted(increases, marks) - 1]
res = df.iloc[idx]
print(res)
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Not efficient tho, but works, so big chunk of code which is kinda slow:
indexes=np.where(df['Area']=='X')[0].tolist()
indexes2=list(map((1).__add__,np.where(df[df.columns[:-1]].sum(axis=1) < df[df.columns[:-1]].shift(-1).sum(axis=1).sort_index())[0].tolist()))
l=[]
for i in indexes:
if min(indexes2,key=lambda x: abs(x-i)) in l:
l.append(min(indexes2,key=lambda x: abs(x-i))-2)
else:
l.append(min(indexes2,key=lambda x: abs(x-i)))
print(df.iloc[l].sort_index())
Output:
Area PG_A PG_B ST_A ST_B
1 0 0 0 1
3 1 0 0 1
7 2 1 1 1
I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.
The dataframes are one-hot encoded, so they look like this:
col_1, col_2, col_3, ...
0 1 0
1 0 0
0 0 1
...
I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.
One of the dataframes has 500 columns, the other has 100 columns.
This is the fastest version that I've been able to write so far:
interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)
I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?
EDIT: example
df_1:
1col_1, 1col_2, 1col_3
0 1 0
1 0 0
0 0 1
df_2:
2col_1, 2col_2
0 1
1 0
0 0
interact_pd:
1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2
0 0 0 0 1 0
1 0 0 0 0 0
0 0 0 0 0 0
# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
Timing
code
from string import ascii_letters
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))
def pir1(df_1, df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
results
You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):
df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
1col_1_2col_1 1col_2_2col_1 1col_3_2col_1 1col_1_2col_2 \
0 0 0 0 0
1 1 0 0 0
2 0 0 0 0
1col_2_2col_2 1col_3_2col_2
0 1 0
1 0 0
2 0 0
--> See benchmark for comparisons with other answers to choose the best option for your dataset.
Benchmark
Functions:
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
def Test3(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
return df
def Test4(df_1,df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def jeanrjc_imp(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns)
return df
Code:
Sorry, ugly code, the plot at the end matters :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
resa[r] = []
resb[r] = []
resc[r] = []
for i in [5, 10, 30, 50, 150, 200]:
a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
resa[r].append(a.best)
resb[r].append(b.best)
resc[r].append(c.best)
X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
for i in r:
a.plot(X, r[i], label=i)
a.set_xlabel("df_1 columns #")
a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()
With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.
Conclusion:
Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!
HTH
You can use numpy.
Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:
import pandas as pd
import numpy as np
A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]
DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T
def Test1(DA,DB):
E = pd.DataFrame(index=DA.index)
DAC = [column for column in DA]
for column in DB:
C = DA[DAC].multiply(DB[column], axis="index")
E = E.join(C, lsuffix='_' + str(column))
return E
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
print Test1(DA,DB)
print Test2(DA,DB)
Output:
0_1 1_1 2_1 0 1 2 0_3 1_3 2_3 0 1 2 0 1 2
0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
2 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
1col_1_2col_1 1col_1_2col_2 1col_1_2col_3 1col_2_2col_1 1col_2_2col_2 \
0 0 0 0 1 0
1 0 0 0 0 0
2 1 1 0 1 1
3 0 0 0 0 0
1col_2_2col_3 1col_3_2col_1 1col_3_2col_2 1col_3_2col_3 1col_4_2col_1 \
0 0 1 0 0 1
1 0 0 1 1 0
2 0 0 0 0 0
3 0 0 0 0 1
1col_4_2col_2 1col_4_2col_3 1col_5_2col_1 1col_5_2col_2 1col_5_2col_3
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 1 0 0 0
Performance of your function:
%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop
Performance of my function:
%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop
It's not beautiful, but it's efficient.