Python - Alternates solutions to iterrows - python

I have written the following code to create a dataframe, and add new rows and columns based on a certain conditions. Unfortunately, it takes a lot of time to execute.
Are there any alternate ways to do this?
Any inputs are highly appreciated.
dfCircuito=None
for index, row in dadosCircuito.iterrows():
for mes in range(1,13):
for nue in range(1,5):
for origem in range (1,3):
for suprimento in range (1,3):
for tipo in range (1,3):
df=pd.DataFrame(dadosCircuito.iloc[[index]])
df['MES']=mes
if(nue==1):
df['NUE']='N'
elif(nue==2):
df['NUE']='C'
elif(nue==3):
df['NUE']='F'
else:
df['NUE']='D'
if(origem==1):
df['Origem']='DISTRIBUICAO'
else:
df['Origem']='SUBTRANSMISSAO'
if(suprimento==1):
df['Suprimento']='INTERNO'
else:
df['Suprimento']='EXTERNO'
if(tipo==1):
df['TipoOcorrencia']='EMERGENCIAL'
else:
df['TipoOcorrencia']='PROGRAMADA'
dfCircuito=pd.concat([dfCircuito, df], axis=0) ```

If I understand you correctly, you are trying to add a number of rows per row of dadosCircuito. The extra rows are permutations of mes=1...12; nue=N,C,F,D; ...
You can create a dataframe containing the permutations of attributes, then join it back to dadosCircuito:
mes = range(1,13)
nues = list('NCFD')
origems = ['DISTRIBUICAO', 'SUBTRANSMISSAO']
suprimentos = ['INTERNO', 'EXTERNO']
tipos = ['EMERGENCIAL', 'PROGRAMADA']
# Make sure dadosCircuito.index is unique. If not, call a reset_index
# dadosCircuito = dadosCircuito.reset_index()
df = pd.MultiIndex.from_product([dadosCircuito.index, mes, nues, origems, suprimentos, tipos], names=['index', 'MES', 'NUE', 'Origem', 'Suprimento', 'TipoOcorrencia']) \
.to_frame(index=False) \
.set_index('index')
dfCircuito = dadosCircuito.join(df)

Related

how to make a Correlation One Column to Many Columns and return a list?

i would like to create a correlation function between a column and the others, passing the dataframe with all columns, corelating wiht a specif colum and returning a list of metrics and correlation i`am doing this like this.
correlations = df.corr().unstack().sort_values(ascending=True)
correlations = pd.DataFrame(correlations).reset_index()
correlations.columns = ['corr_matrix', 'dfbase', 'correlation']
correlations.query("corr_matrix == 'venda por m2' & dfbase != 'venda por m2'")
but i would like to know a way to make this with a function.
Something like this should do
def get_nonself_correlation(df,self_name):
temp = df.corr()
temp = temp.loc[temp.index!=self_name,temp.columns==self_name]
temp = temp.unstack().reset_index()
temp.columns = ['corr_matrix', 'dfbase', 'correlation']
return temp

Delete repeating rows in a DataFrame based on a condition pandas

I'm trying to delete repeating rows in a data frame based on the following condition:
If the value of the column pagePath is the same as in the previous row and the SessionId is the same, I need this row deleted. If the SessionId is different, then the repeating pagePath shouldn't be deleted. This is what I tried:
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows(): # i = index, _ = row
if i != 0:
try:
while data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
data = data.drop(i - 1)
data = data.reset_index(drop=True)
except KeyError:
continue
As you can see, I'm getting the KeyError Exception, though I don't think it's bad as the code does what it should with the data frame with 1000 rows. The only problem is that it's not working with a larger dataset with 6,5 Mio rows. It's either never finishes, or I get SIGKILL. I am well aware that I shouldn't use for-loop for datasets, but I couldn't find a better solution and would be thankful if you could help me improve my code.
groupby on SessionId and pagePath and find cumulative count of each pair's occurrence; then find difference of consecutive elements using np.ediff1d and assign it to df['cumcount'], and since we want to filter out consecutive duplicates, we filter out df['cumcount']!=1:
cols = df.columns
df['cumcount'] = np.concatenate(([0], np.ediff1d(df.groupby(['SessionId','pagePath']).cumcount())))
out = df.loc[df['cumcount']!=1, cols]
Anyways, as per usual had to solve this one by myself, wouldn't be possible without #np8's comment tho. For anybody who might be interested:
locations = []
data = data.sort_values(['SessionId', 'Datum'], ascending=True, ignore_index=True)
i = 0
for i, _ in data.iterrows(): # i = index, _ = row
if i != 0:
try:
if data.SessionId[i] == data.SessionId[i - 1] and data.pagePath[i] == data.pagePath[i - 1]:
locations.append(i)
except KeyError as e:
print(e)
continue
data_cleaned = data.drop(index=locations)
This took 470 seconds for 6,5 Mio rows DataFrame, which is okay considering the code wasn't finishing executing at all before.

How to create new columns of last 5 sale price off in dataframe

I have a pandas data frame of sneakers sale, which looks like this,
I added columns last1, ..., last5 indicating the last 5 sale prices of the sneakers and made them all None. I'm trying to update the values of these new columns using the 'Sale Price' column. This is my attempt to do so,
for index, row in df.iterrows():
if (index==0):
continue
for i in range(index-1, -1, -1):
if df['Sneaker Name'][index] == df['Sneaker Name'][i]:
df['last5'][index] = df['last4'][i]
df['last4'][index] = df['last3'][i]
df['last3'][index] = df['last2'][i]
df['last2'][index] = df['last1'][i]
df['last1'][index] = df['Sale Price'][i]
continue
if (index == 100):
break
When I ran this, I got a warning,
A value is trying to be set on a copy of a slice from a DataFrame
and the result is also wrong.
Does anyone know what I did wrong?
Also, this is the expected output,
Use this instead of for loop, if you have rows sorted:
df['last1'] = df['Sale Price'].shift(1)
df['last2'] = df['last1'].shift(1)
df['last3'] = df['last2'].shift(1)
df['last4'] = df['last3'].shift(1)
df['last5'] = df['last4'].shift(1)

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources