Capitalize random rows in Panda Dataframe - python

I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized. I wrote this
import pandas as pd
import torch
df = pd.read_csv('Data/fb_moe.csv')
for i in range(len(df)):
sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())
if sample == 1:
df.iloc[i].y = str(df.iloc[i].y).capitalize()
sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())
if sample_1 == 1:
df.iloc[i].x = str(df.iloc[i].x).capitalize()
df.to_csv('Data/fb_moe2.csv')
But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?
Part of the Dataframe
x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun

Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops
n = len(df)
source = np.random.binomial(1, p=.8, size=n) == 1
target = source.copy()
total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1
df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()

Related

Nested loops for dataframes in Python

I'm using nested loops to add new columns with dynamic names based on the dataset columns (col) and columns that drops one col (I called it interact col). It works well for small datasets, but it becomes very slow if I have datasets with a very high amount of features. Any tips to simplify the process to make it faster?
import numpy as np
import pandas as pd
X = pd.read_csv('water_potability.csv')
X = X.drop(columns='Unnamed: 0')
X_columns = np.array(X.columns)
fi_df = X.copy()
done_list = []
for col in X_columns:
interact_col = X.drop(columns = col).columns
for int_col in interact_col:
fi_df['({})_minus_({})'.format(col, int_col)] = X[col]-X[int_col]
fi_df['({})_div_({})'.format(col, int_col)] = X[col]/X[int_col]
if int_col not in done_list:
fi_df['({})_add_({})'.format(col, int_col)] = X[col]+X[int_col]
fi_df['({})_multi_({})'.format(col, int_col)] = X[col]*X[int_col]
done_list.append(col)

how can select data of coefficient of 3 columns from csv file

I would like to plot amount of columns for 2 different scenario based on index of rows in my dataset preferably via Pandas.DataFrame :
1st scenario: columns index[2,5,8,..., n+2]
2nd scenario: the last 480 columns or column index [961-1439]
picture
I've tried to play with index of columns which is following:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dft = pd.read_csv("D:\Test.csv" , header=None)
dft.head()
id_set = dft[dft.index % 2 == 0].astype('int').values
A = dft[dft.index % 2 == 1].values
B = dft[dft.index % 2 == 2].values
C = dft[dft.index % 2 == 3].values
data = {'A': A[:,0], 'B': B[:,0], 'C': C[:,0]}
df = pd.DataFrame(data, columns=['A','B','C'], index = id_set[:,0])
#1st scenario
j=0
index=[]
for i in range(1439):
if j==2:
j=0
continue
else:
index.append(i)
j+=1
print(index)
#2nd scenario
last_480 = df.[0:480][::-1]
I've found this post1 and post2 but they weren't my case!
I would appreciate if someone can help me.
1st scenario:
df.iloc[:, 2::3]
The slicing here means all rows, columns starting from the 2nd, and every 3 after that.
2nd scenario:
df.iloc[:, :961:-1]
The slicing here means all rows, columns to 961 from the end of the list.
EDIT:
import matplotlib.pyplot as plt
import seaborn as sns
senario1 = df.iloc[:, 2::3].copy()
sns.lineplot(data = senario1.T)
You can save the copy of the slice to another variable, then since you want to graph row-wise you need to take the transpose of the sliced matrix (This will make yours rows into columns).

How to create a new dataframe by sorted data

I would like to find out the row which meets the condition RSI < 25.
However, the result is generated with one data frame. Is it possible to create separate dataframes for any single row?
Thanks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='TSLA'
ck_df = wb.DataReader(stock,data_source='yahoo',start='2015-01-01')
rsi_period = 14
chg = ck_df['Close'].diff(1)
gain = chg.mask(chg<0,0)
ck_df['Gain'] = gain
loss = chg.mask(chg>0,0)
ck_df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
ck_df['Avg Gain'] = avg_gain
ck_df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
ck_df['RSI'] = rsi
RSIFactor = ck_df['RSI'] <25
ck_df[RSIFactor]
If you want to know at what index the RSI < 25 then just use:
ck_df[ck_df['RSI'] <25].index
The result will also be a dataframe. If you insist on making a new one then:
new_df = ck_df[ck_df['RSI'] <25].copy()
To split the rows found by #Omkar's solution into separate dataframes you might use this function taken from here: Pandas: split dataframe into multiple dataframes by number of rows;
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
dfs.append(df.iloc[start : count])
return dfs
With this you get a list of dataframes.

Iterating over multiple pandas dataframe is slow

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))

pandas: setting last N rows of multi-index to Nan for speeding up groupby with shift

I am trying to speed up my groupby.apply + shift and
thanks to this previous question and answer: How to speed up Pandas multilevel dataframe shift by group? I can prove that it does indeed speed things up when you have many groups.
From that question I now have the following code to set the first entry in each multi-index to Nan. And now I can do my shift globally rather than per group.
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
but I want to look forward, not backwards, and need to do calculations across N rows. So I am trying to use some similar code to set the last N entries to NaN, but obviously I am missing some important indexing knowledge as I just can't figure it out.
I figure I want to convert this so that every entry is a range rather than a single integer. How would I do that?
# the start of each group, ignoring the first entry
df.groupby(level=0).size().cumsum()[1:]
Test setup (for backwards shift) if you want to try it:
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
df['tmpShift'] = df['colB'].shift(1)
df.iloc[df.groupby(level=0).size().cumsum()[:-1]] = np.nan
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmp',1,inplace=True)
Thanks!
I ended up doing it using a groupby apply as follows (and coded to work forwards or backwards):
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
df = df.groupby(level=0).apply(replace_tail,'tmpShift',2,np.nan)
So the final code is:
def replace_tail(grp,col,N,value):
if (N > 0):
grp[col][:N] = value
else:
grp[col][N:] = value
return grp
length = 5
groups = 3
rng1 = pd.date_range('1/1/1990', periods=length, freq='D')
frames = []
for x in xrange(0,groups):
tmpdf = pd.DataFrame({'date':rng1,'category':int(10000000*abs(np.random.randn())),'colA':np.random.randn(length),'colB':np.random.randn(length)})
frames.append(tmpdf)
df = pd.concat(frames)
df.sort(columns=['category','date'],inplace=True)
df.set_index(['category','date'],inplace=True,drop=True)
shiftBy=-1
df['tmpShift'] = df['colB'].shift(shiftBy)
df = df.groupby(level=0).apply(replace_tail,'tmpShift',shiftBy,np.nan)
# Yay this is so much faster.
df['newColumn'] = df['tmpShift'] / df['colA']
df.drop('tmpShift',1,inplace=True)

Categories

Resources