How to create a new dataframe by sorted data

How to create a new dataframe by sorted data - python

I would like to find out the row which meets the condition RSI < 25.
However, the result is generated with one data frame. Is it possible to create separate dataframes for any single row?
Thanks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='TSLA'
ck_df = wb.DataReader(stock,data_source='yahoo',start='2015-01-01')
rsi_period = 14
chg = ck_df['Close'].diff(1)
gain = chg.mask(chg<0,0)
ck_df['Gain'] = gain
loss = chg.mask(chg>0,0)
ck_df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
ck_df['Avg Gain'] = avg_gain
ck_df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
ck_df['RSI'] = rsi
RSIFactor = ck_df['RSI'] <25
ck_df[RSIFactor]

If you want to know at what index the RSI < 25 then just use:
ck_df[ck_df['RSI'] <25].index
The result will also be a dataframe. If you insist on making a new one then:
new_df = ck_df[ck_df['RSI'] <25].copy()

To split the rows found by #Omkar's solution into separate dataframes you might use this function taken from here: Pandas: split dataframe into multiple dataframes by number of rows;
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
dfs.append(df.iloc[start : count])
return dfs
With this you get a list of dataframes.

Related

Creating Cartesian Product DataFrame without maxing Memory

I have several dataframes, from which I'm creating a cartesian product (on purpose!)
After this, I'm exporting the result to disk.
I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time?
Example Code:
import pandas as pd
def create_list_from_range(r1,r2):
if (r1 == r2):
return r1
else:
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
# make a list of options
color_opt = ['red','blue','green','orange']
dow_opt = create_list_from_range(1,7)
hod_opt = create_list_from_range(0,23)
# turn each list into a dataframe
df_color = pd.DataFrame({'color': color_opt})
df_day = pd.DataFrame({'day_of_week': dow_opt})
df_hour = pd.DataFrame({'hour_of_day': hod_opt})
# add a dummy columns to everything so I can easily do a cartesian product
df_color['dummy']=1
df_day['dummy']=1
df_hour['dummy']=1
# now cartesian product... cascading
merge1 = pd.merge(df_day, df_hour, on='dummy')
FINAL = pd.merge(merge1, df_color, on='dummy')
FINAL.to_csv('FINAL_OUTPUT.csv', index=False)

You could try building up individual rows using itertools.product. In your example, you could do this as follows:
from itertools import product
prod = product(color_opt, dow_opt, hod_opt)
You can then get a number of rows and append them to an existing csv file using
df.to_csv("file", mode="a")

Capitalize random rows in Panda Dataframe

I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized. I wrote this
import pandas as pd
import torch
df = pd.read_csv('Data/fb_moe.csv')
for i in range(len(df)):
sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())
if sample == 1:
df.iloc[i].y = str(df.iloc[i].y).capitalize()
sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())
if sample_1 == 1:
df.iloc[i].x = str(df.iloc[i].x).capitalize()
df.to_csv('Data/fb_moe2.csv')
But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?
Part of the Dataframe
x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun

Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops
n = len(df)
source = np.random.binomial(1, p=.8, size=n) == 1
target = source.copy()
total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1
df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()

Iterating over multiple pandas dataframe is slow

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))

Optimizing pairwise mutual information score

I am trying to compute the mutual information score between all the columns of a pandas dataframe,
from sklearn.metrics.cluster import adjusted_mutual_info_score
from itertools import combinations
current_valid_columns = list(train.columns.difference(["ID"]))
MI_scores = pd.DataFrame(columns=["features_pair","adjusted_mutual_information"])
current_index = 0
for columns_pair in combinations(current_valid_columns, 2):
row = pd.Series([str(columns_pair),adjusted_mutual_info_score(train[columns_pair[0]],train[columns_pair[1]])])
MI_scores.loc[current_index] = row.values
current_index +=1
MI_scores.to_csv("adjusted_mutual_information_score.csv", sep="|", index=False)
This works, but it's very slow on a dataframe with a large number of columns. How can I optimize it?

How can I count a specific value in group_by in pandas?

I have a dataframe and I use groupby to group it by Season. One of the columns of the original df is named Check and consists of True and False. My aim it to count the True values for each group and put it in the new dataframe.
import pandas as pd
df = ....
df['Check'] = df['Actual'] == df['Prediction']
grouped_per_year = df.groupby('Season')
df_2= pd.DataFrame()
df_2['Seasons'] = total_matches_per_year.keys()
df_2['Successes'] = ''
df_2['Total_Matches'] = list(grouped_per_year.size())
df_2['SR'] = df_2['Successes'] / df_2['Total_Matches']
df_2['Money_In'] = list(grouped_per_year['Money_In'].apply(sum))
df_2['Profit (%)'] = (df_profit['Money_In'] - df_profit['Total_Matches']) / df_profit['Total_Matches'] * 100.
I have tried:
successes_per_year = grouped_per_year['Pred_Check'].value_counts()
but I don't know how to get only the True count.

For counting True, you can also use sum (as True=1 and False=0 when doing a numerical operation):
grouped_per_year['Pred_Check'].sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to create a new dataframe by sorted data - python

If you want to know at what index the RSI < 25 then just use: ck_df[ck_df['RSI'] <25].index The result will also be a dataframe. If you insist on making a new one then: new_df = ck_df[ck_df['RSI'] <25].copy()

Related

Creating Cartesian Product DataFrame without maxing Memory

Capitalize random rows in Panda Dataframe

Iterating over multiple pandas dataframe is slow

Optimizing pairwise mutual information score

How can I count a specific value in group_by in pandas?

Categories

Resources