I am trying to compute the mutual information score between all the columns of a pandas dataframe,
from sklearn.metrics.cluster import adjusted_mutual_info_score
from itertools import combinations
current_valid_columns = list(train.columns.difference(["ID"]))
MI_scores = pd.DataFrame(columns=["features_pair","adjusted_mutual_information"])
current_index = 0
for columns_pair in combinations(current_valid_columns, 2):
row = pd.Series([str(columns_pair),adjusted_mutual_info_score(train[columns_pair[0]],train[columns_pair[1]])])
MI_scores.loc[current_index] = row.values
current_index +=1
MI_scores.to_csv("adjusted_mutual_information_score.csv", sep="|", index=False)
This works, but it's very slow on a dataframe with a large number of columns. How can I optimize it?
Related
I'm using nested loops to add new columns with dynamic names based on the dataset columns (col) and columns that drops one col (I called it interact col). It works well for small datasets, but it becomes very slow if I have datasets with a very high amount of features. Any tips to simplify the process to make it faster?
import numpy as np
import pandas as pd
X = pd.read_csv('water_potability.csv')
X = X.drop(columns='Unnamed: 0')
X_columns = np.array(X.columns)
fi_df = X.copy()
done_list = []
for col in X_columns:
interact_col = X.drop(columns = col).columns
for int_col in interact_col:
fi_df['({})_minus_({})'.format(col, int_col)] = X[col]-X[int_col]
fi_df['({})_div_({})'.format(col, int_col)] = X[col]/X[int_col]
if int_col not in done_list:
fi_df['({})_add_({})'.format(col, int_col)] = X[col]+X[int_col]
fi_df['({})_multi_({})'.format(col, int_col)] = X[col]*X[int_col]
done_list.append(col)
I have several dataframes, from which I'm creating a cartesian product (on purpose!)
After this, I'm exporting the result to disk.
I believe the size of the resulting dataframe could exceed my memory footprint, so I'm wondering is there a way that I can chunk this so that the dataframe doesn't need to all be in memory at the same time?
Example Code:
import pandas as pd
def create_list_from_range(r1,r2):
if (r1 == r2):
return r1
else:
res = []
while(r1 < r2+1 ):
res.append(r1)
r1 += 1
return res
# make a list of options
color_opt = ['red','blue','green','orange']
dow_opt = create_list_from_range(1,7)
hod_opt = create_list_from_range(0,23)
# turn each list into a dataframe
df_color = pd.DataFrame({'color': color_opt})
df_day = pd.DataFrame({'day_of_week': dow_opt})
df_hour = pd.DataFrame({'hour_of_day': hod_opt})
# add a dummy columns to everything so I can easily do a cartesian product
df_color['dummy']=1
df_day['dummy']=1
df_hour['dummy']=1
# now cartesian product... cascading
merge1 = pd.merge(df_day, df_hour, on='dummy')
FINAL = pd.merge(merge1, df_color, on='dummy')
FINAL.to_csv('FINAL_OUTPUT.csv', index=False)
You could try building up individual rows using itertools.product. In your example, you could do this as follows:
from itertools import product
prod = product(color_opt, dow_opt, hod_opt)
You can then get a number of rows and append them to an existing csv file using
df.to_csv("file", mode="a")
I'm making a reverse denoisng autoencoder and I have a dataset but it's all lowercased, but I want 80% of the rows the source entry to be capitalized and only 60% of the target entries to be capitalized. I wrote this
import pandas as pd
import torch
df = pd.read_csv('Data/fb_moe.csv')
for i in range(len(df)):
sample = int(torch.distributions.Bernoulli(torch.FloatTensor([.8])).sample())
if sample == 1:
df.iloc[i].y = str(df.iloc[i].y).capitalize()
sample_1 = int(torch.distributions.Bernoulli(torch.FloatTensor([.6])).sample())
if sample_1 == 1:
df.iloc[i].x = str(df.iloc[i].x).capitalize()
df.to_csv('Data/fb_moe2.csv')
But this is pretty slow cause my csv is like 8 million rows is there a faster way to do this?
Part of the Dataframe
x,y
jon,jun
an,jun
ju,jun
jin,jun
nun,jun
un,jun
jon,jun
jin,jun
nen,jun
ju,jun
jn,jun
jul,jun
jen,jun
hun,jun
ju,jun
hun,jun
hun,jun
jon,jun
jin,jun
un,jun
eun,jun
jhn,jun
Try adding some boolean mask and some apply functions, pandas does not behave quickly in for loops
n = len(df)
source = np.random.binomial(1, p=.8, size=n) == 1
target = source.copy()
total_source_true = np.sum(source)
target[source] = np.random.binomial(1, p=.6, size=total_source_true) == 1
df.loc[source, 'x'] = df.loc[source, 'x'].str.capitalize()
df.loc[target, 'y'] = df.loc[source, 'y'].str.capitalize()
I would like to find out the row which meets the condition RSI < 25.
However, the result is generated with one data frame. Is it possible to create separate dataframes for any single row?
Thanks.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='TSLA'
ck_df = wb.DataReader(stock,data_source='yahoo',start='2015-01-01')
rsi_period = 14
chg = ck_df['Close'].diff(1)
gain = chg.mask(chg<0,0)
ck_df['Gain'] = gain
loss = chg.mask(chg>0,0)
ck_df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
ck_df['Avg Gain'] = avg_gain
ck_df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
ck_df['RSI'] = rsi
RSIFactor = ck_df['RSI'] <25
ck_df[RSIFactor]
If you want to know at what index the RSI < 25 then just use:
ck_df[ck_df['RSI'] <25].index
The result will also be a dataframe. If you insist on making a new one then:
new_df = ck_df[ck_df['RSI'] <25].copy()
To split the rows found by #Omkar's solution into separate dataframes you might use this function taken from here: Pandas: split dataframe into multiple dataframes by number of rows;
def split_dataframe_to_chunks(df, n):
df_len = len(df)
count = 0
dfs = []
while True:
if count > df_len-1:
break
start = count
count += n
dfs.append(df.iloc[start : count])
return dfs
With this you get a list of dataframes.
I'm working with a csv file in Python using Pandas.
I'm having a few troubles thinking on how to achieve the following goal.
What I need to achieve is to group entries using a similarity function.
For example, each group X should contain all entries where each couple in the group differs for at most Y on a certain attribute-column value.
Given this example of CSV:
<pre>
name;sex;city;age
john;male;newyork;20
jack;male;newyork;21
mary;female;losangeles;45
maryanne;female;losangeles;48
eric;male;san francisco;29
jenny;female;boston2;30
mattia;na;BostonDynamics;50
</pre>
and considering the age column, with a difference of at most 3 on this value I would get the following groups:
A = {john;male;newyork;20
jack;male;newyork;21}
B={eric;male;san francisco;29
jenny;female;boston2;30}
C={mary;female;losangeles;45
maryanne;female;losangeles;48}
D={maryanne;female;losangeles;48
mattia;na;BostonDynamics;50}
Actually this is my work-around but I hope there exists something more pythonic.
import pandas as pandas
import numpy as numpy
def main():
csv_path = "../resources/dataset_string.csv"
csv_data_frame = pandas.read_csv(csv_path, delimiter=";")
print("\nOriginal Values:")
print(csv_data_frame)
sorted_df = csv_data_frame.sort_values(by=["age", "name"], kind="mergesort")
print("\nSorted Values by AGE & NAME:")
print(sorted_df)
min_age = int(numpy.min(sorted_df["age"]))
print("\nMin_Age:", min_age)
max_age = int(numpy.max(sorted_df["age"]))
print("\nMax_Age:", max_age)
threshold = 3
bins = numpy.arange(min_age, max_age, threshold)
print("Bins:", bins)
ind = numpy.digitize(sorted_df["age"], bins)
print(ind)
print("\n\nClustering by hand:\n")
current_min = min_age
for cluster in range(min_age, max_age, threshold):
next_min = current_min + threshold
print("<Cluster({})>".format(cluster))
print(sorted_df[(current_min <= sorted_df["age"]) & (sorted_df["age"] <= next_min)])
print("</Cluster({})>\n".format(cluster + threshold))
current_min = next_min
if __name__ == "__main__":
main()
On one attribute this is simple:
Sort
Linearly scan the data, and whenever the threshold is violated, begin a new group.
While this won't be optimal, it should be better than what you already have, at less cost.
However, in the multivariate case, finding he optimal groups is supposedly NP-hard, so finding the optimal grouping will require brute-force search in exponential time. So you will need to approximate this, either by AGNES (in O(n³)) or by CLINK (usually worse quality, but O(n²)).
As this is fairly expensive, it will not be a simple operator of your data frame.