Set up a column based on another column and outside list in a Pandas Dataframe - python

I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())

Related

Make Pandas Dataframe with lists of different length

Im trying to backtest a trading strategy and therefore I have split my data with TimeSeriesSplit. This creates 5 different pandas series for all columns with different lengths. I've managed to hardcode the first list with length = 16 into a pandas dataframe, but would like to make individual dataframes for all different lengths in my list.
The code is the following:
from sklearn.model_selection import TimeSeriesSplit
import yfinance as yf
data = yf.download("NVDA",start="2017-01-01", end="2017-04-30")
data
tss = TimeSeriesSplit(n_splits = 5)
column = []
for columns in data:
column.append(data[columns])
train_data = []
test_data = []
for i in column:
for train_index, test_index in tss.split(i):
train = i[train_index]
test = i[test_index]
train_data.append(train)
test_data.append(test)
# copied the list
testing = train_data.copy()
first_df = []
for i in testing:
#print(len(i))
if len(i) == 16:
first_df.append(i)
# this works to get the dataframe back with different lengths
new_df = pd.DataFrame(first_df)
test = new_df.transpose()
test
The output from "test" is correct for the first split of training data, but how can I do this with all 5 different lengths at once?

pandas: while loop to simultaneously advance through multiple lists and call functions

I want my code to:
read data from a CSV and make a dataframe: "source_df"
see if the dataframe contains any columns specified in a list:
"possible_columns"
call a unique function to replace the values in each column whose header is found in the "possible_columns" the list, then insert the modified values in a new dataframe: "destination_df"
Here it is:
import pandas as pd
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
#creates destination_df
blanklist = []
destination_df = pd.DataFrame(blanklist)
#create the column header lists for comparison in the while loop
columns = source_df.head(0)
possible_columns = ['yes/no','true/false']
#establish the functions list and define the functions to replace column values
fix_functions_list = ['yes_no_fix()','true_false_fix()']
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
'''use the counter to call a unique function from the function list to replace the values in each column whose header is found in the "possible_columns" the list, insert the modified values in "destination_df, then advance the counter'''
counter = 0
while counter < len(possible_columns):
if possible_columns[counter] in columns:
destination_df.insert(counter, possible_columns[counter], source_df[possible_columns[counter]])
fix_functions_list[counter]
counter = counter + 1
#see if it works
print(destination_df.head(10))
When I print(destination_df), I see the unmodified column values from source_df. When I call the functions independently they work, which makes me think something is going wrong in my while loop.
Your issue is that you are trying to call a function that is stored in a list as a string.
fix_functions_list[cnt]
This will not actually run the function just access the string value.
I would try and find another way to run these functions.
def yes_no_fix():
destination_df['yes/no'] = destination_df['yes/no fixed'].replace("No","0").replace("Yes","1")
def true_false_fix():
destination_df['true/false'] = destination_df['true/false fixed'].replace('False', '1').replace('True', '0')
fix_functions_list = {0:yes_no_fix,1:true_false_fix}
and change the function calling to like below
fix_functions_list[counter]()
#creates source_df
file = "yes-no-true-false.csv"
data = pd.read_csv(file)
source_df = pd.DataFrame(data)
possible_columns = ['yes/no','true/false']
mapping_dict={'yes/no':{"No":"0","Yes":"1"} ,'true/false': {'False':'1','True': '0'}
old_columns=[if column not in possible_columns for column in source_df.columns]
existed_columns=[if column in possible_columns for column in source_df.columns]
new_df=source_df[existed_columns]
for column in new_df.columns:
new_df[column].map(mapping_dict[column])
new_df[old_columns]=source_df[old_columns]

using pandas to find the string from a column

I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result

How to create a pandas dataframe inside a function

I have written a function to split the sentences into words and i need to create features out of them. I am encountering following issues
1. when i use a list to hold all values, when i retrieve them,all the features are created as a single column, whereas i need each of them as single column
2. getting zero division error even though I am using if condition to check if the count is zero
def token_features(q1,q2):
stats = [0.0]*10
q1_tokens = q1.split()
q2_tokens = q2.split()
q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
common_word_count = len(set(q1_words) & set(q2_words))
s1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
s2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
stop_word_count = len(s1_stops & s2_stops)
common_token_count = len(set(q1_tokens) & set(q2_tokens))
if (common_word_count==0 or stop_word_count==0 or common_token_count==0):
pass
stats[0] = common_word_count/min(len(q1_words),len(q2_words))
stats[1] = common_word_count/max(len(q1_words),len(q2_words))
stats[2] = stop_word_count/min(len(s1_stops),len(s2_stops))
stats[3] = stop_word_count/max(len(s1_stops),len(s2_stops))
stats[4] = common_token_count/min(len(q1_tokens),len(q2_tokens))
stats[5] = common_token_count/max(len(q1_tokens),len(q2_tokens))
stats[6] = int(q1_tokens[-1] == q2_tokens[-1])
stats[7] = int(q1_tokens[0] == q2_tokens[0])
stats[8] = abs(len(q1_tokens) - len(q2_tokens))
stats[9] = (len(q1_tokens) + len(q2_tokens))/2
return stats
what i need is
1. how to create a dataframe inside the function and add it to another dataframe. i.e, I have a Dataframe named out, which has columns a,b. now i need to create a dataframe inside the function and add those columns of the dataframe to "out" dataframe.
If i understood correctly you are trying to do this:
import pandas as pd
data = {'col_1': ['Yes','3'], 'col_2': ["2.0","Yes"]}
base_df=pd.DataFrame.from_dict(data)
def appendToDf(base_df):
data = {'new1': ['Yes','3'], 'new2': ["2.0","Yes"]}
dfNew=pd.DataFrame.from_dict(data)
out_df=pd.concat([base_df,dfNew], axis=1)
return out_df
appendToDf(base_df)

Iterating over multiple pandas dataframe is slow

I'm trying to find the number of similar words for all rows in Dataframe1 for every single row with words in Dataframe 2.
Based on the similarities I want to create a new data frame with where columns = N rows of dataframe2
values = similarity.
My current code is working, but it runs very slow. I'm not sure how to optimize it...
df = pd.DataFrame([])
for x in range(10000):
save = {}
terms_1 = data['text_tokenized'].iloc[x]
save['code'] = data['code'].iloc[x]
for y in range(3000):
terms_2 = data2['terms'].iloc[y]
similar_n = len(list(terms_2.intersection(terms_1)))
save[data2['code'].iloc[y]] = similar_n
df = df.append(pd.DataFrame([save]))
Update: new code (still running slow)
def get_sim(x, terms):
similar_n = len(list(x.intersection(terms)))
return similar_n
for index in icd10_terms.itertuples():
code,terms = index[1],index[2]
data[code] = data['text_tokenized'].apply(get_sim, args=(terms,))

Categories

Resources