Iterate over Dataframes name - python

I want to print the name of the Dataframe in a for a loop, but I dont get it right
When I iterate over the datasets list I get the dataset. If I try with str(d) for example I get all the dataset as string. d.name() doesnt work either.
What can I do to print just the name of the Dataframe as a string?
Thanks in advance!
PD: I get this Error, "AttributeError: 'DataFrame' object has no attribute 'name'"
# Define lists
datasets = [train_data, test_data]
features = ['Age', 'Fare']
# Create function
fig, outliers = plt.subplots(figsize=(20,10), ncols=4)
row, col = 0, 0
for f in features:
for d in datasets:
sns.boxplot(x=d[f], orient='v', color=pal_titanic[3], ax=outliers[col])
outliers[col].set_title(f + 'in' + d)
col = col + 1

Related

'str' object has no attribute 'values'

I am stuck and I have looked up others solutions to this but I don't quite understand. In my code I have a giant matrix in a csv file that I want to iterate data in my 4th column only. It is called 'MovementTime' i thought that by calling it the way shown below I could iterate my data and therefore sort it. I am getting the error
'str' object has no attribute 'values'
Can someone explain to me why im getting this error?
Thank you!
bigdata = pd.read_csv(r'Assetslog_912021_11.csv')
data = pd.DataFrame(bigdata)
#create a function to analyze data
def analytics(data):
data.columns = ['Time', 'Fixed Delta', 'Movement Time', 'MovementNumber', 'Rest Flag', 'DistortionDigit', 'RobotForceX','RobotForceY','RobotForceZ', 'PrevPositionX','PrevPositionY','PrevPositionZ', 'TargetPosZ', 'TargetPosY', 'TargetPosZ', 'PlayerPosX', 'PlayerPosX', 'PlayerPosY', 'PlayerPosZ', 'RobotVelX','RobotVelY','RobotVelZ', 'LocalPosX', 'LocalPosY', 'LocalPosZ', 'PerpError', 'ExtError']
i = np.iterable(data.columns)
for i in set(data['MovementNumber'.]):
print("Plot for Movement Number " + str(i))
data2 = data.loc[['MovementNumber'] == i]
ax = plt.axes(projection = '3d')
xdata = data2['PlayerPosX'].values
ydata = data2['PlayerPosY'].values
zdata = data2['PlayerPosZ'].values
plot1 =ax.scatter3D(xdata,ydata,zdata, c=zdata)
plt.show(plot1)
This line is not right:
data2 = data.loc[['MovementNumber'] == i]
That's going to compare a list containing a string to an integer, which will always be false. I believe you want
data2 = data[data['MovementNumber'] == i]]
That assigns to data2 all the rows where MovementNumber is i.
And, by the way, your indentation is wrong. I assume you want one plot per movement number, so all the lines starting with ax = ... need to be indented, so they are inside the loop.

Set up a column based on another column and outside list in a Pandas Dataframe

I am trying to create a new column in a Pandas dataframe which takes only one array from a list of 5 arrays (the list is titled cluster_centre) and puts that array into the dataframe. It would take the array at the index that matches the value in the 'labels' column of the same dataframe (which has values of 0,1,2,3 or 4). So for instance, if the sentence in that row was given a label of 2 i.e. the 'labels' column value for that row would be 2, then the value of the 'cluster_centres' column in the df at that row would be cluster_centre[2]. How can I do this? The code I have attempted is pasted below:
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import pandas as pd
with open('JWN_Nordstrom_MDNA_overview_2017.txt', 'r') as file:
initial_corpus = file.read()
corpus = initial_corpus.split('. ')
# Extract sentence embeddings
embedder = SentenceTransformer('bert-base-wikipedia-sections-mean-tokens')
corpus_embeddings = embedder.encode(corpus)
# Perform KMeans clustering
num_clusters = 5
clustering_model = KMeans(n_clusters=num_clusters)
clustering_model.fit(corpus_embeddings)
cluster_assignment = clustering_model.labels_
cluster_centre = clustering_model.cluster_centers_
# Create dataframe
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
# The line below creates a ValueError
All_data_df['cluster_centres'] = cluster_centre[All_data_df['labels']]
print(All_data_df.head())
I get this error: ValueError: Wrong number of items passed 768, placement implies 1
UPDATE: I did some new stuff and tried this:
All_data_df = pd.DataFrame()
All_data_df['sentences'] = corpus
All_data_df['embeddings'] = corpus_embeddings
All_data_df['labels'] = cluster_assignment
#All_data_df['cluster_centres'] = 0
for index, row in All_data_df.iterrows():
iforval = cluster_centre[row['labels']]
All_data_df.at[index, 'cluster_centres'] = iforval
print(All_data_df.head())
But get a new error: ValueError: Must have equal len keys and value when setting with an iterable. I printed iforval inside the loop and it does indeed return 29 correct arrays from the cluster_centre list, which matches the 29 rows present in the dataframe. Now I just need to put them into the new column of the dataframe, but .at[] didn't work, not sure if I am using it correctly.
EDIT/UPDATE: Ok I found a sort of solution, don't know why I didn't realise this before, I just created a list beforehand and made that into the new column, ended up being much simpler.
cluster_centres_list = [cluster_centres[label] for label in cluster_assignment]
all_data_df = pd.DataFrame()
all_data_df['sentences'] = corpus
all_data_df['embeddings'] = corpus_embeddings
all_data_df['labels'] = cluster_assignment
all_data_df['cluster_centres'] = cluster_centres_list
print(all_data_df.head())

How do I apply CountVectorizer to each row in a dataframe?

I have a dataframe say df which has 3 columns. Column A and B are some strings. Column C is a numeric variable.
Dataframe
I want to convert this to a feature matrix by passing it to a CountVectorizer.
I define my countVectorizer as:
cv = CountVectorizer(input='content', encoding='iso-8859-1',
decode_error='ignore', analyzer='word',
ngram_range=(1), tokenizer=my_tokenizer, stop_words='english',
binary=True)
Next I pass the entire dataframe to cv.fit_transform(df) which doesn't work.
I get this error:
cannot unpack non-iterable int object
Next I covert each row of the dataframe to
sample = pdt_items["A"] + "," + pdt_items["C"].astype(str) + "," + pdt_items["B"]
Then I apply
cv_m = sample.apply(lambda row: cv.fit_transform(row))
I still get error:
ValueError: Iterable over raw text documents expected, string object received.
Please let me know where am I going wrong?Or if I need to take some other approach?
Try this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
A = ['very good day', 'a random thought', 'maybe like this']
B = ['so fast and slow', 'the meaning of this', 'here you go']
C = [1, 2, 3]
pdt_items = pd.DataFrame({'A':A,'B':B,'C':C})
cv = CountVectorizer()
# use pd.DataFrame here to avoid your error and add your column name
sample = pd.DataFrame(pdt_items['A']+','+pdt_items['B']+','+pdt_items['C'].astype('str'), columns=['Output'])
vectorized = cv.fit_transform(sample['Output'])
With the help of #QuantStats's comment, I applied the cv on each row of dataframe as follows:
row_input = df['column_name'].tolist()
kwds = []
for i in range(len(row_input)):
cell_input = [row_input[i]]
full_set = row_keywords(cell_input, 1,1)
candidates = [x for x in full_set if x[1]> 1] # to extract frequencies more than 1
kwds.append(candidates)
kwds_col = pd.Series(kwds)
df['Keywords'] = kwds_col
("row_keywords" is a function for CountVectorizer.)

Pandas: how to add a dataframe inside a cell of another dataframe?

I have an empty dataframe like the following:
simReal2013 = pd.DataFrame(index = np.arange(0,1,1))
Then I read as dataframes some .csv files.
stat = np.arange(0,5)
xv = [0.005, 0.01, 0.05]
br = [0.001,0.005]
for i in xv:
for j in br:
I = 0
for s in stat:
string = 'results/2013/real/run_%d_%f_%f_15.0_10.0_T0_RealNet.csv'%(s,i,j)
sim = pd.read_csv(string, sep=' ')
I += np.array(sim.I)
sim.I = I / 5
col = '%f_%f'%(i,j)
simReal2013.insert(0, col, sim)
I would like to add the dataframe that I read in a cell of simReal2013. In doing so I get the following error:
ValueError: Wrong number of items passed 9, placement implies 1
Yes, putting a dataframe inside of a dataframe is probably not the way you want to go, but if you must, this is one way to do it:
df_in=pd.DataFrame([[1,2,3]]*2)
d={}
d['a']=df
df_out=pd.DataFrame([d])
type(df_out.loc[0,"a"])
>>> pandas.core.frame.DataFrame
Maybe a dictionary of dataframes would suffice for your use.

Text To Column Function

I am trying to write my own function in Python 3.5, but not having much luck.
I have a data frame that is 17 columns, 1,200 rows (tiny)
One of the columns is called "placement". Within this column, I have text contained in each row. The naming convention is as follows:
Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_
The following code works perfectly and does exactly what i need it to do; I just don't want to do this for every data set i have:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
df_detailed = df.join(df_detailed)
new_columns = *["Then i rename the columns labelled 0,1,2 etc"]*
df_detailed.columns = new_columns
df_detailed.head()
What I'm trying to do is build a function, that takes any columns with _ as the delimitator and splits it across new columns.
I have tried the following (but unfortunately defining my own functions is something I'm horrible at.
def text_to_column(df):
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
headings = df_detailed.columns
headings.replace(" ", "_")
df_detailed = df.join(df_detailed)
df_detailed.columns = headings
return (df)
and I get the following error "AttributeError: 'RangeIndex' object has no attribute 'replace'"
The end goal here is to write a function where I can pass the column name into the function, it separates the values contained within the column into new columns and then joins this back to my original Data Frame.
If I'm being ridiculous, please let me know. If someone can help me, it would be greatly appreciated.
Thanks,
Adrian
You need rename function for replace columns names:
headings = df_detailed.columns
headings.replace(" ", "_")
change to:
df_detailed = df_detailed.rename(columns=lambda x: x.replace(" ", "_"))
Or convert columns to_series because replace does not work with index (columns names):
headings.replace(" ", "_")
change to:
headings = headings.to_series().replace(" ", "_")
Also:
df_detailed = df['Placement'].str[0:-1].str.split('_', expand=True).astype(str)
is possible change to:
df_detailed = df['Placement'].str.rstrip('_').str.split('_', expand=True).astype(str)
EDIT:
Sample:
df = pd.DataFrame({'a': [1, 2], 'Placement': ['Campaign_Publisher_Site_AdType_AdSize_Device_Audience_Tactic_', 'a_b_c_d_f_g_h_i_']})
print (df)
Placement a
0 Campaign_Publisher_Site_AdType_AdSize_Device_A... 1
1 a_b_c_d_f_g_h_i_ 2
#input is DataFrame and column name
def text_to_column(df, col):
df_detailed = df[col].str.rstrip('_').str.split('_', expand=True).astype(str)
#replace columns names if necessary
df_detailed.columns = df_detailed.columns.to_series().replace(" ", "_")
#remove column and join new df
df_detailed = df.drop(col, axis=1).join(df_detailed)
return df_detailed
df = text_to_column(df, 'Placement')
print (df)
a 0 1 2 3 4 5 6 7
0 1 Campaign Publisher Site AdType AdSize Device Audience Tactic
1 2 a b c d f g h i

Categories

Resources