I have a dataframe, and I'm trying to encode all the categorical values within the dataframe. So the following is the code I wrote to encode all categorical columns in one go,
for col in data.select_dtypes('object').columns:
data[col] = data[col].astype('category').cat.codes
but this works only sometimes and often throws the following error saying "Dataframe has no attributed as cat"
AttributeError: 'DataFrame' object has no attribute 'cat'
Now I'm not able to understand how it works sometimes and fails other times. Also I haven't applied the cat method to the whole dataframe but to a column (series) each time going though the loop.
Does anyone know what's going wrong here?
Problem is there are duplicated columns names, so if select one column get all columns with same label.
for col in data.select_dtypes('object').columns:
print (col)
#check what select
print (data[col] )
data[col] = data[col].astype('category').cat.codes
Related
def store_press_release_links(all_sublinks_df, column_names):
all_press_release_links_df = pd.DataFrame(columns=column_names)
for i in range(len(all_sublinks_df)):
if (len(all_sublinks_df.loc[i,'sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES'))>0):
all_press_release_links_df.loc[i, 'link'] = all_sublinks_df[i,'link']
all_press_release_links_df.loc[i, 'sub_link'] = all_sublinks_df[i,'sublink']
else:
continue
all_press_release_links_df = all_press_release_links_df.drop_duplicates()
all_press_release_links_df.reset_index(drop=True, inplace=True)
return all_press_release_links_df
store_press_release_links() is a function that accepts a dataframe, all_sublinks_DF, which has two columns. 1. link 2. sub_link
The contents of both these columns are iink names.
I want to look through all the link names present in the sub_link column of the all_sublinks_DF Dataframe one by one and check if the link has the keywords ' press|releases|Press|Releases|PRESS|RELEASES' in it.
If it does, then I want to store that entire row of the all_sublinks_DF Dataframe to a new dataframe all_press_release_links_df.
But when I run this function it gives the error : AttributeError: 'str' object has no attribute 'str'
Where am I going wrong?
I think what you want to do, instead of the loop just use
all_press_release_links_df = all_sublinks_df[all_sublinks_df['sub_link'].str.contains('press|releases|Press|Releases|PRESS|RELEASES')]
There are many things wrong here. There are almost no situations where you want to loop through a pandas dataframe and equally there are almost no situations where you want to build a pandas dataframe row by row.
Here, the entire operation can be done with a single operation (split on two lines for readability) like such:
def store_press_release_links(all_sublinks_df):
check_str = 'press|releases|Press|Releases|PRESS|RELEASES'
return all_sublinks_df[all_sublinks_df.link.str.contains(check_str)]
The reason why you had the error message you had is because you select individual cells of the dataframe, which are not of type pandas.Series. The str property is a pd.Series property.
Also note how the columns field is no longer needed.
I have a df with one column (SKUID) where I want to remove all the characters that are not numerical. Here is an sample of the column:
Essentially I want to remove the underscore and the letter for each row. I have tried using following code:
sku_data.split('_', 1)[0]
This gives me an error of 'DataFrame' object has no attribute 'split'. Where am I going wrong?
This should do for number extraction:
sku_data.SKUID = sku_data.SKUID.str.extract('(\d+)')
Note: don't forget to add the str operator if you want to perform string operations on a DataFrame column
Unfortunately I am unable to produce a replicable example, but here's the issue I'm running into - with one dataframe, I am able to loop through the columns and save the count of unique values per column. With another dataframe, which has the same exact columns and data as the first dataframe - the only difference being that the second dataframe is all object dtypes, while the first has some ints and floats - i run into a 'unhashable type: 'dict'' error.
this works:
for col in olddf.columns:
unique = len(olddf[col].unique())
print(col, unique)
i get an unhashable type: 'dict' error with this:
for col in orig_results.columns:
unique = len(orig_results[col].unique())
Like I mentioned, unfortunately I'm unable to come up with a sample dataset to replicate. Wondering if anyone by any chance has a general idea of what might be happening? Thanks!
Turns out it was the location column throwing an error, which contains dictionaries as values: {'latitude': '40.7388739110531', 'longitude': '40.738873911'}. since dictionaries are unhashable, we can't get a unique count.
I am trying to use a data frame to regroup different kind of data.
I have a data frame with 3 columns :
one that I define and the index (used a groupby command)
one that regroups a parameter, say 'valeur1', for which I want a mean for these that have the same index (used a mean command after the group by)
the last column contains strings. There is only 1 string for each index but some cell might contain nan.
I am trying to get in the end a dataframe with the mean for 1 parameter depending on the index as well as the string that goes with the index (nan in the string column are not important). Here is a picture with an example or what I am trying to get : illustration . Main issue is that dataframe.mean does not work with string
The code I used so far is pretty basic :
dataRaw=pd.read_csv('file.csv', sep=';', encoding='latin-1')
data=dataRaw.groupby(index)
databis=data.mean();
Any suggestion would be greatly appreciated.
Thanks !
I think you need to group by multiple columns:
databis = dataRaw.groupby(['index', 'String']).mean()
I am trying to sort rows within column sample.single by excluding ('./.'). The all data types are objects. I have attempted the options below. I suspect the special character is compromising the second attempt. Dataframe is comprised of 195 columns.
My gtdata columns:
Index(['sample.single', 'sample2.single', 'sample3.single'] dtype='object')
Please advise, thank you!
gtdata = gtdata[('sample.single')!= ('./.') ]
I receive a key error: KeyError: True
When I try:
gtdata = gtdata[gtdata.sample.single != ('./.') ]
I receive an attribute error:
AttributeError: 'DataFrame' object has no attribute 'single'
Not 100% sure what you're trying to achieve, but assuming you have cells containing the "./." string which you want to filter out, the below is one way to do it:
import pandas as pd
# generate some sample data
gtdata = pd.DataFrame({'sample.single': ['blah1', 'blah2', 'blah3./.'],
'sample2.single': ['blah1', 'blah2', 'blah3'],
'sample3.single': ['blah1', 'blah2', 'blah3']})
# filter out all cells in column sample.single containing ./.
gtdata = gtdata[~gtdata['sample.single'].str.contains("./.")]
When subsetting in Pandas you should be passing a boolean vector with the same dimension as the DataFrame.
The problem with your first approach was that ('sample.single')!=('./.') evaluates into a single boolean value as opposed to a boolean vector. You're also comparing two strings, not any column in the DataFrame.
The problem with your second approach is that gtdata.sample.single doesn't make sense in pandas syntax. To get the sample.single column you have to refer to is as gtdata['sample.single']. If your column name did not contain a ".", you could use the shorthand that you were trying to use: e.g. gtdata.sample_single.
I recommend reviewing the documentation for subsetting Pandas DataFrames.