I have pandas DataFrame named 'dataset' and it contains a column named 'class'
when I execute the following line I get SyntaxError: invalid syntax
print("Unique values in the Class column:", dataset.class.unique())
It works for another column names but not working with 'class'
How to use a keyword as column name in pandas ?
class is a keyword in python. A rule of thumb: whenever you're dealing with column names that cannot be used as valid variable names in python, you must use the bracket notation to access: dataset['class'].unique().
There are, of course, exceptions here, but they work against your favour. For example, min/max is a valid variable name in python (even though it shadows builtins). In the case of pandas, however, you cannot refer to such a named column using the Attribute Access notation. There are more such exceptions, they're enumerated in the documentation.
A good place to begin with further reading is the documentation on Attribute Access. Specifically, the red Warning box), which I'm adding here for posterity:
You can use this access only if the index element is a valid Python
identifier, e.g. s.1 is not allowed. See here for an explanation of
valid identifiers.
The attribute will not be available if it conflicts with an existing
method name, e.g. s.min is not allowed, but s['min'] is possible.
Similarly, the attribute will not be available if it conflicts with
any of the following list: index, major_axis, minor_axis, items.
In any of these cases, standard indexing will still work, e.g. s['1'],
s['min'], and s['index'] will access the corresponding element or
column.
class is reserved word.
You can do as dataset['class'].unique()
Related
I am trying to fine-tune Tapas following the instructions here: https://huggingface.co/transformers/v4.3.0/model_doc/tapas.html#usage-fine-tuning , Weak supervision for aggregation (WTQ) using the https://www.microsoft.com/en-us/download/details.aspx?id=54253 , which follow the required format of dataset in the SQA format, tsv files with most of the named columns. But, there is no float_answer column. And as mentioned,
float_answer: the float answer to the question, if there is one (np.nan if there isn’t). Only required in case of weak supervision for aggregation (such as WTQ and WikiSQL)
Since I am using WTQ, I need the float_answer column. I tried populating float_answer based on answer_text as suggested here, using https://github.com/google-research/tapas/blob/master/tapas/utils/interaction_utils_parser.py 's parse_question(table, question, mode) function. However, I am getting errors.
I copied everything from here and put these args:
.
But, I get this error: TypeError: Parameter to CopyFrom() must be instance of same class: expected language.tapas.Question got str.
1) Can you, please help understand what args should I Use or how else can I populate float_answer?
I am using table_csv and the question, answer to which is in the table given:
2) Also we have tried to simply add float_answer column and make all the values np.nan. Crashed, too.
Is there tutorial for WTQ fine-tuning? Thanx!
This question already has answers here:
How to access (get or set) object attribute given string corresponding to name of that attribute
(3 answers)
Closed 8 months ago.
I'm not quite sure how to phrase this question, so let me illustrate with an example.
Let's say you have a Pandas dataframe called store_df with a column called STORE_NUMBER. There are two ways to access a given column in a Pandas dataframe:
store_df['STORE_NUMBER']
and
store_df.STORE_NUMBER
Now let's say that you have a variable called column_name which contains the name of a column in store_df as a string. If you run
store_df[column_name]
All is well. But if you try to run
store_df.column_name
Python throws an AttributeError because it is looking for a literal column named "column_name" which doesn't exist in our hypothetical dataframe.
My question is: Is there a way to look up columns dynamically using second syntax (dot notation)? Not so much because there is anything wrong with the first syntax (list notation), but because I am curious if there is some advanced feature of Python that allows users to replace variable names with their value as another variable (in this case a state variable of the dataframe). I know there is the exec function but I was wondering if there was a more elegant solution. I tried
store_df.{column_name}
but received a SyntaxError.
Would getattr(df, 'column_name_as_str') be the kind of thing you're looking for, perhaps?
I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()
I'm working with the wisconsin breast cancer dataset found here. Feature engineering is important in machine learning so a teacher of mine recommended the MeanEncoder part of a library found here. The dataframe looks like the following:
I did specifically change the diagnosis feature/column to category because one of the errors said that might of been the issue but apparently not as it's not solved.
I want to mean encode the target feature/column using MeanEncode found in the library linked above. Here's my function to attempt to do so:
def MeanEncoding(self):
# Get the columns besides the target variable at the front, which is diagnosis, as recommended by teacher.
cols = self.m_df.iloc[:, 1:].columns.to_list()
# Save specifically the target variable too.
target = self.m_df.iloc[:, 0]
# Now get the object ready.
encoder = MeanEncoder(variables=cols)
print('---Fitting---')
encoder.fit(self.m_df.drop('diagnosis', axis=1), target)
In this code:
m_df - just the dataframe hence the "df"
I drop the diagnosis column/feature in the first argument of encoder.fit, since it's provided in the 2nd argument of the same function. But it means nothing. Because I still get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"
Now with #2, I'm thinking, "No way, I have to transform the numeric features which are 'radius_mean', 'texture_mean', etc into category or object? That makes 0 sense". But I google this error of course and it brings me to this SO thread. This individual is having similar concerns like me except with a different function. The suggestion for him was "Just change the dtype of grade column to object before using imputer", so I change the types as well to object with the following code:
for i in range(1, len(self.m_df.columns)):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Doesn't make sense to me because it's converting the types of genuine numeric columns/features. I get this error which is KIND of expected:
pandas.core.base.DataError: No numeric types to aggregate
Now I'm thinking it just wants a few numeric types, so I slightly alter the code:
for i in range(1, len(self.m_df.columns) - 2):
columnName = self.m_df.columns[i]
self.m_df[columnName] = self.m_df[columnName].astype('object')
Which literally just leaves the last 2 columns as float64 types and therefore all others are type object (besides the diagnosis column which is category but I doubt that matters). Now some numeric types ARE present. Yet I still get the error again
TypeError: Some of the variables are not categorical. Please cast
them as object or category before calling this transformer
I am clearly missing something but not sure what. No matter how I alter the types to satisfy the function, it's wrong.
The MeanEncoder from Feature-engine, as well as all other Feature-engine encoders, work only on variables cast as object or category by default.
So the variables captured in the list cols in this line of code: cols = self.m_df.iloc[:, 1:].columns.to_list() should only contain categorical variables (object or category).
When you set up the encoder here: encoder = MeanEncoder(variables=cols), in variables, you indicate the variables to encode. If you pass cols, it means you want to encode all the variables within the cols list. So you need to ensure that all of them are of type category or object.
If you get the error: "TypeError: Some of the variables are not categorical. Please cast them as object or category before calling this transformer"it means that some of the variables in cols are not of type object or category.
If you want to encode numerical variables, there are 2 options: 1) recast the variables you want to encode as object. 2) set the parameter ignore_format=True as per the transformer's documentation. That should solve your problem.
I am trying to rename columns after I read in a csv file where I want to replace a portion of the column name.
Here is my code where I want to remove :
'yor_yogurt_march_poggen_output_with_shrink.'
I get a user warning. I am using the pandas library if that matters.
df.dolumns = df.columns.str.replace('yor_yogurt_march_poggen_output_with_shrink.', '')
The answer is in the title of your question: Pandas doesn't allow columns
to be created via a new attribute name.
If your df does not have a column with the given name,
you can not refer to this name using attribute notation
(in this case df.dolumns).
You have to specify the new column name as an index, i.e.:
df['dolumns'] = ...
Another detail: After = you have df.columns which is a list of
column names existing so far.
As I understood, you want to perform the same replace in each column name (deleting the mentioned fragment).
So maybe the resulting list of column names should be substituted
under df.columns (statring with c not d)?
Double check df.dolumns is a typo in your post, but hopefully not in your code? Because it thinks you're trying to set a new attribute .dolumns for your df