Taking part of string in python error - python

I got an error Can only use .str accessor with string values (i.e. inferred_type is 'string', 'unicode' or 'mixed')
For this code
newestdata = newestdf.assign(
idobject=newestdf.index.str.split('/').str[1].str.replace("-", "").str.extract('(\d+)', expand=False).astype(int))
What I used to take a part of this:
OOOO-ASAS/INTEL-64646/OOOO-15445/PPPO-9
But that's what happens in one python script, but in another don't, it works well. Do you have an idea what is the problem?

There is problem you have mixed data - some numeric with strings in index.
Need cast to string as first step:
newestdata = newestdf.assign(
idobject=newestdf.index.astype(str).str.split('/').str[1].str.replace("-", "").str.extract('(\d+)', expand=False).astype(int))
^^^^^^^^^^

Related

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

Getting an error when converting to float to get top 10 largest values

I am trying to use the nlargest function to return top 10 values using code below as,
df['roi'].astype(float).nlargest(3, 'roi')
But get an error of
ValueError: keep must be either "first", "last" or "all"
the roi column is an object, which is why I use the astype float but am still getting an error
When I try the keep = all or keep = first or last filter in the nlargest function I get an error of TypeError: nlargest() got multiple values for argument 'keep'
Thanks!
To use the method as you want, you must change your code to:
df.astype(float).nlargest(3, 'roi')
Since this syntax works only for pandas.DataFrames. If you want to specify the colum by its key, as in a dictionary, then you'll be working with pandas.Series, and the correct syntax would be
df['roi'].astype(float).nlargest(3)
The docs for both methods are here, for DataFrames, and here, for Series
For a one-liner you'll need to convert "roi" to a float type first, and then perform nlargest:
Passing a dictionary to .astype allows us to return the entire DataFrame making selective changes to specific columns' dtypes, and then we can perform .nlargest on that returned DataFrame (instead of just having a Series).
df.astype({"roi": float}).nlargest(3, columns="roi")

replace a string in entire dataframe from excel with value

I have this kind of data from excel
dminerals=pd.read_excel(datafile)
print(dminerals.head(5))
Then I replace the 'Tr' and NaN value using for loop with this script
for key, value in dminerals.iteritems():
dminerals[key] = dminerals[key].replace(to_replace='Tr', value=int(1))
dminerals[key] = dminerals[key].replace(to_replace=np.nan, value=int(0))
then print it again, it seems working and print the dataframe types.But it shows object data type.
print(dminerals.head(5))
print(dminerals['C'].dtypes)
I tried using this .astype to change one of the column ['C'] to integer but the result is value error
dminerals['C'].astype(int)
ValueError: invalid literal for int() with base 10: 'tr'
I thought I already change the 'Tr' in the dataframe into integer value. Is there anything that I miss in the process above? Please help, thank you in advance!
You are replacing Tr with 1, however there is a tr that's not being replaced (this is what you ValueError is saying. Remember python is case sensitive. Also, using for loops is extremely inefficient you might want to try using the following lines of code:
dminerales = dminerales.replace({'Tr':1,'tr':1}).fillna(0)
I'm using fillna() which is also better to fill the null values with the specified value 0 in this case, instead of using repalce.

Panda Get_Value throwing error : '[xxxx]' is an invalid key

I am trying to use Python DataFrame.Get_Value(Index,ColumnName) to get value of column and it keep throwing following Error
"'[10004]' is an invalid key" where 10004 is index value.
This is how Dataframe looks:
I have successfully used get_value before.. I dont know whats wrong with this dataframe.
First, pandas.DataFrame.get_value is deprecated (and should have been get_value, as opposed to Get_Value). It's better to use a non-deprecated method such as .loc or .at instead:
df.loc[10004, 'Column_Name']
# Or:
df.at[10004, 'Column_Name']
Your issue with might be that you have 10004 stored as a string instead of an integer. Try surrounding the index by quotes (df.loc['10004', 'Column_Name']). You can check this easily by saying: df.index.dtype, and seeing if it returns dtype('O')

Python 3.6 - getting error 'an integer is required (got type str)' while converting some strings to time

I wrote a function that takes in a row that has some raw numeric data in one of the columns and converts it to appropriate minutes (its a csv file). All the columns have been kept as strings if at all I need to wrangle with them since strings makes it easier. However, when converting some data into time format (again in strings), I get the error described in the title. My functions looks like the following:
def duration_in_mins(row, city):
duration = [] # Make an empty list
form = '%M.%S'
if city == 'X':
city_file = open('X.csv')
city_csv = csv.reader(city_file)
for row in city_csv:
duration.append(row[1]) # The column that contains the strings
for i in duration:
datetime.datetime.fromtimestamp(i).strftime(form, 'minutes') # Conversion
else:
return 'Error'
return duration
The line datetime.datetime.fromtimestamp(i).strftime(form, 'minutes') is where I'm getting the error according to traceback when I run duration_in_mins(). Is the traceback telling me to convert the string into numeric first and then to time? Can datetime not convert strings directly into time?
duration_in_mins(5, 'X)
line 39, in duration_in_mins
datetime.datetime.fromtimestamp(i).strptime(form, 'minutes')
TypeError: an integer is required (got type str)
As you say, datetime.fromtimestamp() requires a number but you are giving it a string. Python has a philosophy "In the face of ambiguity, refuse the temptation to guess" so it will throw an exception if you give a function an object of the wrong type.
Other problems with the code:
The strftime() method will only take one argument but you're
passing it a second one.
Also, did you mean to nest those loops? Every time you append another
value to duration you convert all of the values in the list to times, and
then if the conversion did work you are just throwing away the result
of the conversion and not saving it anywhere.
First, there is an incoherence between your code and the reported error:
datetime.datetime.fromtimestamp(i).strftime(form, 'minutes') # Conversion
and
line 39, in duration_in_mins
datetime.datetime.fromtimestamp(i).strptime(form, 'minutes')
TypeError: an integer is required (got type str)
Please note strptime is different than strftime
Then, you should split the line with error to understand what function exactly caused the exception:
x = datetime.datetime.fromtimestamp(i)
x.strftime(form, 'minutes') # Conversion
And you will probably see that the error comes from fromtimestamp(ts) which needs an integer timestamp argument instead of a str.
Convert this argument as int at some point and everything should be ok.

Categories

Resources