.strip() with in-place solution not working - python

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?

Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]

You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])

What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

Related

How to use .translate() correctly for the removal of non alphabetical either numerical characters?

I want to remove symbols (Most of them but not all) from my data column 'Review'.
A little background on my code:
from pandas.core.frame import DataFrame
# convert to lower case
data['Review'] = data['Review'].str.lower()
# remove trailing white spaces
data['Review'] = data['Review'].str.strip()
This is what I did based on what I read on the internet (I'm still on the beginner-level of NLP, so don't be surprised to find more than one mistake, I just want to know what are they):
import string
sep = '|'
punctuation_chars = '"#$%&\()*+,-./:;<=>?#[\\]^_`{}~'
mapping_table = str.maketrans(dict.fromkeys(punctuation_chars, ''))
= sep.join(df[df(data['Review']).tolist()]).translate(mapping_table).split(sep)
However, I get the following error:
AttributeError: 'DataFrame' object has no attribute 'tolist'
How could I solve it? I want to use .translate() because I read it's more efficient than other methods.
The AttributeError is caused because DataFrame.tolist() doesn't exist. It looks like the code assumes that df(data['Review']) is a Series, but it is actually a DataFrame.
df = DataFrame(data['Review'])
translated_reviews = sep.join(df[0].tolist()).translate(mapping_table).split(sep)
It's unclear whether data is a DataFrame. If it is, just use it in the join() without calling tolist() or instantiating a new DataFrame.
translated_reviews = sep.join(data['Review']).translate(mapping_table).split(sep)
Your problem was where you were trying to create a dataframe object from a column of your data dataframe and then convert that to list df[df(data['Review']).tolist()] (that part). You can either use
df.values.tolist() which would convert the whole dataframe, df, to a list or if you just want to convert a column use data['Review'].tolist()
So in your situation the final line of your code would be switched to
data['Review'] = sep.join(data['Review'].tolist()).translate(mapping_table).split(sep)

Can I force Python to return only in String-format when I concatenate two series of strings?

I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:

Pandas string.contains doesn't work if searched string contains the substring at the beginning of the string

I'm using str.contains to search for rows where the column contains a particular string as a substring
df[df['col_name'].str.contains('find_this')]
This returns all the rows where 'find_this' is somewhere within the string. However, in the rare but important case where the string in df['col_name'] STARTS with 'find_this', this row is not returned by the above query.
str.contains() returns false where it should return true.
Any help would be greatly appreciated, thanks!
EDIT
I've added some example data as requested.
Image of dataframe.
I want to update the 'Eqvnt_id' column, so for example, the rows where column 'Course_ID' contains AAS 102 all have the same 'Eqvnt_id' value.
To do this I need to be able to search the strings in 'Course_ID' for 'AAS 102' in order to locate the appropriate rows. However, when I do this:
df[df['Course_ID'].str.contains('AAS 102')]
The row that has 'AAS 102 (ENGL 102, JST 102, REL 102)' does not appear in the query!
The datatypes are all objects. I've tried mapping them and applying them to string type, but it has had no effect on the success of the query.
The data from the image can be found at https://github.com/isaachowen/stackoverflowquestionfiles
TLDR: Experiment with pandas.Series.str.normalize(), trying different Unicode forms until the issue is solved. 'NFKC' worked for me.
The problem had to do with the format of the data in the column that I was doing the...
df['column'].str.contains('substring')
...operation on. Using the pandas.Series.str.normalize() function works. Link here. Sometimes, under some circumstances that I can't deliberately recreate, the strings would have '\xa0' and '\n' appended to them at the beginning or the end of the string. This post helps address how to deal with that problem. Following that post, I for-looped through every string column and changed the unicode form until I found something that worked: 'NFKC'.
you can use pandas.Series.str.find() instead - it returns the index where the string is found - if its at the start, the returned index would be 0. If a string is not found, it returns -1.
df[df['col_name'].str.find('find_this') != -1]
Let me know if this helps!

Pandas DataFrame replace does not work with inplace=True

In my column of the data frame i have version numbers like 6.3.5, 1.8, 5.10.0 saved as objects and thus likely as Strings. I want to remove the dots with nothing so i get 635, 18, 5100. My code idea was this:
for row in dataset.ver:
row.replace(".","",inplace=True)
The thing is it works if I don't set inplace to True, but we want to overwrite it and safe it.
You're iterating through the elements within the DataFrame, in which case I'm assuming it's type str (or being coerced to str when you replace). str.replace doesn't have an argument for inplace=....
You should be doing this instead:
dataset['ver'] = dataset['ver'].str.replace('.', '')
Sander van den Oord in the comments is quite correct to point out:
dataset['ver'].replace("[.]","", inplace=True, regex=True)
This is the way we do operations on a column in Pandas because in general, Pandas tries to optimize over for loops. The Pandas developers consider for loops the among least desirable pattern for row-wise operations in Python (see here.)

Filtering a dataset on values not in another dataset

I am looking to filter a dataset based off of whether a certain ID does not appear in a different dataframe.
While I'm not super attached to the way in which I've decided to do this if there's a better way that I'm not familiar with, I want to apply a Boolean function to my dataset, put the results in a new column, and then filter the entire dataset off of that True/False result.
My main dataframe is df, and my other dataframe with the ID's in it is called ID:
def groups():
if df['owner_id'] not in ID['owner_id']:
return True
return False
This ends up being accepted (no syntax problems), so I then go to apply it to my dataframe, which fails:
df['ID Groups?'] = df.apply (lambda row: groups() ,axis=1)
Result:
TypeError: ("'Series' objects are mutable, thus they cannot be hashed", 'occurred at index 0')
It seems that somewhere my data that I'm trying to use (the ID's are both letters and numbers, so strings) is incorrectly formatted.
I have two questions:
Is my proposed method the best way of going about this?
How can I fix the error that I'm seeing?
My apologies if it's something super obvious, I have very limited exposure to Python and coding as a whole, but I wasn't able to find anywhere where this type of question had already been addressed.
Expression to keep only these rows in df that match owner_id in ID:
df = df[df['owner_id'].isin(ID['owner_id'])]
Lambda expression is going to be way slower that this.
isin is the Pandas way. not in is the Python collections way.
The reason you are getting this error is df['owner_id'] not in ID['owner_id'] hashes left hand side to figure out if it is present in the right hand side. df['owner_id'] is of type Series and is not hashable, as reported. Luckily, it is not needed.

Categories

Resources