In my column of the data frame i have version numbers like 6.3.5, 1.8, 5.10.0 saved as objects and thus likely as Strings. I want to remove the dots with nothing so i get 635, 18, 5100. My code idea was this:
for row in dataset.ver:
row.replace(".","",inplace=True)
The thing is it works if I don't set inplace to True, but we want to overwrite it and safe it.
You're iterating through the elements within the DataFrame, in which case I'm assuming it's type str (or being coerced to str when you replace). str.replace doesn't have an argument for inplace=....
You should be doing this instead:
dataset['ver'] = dataset['ver'].str.replace('.', '')
Sander van den Oord in the comments is quite correct to point out:
dataset['ver'].replace("[.]","", inplace=True, regex=True)
This is the way we do operations on a column in Pandas because in general, Pandas tries to optimize over for loops. The Pandas developers consider for loops the among least desirable pattern for row-wise operations in Python (see here.)
Related
I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?
Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]
You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])
What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()
I want to concatenate two columns in pandas containing mostly string values and some missing values. The result should be a new column which again contain string values and missings. Mostly it just worked fine with this:
df['newcolumn']=df['column1']+df['column2']
Most of the values in column1 are numbers (interpreted as strings) like 82. But some of the values in column2 are a composition of letters and numbers starting with an E like E52 or E83. When now 82 and E83 are concatenated, the result I want is 82E83. Unfortunately the results then is 8,2E+84. I guess, Python implicitly interpeted this as a number with scientific notation.
I already tried different ways of concatenating and forcing string format, but the result is always the same:
df['newcolumn']=(df['column1']+df['column2']).asytpe(str)
or
df['newcolumn']=(df['column1'].str.cat(df['column2'])).asytpe(str)
It seems Python first create a float, creating this not wanted format and then change the type to string, keeping results like 8,2E+84. Is there a solution for strictly keeping string format?
Edit: Thanks for your comments. As I tried to reproduce the problem myself with a very short dataframe, the problem also didn't occur. Finally I realized that it was only a problem with Excel automatically intepreting the cells as (wrong) numbers (in the CSV-Output). I didn't realize it before, because another dataframe coming from a CSV-File I used for merging with this dataframe on this concatenated strings was also already "destroyed" the same way by Excel. So the merging didn't work properly and I thought the concatenating in Python is the problem. I used to view the dataframe with Excel because it is really big. In the future I will be more carefully with this. My apologies for misplacing the problem!
Type conversion is not required in this case. You can simply use
df["newcolumn"] = df.apply(lambda x: f"{str(x[0])}{str(x[1])}", axis = 1)
Output:
I have a column that has the following data
column
------
1+1
2+3
4+5
How do I get pandas to sum these values so that the out put is 2,5,9 instead of the above?
Many thanks
You column obviously contains strings, so, you must somehow evaluate them. Use pd.eval function. Eg
frame['column'].apply(pd.eval)
If interested in performance, probably use an alternative method, like ast.literal_eval. Thanks to user #Serge Ballesta for mentioning
I'm a veteran of Pandas DataFrame objects, but I'm struggling to find a clean, convenient method for altering the values in a Dask DataFrame column. For a specific example, I'm trying to multiply positive values in a numpy.float column by -1, thereby making them negative. Here is my current method (I'm trying to change the last column in the DataFrame):
cols = df.columns
df[[cols[-1]]] = df[[cols[-1]]]*-1
This seems to work only if the column has a string header, otherwise it adds another column using the index number as a string-type column name for a new column. Is there something akin to the Pandas method of, say, df.iloc[-1,:] = df.iloc[-1,:]*-1 that I can use with a Dask dataframe?
Edit: I'm also trying to implement: df = df.applymap(lambda x: x*-1). This, of course, applies the function to the entire dataframe, but is there a way to apply a function over just one column? Thank you.
first question
If something works for string columns and not for numeric-named columns then that is probably a bug. I recommend raising an issue at https://github.com/dask/dask/issues/new
second question
but is there a way to apply a function over just one column?
You can't apply a single Python function over a dask dataframe that is stored in many pieces directly, however methods like .map_partitions or .reduction may help you to achieve the same result with some cleverness.
in the future we recommend asking separate questions separately on stack overflow
Any help on how can I skip missing values in my world field. I thought na_action='ignore' would help, but it doesn't for my case .
df['world'] = df['world'].map(lambda x: x.rstrip('L.locoMoco'),na_action='ignore')
Thanks
If world is an object column, call str.rstrip directly.
df['world'] = df['world'].str.rstrip('L.locoMoco')
If the column is one of objects, NaNs are preserved. However, if you have numeric values, they're are coerced to NaNs, so if this is not intended behaviour, I'd suggest, either
Coercing those values to string (to preserve them), or
Using slower alternatives like a for loop or apply.