Pandas skip missing values in map function

Pandas skip missing values in map function - python

Any help on how can I skip missing values in my world field. I thought na_action='ignore' would help, but it doesn't for my case .
df['world'] = df['world'].map(lambda x: x.rstrip('L.locoMoco'),na_action='ignore')
Thanks

If world is an object column, call str.rstrip directly.
df['world'] = df['world'].str.rstrip('L.locoMoco')
If the column is one of objects, NaNs are preserved. However, if you have numeric values, they're are coerced to NaNs, so if this is not intended behaviour, I'd suggest, either
Coercing those values to string (to preserve them), or
Using slower alternatives like a for loop or apply.

Related

.strip() with in-place solution not working

I'm trying to find a solution for stripping blank spaces from some strings in my DataFrame. I found this solution, where someone said this:
I agree with the other answers that there's no inplace parameter for
the strip function, as seen in the
documentation
for str.strip.
To add to that: I've found the str functions for pandas Series
usually used when selecting specific rows. Like
df[df['Name'].str.contains('69'). I'd say this is a possible reason
that it doesn't have an inplace parameter -- it's not meant to be
completely "stand-alone" like rename or drop.
Also to add! I think a more pythonic solution is to use negative
indices instead:
data['Name'] = data['Name'].str.strip().str[-5:]
This way, we don't have to assume that there are 18 characters, and/or
we'll consistently get "last 5 characters" instead!
So, I have a list of DataFrames called 'dataframes'. On the first dataframe (which is dataframes[0]), I have a column named 'CNJ' with string values, some of them with a blank space in the end. For example:
Input:
dataframes[0]['cnj'][9]
Output:
'0100758-73.2019.5.01.0064 '
So, following the comment above, I did this:
Input:
dataframes[0]['cnj'] = dataframes[0]['cnj'].strip()
Then I get the following error:
AttributeError: 'Series' object has no attribute 'strip'
Since the solution given on the other topic worked, what am I doing wrong to get this error? It seemed to me it shouldn't work because its a Series, but it should get the same result as the one mentioned above (data['Name'] = data['Name'].str.strip().str[-5:]), right?

Use
dataframes[0]['cnj']=dataframes[0]['cnj'].str.strip()
or better yet, store the dataframe in a variable first:
df0=dataframes[0]
df0['cnj']=df0['cnj'].str.strip()
The code in the solution you posted uses .str. :
data['Name'] = data['Name'].str.strip().str[-5:]
The Pandas Series object has no string or date manipulation methods methods. These are exposed through the Series.str and Series.dt accessor objects.
The result of Series.str.strip() is a new series. That's why .str[-5:] is needed to retrieve the last 5 characters. That results is a new series again. That expression is equivalent to :
temp_series=data['Name'].str.strip()
data['Name'] = temp_series.str[-5:]

You could just apply a transformation function on the column values like this.
data["Name"] = data["Name"].apply(lambda x: str(x).strip()[-5:])

What you need is a string without the right spaces is a series or a dataframe right, at least that's my understanding looking at your query, use str.rstrip() which will work both on series and dataframe objects.
Note: strip() usually is only for string datatypes, so the error you are getting is appropriate.
Refer to link , and try implementing str.rstrip() provided by pandas.
For str.strip() you can refer to this link, it works for me.
In your case, assuming the dataframe column to be s, you can use the below code:
df[s].str.strip()

How to avoid trailing zeros in pandas dataframe column values while reading data from database?

I am reading data from postgresql DB into pandas dataframe. In one of the columns all values are integer while some are missing. Dataframe while reading is attaching trailing zeros to all the values in the column.
e.g. Original Data
SUBJID
1031456
1031457
1031458
What I am getting in the Dataframe column is this
df['SUBJID'].head()
1031456.0
1031457.0
1031458.0
I know I can remove it but there are multiple columns & I never know which column will have this problem. So while reading itself I want to ensure that everything is read as string & without those trailing zeros.
I have already tried with df = pd.read_sql('q',dtype=str). But it's not giving desired output.
Please let me know the solution.

Adding another answer because this is different than the other one.
This happens because your dataset contains empty cells, and since Int type doesn't support NA/NaN it get casted to float.
One solution would be to fill the NA/NaN with 0 then set the type as int like so
columns = ['SUBJID'] # you can list the columns you want, or you can run it on the whole dataframe if you want to.
df[columns] = df[columns].fillna(0).astype(int)
# then you can convert to string after converting to int if you need to do so
Another would be to have the sql query do the filling for you (which is a bit tedious to write if you ask me).
Note that pandas.read_sql doesn't have dtype argument anyways.

try setting the dtype of the column to int then to str.
df['SUBJID'] = df['SUBJID'].astype('int32')
df['SUBJID'] = df['SUBJID'].astype(str)
if you want to manually fix the strings, then you can do
df['SUBJID'] = df['SUBJID'].apply(lambda x: x.split(".")[0])
This should strip out the "." and everything after it, but make sure you don't use it on columns that contain a "." that you need.

Pandas DataFrame replace does not work with inplace=True

In my column of the data frame i have version numbers like 6.3.5, 1.8, 5.10.0 saved as objects and thus likely as Strings. I want to remove the dots with nothing so i get 635, 18, 5100. My code idea was this:
for row in dataset.ver:
row.replace(".","",inplace=True)
The thing is it works if I don't set inplace to True, but we want to overwrite it and safe it.

You're iterating through the elements within the DataFrame, in which case I'm assuming it's type str (or being coerced to str when you replace). str.replace doesn't have an argument for inplace=....
You should be doing this instead:
dataset['ver'] = dataset['ver'].str.replace('.', '')

Sander van den Oord in the comments is quite correct to point out:
dataset['ver'].replace("[.]","", inplace=True, regex=True)
This is the way we do operations on a column in Pandas because in general, Pandas tries to optimize over for loops. The Pandas developers consider for loops the among least desirable pattern for row-wise operations in Python (see here.)

drop rows with errors for pandas data coercion

I have a dataframe, for which I need to convert columns to floats and ints, that has bad rows, ie., values that are in a column that should be a float or an integer are instead string values.
If I use df.bad.astype(float), I get an error, this is expected.
If I use df.bad.astype(float, errors='coerce'), or pd.to_numeric(df.bad, errors='coerce'), bad values are replaced with np.NaN, also according to spec and reasonable.
There is also errors='ignore', another option that ignores the errors and leaves the erroring values alone.
But actually, I want to not ignore the errors, but drop the rows with bad values. How can I do this?
I can ignore the errors and do some type checking, but that's not an ideal solution, and there might be something more idiomatic to do this.
Example
test = pd.DataFrame(["3", "4", "problem"], columns=["bad"])
test.bad.astype(float) ## ValueError: could not convert string to float: 'problem'
I want something like this:
pd.to_numeric(df.bad, errors='drop')
And this returns dataframe with only the 2 good rows.

Since the bad values are replaced with np.NaN would it not be simply just df.dropna() to get rid of the bad rows now?
EDIT:
Since you need to not drop the initial NaNs, maybe you could use df.fillna() prior to using pd.to_numeric

Casting NaN into int in a pandas Series

I have missing values in a column of a series, so the command dataframe.colname.astype("int64") yields an error.
Any workarounds?

The datatype or dtype of a pd.Series has very little impact on the actual way it is used.
You can have a pd.Series with integers, and set the dtype to be object. You can still do the same things with the pd.Series.
However, if you manually set dtypes of pd.Series, pandas will start to cast the entries inside the pd.Series. In my experience, this only leads to confusion.
Do not try to use dtypes as field types in relational databases. They are not the same thing.
If you want to have integes and NaNs/Nones mixed in a pd.Series, just set the dtype to object.
Settings the dtype to float will let you have float representations of ints and NaNs mixed. But remember that floats are prone to be unexact in their representation
One common pitfall with dtypes which I should mention is the pd.merge operation, which will silently refuse to join when the keys used has different dtypes, for example int vs object even if the object only contains ints.
Other workarounds
You can use the Series.fillna method to fill your NaN values with something unlikely. 0 or -1.
Copy the NaNs to a new column df['was_nan'] = pd.isnull(df['floatcol']), then use the Series.fillna method. This way you do not lose any information.
When calling the Series.astype() method, give it the keyword argument raise_on_error=False, and just use the current dtype if it fails. Because dtypes do not matter that much.
TLDR;
Don't focus on having the 'right dtype', dtypes are strange. Focus on what you want the column to actually do. dtype=object is fine.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.