Cleaning column value when starting with

Cleaning column value when starting with - python

I have the following df:
df = pd.DataFrame(np.array([[.1, 2, 3], [.4, 5, 6], [7, 8, 9]]),
columns=['col1', 'b', 'c'])
out:
col1 b c
0 0.1 2.0 3.0
1 0.4 5.0 6.0
2 7.0 8.0 9.0
When a value begins with a '.'/point, I want to remove it. But only if it starts with a point / '.'.
I've tried the following:
s = df['col1']
df['col1'] = s.mask(df['col1'].str.startswith('.',na=False),s.str.replace(".",""))
desired output:
col1 b c
0 1 2.0 3.0
1 4 5.0 6.0
2 7.0 8.0 9.0
However this does not work. Please help!

since you have numerical values, You can multiply 10 and replace with a condition:
df.mul(10).mask(df.ge(1),df)
#df['col1'] = df['col1'].mul(10).mask(df['col1'].ge(1),df['col1']) for 1 column
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0

use boolean masking and create a mask:
mask=df['col1'].astype(str).str.startswith('0.')
Finally make use of that mask:
df.loc[mask,'col1']=df.loc[mask,'col1'].astype(str).str.lstrip('0.').astype(float)
Now if you print df you will get your desired output:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0

Via NumpPy np.where():
df['col1'] = np.where(df<1, df*10, df)
df contents:
col1 b c
0 1.0 2.0 3.0
1 4.0 5.0 6.0
2 7.0 8.0 9.0

Related

How do I interpolate filling in nans with the minimum value on either side?

I have a dataframe of the form:
df = {'col_1': [5,4,np.nan,np.nan,1,0,1,2,np.nan,np.nan,5],
'col_2': [5,4,3,2,np.nan,np.nan,np.nan,np.nan,3,4,5]}
df = pd.DataFrame(df)
I want to "interpolate" but taking the min value on either side for desired result of:
df_desired = {'col_1': [5,4,1,1,1,0,1,2,2,2,5],
'col_2': [5,4,3,2,2,2,2,2,3,4,5]}
df_desired = pd.DataFrame(df_desired)
Does anyone know a good way of doing this? Thanks!

Here is a way where you can get np.miminum between ffill and bfill
out = np.minimum(df.ffill(),df.bfill())
print(out)
col_1 col_2
0 5.0 5.0
1 4.0 4.0
2 1.0 3.0
3 1.0 2.0
4 1.0 2.0
5 0.0 2.0
6 1.0 2.0
7 2.0 2.0
8 2.0 3.0
9 2.0 4.0
10 5.0 5.0

Pandas: Fillna with local average if a condition is met

Let's say I have data like this:
df = pd.DataFrame({'col1': [5, np.nan, 2, 2, 5, np.nan, 4], 'col2':[1,3,np.nan,np.nan,5,np.nan,4]})
print(df)
col1 col2
0 5.0 1.0
1 NaN 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 NaN NaN
6 4.0 4.0
How can I use fillna() to replace NaN values with the average of the prior and the succeeding value if both of them are not NaN ?
The result would look like this:
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0
Also, is there a way of calculating the average from the previous n and succeeding n values (if they are all not NaN) ?

We can shift the dataframe forward and backwards. Then add these together and divide them by two and use that to fillna:
s1, s2 = df.shift(), df.shift(-1)
df = df.fillna((s1 + s2) / 2)
col1 col2
0 5.0 1.0
1 3.5 3.0
2 2.0 NaN
3 2.0 NaN
4 5.0 5.0
5 4.5 4.5
6 4.0 4.0

Negating column values and adding particular values in only some columns in a Pandas Dataframe

Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.

You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0

pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of

Combine 2 series pandas - overwriting the NANs [duplicate]

I'm looking for a method that behaves similarly to coalesce in T-SQL. I have 2 columns (column A and B) that are sparsely populated in a pandas dataframe. I'd like to create a new column using the following rules:
If the value in column A is not null, use that value for the new column C
If the value in column A is null, use the value in column B for the new column C
Like I mentioned, this can be accomplished in MS SQL Server via the coalesce function. I haven't found a good pythonic method for this; does one exist?

use combine_first():
In [16]: df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
In [17]: df.loc[::2, 'a'] = np.nan
In [18]: df
Out[18]:
a b
0 NaN 0
1 5.0 5
2 NaN 8
3 2.0 8
4 NaN 3
5 9.0 4
6 NaN 7
7 2.0 0
8 NaN 6
9 2.0 5
In [19]: df['c'] = df.a.combine_first(df.b)
In [20]: df
Out[20]:
a b c
0 NaN 0 0.0
1 5.0 5 5.0
2 NaN 8 8.0
3 2.0 8 2.0
4 NaN 3 3.0
5 9.0 4 9.0
6 NaN 7 7.0
7 2.0 0 2.0
8 NaN 6 6.0
9 2.0 5 2.0

Coalesce for multiple columns with DataFrame.bfill
All these methods work for two columns and are fine with maybe three columns, but they all require method chaining if you have n columns when n > 2:
example dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'col1':[np.NaN, 2, 4, 5, np.NaN],
'col2':[np.NaN, 5, 1, 0, np.NaN],
'col3':[2, np.NaN, 9, 1, np.NaN],
'col4':[np.NaN, 10, 11, 4, 8]})
print(df)
col1 col2 col3 col4
0 NaN NaN 2.0 NaN
1 2.0 5.0 NaN 10.0
2 4.0 1.0 9.0 11.0
3 5.0 0.0 1.0 4.0
4 NaN NaN NaN 8.0
Using DataFrame.bfill over the columns axis (axis=1) we can get the values in a generalized way even for a big n amount of columns
Plus, this would also work for string type columns !!
df['coalesce'] = df.bfill(axis=1).iloc[:, 0]
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0
Using the Series.combine_first (accepted answer), it can get quite cumbersome and would eventually be undoable when amount of columns grow
df['coalesce'] = (
df['col1'].combine_first(df['col2'])
.combine_first(df['col3'])
.combine_first(df['col4'])
)
col1 col2 col3 col4 coalesce
0 NaN NaN 2.0 NaN 2.0
1 2.0 5.0 NaN 10.0 2.0
2 4.0 1.0 9.0 11.0 4.0
3 5.0 0.0 1.0 4.0 5.0
4 NaN NaN NaN 8.0 8.0

Try this also.. easier to remember:
df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
This is slighty faster: df['c'] = np.where(df["a"].isnull() == True, df["b"], df["a"] )
%timeit df['d'] = df.a.combine_first(df.b)
1000 loops, best of 3: 472 µs per loop
%timeit df['c'] = np.where(df["a"].isnull(), df["b"], df["a"] )
1000 loops, best of 3: 291 µs per loop

combine_first is the most straightforward option. There are a couple of others which I outline below. I'm going to outline a few more solutions, some applicable to different cases.
Case #1: Non-mutually Exclusive NaNs
Not all rows have NaNs, and these NaNs are not mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, 7.0, np.nan],
'b': [5.0, 3.0, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 5.0
1 2.0 3.0
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 7.0 6.0
6 NaN 7.0
Let's combine first on a.
Series.mask
df['a'].mask(pd.isnull, df['b'])
# df['a'].mask(df['a'].isnull(), df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
Series.where
df['a'].where(pd.notnull, df['b'])
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 7.0
6 7.0
Name: a, dtype: float64
You can use similar syntax using np.where.
Alternatively, to combine first on b, switch the conditions around.
Case #2: Mutually Exclusive Positioned NaNs
All rows have NaNs which are mutually exclusive between columns.
df = pd.DataFrame({
'a': [1.0, 2.0, 3.0, np.nan, 5.0, np.nan, np.nan],
'b': [np.nan, np.nan, np.nan, 4.0, np.nan, 6.0, 7.0]})
df
a b
0 1.0 NaN
1 2.0 NaN
2 3.0 NaN
3 NaN 4.0
4 5.0 NaN
5 NaN 6.0
6 NaN 7.0
Series.update
This method works in-place, modifying the original DataFrame. This is an efficient option for this use case.
df['b'].update(df['a'])
# Or, to update "a" in-place,
# df['a'].update(df['b'])
df
a b
0 1.0 1.0
1 2.0 2.0
2 3.0 3.0
3 NaN 4.0
4 5.0 5.0
5 NaN 6.0
6 NaN 7.0
Series.add
df['a'].add(df['b'], fill_value=0)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64
DataFrame.fillna + DataFrame.sum
df.fillna(0).sum(1)
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
dtype: float64

I encountered this problem with but wanted to coalesce multiple columns, picking the first non-null from several columns. I found the following helpful:
Build dummy data
import pandas as pd
df = pd.DataFrame({'a1': [None, 2, 3, None],
'a2': [2, None, 4, None],
'a3': [4, 5, None, None],
'a4': [None, None, None, None],
'b1': [9, 9, 9, 999]})
df
a1 a2 a3 a4 b1
0 NaN 2.0 4.0 None 9
1 2.0 NaN 5.0 None 9
2 3.0 4.0 NaN None 9
3 NaN NaN NaN None 999
coalesce a1 a2, a3 into a new column A
def get_first_non_null(dfrow, columns_to_search):
for c in columns_to_search:
if pd.notnull(dfrow[c]):
return dfrow[c]
return None
# sample usage:
cols_to_search = ['a1', 'a2', 'a3']
df['A'] = df.apply(lambda x: get_first_non_null(x, cols_to_search), axis=1)
print(df)
a1 a2 a3 a4 b1 A
0 NaN 2.0 4.0 None 9 2.0
1 2.0 NaN 5.0 None 9 2.0
2 3.0 4.0 NaN None 9 3.0
3 NaN NaN NaN None 999 NaN

I'm thinking a solution like this,
def coalesce(s: pd.Series, *series: List[pd.Series]):
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s
because given a DataFrame with columns with ['a', 'b', 'c'], you can use it like a SQL coalesce,
df['d'] = coalesce(df.a, df.b, df.c)

For a more general case, where there are no NaNs but you want the same behavior:
Merge 'left', but override 'right' values where possible

Good code, put you have a typo for python 3, correct one looks like this
"""coalesce the column information like a SQL coalesce."""
for other in series:
s = s.mask(pd.isnull, other)
return s

Consider using DuckDB for efficient SQL on Pandas. It's performant, simple, and feature-packed. https://duckdb.org/2021/05/14/sql-on-pandas.html
Sample Dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,np.NaN, 3, 4, 5],
'B':[np.NaN, 2, 3, 4, np.NaN]})
Coalesce using DuckDB:
import duckdb
out_df = duckdb.query("""SELECT A,B,coalesce(A,B) as C from df""").to_df()
print(out_df)
Output:
A B c
0 1.0 NaN 1.0
1 NaN 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 NaN 5.0

What is the difference between combine_first and fillna?

These two functions seem equivalent to me. You can see that they accomplish the same goal in the code below, as columns c and d are equal. So when should I use one over the other?
Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=list('ab'))
df.loc[::2, 'a'] = np.nan
Returns:
a b
0 NaN 4
1 2.0 6
2 NaN 8
3 0.0 4
4 NaN 4
5 0.0 8
6 NaN 7
7 2.0 2
8 NaN 9
9 7.0 2
This is my starting point. Now I will add two columns, one using combine_first and one using fillna, and they will produce the same result:
df['c'] = df.a.combine_first(df.b)
df['d'] = df['a'].fillna(df['b'])
Returns:
a b c d
0 NaN 4 4.0 4.0
1 8.0 7 8.0 8.0
2 NaN 2 2.0 2.0
3 3.0 0 3.0 3.0
4 NaN 0 0.0 0.0
5 2.0 4 2.0 2.0
6 NaN 0 0.0 0.0
7 2.0 6 2.0 2.0
8 NaN 4 4.0 4.0
9 4.0 6 4.0 4.0
Credit to this question for the data set: Combine Pandas data frame column values into new column

combine_first is intended to be used when there are non-overlapping indices. It will effectively fill in nulls as well as supply values for indices and columns that didn't exist in the first.
dfa = pd.DataFrame([[1, 2, 3], [4, np.nan, 5]], ['a', 'b'], ['w', 'x', 'y'])
w x y
a 1.0 2.0 3.0
b 4.0 NaN 5.0
dfb = pd.DataFrame([[1, 2, 3], [3, 4, 5]], ['b', 'c'], ['x', 'y', 'z'])
x y z
b 1.0 2.0 3.0
c 3.0 4.0 5.0
dfa.combine_first(dfb)
w x y z
a 1.0 2.0 3.0 NaN
b 4.0 1.0 5.0 3.0 # 1.0 filled from `dfb`; 5.0 was in `dfa`; 3.0 new column
c NaN 3.0 4.0 5.0 # whole new index
Notice that all indices and columns are included in the results
Now if we fillna
dfa.fillna(dfb)
w x y
a 1 2.0 3
b 4 1.0 5 # 1.0 filled in from `dfb`
Notice no new columns or indices from dfb are included. We only filled in the null value where dfa shared index and column information.
In your case, you use fillna and combine_first on one column with the same index. These translate to effectively the same thing.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cleaning column value when starting with - python

since you have numerical values, You can multiply 10 and replace with a condition: df.mul(10).mask(df.ge(1),df) #df['col1'] = df['col1'].mul(10).mask(df['col1'].ge(1),df['col1']) for 1 column col1 b c 0 1.0 2.0 3.0 1 4.0 5.0 6.0 2 7.0 8.0 9.0

Via NumpPy np.where(): df['col1'] = np.where(df<1, df*10, df) df contents: col1 b c 0 1.0 2.0 3.0 1 4.0 5.0 6.0 2 7.0 8.0 9.0

Related

How do I interpolate filling in nans with the minimum value on either side?

Pandas: Fillna with local average if a condition is met

Negating column values and adding particular values in only some columns in a Pandas Dataframe

Combine 2 series pandas - overwriting the NANs [duplicate]

What is the difference between combine_first and fillna?

Categories

Resources