Fill NAs in a Column with samples from itself - python

Simple toy dataframe:
df = pd.DataFrame({'mycol':['foo','bar','hello','there',np.nan,np.nan,np.nan,'foo'],
'mycol2':'this is here to make it a DF'.split()})
print(df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 NaN make
5 NaN it
6 NaN a
7 foo DF
I'm trying to fill the NaNs in mycol with samples from itself, e.g. I want the NaNs to be replaced with samples of foo,bar,hello etc.
# fill NA values with n samples (n= number of NAs) from df['mycol']
df['mycol'].fillna(df['mycol'].sample(n=df.isna().sum(), random_state=1,replace=True).values)
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
# fill NA values with n samples, n=1. Dropna from df['mycol'] before sampling:
df['mycol'] = df['mycol'].fillna(df['mycol'].dropna().sample(n=1, random_state=1,replace=True)).values
# nothing happens
Expected Output: Nas filled with random samples from mycol:
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 foo make
5 foo it
6 hello a
7 foo DF
edit for answer: #Jezrael's answer below sorted it, I had a problem with my indexes.
df['mycol'] = (df['mycol']
.dropna()
.sample(n=len(df),replace=True)
.reset_index(drop=True))

Interesting problem.
For me working set values with loc with converting values to numpy array for avoid data alignment:
a = df['mycol'].dropna().sample(n=df['mycol'].isna().sum(), random_state=1,replace=True)
print (a)
3 there
7 foo
0 foo
Name: mycol, dtype: object
#pandas 0.24+
df.loc[df['mycol'].isna(), 'mycol'] = a.to_numpy()
#pandas below
#df.loc[df['mycol'].isna(), 'mycol'] = a.values
print (df)
mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF
Your solution should working if length of Series and index same like original DataFrame:
s = df['mycol'].dropna().sample(n=len(df), random_state=1,replace=True)
s.index = df.index
print (s)
0 there
1 foo
2 foo
3 bar
4 there
5 foo
6 foo
7 bar
Name: mycol, dtype: object
df['mycol'] = df['mycol'].fillna(s)
print (df)
# mycol mycol2
0 foo this
1 bar is
2 hello here
3 there to
4 there make
5 foo it
6 foo a
7 foo DF

you can do forward or backward fill:
#backward fill
df['mycol'] = df['mycol'].fillna(method='bfill')
#forward Fill
df['mycol'] = df['mycol'].fillna(method='ffill')

Related

How to optimize turning a group of wide pandas columns into two long pandas columns

I have a process that takes a dataframe and turns a set of wide pandas columns into two long pandas columns, like so:
original wide:
wide = pd.DataFrame(
{
'id':['foo'],
'a':[1],
'b':[2],
'c':[3],
'x':[4],
'y':[5],
'z':[6]
}
)
wide
id a b c x y z
0 foo 1 2 3 4 5 6
desired long:
lon = pd.DataFrame(
{
'id':['foo','foo','foo','foo','foo','foo'],
'type':['a','b','c','x','y','z'],
'val':[1,2,3,4,5,6]
}
)
lon
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6
I found out a way to do this by chaining the following pandas assignments
(wide
.set_index('id')
.T
.unstack()
.reset_index()
.rename(columns={'level_1':'type',0:'val'})
)
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6
But when I scale my data this seems to be posing issues for me. I was just looking for an alternative solution to what I have already accomplished that is perhaps faster/more computationally efficient.
I think you are looking for the pandas melt function.
Assuming your original dataframe is called wide, then:
df = pd.melt(wide, id_vars="id")
df.columns = ['id', 'type', 'val']
print(df)
Output:
id type val
0 foo a 1
1 foo b 2
2 foo c 3
3 foo x 4
4 foo y 5
5 foo z 6

Replacing null value in Python with next available value by group

df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': [None,None,'A',None,'B',None]
})
I would like to replace missing values by the first next non missing value by group. The desired result is:
df = pd.DataFrame({
'group': [1,1,1,2,2,2],
'value': ['A','A','A','B','B',None]
})
You can try this:
df['value'] = df.groupby(by=['group'])['value'].backfill()
print(df)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
The Easiest way as #Erfan mention using backfill method DataFrameGroupBy.bfill.
Solution 1)
>>> df['value'] = df.groupby('group')['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 2)
DataFrameGroupBy.bfill with limit parameter works perfectly as well here.
From the pandas Documentation which nicely briefs the Limit the amount of filling is worth to read. as per the doc If we only want consecutive gaps filled up to a certain number of data points, we can use the limit keyword.
>>> df['value'] = df.groupby(['group']).bfill(limit=2)
# >>> df['value'] = df.groupby('group').bfill(limit=2)
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 3)
With groupby() we can also combine fillna() with bfill() along with limit parameter.
>>> df.groupby('group').fillna(method='bfill',limit=2)
value
0 A
1 A
2 A
3 B
4 B
5 None
Solution 4)
Other way around using DataFrame.transform function to fill the value column after group by with DataFrameGroupBy.bfill.
>>> df['value'] = df.groupby('group')['value'].transform(lambda v: v.bfill())
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None
Solution 5)
You can use DataFrame.set_index to add the group column to the index, making it unique, and do a simple bfill() via groupby(), then you can use reset index to its original state.
>>> df.set_index('group', append=True).groupby(level=1).bfill().reset_index(level=1)
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 NaN
Solution 6)
In case strictly not going for groupby() then below would be the easiest ..
>>> df['value'] = df['value'].bfill()
>>> df
group value
0 1 A
1 1 A
2 1 A
3 2 B
4 2 B
5 2 None

Assign pandas.DataFrame column To Series with Default

Suppose I have a DataFrame
df = pandas.DataFrame({'a': [1,2], 'b': [3,4]}, ['foo', 'bar'])
a b
foo 1 3
bar 2 4
And I want to added a column based on another Series:
s = pandas.Series({'foo': 10, 'baz': 20})
foo 10
baz 20
dtype: int64
How do I assign the Series to a column of the DataFrame and provide a default value if the DataFrame index value is not in the Series index?
I'm looking for something of the form:
df['c'] = s.withDefault(42)
Which would result in the following Dataframe:
a b c
foo 1 3 10
bar 2 4 42
#Note: bar got value 42 because it's not in s
Thank you in advance for your consideration and response.
Using map with get
get has an argument that you can use to specify the default value.
df.assign(c=df.index.map(lambda x: s.get(x, 42)))
a b c
foo 1 3 10
bar 2 4 42
Use reindex with fill_value
df.assign(c=s.reindex(df.index, fill_value=42))
a b c
foo 1 3 10
bar 2 4 42
You need to use join between df and dataframe which is obtained from s and then fill the NaN with default value, which is 42, in your case.
df['c'] = df.join(pandas.DataFrame(s, columns=['c']))['c'].fillna(42).astype(int)
Output:
a b c
foo 1 3 10
bar 2 4 42

Creating column on filtered pandas DataFrame

From an initial DataFrame loaded from a csv file,
df = pd.read_csv("file.csv",sep=";")
I get a filtered copy with
df_filtered = df[df["filter_col_name"]== value]
However, when creating a new column using the diff() method,
df_filtered["diff"] = df_filtered["feature"].diff()
I get the following warning:
/usr/local/bin/ipython3:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
#!/usr/bin/python3
I notice also that the processing time is very long.
Surprisingly (at leat to me...), if I do the same thing on the non-filtered DataFrame, I runs fine.
How should I proceed to create a "diff" column on the filtered data?
You need copy:
If you modify values in df_filtered later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
#need process sliced df, return sliced df
df_filtered = df[df["filter_col_name"]== value].copy()
Or:
#need process sliced df, return all df
df.loc[df["filter_col_name"]== value, 'feature'] =
df.loc[df["filter_col_name"]== value , 'feature'].diff()
Sample:
df = pd.DataFrame({'filter_col_name':[1,1,3],
'feature':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
C D E F feature filter_col_name
0 7 1 5 7 4 1
1 8 3 3 4 5 1
2 9 5 6 3 6 3
value = 1
df_filtered = df[df["filter_col_name"]== value].copy()
df_filtered["diff"] = df_filtered["feature"].diff()
print (df_filtered)
C D E F feature filter_col_name diff
0 7 1 5 7 4 1 NaN
1 8 3 3 4 5 1 1.0
value = 1
df.loc[df["filter_col_name"]== value, 'feature'] =
df.loc[df["filter_col_name"]== value , 'feature'].diff()
print (df)
C D E F feature filter_col_name
0 7 1 5 7 NaN 1
1 8 3 3 4 1.0 1
2 9 5 6 3 6.0 3
Try using
df_filtered.loc[:, "diff"] = df_filtered["feature"].diff()

Pandas: Create new dataframe that averages duplicates from another dataframe

Say I have a dataframe my_df with column duplicates, e..g
foo bar foo hello
0 1 1 5
1 1 2 5
2 1 3 5
I would like to create another dataframe that averages the duplicates:
foo bar hello
0.5 1 5
1.5 1 5
2.5 1 5
How can I do this in Pandas?
So far I have managed to identify duplicates:
my_columns = my_df.columns
my_duplicates = print [x for x, y in collections.Counter(my_columns).items() if y > 1]
By I don't know how to ask Pandas to average them.
You can groupby the column index and take the mean:
In [11]: df.groupby(level=0, axis=1).mean()
Out[11]:
bar foo hello
0 1 0.5 5
1 1 1.5 5
2 1 2.5 5
A somewhat trickier example is if there is a non numeric column:
In [21]: df
Out[21]:
foo bar foo hello
0 0 1 1 a
1 1 1 2 a
2 2 1 3 a
The above will raise: DataError: No numeric types to aggregate. Definitely not going to win any prizes for efficiency, but here's generic method to do in this case:
In [22]: dupes = df.columns.get_duplicates()
In [23]: dupes
Out[23]: ['foo']
In [24]: pd.DataFrame({d: df[d] for d in df.columns if d not in dupes})
Out[24]:
bar hello
0 1 a
1 1 a
2 1 a
In [25]: pd.concat(df.xs(d, axis=1) for d in dupes).groupby(level=0, axis=1).mean()
Out[25]:
foo
0 0.5
1 1.5
2 2.5
In [26]: pd.concat([Out[24], Out[25]], axis=1)
Out[26]:
foo bar hello
0 0.5 1 a
1 1.5 1 a
2 2.5 1 a
I think the thing to take away is avoid column duplicates... or perhaps that I don't know what I'm doing.

Categories

Resources