I have question similar to a previous post. I want to replace missing values in A with B if B is not-missing. I've used a toy dataset.
#Create sample dataset
import pandas as pd
import numpy as np
np.random.seed(12345)
df = pd.DataFrame(np.random.randn(100, 4), columns=list('ABCD'))
df
df[df < 0] = 'NaN'
print(df)
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
#Replace NaN in A with B if B is not NaN
df['A'] = np.where(pd.isnull(df['A']) & pd.notnull(df['B']) == 0, df['B']*1, df['A'])
print(df)
obs A B
0 0.478943 0.478943
1 NaN NaN
2 1.39341 1.39341
3 0.281746 0.281746
4 1.24643 1.24643
5 NaN NaN
6 0.228913 0.228913
7 0.886429 0.886429
8 NaN NaN
9 NaN NaN
This code does the job. But why do I need pd.notnull(df['B']) == 0? If I write:
pd.notnull(df['B'])
instead, the code does not work correctly. The output from that is:
Obs. A B
0 NaN 0.478943
1 NaN NaN
2 1.96578 1.39341
3 0.0929079 0.281746
4 0.769023 1.24643
5 1.00719 NaN
6 0.274992 0.228913
7 1.35292 0.886429
8 NaN NaN
9 1.66903 NaN
I'm trying to understand the flaw in my logic. Any other simple intuitive code will be appreciated.
I basically need to do this simple operation for a very large dataset (100m obs+) so looking for a fast way (in terms of computer processing time) to do it. Thanks in advance.
Replace 'NaN' with np.nan and apply fillna on column A using column B
df = df.replace('NaN', np.nan)
df.A.fillna(df.B, inplace=True)
Output:
A B
0 0.478943 0.478943
1 NaN NaN
2 1.965781 1.393406
3 0.092908 0.281746
4 0.769023 1.246435
5 1.007189 NaN
6 0.274992 0.228913
7 1.352917 0.886429
8 NaN NaN
9 1.669025 NaN
Related
Consider a simple dataframe:
import numpy as np
import pandas as pd
x = pd.DataFrame(np.arange(10).reshape(5,2))
print(x)
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
I would like to create a hierarchically indexed dataframe of the form:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
where the 'a' columns correspond to the original dataframe columns and the 'b' columns are blank (or nan).
I can certainly create a hierarchically indexed dataframe with all NaNs and loop over the columns of the original dataframe, writing them into
the new dataframe. Is there something more compact than that?
you can do with MultiIndex.from_product
extra_level = ['a', 'b']
new_cols = pd.MultiIndex.from_product([x.columns, extra_level])
x.columns = new_cols[::len(x.columns)] # select all the first element of extra_level
x = x.reindex(columns=new_cols)
print(x)
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
Very much like #Ben.T I am using MultiIndex.from_product:
x.assign(l='a')
.set_index('l', append=True)
.unstack()
.reindex(pd.MultiIndex.from_product([x.columns.tolist(), ['a','b']]), axis=1)
Output:
0 1
a b a b
0 0 NaN 1 NaN
1 2 NaN 3 NaN
2 4 NaN 5 NaN
3 6 NaN 7 NaN
4 8 NaN 9 NaN
I have a dataframe like,
pri_col col1 col2 Date
r1 3 4 2020-09-10
r1 4 1 2020-09-11
r1 2 7 2020-09-12
r1 6 4 2020-09-13
Note: There are many more unique values in 'pri_col' column. This is just a sample here. So I'm giving single value. Also, for a single value of 'pri_col' the value of 'Date' will be unique always.
I need the dataframe like,
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 4 2 6 4 1 7 4
According to a previous solution, I have tried this solution:
df = (df.reset_index()
.melt(id_vars=['index','pri_col','Date'],
var_name='cols',
value_name='val')
.pivot(index=['index','pri_col'],
columns=['cols','Date'],
values='val'))
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(level=1).rename_axis(None)
print (df)
But this is the resulting dataframe:
pri_col col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
r1 3 NaN NaN NaN 4 NaN NaN NaN
r1 NaN 4 NaN NaN NaN 1 NaN NaN
r1 NaN NaN 2 NaN NaN NaN 7 NaN
r1 NaN NaN NaN 6 NaN NaN NaN 4
How do I solve the issue?
Also, I had asked a question recently that may sound similar.
IIUC, use pandas.DataFrame.set_index with unstack:
new_df = df.set_index(['pri_col', 'Date']).unstack()
new_df.columns = ["%s_%s" % (i, j) for i, j in new_df.columns]
print(new_df)
Output:
col1_2020-09-10 col1_2020-09-11 col1_2020-09-12 col1_2020-09-13 \
pri_col
r1 3 4 2 6
col2_2020-09-10 col2_2020-09-11 col2_2020-09-12 col2_2020-09-13
pri_col
r1 4 1 7 4
I have a DataFrame df that looks something like this:
df
a b c
0 0.557894 -0.196294 -0.020490
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
5 -0.337374 NaN -0.771888
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
9 -2.345448 2.443669 -1.409422
I want to select the rows that have a value over some value, which I would normally do using:
new_df = df[df['c'] >= .5]
but that will return:
a b c
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
5 -0.337374 NaN 0.771888
8 0.737413 NaN 0.679575
I want to get those rows, but also keep the rows that have nan values in column 'c'. I haven't been able to find a question asking the same thing, they usually ask for one or the other, but not both. I can hard code the rows that I want to drop since I know the specific values, but I was wondering if there is a better solution. The end result should look something like this:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
Only dropping rows 0,5 and 9 since they are less than .5 in columns 'c'
You should use the | (or) operator.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.557894,1.138774,np.nan,-0.069319,1.040089,-0.337374,-1.813278,np.nan,0.737413,-2.345448],
'b': [-0.196294,-0.699224,2.384483,np.nan,-0.271777,np.nan,-1.564666,np.nan,np.nan,2.443669],
'c': [-0.020490,np.nan,0.554292,1.162941,np.nan,-0.771888,np.nan,np.nan,0.679575,-1.409422]})
df = df[(df['c'] >= .5) | (df['c'].isnull())]
print(df)
Output:
a b c
1 1.138774 -0.699224 NaN
2 NaN 2.384483 0.554292
3 -0.069319 NaN 1.162941
4 1.040089 -0.271777 NaN
6 -1.813278 -1.564666 NaN
7 NaN NaN NaN
8 0.737413 NaN 0.679575
You should be able to do this by
new_df = df[df['c'] >=5 or df['c'] == 'NaN']
I'm pretty new to Pandas and programming in general but I've always been able to find the answer to any problem through google until now. Sorry about the not terribly descriptive question, hopefully someone can come up with something clearer.
I'm trying to group data together, perform functions on that data, update a column and then use the data from that column on the next group of data.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random(9),columns=['A'])
df['B'] = [1,1,1,2,2,3,3,3,3]
df['C'] = np.nan
df['D'] = np.nan
df.loc[0:2,'C'] = 500
Giving me
A B C D
0 0.825828 1 500.0 NaN
1 0.218618 1 500.0 NaN
2 0.902476 1 500.0 NaN
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
The 500 in column C is the initial condition. I want to group the data by column B and perform the following function on the first group
def function1(row):
return row['A']*row['C']/6
giving me
A B C D
0 0.825828 1 500.0 68.818971
1 0.218618 1 500.0 18.218145
2 0.902476 1 500.0 75.206313
3 0.452525 2 NaN NaN
4 0.513505 2 NaN NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then want to sum the first three values in D and add them to the last value in C and making this value the group 2 value
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 NaN
4 0.513505 2 662.243429 NaN
5 0.089975 3 NaN NaN
6 0.282479 3 NaN NaN
7 0.774286 3 NaN NaN
8 0.408501 3 NaN NaN
I then perform function1 on group 2 and repeat until I end up with this
A B C D
0 0.825828 1 500.000000 68.818971
1 0.218618 1 500.000000 18.218145
2 0.902476 1 500.000000 75.206313
3 0.452525 2 662.243429 49.946896
4 0.513505 2 662.243429 56.677505
5 0.089975 3 768.867830 11.529874
6 0.282479 3 768.867830 36.198113
7 0.774286 3 768.867830 99.220591
8 0.408501 3 768.867830 52.347246
The dataframe will consist of hundreds of rows. I've been trying various groupby, apply combinations but I'm completely stumped.
Thanks
Here is a solution:
df['D'] = df['A'] * df['C']/6
for i in df['B'].unique()[1:]:
df.loc[df['B']==i, 'C'] = df['D'].sum()
df.loc[df['B']==i, 'D'] = df['A'] * df['C']/6
You can use numpy.unique() for the selction. In your code this might look somehow like this:
import numpy as np
import math
unique, indices, counts = np.unique(df['B'], return_index=True, return_counts=True)
for i in range(len(indices)):
for j in range(len(counts)):
row = df[indices[i]+j]
if math.isnan(row['C']):
row['C'] = df.loc[indices[i-1], 'D']
# then call your function
function1(row)
Suppose I have a dataframe:
a b c
0 1 2 NaN
1 2 NaN 4
3 Nan 4 NaN
I want to check for NaN in only some particular column's and want the resulting dataframe as:
a b c
0 1 2 NaN
3 Nan 4 NaN
Here I want to check for NaN in only Column 'a' and Column 'c'.
How this can be done?
You could do that with isnull and any methods:
In [264]: df
Out[264]:
a b c
0 1 2 NaN
1 2 NaN 4
2 NaN 4 NaN
In [265]: df[df.isnull().any(axis=1)]
Out[265]:
a b c
0 1 2 NaN
2 NaN 4 NaN
Note: if you just want clear rows without any NaN you could use dropna method
EDIT
If you want to subset your dataframe you could use mask with your columns and apply it to the whole dataframe:
df_subset = df[['a', 'c']]
In [282]: df[df_subset.isnull().any(axis=1)]
Out[282]:
a b c
0 1 2 NaN
2 NaN 4 NaN