pandas replace null values for a subset of columns - python

I have a data frame with many columns, say:
df:
name salary age title
John 100 35 eng
Bill 200 NaN adm
Lena NaN 28 NaN
Jane 120 45 eng
I want to replace the null values in salary and age, but no in the other columns. I know I can do something like this:
u = df[['salary', 'age']]
df[['salary', 'age']] = u.fillna(-1)
But this seems terse as it involves copying. Is there a more efficient way to do this?

According to Pandas documentation in 23.3
values = {'salary': -1, 'age': -1}
df.fillna(value=values, inplace=True)

Try this:
subset = ['salary', 'age']
df.loc[:, subset] = df.loc[:, subset].fillna(-1)

It is not so beautiful, but it works:
df.salary.fillna(-1, inplace=True)
df.age.fillna(-1, inplace=True)
df
>>> name salary age title
0 John 101.0 35.0 eng
1 Bill 200.0 -1.0 adm
2 Lena -1.0 28.0 NaN
3 Jane 120.0 45.0 eng

I was hoping fillna() had subset parameter like drop(), maybe should post request to pandas however this is the cleanest version in my opinion.
df[["salary", "age"]] = df[["salary", "age"]].fillna(-1)

You can do:
df = df.assign(
salary=df.salary.fillna(-1),
age=df.age.fillna(-1),
)
if you want to chain it with other operations.

Related

Pandas: Combine rows having same date different time into a single row of the same date(consolidate partial data of different time for same identity)

I have a sample dataframe as given below.
import pandas as pd
import numpy as np
NaN = np.nan
data = {'ID':['A', 'A', 'A', 'B','B','B'],
'Date':['2021-09-20 04:34:57', '2021-09-20 04:37:25', '2021-09-20 04:38:26', '2021-09-01
00:12:29','2021-09-01 11:20:58','2021-09-02 09:20:58'],
'Name':['xx',NaN,NaN,'yy',NaN,NaN],
'Height':[174,NaN,NaN,160,NaN,NaN],
'Weight':[74,NaN,NaN,58,NaN,NaN],
'Gender':[NaN,'Male',NaN,NaN,'Female',NaN],
'Interests':[NaN,NaN,'Hiking,Sports',NaN,NaN,'Singing']}
df1 = pd.DataFrame(data)
df1
I want to combine the data present on the same date into a single row. The 'Date' column is in timestamp format.
The final output should look like the image shown below.
Any help is greatly appreciated. Thanks.
New Solution
The old solution was based on initial version of question where empty strings instead of NaN values were used for undefined values and all columns were of string types. With updated question using NaN for undefined values (and even when also updated to have different column data types of numeric and string types), the solution can be simplified as follows:
You can use .groupby() + GroupBy.last() to group by ID and date (without time) and then aggregate the NaN and non-NaN elements with the latest (asssuming column Date is presented in chronological order) non-NaN values for an ID, as follows:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
# Sort `df1` with ['ID', 'Date'] order if not already in this order
#df1 = df1.sort_values(['ID', 'Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.last()
.reset_index()
).replace([None], [np.nan])
Result:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
Old Solution
You can use .groupby() + .agg() to group by ID and date and then aggregate the NaN and non-NaN elements, as follows:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: ''.join(x.dropna().astype(str)))
.reset_index()
).replace('', np.nan)
Result:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
As your original question had all columns of string types, the above codes work fine to give results of all columns as string types. However, your edited question has data with both numeric and string types. In order to retain the original data types, we can modify the codes as follows:
# Convert `Date` to datetime format
df1['Date'] = pd.to_datetime(df1['Date'])
df_out = (df1.groupby(['ID', pd.Grouper(key='Date', freq='D')])
.agg(lambda x: np.nan if len(w:=x.dropna().reset_index(drop=True)) == 0 else w)
.reset_index()
)
Result:
print(df_out)
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174.0 74.0 Male Hiking,Sports
1 B 2021-09-01 yy 160.0 58.0 Female NaN
2 B 2021-09-02 NaN NaN NaN NaN Singing
print(df_out.dtypes)
ID object
Date datetime64[ns]
Name object
Height float64 <==== retained as numeric dtype
Weight float64 <==== retained as numeric dtype
Gender object
Interests object
dtype: object
Start first by converting to datetime and flooring:
In [3]: df["Date"] = pd.to_datetime(df["Date"]).dt.floor('D')
In [4]: df
Out[4]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg
1 A 2021-09-20 Male
2 A 2021-09-20 Hiking,Sports
3 B 2021-09-01 yy 160cm 58kg
4 B 2021-09-01 Female
5 B 2021-09-02 Singing
Now using groupby and sum:
In [5]: df.groupby(["ID", "Date"]).sum().reset_index()
Out[5]:
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing
If your data are correctly ordered as your sample, you can merge your data as below:
>>> df1.groupby(['ID', pd.Grouper(key='Date', freq='D')]) \
.sum().reset_index()
ID Date Name Height Weight Gender Interests
0 A 2021-09-20 xx 174cm 74kg Male Hiking,Sports
1 B 2021-09-01 yy 160cm 58kg Female
2 B 2021-09-02 Singing

An alert when trying to change a value in a column in pandas

I have this dataset (the Titanic dataset):
import pandas as pd
import numpy as np
url = 'https://raw.githubusercontent.com/chrisalbon/simulated_datasets/master/titanic.csv'
df = pd.read_csv(url)
And I want to change for the column 'Sex' all the values 'male' with 'NaN'. This is the code:
df['Sex'] = df['Sex'].replace('male',np.nan)
df.head(3)
Name PClass Age Sex Survived SexCode
0 Allen, Miss Elisabeth Walton 1st 29.0 female 1 1
1 Allison, Miss Helen Loraine 1st 2.0 female 0 1
2 Allison, Mr Hudson Joshua... 1st 30.0 NaN 0 0
And I want to roll back and change the NaN values to 'male'. But I tried this:
df['Sex'][df['Sex'].isnull()]='male'
df
But I receive a message: A value is trying to be set on a copy of a slice from a DataFrame
The change was made, but perhaps my logic is bad. Please, do you suggest a best way to code this?
The recommendation from pandas is to do the setting like below which gets rid of the warning.
df.loc[df['Sex'].isnull(),'Sex']='male'
df.head()

How to replace NaN values in a column A with respect to average value that is related to column B?

I am working on famous Titanic dataset.
Am trying to fill the X.Age.isna() NaN values with Avg_Age_byTitle,which i have calculated using X.groupby('Name').mean()['Age']
Avg_Age_byTitle =
Name
Capt 70.000000
Col 58.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.898148
Ms 28.000000
Rev 43.166667
Sir 49.000000
the Countess 33.000000
Name: Age, dtype: float64
I tried this X.Age[Avg_Age_byTitle[X.Name[ X.Age.isna()]]] which returns series with Age as index and NaN as values. what's that am doing wrong?
IIUC you need:
df['Age'] = df.groupby('Pclass')['Age'].apply(lambda x: x.fillna(x.mean())).round(1)
this fills the NaN in Age based on the average of groups of Pclass.
Given that X and Avg_Age_byTitle both have Name as index, you can try:
X[['Age']] = X[['Age']].fillna(Avg_Age_byTitle)
Thanks All.
Solution:
X.Age = X.groupby(['Name']).Age.apply(lambda X : X.fillna(X.mean()))

Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
You can reproduce the same DataFrame with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
Some pointers/ideas here:
Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further
This can also be re-written as
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)

Expand pandas dataframe based on range in a column

I have a pandas dataframe like this:
Name SICs
Agric 0100-0199
Agric 0910-0919
Agric 2048-2048
Food 2000-2009
Food 2010-2019
Soda 2097-2097
The SICs column gives a range of integer values that match the Name given in the first column (although they're stored as a string).
I need to expand this DataFrame so that it has one row for each integer in the range:
Agric 100
Agric 101
Agric 102
...
Agric 199
Agric 910
Agric 911
...
Agric 919
Agric 2048
Food 2000
...
Is there a particularly good way to do this? I was going to do something like this
ranges = {i:r.split('-') for i, r in enumerate(inds['SICs'])}
ranges_expanded = {}
for r in ranges:
ranges_expanded[r] = range(int(ranges[r][0]),int(ranges[r][1])+1)
but I wonder if there's a better way or perhaps a pandas feature to do this. (Also, I'm not sure this will work, as I don't yet see how to read the ranges_expanded dictionary into a DataFrame.)
Quick and dirty but I think this gets you to what you need:
from io import StringIO
import pandas as pd
players=StringIO(u"""Name,SICs
Agric,0100-0199
Agric,0210-0211
Food,2048-2048
Soda,1198-1200""")
df = pd.DataFrame.from_csv(players, sep=",", parse_dates=False).reset_index()
df2 = pd.DataFrame(columns=('Name', 'SIC'))
count = 0
for idx,r in df.iterrows():
data = r['SICs'].split("-")
for i in range(int(data[0]), int(data[1])+1):
df2.loc[count] = (r['Name'], i)
count += 1
The neatest way I found (building on from Andy Hayden's answer):
# Extract date min and max
df = df.set_index("Name")
df = df['SICs'].str.extract("(\d+)-(\d+)")
df.columns = ['min', 'max']
df = df.astype('int')
# Enumerate dates into wide table
enumerated_dates = [np.arange(row['min'], row['max']+1) for _, row in df.iterrows()]
df = pd.DataFrame.from_records(data=enumerated_dates, index=df.index)
# Convert from wide to long table
df = df.stack().reset_index(1, drop=True)
It is however slow due to the for loop. A vectorised solution would be amazing but I cant find one.
You can use str.extract to get strings from a regular expression:
In [11]: df
Out[11]:
Name SICs
0 Agri 0100-0199
1 Agri 0910-0919
2 Food 2000-2009
First take out the name as that's the thing we want to keep:
In [12]: df1 = df.set_index("Name")
In [13]: df1
Out[13]:
SICs
Name
Agri 0100-0199
Agri 0910-0919
Food 2000-2009
In [14]: df1['SICs'].str.extract("(\d+)-(\d+)")
Out[14]:
0 1
Name
Agri 0100 0199
Agri 0910 0919
Food 2000 2009
Then flatten this with stack (which adds a MultiIndex):
In [15]: df1['SICs'].str.extract("(\d+)-(\d+)").stack()
Out[15]:
Name
Agri 0 0100
1 0199
0 0910
1 0919
Food 0 2000
1 2009
dtype: object
If you must you can remove the 0-1 level of the MultiIndex:
In [16]: df1['SICs'].str.extract("(\d+)-(\d+)").stack().reset_index(1, drop=True)
Out[16]:
Name
Agri 0100
Agri 0199
Agri 0910
Agri 0919
Food 2000
Food 2009
dtype: object

Categories

Resources