Fill nulls in columns with non-null values from other columns - python

Given a dataframe with similar columns having null values in between. How to dynamically fill nulls in the columns with non-null values from other columns without explicitly stating the names of other column names e.g. select first column category1 and fill the null rows with values from other columns of same rows?
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'category1': [None, 21, None, 10, None, 30, 31,45, 23, 56],
'category2': [10, 21, 20, 10, None, 30, None,45, 23, 56],
'category3': [10, 21, 20, 10, None, 30, 31,45, 23, 56],}
df = pd.DataFrame(data)
df = df.set_index('year')
df
category1 category2 category3
year
2010 NaN 10 10
2011 21 21 21
2012 NaN 20 20
2013 10 10 10
2014 NaN NaN NaN
2015 30 30 NaN
2016 31 NaN 31
2017 45 45 45
2018 23 23 23
2019 56 56 56
After filling category1:
category1 category2 category3
year
2010 10 10 10
2011 21 21 21
2012 20 20 20
2013 10 10 10
2014 NaN NaN NaN
2015 30 30 NaN
2016 31 NaN 31
2017 45 45 45
2018 23 23 23
2019 56 56 56

IIUC you can do it this way:
In [369]: df['category1'] = df['category1'].fillna(df['category2'])
In [370]: df
Out[370]:
category1 category2 category3
year
2010 10.0 10.0 10.0
2011 21.0 21.0 21.0
2012 20.0 20.0 20.0
2013 10.0 10.0 10.0
2014 NaN NaN NaN
2015 30.0 30.0 30.0
2016 31.0 NaN 31.0
2017 45.0 45.0 45.0
2018 23.0 23.0 23.0
2019 56.0 56.0 56.0

You can use first_valid_index with condition if all values are NaN:
def f(x):
if x.first_valid_index() is None:
return None
else:
return x[x.first_valid_index()]
df['a'] = df.apply(f, axis=1)
print (df)
category1 category2 category3 a
year
2010 NaN 10.0 10.0 10.0
2011 21.0 21.0 21.0 21.0
2012 NaN 20.0 20.0 20.0
2013 10.0 10.0 10.0 10.0
2014 NaN NaN NaN NaN
2015 30.0 30.0 30.0 30.0
2016 31.0 NaN 31.0 31.0
2017 45.0 45.0 45.0 45.0
2018 23.0 23.0 23.0 23.0
2019 56.0 56.0 56.0 56.0

Try This:
df['category1']= df['category1'].fillna(df.median(axis=1))

Related

Substituting values with conditions

I have a dataframe like this one below
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.5 10 18 10.00 10.00
St.6 40 4 25.00 30.00
St.7 10 13 8.00 15.00
St.8 46 11 31.00 40.00
St.9 28 9 10.00 10.00
St.10 14 22 25.00 30.00
St.11 5 40 8.00 15.00
St.12 11 10 31.00 40.00
...
St.89 61 35 10.00 10.00
St.90 23 29 25.00 30.00
St.91 35 12 8.00 15.00
St.92 31 7 31.00 40.00
I want to change the station codes by matching the coordinates, substituing the codes by repeating the first 4 codes, obtaining this
Air Station Code Humidity Temperature Latitude Longitude
St.1 20 10 10.00 10.00
St.2 4 15 25.00 30.00
St.3 16 21 8.00 15.00
St.4 38 8 31.00 40.00
St.1 10 18 10.00 10.00
St.2 40 4 25.00 30.00
St.3 10 13 8.00 15.00
St.4 46 11 31.00 40.00
St.1 28 9 10.00 10.00
St.2 14 22 25.00 30.00
St.3 5 40 8.00 15.00
St.4 11 10 31.00 40.00
...
St.1 61 35 10.00 10.00
St.2 23 29 25.00 30.00
St.3 35 12 8.00 15.00
St.4 31 7 31.00 40.00
Is there some way to implement an "if/else" substitution on the whole dataframe without going manually over every observation in python?
df['Air Station Code'] = 'St.' + pd.Series(df[['Latitude','Longitude']].astype(str).agg(sum, axis=1).factorize()[0] + 1).astype(str)
df
Out[77]:
Air Station Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
There may be a better way to do this, but this solves the problem. I create a second dataframe without the duplicates, which keeps the first occurrence of each lat/long. I make lat/long the index and drop the other columns. I then do a "join", adding a new column with the matching lat/long. I then overwrite the original station code with the looked up one.
import pandas as pd
data = [
["St.1", 20, 10, 10.00, 10.00],
["St.2", 4, 15, 25.00, 30.00],
["St.3", 16, 21, 8.00, 15.00],
["St.4", 38, 8, 31.00, 40.00],
["St.5", 10, 18, 10.00, 10.00],
["St.6", 40, 4, 25.00, 30.00],
["St.7", 10, 13, 8.00, 15.00],
["St.8", 46, 11, 31.00, 40.00],
["St.9", 28, 9, 10.00, 10.00],
["St.10", 14, 22, 25.00, 30.00],
["St.11", 5, 40, 8.00, 15.00],
["St.12", 11, 10, 31.00, 40.00],
["St.89", 61, 35, 10.00, 10.00],
["St.90", 23, 29, 25.00, 30.00],
["St.91", 35, 12, 8.00, 15.00],
["St.92", 31, 7, 31.00, 40.00],
]
column = "Air_Station_Code Humidity Temperature Latitude Longitude".split()
df = pd.DataFrame(data,columns=column)
print(df)
df1 = df.drop_duplicates(['Latitude','Longitude'])
df1 = df1[['Air_Station_Code','Latitude','Longitude']]
df1.set_index(['Latitude','Longitude'], inplace=True)
print(df1)
df2 = df.join( df1, on=['Latitude','Longitude'], rsuffix='R' )
print(df2)
df['Air_Station_Code'] = df2['Air_Station_CodeR']
print(df)
Output:
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.5 10 18 10.0 10.0
5 St.6 40 4 25.0 30.0
6 St.7 10 13 8.0 15.0
7 St.8 46 11 31.0 40.0
8 St.9 28 9 10.0 10.0
9 St.10 14 22 25.0 30.0
10 St.11 5 40 8.0 15.0
11 St.12 11 10 31.0 40.0
12 St.89 61 35 10.0 10.0
13 St.90 23 29 25.0 30.0
14 St.91 35 12 8.0 15.0
15 St.92 31 7 31.0 40.0
Air_Station_Code
Latitude Longitude
10.0 10.0 St.1
25.0 30.0 St.2
8.0 15.0 St.3
31.0 40.0 St.4
Air_Station_Code Humidity ... Longitude Air_Station_CodeR
0 St.1 20 ... 10.0 St.1
1 St.2 4 ... 30.0 St.2
2 St.3 16 ... 15.0 St.3
3 St.4 38 ... 40.0 St.4
4 St.5 10 ... 10.0 St.1
5 St.6 40 ... 30.0 St.2
6 St.7 10 ... 15.0 St.3
7 St.8 46 ... 40.0 St.4
8 St.9 28 ... 10.0 St.1
9 St.10 14 ... 30.0 St.2
10 St.11 5 ... 15.0 St.3
11 St.12 11 ... 40.0 St.4
12 St.89 61 ... 10.0 St.1
13 St.90 23 ... 30.0 St.2
14 St.91 35 ... 15.0 St.3
15 St.92 31 ... 40.0 St.4
[16 rows x 6 columns]
Air_Station_Code Humidity Temperature Latitude Longitude
0 St.1 20 10 10.0 10.0
1 St.2 4 15 25.0 30.0
2 St.3 16 21 8.0 15.0
3 St.4 38 8 31.0 40.0
4 St.1 10 18 10.0 10.0
5 St.2 40 4 25.0 30.0
6 St.3 10 13 8.0 15.0
7 St.4 46 11 31.0 40.0
8 St.1 28 9 10.0 10.0
9 St.2 14 22 25.0 30.0
10 St.3 5 40 8.0 15.0
11 St.4 11 10 31.0 40.0
12 St.1 61 35 10.0 10.0
13 St.2 23 29 25.0 30.0
14 St.3 35 12 8.0 15.0
15 St.4 31 7 31.0 40.0

Filling of NaN values with the average of Quantity corresponding to a particular year

Year Week_Number DC_Zip Asin_code
1 2016 1 84105 NaN
2 2016 1 85034 NaN
3 2016 1 93711 NaN
4 2016 1 98433 NaN
5 2016 2 12206 21.0
6 2016 2 29306 10.0
7 2016 2 33426 11.0
8 2016 2 37206 1.0
9 2017 1 12206 266.0
10 2017 1 29306 81.0
11 2017 1 33426 NaN
12 2017 1 37206 NaN
13 2017 1 45216 99.0
14 2017 1 60160 100.0
15 2017 1 76110 76.0
16 2018 1 12206 562.0
17 2018 1 29306 184.0
18 2018 1 33426 NaN
19 2018 1 37206 NaN
20 2018 1 45216 187.0
21 2018 1 60160 192.0
22 2018 1 76110 202.0
23 2019 1 12206 511.0
24 2019 1 29306 NaN
25 2019 1 33426 224.0
26 2019 1 37206 78.0
27 2019 1 45216 160.0
28 2019 1 60160 NaN
29 2019 1 76110 221.0
30 2020 6 93711 NaN
31 2020 6 98433 NaN
32 2020 7 12206 74.0
33 2020 7 29306 22.0
34 2020 7 33426 32.0
35 2020 7 37206 10.0
36 2020 7 45216 34.0
I want to fill the NaN values with the Average of Asin_code for that particular year.I am able to fill the values for 2016 with this code
df["Asin_code"]=df.Asin_code.fillna(df.Asin_code.loc[(df.Year==2016)].mean(),axis=0)
But unable to do with the whole dataframe..
Use groupby().transform() and fillna:
df['Asin_code'] = df['Asin_code'].fillna(df.groupby('Year').Asin_code.transform('mean'))

weighted average on a column of dataframe with provided weighted rate

Year Week_Number DC_Zip Asin_code
0 2016 1 12206 NaN
1 2016 1 29306 NaN
2 2016 1 33426 NaN
3 2016 1 37206 NaN
4 2016 1 45216 NaN
5 2016 1 60160 NaN
6 2016 1 76110 NaN
7 2016 1 80215 NaN
8 2016 1 84105 NaN
9 2016 1 85034 NaN
10 2016 1 93711 NaN
11 2016 1 98433 NaN
12 2016 2 12206 21.0
13 2016 2 29306 10.0
14 2016 2 33426 11.0
15 2016 2 37206 1.0
16 2016 2 45216 5.0
17 2016 2 60160 7.0
18 2016 2 76110 12.0
19 2016 2 80215 NaN
20 2016 2 84105 2.0
21 2016 2 85034 1.0
22 2016 2 93711 23.0
23 2016 2 98433 7.0
24 2016 3 12206 95.0
25 2016 3 29306 26.0
26 2016 3 33426 51.0
27 2016 3 37206 18.0
28 2016 3 45216 34.0
29 2016 3 60160 30.0
... ... ... ... ...
2778 2020 29 76110 33.0
2779 2020 29 80215 5.0
2780 2020 29 84105 3.0
2781 2020 29 85034 8.0
2782 2020 29 93711 53.0
2783 2020 29 98433 15.0
2784 2020 30 12206 75.0
2785 2020 30 29306 27.0
2786 2020 30 33426 34.0
2787 2020 30 37206 12.0
2788 2020 30 45216 14.0
2789 2020 30 60160 28.0
2790 2020 30 76110 47.0
2791 2020 30 80215 11.0
2792 2020 30 84105 3.0
2793 2020 30 85034 17.0
2794 2020 30 93711 62.0
2795 2020 30 98433 13.0
2796 2020 31 12206 109.0
2797 2020 31 29306 30.0
2798 2020 31 33426 31.0
2799 2020 31 37206 14.0
2800 2020 31 45216 23.0
2801 2020 31 60160 21.0
2802 2020 31 76110 25.0
2803 2020 31 80215 7.0
2804 2020 31 84105 4.0
2805 2020 31 85034 8.0
2806 2020 31 93711 71.0
2807 2020 31 98433 9.0
2808 rows × 4 columns
This is the sales data I am dealing with. I have to perform a weighted average on Asin_code with weighted rate = [5, 5, 20, 30, 40] on respective years 2016, 2017, 2018, 2019 and 2020. I have to create a function so that it will give me a column containing the weighted average of Asin_code."Nan" values should be dropped. We should also change the weighted rate in the future to view more patterns with the data. Any help would be appreciated.
i am trying the following code:
for i in range(len(df.Asin_code)):
df["Weighted_avg"]=rate[0]*df.Asin_code[i]/df.Asin_code.loc[(df.Year==2016)].sum()
just facing difficulties in consolidating the data for whole 5 years.
It becomes much simpler it you define your weights as a dict instead of a list then a simple use of apply() works
# define weights for year as a dict
wr = {2016:5, 2017:5, 2018:20, 2019:30, 2020:40}
df["Weighted_avg"] = df.apply(lambda r:
# numerator is weight * Asin_code[i]
( r["Asin_code"] * wr[r["Year"]]
/
# denomimator sum(Asin_code for year)
df.Asin_code.loc[(df.Year==r["Year"])].sum() ), axis=1)
output
Idx Year Week_Number DC_Zip Asin_code Weighted_avg
25 2016 3 29306 26.0 0.367232
26 2016 3 33426 51.0 0.720339
27 2016 3 37206 18.0 0.254237
28 2016 3 45216 34.0 0.480226
29 2016 3 60160 30.0 0.423729
2778 2020 29 76110 33.0 1.625616
2779 2020 29 80215 5.0 0.246305
2780 2020 29 84105 3.0 0.147783
2781 2020 29 85034 8.0 0.394089
2782 2020 29 93711 53.0 2.610837
suplementary update
Updated request: weighted_average[at index 1]=rate[for year 2016]*Asin_code[at first index of 2016]+rate[for year 2017]*Asin_code[at first index of 2017]+rate[for year 2018]*Asin_code[at first index of 2018]+rate[for year 2019]*Asin_code[at first index of 2019]+rate[for year 2020]*Asin_code[at first index of 2020]
df.dropna().groupby("Year").agg({"Asin_code":"first"}).reset_index()\
.assign(wa=lambda dfa:
dfa.apply(lambda r: r["Asin_code"]*wr[r['Year']],axis=1))["wa"].sum()
df["Weighted_avg"] = df.apply(lambda r: ( (r["Asin_code"] *wr[r["Year"]]).sum(axis = 0)), axis=1)
Output
12 2016 2 12206 21.0 105.0
13 2016 2 29306 10.0 50.0
14 2016 2 33426 11.0 55.0
15 2016 2 37206 1.0 5.0
16 2016 2 45216 5.0 25.0
17 2016 2 60160 7.0 35.0
18 2016 2 76110 12.0 60.0
19 2016 2 80215 NaN NaN
20 2016 2 84105 2.0 10.0
21 2016 2 85034 1.0 5.0
22 2016 2 93711 23.0 115.0
23 2016 2 98433 7.0 35.0
24 2016 3 12206 95.0 475.0
25 2016 3 29306 26.0 130.0
26 2016 3 33426 51.0 255.0
27 2016 3 37206 18.0 90.0
28 2016 3 45216 34.0 170.0
29 2016 3 60160 30.0 150.0
... ... ... ... ... ...
2778 2020 29 76110 33.0 1320.0
2779 2020 29 80215 5.0 200.0
2780 2020 29 84105 3.0 120.0
2781 2020 29 85034 8.0 320.0
2782 2020 29 93711 53.0 2120.0
2783 2020 29 98433 15.0 600.0
2784 2020 30 12206 75.0 3000.0
2785 2020 30 29306 27.0 1080.0
2786 2020 30 33426 34.0 1360.0
2787 2020 30 37206 12.0 480.0
2788 2020 30 45216 14.0 560.0
2789 2020 30 60160 28.0 1120.0
2790 2020 30 76110 47.0 1880.0
2791 2020 30 80215 11.0 440.0
2792 2020 30 84105 3.0 120.0
2793 2020 30 85034 17.0 680.0
2794 2020 30 93711 62.0 2480.0
2795 2020 30 98433 13.0 520.0
2796 2020 31 12206 109.0 4360.0
2797 2020 31 29306 30.0 1200.0
2798 2020 31 33426 31.0 1240.0
2799 2020 31 37206 14.0 560.0
2800 2020 31 45216 23.0 920.0
2801 2020 31 60160 21.0 840.0
2802 2020 31 76110 25.0 1000.0
2803 2020 31 80215 7.0 280.0
2804 2020 31 84105 4.0 160.0
2805 2020 31 85034 8.0 320.0
2806 2020 31 93711 71.0 2840.0
2807 2020 31 98433 9.0 360.0
Got my solution with this.

Pandas df.pivot_table - aggfunc = sum not producing desired output

Say I have a data frame, sega_df:
MONTH Character Rings Chili Dogs Emeralds
0 Jun 2017 Sonic 25.0 10.0 6.0
5 Jun 2017 Sonic 19.0 15.0 0.0
8 Jun 2017 Shadow 4.0 1.0 0.0
9 Jun 2017 Shadow 23.0 1.0 0.0
12 Jun 2017 Knuckles 9.0 3.0 1.0
13 Jun 2017 Tails 10.0 6.0 0.0
22 Jul 2017 Sonic 5.0 20.0 0.0
23 Jul 2017 Shadow 3.0 3.0 7.0
24 Jul 2017 Knuckles 9.0 4.0 0.0
27 Jul 2017 Knuckles 11.0 2.0 0.0
28 Jul 2017 Tails 12.0 3.0 0.0
29 Jul 2017 Tails 12.0 5.0 0.0
My pivot_table command gives me a table output of each character by row against each month, but the values are a series of random Nan or 0. The 0s are because there is more data with 0s in later months and I only posted the first few rows. The data types of the values in the three columns (Rings,Chili Dogs, and Emeralds) are numpy.float64, so I'm also curious if that affects it, or if it's how I define aggfunc.
My values argument and pivot_table commmand is as follows:
values = list(sega_df.columns.values)
test = pd.pivot_table(data = sega_df, values = values, index = 'Character', columns = 'MONTH', aggfunc='sum')
Here is my desired pivot_table output, -- with the sum of the three columns per character per month (eg. Sonic for month of June is [25 + 10 + 6 + 19 + 15 + 0] = 75.0):
MONTH Jun 2017 Jul 2017
Character
0 Sonic 75.0 25.0
1 Shadow 29.0 18.0
2 Knuckles 13.0 26.0
3 Tails 16.0 32.0
Just need groupby sum and sum with axis = 1 , then we unstack
df.groupby(['Character','MONTH']).sum().sum(1).unstack()
Out[953]:
MONTH Jul2017 Jun2017
Character
Knuckles 26.0 13.0
Shadow 13.0 29.0
Sonic 25.0 75.0
Tails 32.0 16.0

How to select rows that not consist of only NaN values and 0s

This is my dataframe:
cols = ['Country', 'Year', 'Orange', 'Apple', 'Plump']
data = [['US', 2008, 17, 29, 19],
['US', 2009, 11, 12, 16],
['US', 2010, 14, 16, 38],
['Spain', 2008, 11, None, 33],
['Spain', 2009, 12, 19, 17],
['France', 2008, 17, 19, 21],
['France', 2009, 19, 22, 13],
['France', 2010, 12, 11, 0],
['France', 2010, 0, 0, 0],
['Italy', 2009, None, None, None],
['Italy', 2010, 15, 16, 17],
['Italy', 2010, 0, None, None],
['Italy', 2011, 42, None, None]]
I want to select rows which in which orange apple and plumps are not consist of only "None"s, only 0s or mix of them. So the Resulting output should be:
Country Year Orange Apple Plump
0 US 2008 17.0 29.0 19.0
1 US 2009 11.0 12.0 16.0
2 US 2010 14.0 16.0 38.0
3 Spain 2008 11.0 NaN 33.0
4 Spain 2009 12.0 19.0 17.0
5 France 2008 17.0 19.0 21.0
6 France 2009 19.0 22.0 13.0
7 France 2010 12.0 11.0 0.0
10 Italy 2010 15.0 16.0 17.0
12 Italy 2011 42.0 NaN NaN
Second I want to drop the countries for which I don't have observations for all three years. So resulting output should only consist Us and France. How I could get them ?
I have tried something like:
df = df[(df['Orange'].notnull())| \
(df['Apple'].notnull()) | (df['Plump'].notnull()) | (df['Orange'] != 0 )| (df['Apple']!= 0) | (df['Plump']!= 0)]
Also I tried:
df = df[((df['Orange'].notnull())| \
(df['Apple'].notnull()) | (df['Plump'].notnull())) & ((df['Orange'] != 0 )| (df['Apple']!= 0) | (df['Plump']!= 0))]
In [307]: df[~df[['Orange','Apple','Plump']].fillna(0).eq(0).all(1)]
Out[307]:
Country Year Orange Apple Plump
0 US 2008 17.0 29.0 19.0
1 US 2009 11.0 12.0 16.0
2 US 2010 14.0 16.0 38.0
3 Spain 2008 11.0 NaN 33.0
4 Spain 2009 12.0 19.0 17.0
5 France 2008 17.0 19.0 21.0
6 France 2009 19.0 22.0 13.0
7 France 2010 12.0 11.0 0.0
10 Italy 2010 15.0 16.0 17.0
12 Italy 2011 42.0 NaN NaN
None values are going to be read as NaN, so you can replace 0s and convert them as NaN as well. After that you can do what MaxU suggested you. That would be something like:
In: df = df.replace(0,np.nan)
df = df[df[['Orange','Apple','Plump']].notnull().any(1)]
Out:
Country Year Orange Apple Plump
0 US 2008 17 29 19
1 US 2009 11 12 16
2 US 2010 14 16 38
3 Spain 2008 11 NaN 33
4 Spain 2009 12 19 17
5 France 2008 17 19 21
6 France 2009 19 22 13
7 France 2010 12 11 NaN
10 Italy 2010 15 16 17
12 Italy 2011 42 NaN NaN
For your second question I understand in this case you want to get rid of countries for which you don't have observations for 2008,2009,2010.
For that you could do something like:
countries = []
for group,values in enumerate(df.groupby('Country')):
lista = values[1].Year.unique() == [2008,2009,2010]
if (np.all(lista)):
countries.append(values[0])
df = df[df.Country.isin(countries)]
Which will yield something like:
Country Year Orange Apple Plump
0 US 2008 17 29 19
1 US 2009 11 12 16
2 US 2010 14 16 38
5 France 2008 17 19 21
6 France 2009 19 22 13
7 France 2010 12 11 NaN
8 France 2010 NaN NaN NaN
Finally you can apply both solutions at the same time doing:
df[df[['Orange','Apple','Plump']].notnull().any(1) & df.Country.isin(countries)])
Getting:
Country Year Orange Apple Plump
0 US 2008 17 29 19
1 US 2009 11 12 16
2 US 2010 14 16 38
5 France 2008 17 19 21
6 France 2009 19 22 13
7 France 2010 12 11 NaN

Categories

Resources