Fill multiple nulls for categorical data - python

I'm wondering if there is a pythonic way to fill nulls for categorical data by randomly choosing from the distribution of unique values. Basically proportionally / randomly filling categorical nulls based on the existing distribution of the values in the column...
-- below is an example of what I'm already doing
--I'm using numbers as categories to save time, I'm not sure how to randomly input letters
import numpy as np
import pandas as pd
np.random.seed([1])
df = pd.DataFrame(np.random.normal(10, 2, 20).round().astype(object))
df.rename(columns = {0 : 'category'}, inplace = True)
df.loc[::5] = np.nan
print df
category
0 NaN
1 12
2 4
3 9
4 12
5 NaN
6 10
7 12
8 13
9 9
10 NaN
11 9
12 10
13 11
14 9
15 NaN
16 10
17 4
18 9
19 9
This is how I'm currently inputting the values
df.category.value_counts()
9 6
12 3
10 3
4 2
13 1
11 1
df.category.value_counts()/16
9 0.3750
12 0.1875
10 0.1875
4 0.1250
13 0.0625
11 0.0625
# to fill categorical info based on percentage
category_fill = np.random.choice((9, 12, 10, 4, 13, 11), size = 4, p = (.375, .1875, .1875, .1250, .0625, .0625))
df.loc[df.category.isnull(), "category"] = category_fill
Final output works, just takes a while to write
df.category.value_counts()
9 9
12 4
10 3
4 2
13 1
11 1
Is there a faster way to do this or a function that would serve this purpose?
Thanks for any and all help!

You could use stats.rv_discrete:
from scipy import stats
counts = df.category.value_counts()
dist = stats.rv_discrete(values=(counts.index, counts/counts.sum()))
fill_values = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = fill_values
EDIT: For general data(not restricted to integers) you can do:
dist = stats.rv_discrete(values=(np.arange(counts.shape[0]),
counts/counts.sum()))
fill_idxs = dist.rvs(size=df.shape[0] - df.category.count())
df.loc[df.category.isnull(), "category"] = counts.iloc[fill_idxs].index.values

Related

How to ignore NaN when applying rolling with Pandas

May I know how to ignore NaN when performing rolling on a df.
For example, given a df, perform rolling on column a, but ignore the Nan. This requirement should produced something
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 12050.00
6 NaN NaN
7 17514.0 19350.00
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 21018.67
11 NaN NaN
12 26164.0 27796.67
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
However, I dont know which part of this line should be tweaked to get the above expected output
df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
Currently, the following code
import numpy as np
import pandas as pd
arr=[[6772],[7182],[8570],[11078],[11646],[13426],[np.nan],[17514],[18408],
[22128],[22520],[np.nan],[26164],[26590],[30636],[3119],[32166],[34774]]
df=pd.DataFrame(arr,columns=['a'])
w = 2
df['avg'] = df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
produced the following,
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 13416.00 <<<
6 NaN 15248.50 <<<
7 17514.0 17869.00 <<<
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 22305.00 <<<
11 NaN 24350.50 <<<
12 26164.0 26477.50 <<<
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
<<< indicate where the values are different than the expected output
Update:
adding fillna
df['avg'] = df['a'].fillna(value=0).rolling(2 * w + 1, center=True, min_periods=1).mean()
Does not produced the expected output
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 8944.00
5 13426.0 10732.80
6 NaN 12198.80
7 17514.0 14295.20
8 18408.0 16114.00
9 22128.0 16114.00
10 22520.0 17844.00
11 NaN 19480.40
12 26164.0 21182.00
13 26590.0 17301.80
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
12050=sum(11078 11646 13426 )/3
IIUC, you want to restart the rolling when nan is met. One way would be to use pandas.DataFrame.groupby:
m = df.isna().any(1)
df["avg"] = (df["a"].groupby(m.cumsum())
.rolling(2 * w + 1, center=True, min_periods=1).mean()
.reset_index(level=0, drop=True))
df["avg"] = df["avg"][~m]
Output:
a avg
0 6772.0 7508.000000
1 7182.0 8400.500000
2 8570.0 9049.600000
3 11078.0 10380.400000
4 11646.0 11180.000000
5 13426.0 12050.000000
6 NaN NaN
7 17514.0 19350.000000
8 18408.0 20142.500000
9 22128.0 20142.500000
10 22520.0 21018.666667
11 NaN NaN
12 26164.0 27796.666667
13 26590.0 21627.250000
14 30636.0 23735.000000
15 3119.0 25457.000000
16 32166.0 25173.750000
17 34774.0 23353.000000

Creating a data frame named after values from another data frame

I have a data frame containing three columns, whereas col_1 and col_2 are containing some arbitrary data:
data = {"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)}
df = pd.DataFrame(data)
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
and another data frame containing height values, that should be used to segment the Height column from the df.
data_segments = {"Section Height" : [1, 10, 20]}
df_segments = pd.DataFrame(data_segments)
Section Height
0 1
1 10
2 20
I want to create two new data frames, df_segment_0 containing all columns of the initial df but only for Height rows within the first two indices in the df_segments. The same approach should be taken for the df_segment_1. They should look like:
df_segment_0
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
df_segment_1
Height Col_1 Col_2
9 10 20 30
10 11 22 33
11 12 24 36
12 13 26 39
13 14 28 42
14 15 30 45
15 16 32 48
16 17 34 51
17 18 36 54
18 19 38 57
I tried the following code using the .loc method and added the suggestion of C Hecht to create a list of data frames:
df_segment_list = []
try:
for index in df_segments.index:
df_segment = df[["Height", "Col_1", "Col_2"]].loc[(df["Height"] >= df_segments["Section Height"][index]) & (df["Height"] < df_segments["Section Height"][index + 1])]
df_segment_list.append(df_segment)
except KeyError:
pass
Try-except is used only to ignore the error for the last name entry since there is no height for index=2. The data frames in this list can be accessed as C Hecht:
df_segment_0 = df_segment_list[0]
Height Col_1 Col_2
0 1 2 3
1 2 4 6
2 3 6 9
3 4 8 12
4 5 10 15
5 6 12 18
6 7 14 21
7 8 16 24
8 9 18 27
However, I would like to automate the naming of the final data frames. I tried:
for i in range(0, len(df_segment_list)):
name = "df_segment_" + str(i)
name = df_segment_list[i]
I expect that this code to simply automate the df_segment_0 = df_segment_list[0], instead I receive an error name 'df_segment_0' is not defined.
The reason I need separate data frames is that I will perform many subsequent operations using Col_1 and Col_2, so I need row-wise access to each one of them, for example:
df_segment_0 = df_segment_0 .assign(col_3 = df_segment_0 ["Col_1"] / df_segment_0 ["Col_2"])
How do I achieve this?
EDIT 1: Clarified question with the suggestion from C Hecht.
If you want to get all entries that are smaller than the current segment height in your segmentation data frame, here you go :)
import pandas as pd
df1 = pd.DataFrame({"Height": range(1, 20, 1), "Col_1": range(2, 40, 2), "Col_2": range(3, 60, 3)})
df_segments = pd.DataFrame({"Section Height": [1, 10, 20]})
def segment_data_frame(data_frame: pd.DataFrame, segmentation_plan: pd.DataFrame):
df = data_frame.copy() # making a safety copy because we mutate the df !!!
for sh in segmentation_plan["Section Height"]: # sh is the new maximum "Height"
df_new = df[df["Height"] < sh] # select all entries that match the maximum "Height"
df.drop(df_new.index, inplace=True) # remove them from the original DataFrame
yield df_new
# ATTENTION: segment_data_frame() will calculate each segment at runtime!
# So if you don't want to iterate over it but rather have one list to contain
# them all, you must use list(segment_data_frame(...)) or [x for x in segment_data_frame(...)]
for segment in segment_data_frame(df1, df_segments):
print(segment)
print()
print(list(segment_data_frame(df1, df_segments)))
If you want to execute certain steps on those steps you can just use the defined list like so:
for segment in segment_data_frame(df1, df_segments):
do_stuff_with(segment)
If you want to keep track and name the individual frames, you can use a dictionary
Unfortunately I don't 100% understand what you have in mind, but I hope that the following should help you in finding the answer:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Section Height': [20, 90, 111, 232, 252, 3383, 3768, 3826, 3947, 4100], 'df_names': [f'df_section_{i}' for i in range(10)]})
df['shifted'] = df['Section Height'].shift(-1)
new_dfs = []
for index, row in df.iterrows():
if np.isnan(row['shifted']):
# Don't know what you want to do here
pass
else:
new_df = pd.DataFrame({'heights': [i for i in range(int(row['Section Height']), int(row['shifted']))]})
new_df.name = row['df_names']
new_dfs.append(new_df)
The content of new_dfs are dataframes that look like this:
heights
0 20
1 21
2 22
3 23
4 24
.. ...
65 85
66 86
67 87
68 88
69 89
[70 rows x 1 columns]
If you clarify your questions given this input, we could help you all the way, but this should hopefully point you in the right direction.
Edit: A small comment on using df.name: This is not really stable and if you do stuff like dropping a column, pickling/unpickling, etc. the name will likely be lost. But you can surely find a good solution to maintain the name depending on your needs.

np.random.choice exclude certain numbers?

I'm supposed to create code that will simulate a d20 sided dice rolling 25 times using np.random.choice.
I tried this:
np.random.choice(20,25)
but this still includes 0's which wouldn't appear on a dice.
How do I account for the 0's?
Use np.arange:
import numpy as np
np.random.seed(42) # for reproducibility
result = np.random.choice(np.arange(1, 21), 50)
print(result)
Output
[ 7 20 15 11 8 7 19 11 11 4 8 3 2 12 6 2 1 12 12 17 10 16 15 15
19 12 20 3 5 19 7 9 7 18 4 14 18 9 2 20 15 7 12 8 15 3 14 17
4 18]
The above code draws numbers from 0 to 20 both inclusive. To understand why, you could check the documentation of np.random.choice, in particular on the first argument:
a : 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an
int, the random sample is generated as if a was np.arange(n)
np.random.choice() takes as its first argument an array of possible choices (if int is given it works like np.arrange), so you can use list(range(1, 21)) to get the output you want
+1
np.random.choice(20,25) + 1

Reassign index of a dataframe

I have the following dataframe:
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
Name: Passengers, dtype: float64
As you can see numbers are listed twice from 1-12 / 1-12, instead, I would like to change the index to 1-24. The problem is that when plotting it I see the following:
plt.figure(figsize=(15,5))
plt.plot(esta2,color='orange')
plt.show()
I would like to see a continuous line from 1 to 24.
esta2 = esta2.reset_index() will get you 0-23. If you need 1-24 then you could just do esta2.index = np.arange(1, len(esta2) + 1).
quite simply :
df.index = [i for i in range(1,len(df.index)+1)]
df.index.name = 'Month'
print(df)
Val
Month
1 -0.075844
2 -0.089111
3 0.042705
4 0.002147
5 -0.010528
6 0.109443
7 0.198334
8 0.209830
9 0.075139
10 -0.062405
11 -0.211774
12 -0.109167
13 -0.075844
14 -0.089111
15 0.042705
16 0.002147
17 -0.010528
18 0.109443
19 0.198334
20 0.209830
21 0.075139
22 -0.062405
23 -0.211774
24 -0.109167
Just reassign the index:
df.index = pd.Index(range(1, len(df) + 1), name='Month')

Creation dataframe from several list of lists

I need to build a dataframe from 10 list of list. I did it manually, but it's need a time. What is a better way to do it?
I have tried to do it manually. It works fine (#1)
I tried code (#2) for better perfomance, but it returns only last column.
1
import pandas as pd
import numpy as np
a1T=[([7,8,9]),([10,11,12]),([13,14,15])]
a2T=[([1,2,3]),([5,0,2]),([3,4,5])]
print (a1T)
#Output[[7, 8, 9], [10, 11, 12], [13, 14, 15]]
vis1=np.array (a1T)
vis_1_1=vis1.T
tmp2=np.array (a2T)
tmp_2_1=tmp2.T
X=np.column_stack([vis_1_1, tmp_2_1])
dataset_all = pd.DataFrame({"Visab1":X[:,0], "Visab2":X[:,1], "Visab3":X[:,2], "Temp1":X[:,3], "Temp2":X[:,4], "Temp3":X[:,5]})
print (dataset_all)
Output: Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
> Actually I have varying number of columns in dataframe (500-1500), thats why I need auto generated column names. Extra index (1, 2, 3) after name Visab_, Temp_ and so on - constant for every case. See code below.
For better perfomance I tried
code<br>
#2
n=3 # This is varying parameter. The parameter affects the number of columns in the table.
m=2 # This is constant for every case. here is 2, because we have "Visab", "Temp"
mlist=('Visab', 'Temp')
nlist=[range(1, n)]
for j in range (1,n):
for i in range (1,m):
col=i+(j-1)*n
dataset_all=pd.DataFrame({mlist[j]+str(i):X[:, col]})
I expect output like
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
but there is not any result (only error expected an indented block)
Ok, so the number of columns n is the number of sublists in each list, right? You can measure that with len:
len(a1T)
#Output
3
I'll simplify the answer above so you don't need X and add automatic column-names creation:
my_lists = [a1T,a2T]
my_names = ["Visab","Temp"]
dfs=[]
for one_list,name in zip(my_lists,my_names):
n_columns = len(one_list)
col_names=[name+"_"+str(n) for n in range(n_columns)]
df = pd.DataFrame(one_list).T
df.columns = col_names
dfs.append(df)
dataset_all = pd.concat(dfs,axis=1)
#Output
Visab_0 Visab_1 Visab_2 Temp_0 Temp_1 Temp_2
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5
Now is much clearer. So you have:
X=np.column_stack([vis_1_1, tmp_2_1])
Let's create a list with the names of the columns:
columns_names = ["Visab1","Visab2","Visab3","Temp1","Temp2","Temp3"]
Now you can directly make a dataframe like this:
dataset_all = pd.DataFrame(X,columns=columns_names)
#Output
Visab1 Visab2 Visab3 Temp1 Temp2 Temp3
0 7 10 13 1 5 3
1 8 11 14 2 0 4
2 9 12 15 3 2 5

Categories

Resources