How to filling up the missing value in pandas dataframe - python

I have a dataframe that contains missing values.
index month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
I want to fill the missing value of the above data frame with the list below.
list = [201501, 201502, 201503 ... 201612]
The result I want to get...
index month value
0 201501 100
1 201502 100
2 201503 100
3 201504 100
4 201505 100
5 201506 100
6 201507 172
7 201508 172
...
...
23 201611 98
34 201612 98

Setup
my_list = list(range(201501,201509))
df=df.drop('index',axis=1) #remove the column index after use pd.read_clipboard
print(df)
month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
pd.DataFrame.reindex
df = (df.set_index('month')
.reindex( index = np.sort(np.unique(df['month'].tolist() + my_list)) )
.ffill()
.reset_index() )
print(df)
month value
0 201501 100.0
1 201502 100.0
2 201503 100.0
3 201504 100.0
4 201505 100.0
5 201506 100.0
6 201507 172.0
7 201508 172.0
8 201602 181.0
9 201605 98.0
10 201612 98.0

Using pandas.DataFrame.merge:
l = list(range(201501,201509))
new_df = df.merge(pd.Series(l,name='month'),how='outer').sort_values('month').ffill()
new_df['index'] = range(new_df.shape[0])
Output:
index month value
0 0 201501 100.0
4 1 201502 100.0
5 2 201503 100.0
6 3 201504 100.0
7 4 201505 100.0
8 5 201506 100.0
1 6 201507 172.0
9 7 201508 172.0
2 8 201602 181.0
3 9 201605 98.0

Related

loop for creating new column and fill with neighborhood row

I have following df. I am going to dynamically create new columns based on number of date (day_number=2), and conditionally fill them based on "code" and "count"
Current format:
code count
id date
ABC1 2019-04-04 1 76
2019-04-05 2 82
Desired matrix-like format:
code count code1_day1 code1_day1 code1_day2 code2_day2
id date
ABC1 2019-04-04 1 76 76 0 0 82
2019-04-05 2 82
I have done this but it fills the same for every column:
code=[1,2]
for date, new in df.groupby(level=[0]):
for col in range(day_number): # day_number=2
for lvl in code:
new[f"day{col+1}_code1"]=new['count'].where(new['code']==1)
new[f"day{col+1}_code2"]=new['count'].where(new['code']==2)
So many thanks for your help!
A biger example of the databse:
code count new-col1 new_col2 ......
id date
ABC1
2019-04-04 1 76 76 0 79 0 82 0 83 0 88 0 55 3 65 6
2019-04-05 1 79 79 0 82 0 83 0 88 0 55 3 65 6 101 10
2019-04-06 1 82 82 0 83 0 88 0 55 3 65 6 101 10 120 14
2019-04-07 2 83 83 0 88 0 55 3 65 6 101 10 120 14 0 0
2019-04-08 1 88 88 0 55 3 65 6 101 10 120 14 0 0 0 0
2019-04-09 1 55 55 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-09 2 3 65 6 101 10 120 14 0 0 0 0 10 0
2019-04-10 1 65 101 10 120 14 0 0 0 0 10 0
2019-04-10 2 6 120 14 0 0 0 0 10 0
2019-04-11 1 101 0 0 0 0 10 0
your sample data is not so usable so I've simulated
considering differently, the data is grouped, hence groupby() ID in index and code
apply() after a groupby() gets passed a dataframe, build required columns on this dataframe
d = pd.date_range("01-jan-2021","03-jan-2021")
df = pd.concat([
pd.DataFrame({"ID":"ABC1","date":d,"code":1,"count":np.random.randint(20,50, len(d))}),
pd.DataFrame({"ID":"ABC1","date":d,"code":2,"count":np.random.randint(20,50, len(d))})
]).sort_values(["ID","date","code"], ascending=[True,False,True]).set_index(["ID","date"])
# pad an array with NaN to same length as second iterable
def nppad(a, s):
return np.pad(a.astype(float), (0,len(s)-len(a)), "constant", constant_values=np.nan)
df2 = df.groupby(["ID","code"]).apply(lambda dfa: dfa.assign(**{f"code{dfa.iloc[0,0]}_day{i+1}":
nppad(dfa["count"].values[i:],dfa)
for i in range(len(dfa))}))
output
code count code1_day1 code1_day2 code1_day3 code2_day1 code2_day2 code2_day3
ID date
ABC1 2021-01-03 1 40 40.0 38.0 46.0 NaN NaN NaN
2021-01-03 2 37 NaN NaN NaN 37.0 33.0 33.0
2021-01-02 1 38 38.0 46.0 NaN NaN NaN NaN
2021-01-02 2 33 NaN NaN NaN 33.0 33.0 NaN
2021-01-01 1 46 46.0 NaN NaN NaN NaN NaN
2021-01-01 2 33 NaN NaN NaN 33.0 NaN NaN

Plotting in pivot table using label

my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading
Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')

Expanding a data set from months to days in Python for non-DateTime type data

How do I expand this monthly table(Table A) into a daily table(Table B) that spreads revenue across the 30 day period?
Table A
index Month Revenue ($)
0 1 300
1 2 330
2 3 390
(Assuming each month has 30 days)
Table B
index Month Day Revenue ($)
0 1 1 10
1 1 2 10
2 1 3 10
... ... ... ...
30 2 1 11
31 2 2 11
... ... ... ...
60 3 1 13
... ... ... ...
89 3 30 13
Try:
df = pd.concat([df]*30).assign(Revenue=lambda x: x['Revenue'] / 30).sort_values('Month')
Create the days column
df['day'] = [i for i in range(1, 31)] * number_of_months
print(df)
Month Revenue day
0 1 10.0 1
1 1 10.0 2
2 1 10.0 3
3 1 10.0 4
4 1 10.0 5
.. ... ... ...
85 3 13.0 26
86 3 13.0 27
87 3 13.0 28
88 3 13.0 29
89 3 13.0 30

Pandas : Cumulative sum with moving window (following and preceding rows)

I have the following dataset :
date sales
201201 5
201202 5
201203 5
201204 5
201205 5
201206 5
201207 5
201208 5
201209 5
201210 5
201211 5
201212 5
201301 100
201302 100
And I want to compute the cumulative sum of sales, from the beginning to the actual date + 12 months
So here :
date sales expected
201201 5 60
201202 5 160
201203 5 260
201204 5 260
201205 5 260
201206 5 260
201207 5 260
201208 5 260
201209 5 260
201210 5 260
201211 5 260
201212 5 260
201301 100 260
201302 100 260
According to this question How to compute cumulative sum of previous N rows in pandas? I tried :
df['sales'].rolling(window=12).sum()
However I am looking for something more like this :
df['sales'].rolling(window=['unlimited preceding, 11 following']).sum()
Use cumsum directly thanks shift by 11, than use ffill to fill NaNs with previous value:
df['expected'] = df['sales'].cumsum().shift(-11).ffill()
And now:
print(df)
Is:
date sales expected
0 201201 5 60.0
1 201202 5 160.0
2 201203 5 260.0
3 201204 5 260.0
4 201205 5 260.0
5 201206 5 260.0
6 201207 5 260.0
7 201208 5 260.0
8 201209 5 260.0
9 201210 5 260.0
10 201211 5 260.0
11 201212 5 260.0
12 201301 100 260.0
13 201302 100 260.0

Equivalent of transform in R/ddply in Python/pandas?

In R's ddply function, you can compute any new columns group-wise, and append the result to the original dataframe, such as:
ddply(mtcars, .(cyl), transform, n=length(cyl)) # n is appended to the df
In Python/pandas, I have computed it first, and then merge, such as:
df1 = mtcars.groupby("cyl").apply(lambda x: Series(x["cyl"].count(), index=["n"])).reset_index()
mtcars = pd.merge(mtcars, df1, on=["cyl"])
or something like that.
However, I always feel like that's pretty daunting, so is it feasible to do it all once?
Thanks.
You can add a column to a DataFrame by assigning the result of a groupby/transform operation to it:
mtcars['n'] = mtcars.groupby("cyl")['cyl'].transform('count')
import pandas as pd
import pandas.rpy.common as com
mtcars = com.load_data('mtcars')
mtcars['n'] = mtcars.groupby("cyl")['cyl'].transform('count')
print(mtcars.head())
yields
mpg cyl disp hp drat wt qsec vs am gear carb n
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 7
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 14
To add multiple columns, you could use groupby/apply. Make sure the function you apply returns a DataFrame with the same index as its input. For example,
mtcars[['n','total_wt']] = mtcars.groupby("cyl").apply(
lambda x: pd.DataFrame({'n': len(x['cyl']), 'total_wt': x['wt'].sum()},
index=x.index))
print(mtcars.head())
yields
mpg cyl disp hp drat wt qsec vs am gear carb n total_wt
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7 21.820
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7 21.820
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11 25.143
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 7 21.820
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 14 55.989

Categories

Resources