adding new rows to an existing dataframe - python

This is my dataframe. How to I add max_value, min_value, mean_value, median_value names to rows so that my index values will be like
0
1
2
3
4
max_value
min_value
mean_value
median_value
Could anyone help me in solving this

If want add rows use add DataFrame.agg:
df1 = df.append(df.agg(['max','min','mean','median']))
If want add columns use assign with min, max, mean and median:
df2 = df.assign(max_value=df.max(axis=1),
min_value=df.min(axis=1),
mean_value=df.mean(axis=1),
median_value=df.median(axis=1))

one Way is,
Thanks to #jezrael for the help.
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)), columns=list('ABCD'))
df1=df.copy()
#column wise calc
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()
#row wise calc
df['max']=df1.max(axis=1)
df['min']=df1.min(axis=1)
df['mean']=df1.mean(axis=1)
df['median']=df1.median(axis=1)
O/P:
A B C D max min mean median
0 49.0 91.0 16.0 17.0 91.0 16.0 43.25 33.0
1 20.0 42.0 86.0 60.0 86.0 20.0 52.00 51.0
2 32.0 25.0 94.0 13.0 94.0 13.0 41.00 28.5
3 40.0 1.0 66.0 31.0 66.0 1.0 34.50 35.5
4 18.0 30.0 67.0 31.0 67.0 18.0 36.50 30.5
max 49.0 91.0 94.0 60.0 NaN NaN NaN NaN
min 18.0 1.0 16.0 13.0 NaN NaN NaN NaN
mean 31.8 37.8 65.8 30.4 NaN NaN NaN NaN
median 32.0 30.0 67.0 31.0 NaN NaN NaN NaN

This worked well and fine:
df1 = df.copy()
df.loc['max']=df1.max()
df.loc['min']=df1.min()
df.loc['mean']=df1.mean()
df.loc['median']=df1.median()

Related

Apply fillna(method='bfill') only if the values in same year and month with Python

Let's say I have a panel dataframe with lots of NaNs inside as follow:
import pandas as pd
import numpy as np
np.random.seed(2021)
dates = pd.date_range('20130226', periods=720)
df = pd.DataFrame(np.random.randint(0, 100, size=(720, 3)), index=dates, columns=list('ABC'))
for col in df.columns:
df.loc[df.sample(frac=0.4).index, col] = pd.np.nan
df
Out:
A B C
2013-02-26 NaN NaN NaN
2013-02-27 NaN NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 NaN 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 NaN NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
I need to apply df.fillna(method='bfill') or df.fillna(method='ffill') to the dataframe only if they are in same year and month:
For example, if I apply df.fillna(method='bfill'), the expected result will like this:
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
How could I do that in Pandas? Thanks.
You could resample by M (month) and transform bfill:
>>> df.resample("M").transform('bfill')
A B C
2013-02-26 62.0 NaN 44.0
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 74.0
2015-02-12 67.0 10.0 74.0
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>
For specific columns:
>>> df[['A', 'B']] = df.resample("M")[['A', 'B']].transform('bfill')
>>> df
A B C
2013-02-26 62.0 NaN NaN
2013-02-27 62.0 NaN 44.0
2013-02-28 62.0 NaN 29.0
2013-03-01 21.0 70.0 24.0
2013-03-02 12.0 70.0 70.0
... ... ... ...
2015-02-11 38.0 42.0 NaN
2015-02-12 67.0 10.0 NaN
2015-02-13 27.0 10.0 74.0
2015-02-14 18.0 NaN NaN
2015-02-15 NaN NaN NaN
[720 rows x 3 columns]
>>>

how to ffill a multi-index dataframe based on a first level mask

I have a multi-index dataframes.
import pandas as pd
from itertools import product
arrays = [['bar', 'baz','foo'], range(4)]
tuples = list(product(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
multi_ind=pd.DataFrame(np.random.randn(6, len(tuples)), index=range(6), columns=index)
Some values are nans:
multi_ind.loc[3,('bar',2)]=np.nan
multi_ind.loc[3,('bar',3)]=np.nan
multi_ind.loc[4,('bar',1)]=np.nan
For 'bar' I would like to fill all nans expect last, as described in:
Forward fill all except last value in python pandas dataframe
mask=multi_ind['bar']
last_valid_column_per_row = mask.apply(pd.Series.last_valid_index,axis=1)
mask=mask.apply(lambda series:series[:int(last_valid_column_per_row.loc[series.name])].ffill(),axis=1)
Then I would like to ffill() also the other first levels (.e.g baz,foo), using the same logic as for bar (up to last valid index from df['bar']), and I would like to set a nan also any value which was still nan in bar
How to achieve that in an efficient way?
Now I am doing the following, but it is very slow...
df_as_dict={}
df=df.ffill(axis=1) # start by ffilling
for first_level,gr in df.groupby(level=0,axis=1):
gr[first_level][(mask.isnull())]=np.nan # then remove the nans (they should be only at the end)
df_as_dict[first_level]=gr[first_level]
The code based on last_valid_index (in the indicated post) actually
fills NaN along the given axis:
without initial NaN cells (ffill has no previous value to
take as source),
without trailing NaN cells (whatever their number), just
because of last_valid_index, terminating the action just
before the trailing continuous sequence of NaNs,
but if you are happy with this arrangement, let it be.
I created the test DataFrame the following, more concise way:
arrays = [['bar', 'baz','foo'], range(4)]
cols = pd.MultiIndex.from_product(arrays, names=['first', 'second'])
np.random.seed(2)
arr = np.arange(1, 6 * 12 + 1, dtype=float).reshape(6, -1)
# Where to put NaN (x / y)
ind = (np.array([0, 0, 1, 2, 2, 2, 3, 4, 4, 5, 5]),
np.array([1, 2, 6, 1, 3, 5,10, 2, 3,10,11]))
arr[ind] = np.nan
multi_ind = pd.DataFrame(arr, columns=cols)
so that it contains:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 NaN NaN 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 NaN 20.0 21.0 22.0 23.0 24.0
2 25.0 NaN 27.0 NaN 29.0 NaN 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 NaN 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
To get your result, run:
result = multi_ind.stack(level=0).apply(
lambda row: row[: row.last_valid_index() + 1].ffill(), axis=1)\
.unstack(level=1).swaplevel(axis=1).reindex(columns=multi_ind.columns)
Note that your last_valid_column_per_row is not needed.
It is enough to pass axis=1 to operate on rows, instead of
columns (like in the indicated post).
The result is:
first bar baz foo
second 0 1 2 3 0 1 2 3 0 1 2 3
0 1.0 1.0 1.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0
1 13.0 14.0 15.0 16.0 17.0 18.0 18.0 20.0 21.0 22.0 23.0 24.0
2 25.0 25.0 27.0 NaN 29.0 29.0 31.0 32.0 33.0 34.0 35.0 36.0
3 37.0 38.0 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 46.0 48.0
4 49.0 50.0 NaN NaN 53.0 54.0 55.0 56.0 57.0 58.0 59.0 60.0
5 61.0 62.0 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 NaN NaN
Details:
stack(level=0) - put bar, baz and foo "fragments"
in consecutive rows.
apply(….ffill(), axis=1) - fill each row, without the trailing
sequence of NaN (if any). Note that I added + 1 in order to
include the last non-NaN value in the result. Otherwise the last
column would have been dropped.
unstack(level=1) - restore the previous ("wide") arangement,
but unfortunately the order of column MultiIndex levels is reversed.
swaplevel(axis=1) - restore the original order of column levels,
but unfortunately the order of column names is wrong.
reindex(…) - restore the original column order.

Updating row values in a dataframe by finding similar rows based on a defined number of similar column values

I am trying to update rows in my dataframe to account for missing data by using a similarity threshold to compare how many values are the same in different rows. Below is what I am trying, but it is not updating the rows despite calling out the correct rows to fill. The current threshold is over half of the values being the same, so in this example it is any row that has 3 or more similar values, and I am looking for it to only return values that exist within the dataframe already.
threshold = .5
for index1, row1 in df.iterrows():
if row1.isnull().values.any():
for index2, row2 in df.iterrows():
count = 0
for col in df.columns:
print (col)
if row1[col] == row2[col] and index1 != index2:
count = count + 1
else:
count = count
if count > threshold*len(df.columns) and count < len(df.columns):
row1.at[index1] = index2
break
My input dataframe looks like this, so an example of what I am looking for is that row 2 should have the NaN replaced with the value of the column from row 1:
CODE B2004 B2014 C2100 X3200 X1300
ID
20326 40.0 40.0 29.0 39.0 49.0
20338 40.0 NaN 29.0 39.0 49.0
20361 40.0 40.0 NaN 59.0 89.0
20381 40.0 40.0 NaN 59.0 NaN
20384 40.0 40.0 49.0 59.0 89.0
12385 40.0 40.0 29.0 29.0 55.0
12485 40.0 NaN NaN NaN 49.0
12492 35.0 35.0 NaN NaN 49.0
12685 35.0 35.0 29.0 39.0 49.0
12687 40.0 NaN 29.0 29.0 55.0
The expected dataframe would be this:
CODE B2004 B2014 C2100 X3200 X1300
ID
20326 40.0 40.0 29.0 39.0 49.0
20338 40.0 40.0 29.0 39.0 49.0
20361 40.0 40.0 49.0 59.0 89.0
20381 40.0 40.0 49.0 59.0 89.0
20384 40.0 40.0 49.0 59.0 89.0
12385 40.0 40.0 29.0 29.0 55.0
12485 40.0 NaN NaN NaN 49.0
12492 35.0 35.0 29.0 29.0 49.0
12685 35.0 35.0 29.0 39.0 49.0
12687 40.0 40.0 29.0 29.0 55.0
Any thoughts or ideas are appreciated!
I figured out what was wrong. Since row is only a copy of the df, it wasn't actually assigning the value. By changing the 2nd to last row to
df.loc[index1] = row2
I was able to solve the issue

Selective Pandas dropna so that the DataFrame will be different lengths

I am attempting to drop the nan values in my DataFrame df, however I am having difficulty in dropping the for each column without effecting the entire row. An example of my df can be seen below.
Advertising No Advertising
nan 7.0
71.0 nan
65.0 nan
14.0 nan
76.0 nan
nan 36.0
nan 9.0
73.0 nan
85.0 nan
17.0 nan
nan 103.0
My desired output is shown below.
Advertising No Advertising
71.0 7.0
65.0 36.0
14.0 9.0
76.0 103.0
73.0
85.0
17.0
The examples given are just a snippet of the total DataFrame.
Any help would be greatly appreciated.
Use justify with DataFrame.dropna:
df = pd.DataFrame(justify(df.values, invalid_val=np.nan, axis=0, side='up'),
index=df.index,
columns=df.columns).dropna(how='all')
print (df)
Advertising No Advertising
0 71.0 7.0
1 65.0 36.0
2 14.0 9.0
3 76.0 103.0
4 73.0 NaN
5 85.0 NaN
6 17.0 NaN
Another slowier solution is use DataFrame.apply with Series.dropna:
df = df.apply(lambda x: pd.Series(x.dropna().values))
print (df)
Advertising No Advertising
0 71.0 7.0
1 65.0 36.0
2 14.0 9.0
3 76.0 103.0
4 73.0 NaN
5 85.0 NaN
6 17.0 NaN
Mixing numeric with strings (empty strings) is not good idea, because if need processes number later pandas functions failed, so rather dont do it.
But is is possible by :
df = df.fillna('')

Transposing dataframe column, creating different rows per day

I have a dataframe that has one column and a timestamp index including anywhere from 2 to 7 days:
kWh
Timestamp
2017-07-08 06:00:00 0.00
2017-07-08 07:00:00 752.75
2017-07-08 08:00:00 1390.20
2017-07-08 09:00:00 2027.65
2017-07-08 10:00:00 2447.27
.... ....
2017-07-12 20:00:00 167.64
2017-07-12 21:00:00 0.00
2017-07-12 22:00:00 0.00
2017-07-12 23:00:00 0.00
I would like to transpose the kWh column so that one day's worth of values (hourly granularity, so 24 values/day) fill up a row. And the next row is the next day of values and so on (so five days of forecasted data has five rows with 24 elements each).
Because my query of the data comes in the vertical format, and my regression and subsequent analysis already occurs in the vertical format, I don't want to change the process too much and am hoping there is a simpler way. I have tried giving a multi-index with df.index.hour and then using unstack(), but I get a huge dataframe with NaN values everywhere.
Is there an elegant way to do this?
If we start from a frame like
In [25]: df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
In [26]: df.head()
Out[26]:
kWh
Timestamp
2017-07-08 00:00:00 1
2017-07-08 01:00:00 2
2017-07-08 02:00:00 3
2017-07-08 03:00:00 4
2017-07-08 04:00:00 5
we can make date and hour columns and then pivot:
In [27]: df["date"] = df.index.date
In [28]: df["hour"] = df.index.hour
In [29]: df.pivot(index="date", columns="hour", values="kWh")
Out[29]:
hour 0 1 2 3 4 5 6 7 8 9 ... \
date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
hour 14 15 16 17 18 19 20 21 22 23
date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
Not sure why your MultiIndex code doesn't work.
I'm assuming your MultiIndex code is something along the lines, which gives the same output as the pivot:
In []
df = pd.DataFrame({"kWh": [1]}, index=pd.date_range("2017-07-08",
"2017-07-12", freq="1H").rename("Timestamp")).cumsum()
df.index = pd.MultiIndex.from_arrays([df.index.date, df.index.hour], names=['Date','Hour'])
df.unstack()
Out[]:
kWh ... \
Hour 0 1 2 3 4 5 6 7 8 9 ...
Date ...
2017-07-08 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 ...
2017-07-09 25.0 26.0 27.0 28.0 29.0 30.0 31.0 32.0 33.0 34.0 ...
2017-07-10 49.0 50.0 51.0 52.0 53.0 54.0 55.0 56.0 57.0 58.0 ...
2017-07-11 73.0 74.0 75.0 76.0 77.0 78.0 79.0 80.0 81.0 82.0 ...
2017-07-12 97.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
Hour 14 15 16 17 18 19 20 21 22 23
Date
2017-07-08 15.0 16.0 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0
2017-07-09 39.0 40.0 41.0 42.0 43.0 44.0 45.0 46.0 47.0 48.0
2017-07-10 63.0 64.0 65.0 66.0 67.0 68.0 69.0 70.0 71.0 72.0
2017-07-11 87.0 88.0 89.0 90.0 91.0 92.0 93.0 94.0 95.0 96.0
2017-07-12 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
[5 rows x 24 columns]
​

Categories

Resources