I am trying to set values to a DataFrame for a specific subset of a multi-index and instead of the values being set I am just getting NaN values.
Here is an example:
df_test = pd.DataFrame(np.ones((10,2)),index = pd.MultiIndex.from_product([['even','odd'],[0,1,2,3,4]],names = ['parity','mod5']))
df_test.loc[('even',),1] = pd.DataFrame(np.arange(5)+5,index = np.arange(5))
df_test
0 1
parity mod5
even 0 1.0 NaN
1 1.0 NaN
2 1.0 NaN
3 1.0 NaN
4 1.0 NaN
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
whereas I expected the following output:
0 1
parity mod5
even 0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 1.0 8.0
4 1.0 9.0
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
What do I need to do differently to get the expected result? I have tried a few other things like df_test.loc['even']['1'] but that doesn't even affect the DataFrame at all.
In this example, your indices are specially ordered. If you need to do something like this when index matching matters but the ordering of your DataFrame indices is not guaranteed, then this may be accomplished via DataFrame.update like this:
index = np.arange(5)
np.random.shuffle(index)
df_other = pd.DataFrame(np.arange(5) + 5, index=index).squeeze()
df_test.loc[('even',), 1].update(df_other)
The .squeeze() is needed to convert the DataFrame into a Series (whose shape and indices match those of df_test.loc[('even',), 1]).
You have:
df_test.loc[('even',),1] = pd.DataFrame(np.arange(5)+5,index = np.arange(5))
This assignment causes NaNs in df_test.loc[('even',),1] for 2 reasons. First, you are trying to assign a pd.DataFrame() to a single column. For this to work, you need the same index, as well as the same column name (which defaults to 0 below, but we need 1). It would be easier to use pd.Series(), in which case we don't need to worry about the name. Second, even with the pd.Series, you need to match the index (and index = np.arange(5) does not).
Try as follows:
pd.Series(np.arange(5,10),index = pd.MultiIndex.from_product([['even'],[0,1,2,3,4]]))
# or: pd.DataFrame(np.arange(5,10),columns=[1],
# index = pd.MultiIndex.from_product([['even'],[0,1,2,3,4]]))
# or, if you don't want to bother with the correct index,
# you could of course simply do:
# df_test.loc[('even',),1] = np.arange(5,10)
print(df_test)
0 1
parity mod5
even 0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 1.0 8.0
4 1.0 9.0
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
Related
I have a dataframe in the following format:
A B C D
01-01-2021 1.0 NaN NaN NaN
02-01-2021 2.0 1.0 NaN NaN
03-01-2021 2.0 2.0 NaN NaN
04-01-2021 3.0 2.0 NaN 1.0
05-01-2021 2.0 3.0 NaN 3.0
06-01-2021 1.0 2.0 1.0 2.0
07-01-2021 1.0 1.0 3.0 2.0
08-01-2021 2.0 2.0 1.0 3.0
09-01-2021 3.0 2.0 1.0 2.0
I want to create future-looking windows of width N=6 for each cell and depending on the number of valid (non-NA) values in the cells of these windows, either return the first non-NA value in the window shifted by N=5 downwards, or a NaN.
In the example dataframe, column A is a fully valid one without any NaN values. We create the first future window with width of N=6 for 01-01-2021 which includes dates from 01-01-2021 to 06-01-2021. There are no NaN values, i.e. the total number of valid values (6) is above a threshold of thresh=3. This way, our the first value in column A in our resulting dataframe will be 1.0 on 06-01-2021: we simply take the uppermost valid (non-NA) value in the window we created, and move it down by 5 days, from 01-01-2021 to 06-01-2021. The rest of the values in this column are analogous. This way, therefore, column A will simply be shifted downwards by 5 days.
Column B has its first value missing (NaN). This way, our first value in the resulting dataframe will still appear on 06-01-2021 but it will be the value corresponding to 02-01-2021 from the original dataframe. Importantly, the second value in the resulting dataframe (on 07-01-2021) will be identical to the value on the day before, i.e. both will be 1.0 in the resulting dataframe.
In column C, the first and second future windows do not have enough non-NA values, therefore the values on 06-01-2021 and 07-01-2021 in the resulting dataframe will be NaNs. On 08-01-2021, the resulting dataframe will be 1.0 since that is the first non-NA value in the time window created for 5 days back, on 03-01-2021.
Here is how the resulting dataframe should look like:
A B C D
06-01-2021 1.0 1.0 NaN 1.0
07-01-2021 2.0 1.0 NaN 1.0
08-01-2021 2.0 2.0 1.0 1.0
09-01-2021 3.0 2.0 1.0 1.0
10-01-2021 2.0 3.0 1.0 3.0
11-01-2021 1.0 2.0 1.0 2.0
12-01-2021 1.0 1.0 3.0 2.0
13-01-2021 2.0 2.0 1.0 3.0
14-01-2021 3.0 2.0 1.0 2.0
I am aware of pandas's rolling functionality and that it has a min_periods parameter that really resembles to the functionality that I am trying to apply here. I also know that groupby has a first method that is also partly what I'd need here. However, I am not sure how to connect the dots. My initial idea was shifting the entire dataframe upwards by 6 and use rolling with min_periods, however, rolling has no first method (unlike groupby) and using df.shift(-6) removes the first rows in my dataframe that would be important to determine the values.
It sounds like the future-looking window of width N=6 with thresh=3 is exactly the same as filling values at most 3 days previously. The command pd.fillna was made for this. Specify bfill to fill upwards (axis=0) a maximum of limit=3.
Turning the date column into datetime objects makes it easy to turn them 5 days into the future with DateOffset.
dt=pd.read_csv("data.txt", header=0)
dt["Date"]=pd.to_datetime(dt["Date"], dayfirst=True) + pd.DateOffset(days=5)
print(dt.fillna(method="bfill", axis=0, limit=3))
Date A B C D
0 2021-01-06 1.0 1.0 NaN 1.0
1 2021-01-07 2.0 1.0 NaN 1.0
2 2021-01-08 2.0 2.0 1.0 1.0
3 2021-01-09 3.0 2.0 1.0 1.0
4 2021-01-10 2.0 3.0 1.0 3.0
5 2021-01-11 1.0 2.0 1.0 2.0
6 2021-01-12 1.0 1.0 3.0 2.0
7 2021-01-13 2.0 2.0 1.0 3.0
8 2021-01-14 3.0 2.0 1.0 2.0
Data
Date,A,B,C,D
01-01-2021,1.0,NaN,NaN,NaN
02-01-2021,2.0,1.0,NaN,NaN
03-01-2021,2.0,2.0,NaN,NaN
04-01-2021,3.0,2.0,NaN,1.0
05-01-2021,2.0,3.0,NaN,3.0
06-01-2021,1.0,2.0,1.0,2.0
07-01-2021,1.0,1.0,3.0,2.0
08-01-2021,2.0,2.0,1.0,3.0
09-01-2021,3.0,2.0,1.0,2.0
Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of
I have a large dataframe that looks similar to this:
As you can tell, there are plenty of blanks. I want to propagate non-null values forward (so for example, in the first row 1029 goes to 1963.02.12 column, between 1029 and 1043) but only up to the last entry, that is it should stop propagating when it encounters the last non-null value (for D it would be
the 1992.03.23 column, but for A it'd be 1963.09.21, just outside the screenshot).
Is there a quicker way to achieve this without fiddling around with df.fillna(method='ffill', limit=x)? My original idea was to remember the date of the last entry, propagate values to the end of the row, and then fill the row with nulls after the saved date. I've been wondering if there is a cleverer method to achieve the same result.
This might not be very performant. I couldn't get a pure-pandas solution (which obviously doesn't guarantee performance anyway!)
>>> df
a b c d e
0 0.0 NaN NaN 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
What happens if we just ffill everything?
>>> df.ffill(axis=1)
a b c d e
0 0.0 0.0 0.0 1.0 1.0
1 0.0 1.0 1.0 2.0 3.0
2 NaN 1.0 2.0 2.0 4.0
We need to go back and add NaNs for the last null column in each row:
>>> new_data = []
>>> for _, row in df.iterrows():
... new_row = row.ffill()
... null_columns = [col for col, is_null in zip(row.index, row.isnull().values) if is_null]
... # replace value in last column with NaN
... if null_columns:
... last_null_column = null_columns[-1]
... new_row.ix[last_null_column] = np.nan
... new_data.append(new_row.to_dict())
...
>>> new_df = pd.DataFrame.from_records(new_data)
>>> new_df
a b c d e
0 0.0 0.0 0.0 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output
I have a big dataframe with many columns (like 1000). I have a list of columns (generated by a script ~10). And I would like to select all the rows in the original dataframe where at least one of my list of columns is not null.
So if I would know the number of my columns in advance, I could do something like this:
list_of_cols = ['col1', ...]
df[
df[list_of_cols[0]].notnull() |
df[list_of_cols[1]].notnull() |
...
df[list_of_cols[6]].notnull() |
]
I can also iterate over the list of cols and create a mask which then I would apply to df, but his looks too tedious. Knowing how powerful is pandas with respect to dealing with nan, I would expect that there is a way easier way to achieve what I want.
Use the thresh parameter in the dropna() method. By setting thresh=1, you specify that if there is at least 1 non null item, don't drop it.
df = pd.DataFrame(np.random.choice((1., np.nan), (1000, 1000), p=(.3, .7)))
list_of_cols = list(range(10))
df[list_of_cols].dropna(thresh=1).head()
Starting with this:
data = {'a' : [np.nan,0,0,0,0,0,np.nan,0,0, 0,0,0, 9,9,],
'b' : [np.nan,np.nan,1,1,1,1,1,1,1, 2,2,2, 1,7],
'c' : [np.nan,np.nan,1,1,2,2,3,3,3, 1,1,1, 1,1],
'd' : [np.nan,np.nan,7,9,6,9,7,np.nan,6, 6,7,6, 9,6]}
df = pd.DataFrame(data, columns=['a','b','c','d'])
df
a b c d
0 NaN NaN NaN NaN
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
Rows where not all values are nulls. (Removing row index 0)
df[~df.isnull().all(axis=1)]
a b c d
1 0.0 NaN NaN NaN
2 0.0 1.0 1.0 7.0
3 0.0 1.0 1.0 9.0
4 0.0 1.0 2.0 6.0
5 0.0 1.0 2.0 9.0
6 NaN 1.0 3.0 7.0
7 0.0 1.0 3.0 NaN
8 0.0 1.0 3.0 6.0
9 0.0 2.0 1.0 6.0
10 0.0 2.0 1.0 7.0
11 0.0 2.0 1.0 6.0
12 9.0 1.0 1.0 9.0
13 9.0 7.0 1.0 6.0
One can use boolean indexing
df[~pd.isnull(df[list_of_cols]).all(axis=1)]
Explanation:
The expression df[list_of_cols]).all(axis=1) returns a boolean array that is applied as a filter to the dataframe:
isnull() applied to df[list_of_cols] creates a boolean mask for the dataframe df[list_of_cols] with True values for the null elements in df[list_of_cols], False otherwise
all() returns True if all of the elements are True (row-wise axis=1)
So, by negation ~ (not all null = at least one is non-null) one gets a mask for all rows that have at least one non-null element in the given list of columns.
An example:
Dataframe:
>>> df=pd.DataFrame({'A':[11,22,33,np.NaN],
'B':['x',np.NaN,np.NaN,'w'],
'C':['2016-03-13',np.NaN,'2016-03-14','2016-03-15']})
>>> df
A B C
0 11 x 2016-03-13
1 22 NaN NaN
2 33 NaN 2016-03-14
3 NaN w 2016-03-15
isnull mask:
>>> ~pd.isnull(df[list_of_cols])
B C
0 True True
1 False False
2 False True
3 True True
apply all(axis=1) row-wise:
>>> ~pd.isnull(df[list_of_cols]).all(axis=1)
0 True
1 False
2 True
3 True
dtype: bool
Boolean selection from dataframe:
>>> df[~pd.isnull(df[list_of_cols]).all(axis=1)]
A B C
0 11 x 2016-03-13
2 33 NaN 2016-03-14
3 NaN w 2016-03-15