Get column name with last valid value for each index

Get column name with last valid value for each index - python

I have a dataframe like this -
A B C
0 1 NaN 3.0
1 2 3.0 NaN
2 2 NaN NaN
3 NaN NaN 53
I need to find the column name with the last valid value for each index. For example for the above dataframe, I want to get output something like this.
['C','B','A','C]
I did try to get the column names but was only able to grab the values by using iteritems() on the transpose of the dataframe. Also since It loops through the dataframe, I don't find it very optimal. Please find my approach below
l_val = []
for idx, row in df.T.iteritems():
last_val = None
for x in row:
if not pd.isna(x):
last_val = x
l_val.append(last_val)
Returns -
[3.0, 3.0, 2.0]
I have tried searching a lot but most answers referred to last_valid_index method which would return the last valid index in a column which I don't get if I can use for my problem. Can someone please suggest me any fast way to do it.

You can do:
df.idxmax(axis=1).to_list()
Output:
['C', 'B', 'A', 'C']
EDIT:
For the solution which I showed above you will get the index of maximum value. However you can also have a dataframe where values in first columns are greater than values in columns at the end. Then I would suggest using the solution below to get index of last valid value:
df.T.apply(pd.Series.last_valid_index).to_list()
Output:
['C', 'B', 'A', 'C']

Related

Python groupby mode() pick last item when having empty, single and multiple array lengths

I did check for possible solutions, but the most common solutions didn't work.
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(pd.Series.mode)
Gives me multiple arrays in this format:
2611BA []
2611BB 4.0
2611BC [3.0, 6.0]
QUESTION: How to select the last item to use as value for a new column?
Background: one column has rankings. Per group I want to take the mode() and put it as imputed value for NaN's in that group.
In case of multiple modes I want to take the highest. Sometimes a group has only NaN, in that case it should or could stay like that. If a group has 8 NaN's and 1 ranking '8', that de mode should be 8, disregarding the NaN's.
I am trying to create a new column by using codes like this:
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(
lambda x: pd.Series.mode(x)[0])
Or
df_woningen.groupby(['postcode'], dropna=True)['energy_ranking'].agg(lambda x:x.value_counts(dropna=True).index[0])
But I get errors and I believe it's because of the different lengths of the arrays.
TypeError: 'function' object is not subscriptable
index 0 is out of bounds for axis 0 with size 0
Anyone an idea how to solve this?

Assuming this example:
df = pd.DataFrame({'group': list('AAABBC'), 'value': [1,1,2,1,2,float('nan')]})
s = df.groupby('group')['value'].agg(pd.Series.mode)
Input:
group
A 1.0
B [1.0, 2.0]
C []
Name: value, dtype: object
You can use the str accessor and fillna:
s.str[-1].fillna(s.mask(s.str.len().eq(0)))
# or for numbers
# s.str[-1].fillna(pd.to_numeric(s, errors='coerce'))
Output:
group
A 1.0
B 2.0
C NaN
Name: value, dtype: float64

IIUC you can use use a lambda function in conjunction with a -1 for a list to display the data you are looking for
data = {
'Column1' : ['2611BA', '2611BB', '2611BC'],
'Column2' : [[], [4.0], [3.0, 6.0]]
}
df = pd.DataFrame(data)
df['Column3'] = df['Column2'].apply(lambda x : x[-1] if len(x) > 0 else '')
df

Append to only one column in a dataframe python

I have an empty panda date frame. I want to append value to one column at a time. I am trying to iterate through the columns using for loop and append a value (5 for example). I wrote the code below but it does not work. any idea?
example:
df: ['a', 'b', 'c']
for column in df:
df.append({column: 5}, ignore_index=True)
I want to implement this by iterating through the columns. the result should be
df: ['a', 'b', 'c']
5 5 5

This sounds like a horrible idea as it would become extremely inefficient as your df grows in size and I'm almost certain there is a much better way to do this if you would give more context. But for sake of answering the question you could use the shape of the df to figure out the row, and the column name as the column and use .at to manually assign the value.
Here we assign 3 values to the df, one column at a time.
import pandas as pd
df = pd.DataFrame({'a':[],'b':[],'c':[]})
values_to_add = [3,4,5]
for v in values_to_add:
row = df.shape[0]
for column in df.columns:
df.at[row,column] = v
Output
a b c
0 3.0 3.0 3.0
1 4.0 4.0 4.0
2 5.0 5.0 5.0

Pandas: Get values in column that have several different corresponding values in another column

Let's take this sample dataframe :
df=pd.DataFrame({'Name':['A','A','A','B','C','C','D','D'], 'Value':["6","2","1","1","2","2","3","2"]})
Name Value
0 A 6
1 A 2
2 A 1
3 B 1
4 C 2
5 C 2
6 D 3
7 D 2
I would like to extract the values from Name that have at least 2 different values in Value column. I could of course use a for loop to check each value of df["Name"].unique() but it is very slow with my real big dataframe. Do you know an efficient way of doing this ?
Expected output :
[A,D]

Drop duplicates, groupby, and filter for groups of size 1.
Can you do that coding on your own?

One-liner version using GroupBy.nunique + query() + to_list:
df.groupby('Name').nunique().query('Value > 1').index.to_list()
Result:
['A', 'D']

Thanks to Prune comment :
df_gb = df.drop_duplicates().groupby("Name",as_index=False).agg({"Value":"count"})
list(df_gb[df_gb["Value"]>1]["Name"].values)
Output :
['A', 'D']

Good, another option is using GroupBy and nunique method.
df2 = df.groupby(['Name']).nunique()
list(df2[df2['Value'] > 1].index)
Output:
['A', 'D']

How to avoid warning in Pandas? [duplicate]

I'm trying to select a subset of a subset of a dataframe, selecting only some columns, and filtering on the rows.
df.loc[df.a.isin(['Apple', 'Pear', 'Mango']), ['a', 'b', 'f', 'g']]
However, I'm getting the error:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
What 's the correct way to slice and filter now?

TL;DR: There is likely a typo or spelling error in the column header names.
This is a change introduced in v0.21.1, and has been explained in the docs at length -
Previously, selecting with a list of labels, where one or more labels
were missing would always succeed, returning NaN for missing labels.
This will now show a FutureWarning. In the future this will raise a
KeyError (GH15747). This warning will trigger on a DataFrame or a
Series for using .loc[] or [[]] when passing a list-of-labels with at
least 1 missing label.
For example,
df
A B C
0 7.0 NaN 8
1 3.0 3.0 5
2 8.0 1.0 7
3 NaN 0.0 3
4 8.0 2.0 7
Try some kind of slicing as you're doing -
df.loc[df.A.gt(6), ['A', 'C']]
A C
0 7.0 8
2 8.0 7
4 8.0 7
No problem. Now, try replacing C with a non-existent column label -
df.loc[df.A.gt(6), ['A', 'D']]
FutureWarning: Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
A D
0 7.0 NaN
2 8.0 NaN
4 8.0 NaN
So, in your case, the error is because of the column labels you pass to loc. Take another look at them.

This error also occurs with .append call when the list contains new columns. To avoid this
Use:
df=df.append(pd.Series({'A':i,'M':j}), ignore_index=True)
Instead of,
df=df.append([{'A':i,'M':j}], ignore_index=True)
Full error message:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:1472:
FutureWarning: Passing list-likes to .loc or with any missing label
will raise KeyError in the future, you can use .reindex() as an
alternative.
Thanks to https://stackoverflow.com/a/50230080/207661

If you want to retain the index you can pass list comprehension instead of a column list:
loan_data_inputs_train.loc[:,[i for i in List_col_without_reference_cat]]

Sorry, I'm not sure that I correctly understood you, but seems that next way could be acceptable for you:
df[df['a'].isin(['Apple', 'Pear', 'Mango'])][['a', 'b', 'f', 'g']]
Snippet description:
df['a'].isin(['Apple', 'Pear', 'Mango']) # it's "filter" by data in each row in column *a*
df[['a', 'b', 'f', 'g']] # it's "column filter" that provide ability select specific columns set

Pandas slicing FutureWarning with 0.21.0

I'm trying to select a subset of a subset of a dataframe, selecting only some columns, and filtering on the rows.
df.loc[df.a.isin(['Apple', 'Pear', 'Mango']), ['a', 'b', 'f', 'g']]
However, I'm getting the error:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
What 's the correct way to slice and filter now?

TL;DR: There is likely a typo or spelling error in the column header names.
This is a change introduced in v0.21.1, and has been explained in the docs at length -
Previously, selecting with a list of labels, where one or more labels
were missing would always succeed, returning NaN for missing labels.
This will now show a FutureWarning. In the future this will raise a
KeyError (GH15747). This warning will trigger on a DataFrame or a
Series for using .loc[] or [[]] when passing a list-of-labels with at
least 1 missing label.
For example,
df
A B C
0 7.0 NaN 8
1 3.0 3.0 5
2 8.0 1.0 7
3 NaN 0.0 3
4 8.0 2.0 7
Try some kind of slicing as you're doing -
df.loc[df.A.gt(6), ['A', 'C']]
A C
0 7.0 8
2 8.0 7
4 8.0 7
No problem. Now, try replacing C with a non-existent column label -
df.loc[df.A.gt(6), ['A', 'D']]
FutureWarning: Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.
A D
0 7.0 NaN
2 8.0 NaN
4 8.0 NaN
So, in your case, the error is because of the column labels you pass to loc. Take another look at them.

This error also occurs with .append call when the list contains new columns. To avoid this
Use:
df=df.append(pd.Series({'A':i,'M':j}), ignore_index=True)
Instead of,
df=df.append([{'A':i,'M':j}], ignore_index=True)
Full error message:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py:1472:
FutureWarning: Passing list-likes to .loc or with any missing label
will raise KeyError in the future, you can use .reindex() as an
alternative.
Thanks to https://stackoverflow.com/a/50230080/207661

If you want to retain the index you can pass list comprehension instead of a column list:
loan_data_inputs_train.loc[:,[i for i in List_col_without_reference_cat]]

Sorry, I'm not sure that I correctly understood you, but seems that next way could be acceptable for you:
df[df['a'].isin(['Apple', 'Pear', 'Mango'])][['a', 'b', 'f', 'g']]
Snippet description:
df['a'].isin(['Apple', 'Pear', 'Mango']) # it's "filter" by data in each row in column *a*
df[['a', 'b', 'f', 'g']] # it's "column filter" that provide ability select specific columns set

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get column name with last valid value for each index - python

Related

Python groupby mode() pick last item when having empty, single and multiple array lengths

Append to only one column in a dataframe python

Pandas: Get values in column that have several different corresponding values in another column

How to avoid warning in Pandas? [duplicate]

Pandas slicing FutureWarning with 0.21.0

Categories

Resources