Write function for .apply that analyzes column as a whole? - python

I have a dataframe (it is the product of using the pivot function, which is why it has the c and a):
c 367235 419895 992194
a
1999-02-06 Nan 9 Nan
2000-04-03 2 Nan Nan
1999-04-12 Nan Nan 4
1999-08-08 2 Nan Nan
1999-11-01 8 5 1
1999-12-08 Nan 3 Nan
1999-08-17 Nan Nan 10
1999-10-22 Nan 3 Nan
1999-03-23 Nan 4 Nan
2000-09-29 9 Nan Nan
1999-04-30 Nan Nan 1
2000-09-30 9 Nan Nan
I would like to add a new row at the bottom of this dataframe. Each cell in the new row will evaluate the column above it; if the column contains the numbers 9, 8 or 3, the cell will evaluate to "TRUE". If the column does not contain those numbers, the cell will evaluate to "FALSE". Ultimately, my goal is to delete the columns with a "FALSE" cell using the drop function, creating a dataset like so:
c 367235 419895
a
1999-02-06 Nan 9
2000-04-03 2 Nan
1999-04-12 Nan Nan
1999-08-08 2 Nan
1999-11-01 8 5
1999-12-08 Nan 3
1999-08-17 Nan Nan
1999-10-22 Nan 3
1999-03-23 Nan 4
2000-09-29 9 Nan
1999-04-30 Nan Nan
2000-09-30 9 Nan
TRUE TRUE
My problem:
I can write a function that evaluates if one of several numbers are in a list, but I cannot write this function into .apply.
That is, I found that this works for determining if a group of numbers is in a list:
How to check if one of the following items is in a list?
I tried to modify it as follows for the apply function:
def BIS(i):
L1 = [9,8,3]
if i in L1:
return "TRUE"
else:
return "FALSE"
df_wanted.apply(BIS, axis = 0)
this results in an error:
('the truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item, a.any().' u'occured at index 367235')
This makes me think that although .apply takes an entire column as input, it cannot aggregate the truth value of all the individual cells and come up with a total truth value about the column. I looked up a.any and a.bool, and they look very useful, but I don't know where to stick them in? For example, this didn't work:
df_wanted.apply.any(BIS, axis = 0)
nor did this
df_wanted.apply(BIS.any, axis = 0).
Can anyone point me in the right direction? Many thanks in advance

You can use the .isin() method:
df.loc[:, df.isin(['9','8','3']).any()]
And if you need to append the condition to the data frame:
cond = df.isin(['9','8','3']).any().rename("cond")
df.append(cond).loc[:, cond]

Related

Pandas .iloc indexing coupled with boolean indexing in a Dataframe

I looked into existing threads regarding indexing, none of said threads address the present use case.
I would like to alter specific values in a DataFrame based on their position therein, ie., I'd like the values in the second column from the first to the 4th row to be NaN and values in the third column, first and second row to be NaN say we have the following `DataFrame`:
df = pd.DataFrame(np.random.standard_normal((7,3)))
print(df)
0 1 2
0 -1.102888 1.293658 -2.290175
1 -1.826924 -0.661667 -1.067578
2 1.015479 0.058240 -0.228613
3 -0.760368 0.256324 -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
And I want alter df like below with the least amount of code:
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350
I tried using boolean indexing with .loc but resulted in an error:
df.loc[(:2,1:) & (2:4,1)] = np.nan
# exception message:
df.loc[(:2,1:) & (2:4,1)] = np.nan
^
SyntaxError: invalid syntax
I also thought about converting the DataFrame object to a numpy narray object but then I wouldn't know how to use boolean in that case.
One way is define the requirement and assign to be clear:
d = {1:4,2:2}
for col,val in d.items():
df.iloc[:val,col] = np.nan
print(df)
0 1 2
0 -1.102888 NaN NaN
1 -1.826924 NaN NaN
2 1.015479 NaN -0.228613
3 -0.760368 NaN -0.259946
4 0.496348 0.437496 0.646149
5 0.717212 0.481687 -2.640917
6 -0.141584 -1.997986 1.226350

Iterate through rows and identify which columns is true, assign new column the name of the column.header

I have the following DataFrame:
Index
Time Lost
Cause 1
Cause 2
Cause 3
0
40
x
Nan
Nan
1
15
Nan
x
Nan
2
65
x
Nan
Nan
3
10
Nan
Nan
x
There is only one "X" per row which identifies the cause of the time lost column. I am trying to iterate through each row (and column) to determine which column holds the "X". I would then like to add a "Type" column with the name of the column header that was True for each row. This is what I would like as a result:
Index
Time Lost
Cause 1
Cause 2
Cause 3
Type
0
40
x
Nan
Nan
Cause 1
1
15
Nan
x
Nan
Cause 2
2
65
x
Nan
Nan
Cause 1
3
10
Nan
Nan
x
Cause 3
Currently my code looks like this, I am trying to iterate through the DataFrame. However, I'm not sure if there is a function or non-iterative approach to assign the proper value to the "Type" column:
cols = ['Cause1', 'Cause 2', 'Cause 3']
for index, row in df.iterrows():
for col in cols:
if df.loc[index,col] =='X':
df.loc[index,'Type'] = col
continue
else:
df.loc[index,'Type'] = 'Other'
continue
The issue I get with this code is that it iterates but only identifies rows with the last item in the cols list and the remainder go to "Other".
Any help is appreciated!
You could use idxmax on the boolean array of your data:
df['Type'] = df.drop('Time Lost', axis=1).eq('x').idxmax(axis=1)
Note that this only report the first cause if several
output:
Time Lost Cause 1 Cause 2 Cause 3 Type
0 40 x Nan Nan Cause 1
1 15 Nan x Nan Cause 2
2 65 x Nan Nan Cause 1
3 10 Nan Nan x Cause 3

thresh in dropna for DataFrame in pandas in python

df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)

Python: Create New Column Equal to the Sum of all Columns Starting from Column Number 9

I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156

Remove NaN values from dataframe without fillna or Interpolate

I have a dataset:
367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN
When I plot it, using plt.plot(df, '-o') I get this:
But what I would like is for the datapoints from each column to be connected in a line, like so:
I understand that matplotlib does not connect datapoints that are separate by NaN values. I looked at all the options here for dealing with missing data, but all of them would essentially misrepresent the data in the dataframe. This is because each value within the dataframe represents an incident; if I try to replace the NaNs with scalar values or use the interpolate option, I get a bunch of points that are not actually in my dataset. Here's what interpolate looks like:
df_wanted2 = df.apply(pd.Series.interpolate)
If I try to use dropna I'll lose entire rows\columns from the dataframe, and these rows hold valuable data.
Does anyone know a way to connect up my dots? I suspect I need to extract individual arrays from the datasframe and plot them, as is the advice given here, but this seems like a lot of work (and my actual dataframe is much bigger.) Does anyone have a solution?
use interpolate method with parameter 'index'
df.interpolate('index').plot(marker='o')
alternative answer
plot after iteritems
for _, c in df.iteritems():
c.dropna().plot(marker='o')
extra credit
only interpolate from first valid index to last valid index for each column
for _, c in df.iteritems():
fi, li = c.first_valid_index(), c.last_valid_index()
c.loc[fi:li].interpolate('index').plot(marker='o')
Try iterating through with apply and then inside the apply function drop the missing values
def make_plot(s):
s.dropna().plot()
df.apply(make_plot)
An alternative would be to outsource the NaN handling to the graph libary Plotly using its connectgaps function.
import plotly
import pandas as pd
txt = """367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN"""
data_points = [line.split(' ') for line in txt.splitlines()[1:]]
df = pd.DataFrame(data_points)
data = list()
for i in range(1, len(df.columns)):
data.append(plotly.graph_objs.Scatter(
x = df.iloc[:,0].tolist(),
y = df.iloc[:,i].tolist(),
mode = 'line',
connectgaps = True
))
fig = dict(data=data)
plotly.plotly.sign_in('user', 'token')
plot_url = plotly.plotly.plot(fig)

Categories

Resources