working with NaN in a dataframe with if condition - python

I have 2 columns in a dataframe and I am trying to enter into a condition based on if the second one is NaN and First one has some values, unsuccessfully using:
if np.isfinite(train_bk['Product_Category_1']) and np.isnan(train_bk['Product_Category_2'])
and
if not (train_bk['Product_Category_2']).isnull() and (train_bk['Product_Category_3']).isnull()

I would use eval:
df.eval(' ind = ((pc1==pc1) & (pc2!=pc2) )*2+((pc1==pc1)&(pc2==pc2))*3')
df.replace({'ind':{0:1})

Related

Trying to compare to values in a pandas dataframe for max value

I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64

Nan values not getting replaced

The values are getting replaced but the moment i print the data it still shows the nan values
for col in data.columns:
for each in range(len(data[col])):
if math.isnan(data[col][each]) == True:
data.replace(data[col][each], statistics.mean(data[col]))
data
dataset: https://docs.google.com/spreadsheets/d/1AVTVmUVs9lSe7I9EXoPs0gNaSIo9KM2PrXxwVWeqtME/edit?usp=sharing
Looks like what you are trying to do is to replace NaN values by the mean of each column, which has been treated here
Regarding your problem, the function replace(a,b) replaces all the values in your dataframe that are equal to a by b.
Moreover, the function statistics.mean will return NaN if there is a Nan number in your list, so you should use numpy.nanmean() instead.

Get all rows in a Pandas DataFrame column that are in a list of strings - This pattern has match groups

Consider the code
import pandas as pd
# dfs
df_sample = pd.read_csv('...........')
array = ['' , '' , '' ....]
pattern = '|'.join(array)
# get all the rows
print(df_sample.COLUMN_NAME_XXX.str.contains(pattern))
How can I get the column's contents and not TRUE/FALSE as at the moment ?
Since I keep getting this:
manipulations.py:17: UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
print(df_sample.COLUMN_NAME_XXX.str.contains(pattern))
0 False
1 True
2 False
3 NaN
4 NaN
...
10942 False
10943 NaN
10944 NaN
10945 NaN
10946 NaN
Name: COLUMN_NAME_XXX, Length: 568743243, dtype: object
You should be able to pass that logical array directly back to the dataframe slicing operators, like:
df_sample[df_sample.COLUMN_NAME_XXX.str.contains(pattern)]
Which should return all rows where the condition inside the square brackets is satisfied. Conditions can be chained by formatting them like:
[(condition1) | (condition2)] #OR
[(condition1) & (condition2)] #AND
It seems to map NaN to False automatically, but if not you can add that as another step to the boolean dataframe by adding .fillna(value = False):
df_sample[df_sample.COLUMN_NAME_XXX.str.contains(pattern).fillna(value = False)]
Try via fillna():
m=df_sample.COLUMN_NAME_XXX.str.contains(pattern).fillna(False)
#Finally:
out=df[m]
#OR
out=df.loc[m]
Now If you print out you will get your filtered dataframe

Pandas - change cell value based on conditions from cell and from column

I have a Dataframe with a lot of "bad" cells. Let's say, they have all -99.99 as values, and I want to remove them (set them to NaN).
This works fine:
df[df == -99.99] = None
But actually I want to delete all these cells ONLY if another cell in the same row is market as 1 (e.g. in the column "Error").
I want to delete all -99.99 cells, but only if df["Error"] == 1.
The most straight-forward solution I thin is something like
df[(df == -99.99) & (df["Error"] == 1)] = None
but it gives me the error:
ValueError: cannot reindex from a duplicate axis
I tried every given solutions on the internet but I cant get it to work! :(
Since my Dataframe is big I don't want to iterate it (which of course, would work, but take a lot of time).
Any hint?
Try using broadcasting while passing numpy values:
# sample data, special value is -99
df = pd.DataFrame([[-99,-99,1], [2,-99,2],
[1,1,1], [-99,0, 1]],
columns=['a','b','Errors'])
# note the double square brackets
df[(df==-99) & (df[['Errors']]==1).values] = np.nan
Output:
a b Errors
0 NaN NaN 1
1 2.0 -99.0 2
2 1.0 1.0 1
3 NaN 0.0 1
At least, this is working (but with column iteration):
for i in df.columns:
df.loc[df[i].isin([-99.99]) & df["Error"].isin([1]), i] = None

Computing the first non-missing value from each column in a DataFrame

I have a DataFrame which looks like this:
1125400 5430095 1095751
2013-05-22 105.24 NaN 6507.58
2013-05-23 104.63 NaN 6393.86
2013-05-26 104.62 NaN 6521.54
2013-05-27 104.62 NaN 6609.31
2013-05-28 104.54 87.79 6640.24
2013-05-29 103.91 86.88 6577.39
2013-05-30 103.43 87.66 6516.55
2013-06-02 103.56 87.55 6559.43
I would like to compute the first non-NaN value in each column.
As Locate first and last non NaN values in a Pandas DataFrame points out, first_valid_index can be used. Unfortunately, it returns the first row where at least one element is not NaN and does not work per-column.
You should use the apply function which applies a function on either each column (default) or each row efficiently:
>>> first_valid_indices = df.apply(lambda series: series.first_valid_index())
>>> first_valid_indices
1125400 2013-05-22 00:00:00
5430095 2013-05-28 00:00:00
1095751 2013-05-22 00:00:00
first_valid_indiceswill then be a series containing the first_valid_index for each column.
You could also define the lambda function as a normal function outside:
def first_valid_index(series):
return series.first_valid_index()
and then call apply like this:
df.apply(first_valid_index)
The built in function DataFrame.groupby().column.first() returns the first non null value in the column, while last() returns the last.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.GroupBy.first.html
If you don't wish to get the first value for each group, you can add a dummy column of 1s. Then get the first non null value using the groupby & first functions.
from Pandas import DataFrame
df = DataFrame({'a':[None,1,None],'b':[None,2,None]})
df['dummy'] = 1
df.groupby('dummy').first()
df.groupby('dummy').last()
By compute I assume you mean access?
The simplest way to do this is with the pd.Series.first_valid_index() method probably inside a dict comprehension:
values = {col : DF.loc[DF[col].first_valid_index(), col] for col in DF.columns}
values
Just to be clear, each column in a pandas DataFrame is a Series. So the above is the same as doing:
values = {}
for column in DF.columns:
First_Non_Null_Index = DF[column].first_valid_index()
values[column] = DF.loc[First_Non_Null_Index, column]
So the operation in my one line solution is on a per column basis. I.e. it is not going to create the type of error you seem to be suggesting in the edit you made to the question. Let me know if it does not work as expected.

Categories

Resources