find in datfarame outliers and fill with nan python - python

I am trying to make a function to spot the columns with "100" in the header and replace all values in these columns that are above 100 with nan values :
import pandas as pd
data = {'first_100': ['25', '1568200', '5'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_column':['first_value', 'second_value', 'third_value'],
}
df = pd.DataFrame(data)
print (df)
so this is the output I am looking for

Use filter to identify the columns with '100', to_numeric to ensure having numeric values, then mask with a boolean array:
cols = df.filter(like='100').columns
df[cols] = df[cols].mask(df[cols].apply(pd.to_numeric, errors='coerce').gt(100))
output:
first_100 second_column third_100 fourth_column
0 25 first_value 89 first_value
1 NaN second_value 9 second_value
2 5 third_value NaN third_value

Related

How to join multiple dataframe columns based on row index to specified column?

PROBLEM STATEMENT:
I'm trying to join multiple pandas data frame columns, based on row index, to a single column already in the data frame. Issues seem to happen when the data in a column is read in as np.nan.
EXAMPLE:
Original Data frame
time
msg
d0
d1
d2
0
msg0
a
b
c
1
msg1
x
x
x
2
msg0
a
b
c
3
msg2
1
2
3
What I want, if I were to filter for msg0 and msg2
time
msg
d0
d1
d2
0
msg0
abc
NaN
NaN
1
msg1
x
x
x
2
msg0
abc
NaN
Nan
3
msg2
123
NaN
NaN
MY ATTEMPT:
df = pd.DataFrame({'time': ['0', '1', '2', '3'],
'msg': ['msg0', 'msg1', 'msg0', 'msg2'],
'd0': ['a', 'x', 'a', '1'],
'd1': ['b', 'x', 'b', '2'],
'd2': ['c', 'x', np.nan, '3']})
mask = df.index[((df['msg'] == "msg0") |
(df['msg'] == "msg1") |
(df['msg'] == "msg3"))].tolist()
# Is there a better way to combine all columns after a certian point?
# This works fine here but has issues when importing large data sets.
# the 'd0' will be set to NaN too, I think this is due to np.nan
# being set to some columns values when imported.
df.loc[mask, 'd0'] = df['d0'] + df['d1'] + df['d2']
df.iloc[mask, 3:] = "NaN"
The approach might be somewhat similar to #mozway's answer I will make it more detailed to be easier to follow.
1- Define your target columns and messages (just to make it easier to deal with)
# the messages to filter
msgs = ["msg0", "msg2"]
# the columns to filter
columns = df.columns.drop(['time', 'msg'])
# the column to contain the result
total_col = ["d0"]
2- Mask the rows based on the (msgs) column value
mask = df['msg'].isin(msgs)
3- Find the value of the combined values
# a- mask the dataframe to the target columns and rows.
# b- apply ''.join() to join all the column values
# c- to join columns not rows apply on axis = 1
new_total_col = df.loc[mask, columns].apply(lambda x: ''.join(x.dropna().astype(str)), axis=1)
4- Set all target columns and rows to np.nan and redefine the values of the "total" column
df.loc[mask, columns] = np.nan
df.loc[mask, total_col] = new_total_col
Result
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN
You can use:
cols = ['d0', 'd1', 'd2']
# get the rows matching the msg condition
m = df['msg'].isin(['msg0', 'msg2'])
# get relevant columns
# concatenate the non-NaN value
# update as DataFrame to assign NaN is the non-first columns
df.loc[m, cols] = (df
.loc[m, cols]
.agg(lambda r: ''.join(r.dropna()), axis=1)
.rename(cols[0]).to_frame()
)
print(df)
Output:
time msg d0 d1 d2
0 0 msg0 abc NaN NaN
1 1 msg1 x x x
2 2 msg0 ab NaN NaN
3 3 msg2 123 NaN NaN

create dataframe with outliers and then replace with nan

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

Replacing column indexes with the row below [duplicate]

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Python select column on the left from another column

I have a tricky problem to select column in a dataframe. I have a dataframe and multiple columns in it have the same name "PTime".
This is my dataframe:
PTime first_column PTime third_column PTime fourth_column
0 4 first_value 1 first_value 6 first_value
1 4 second_value 2 second_value 7 second_value
This is what I want:
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
I will select my columns from a list:
My code:
data = {'PTime': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['PTime','first_column','fourth_column']
df = pd.DataFrame(data)
#df = df[df.columns.intersection(list_c)]
df = df[list_c]
df
So my goal is to select the column that is in the list and select the column to the left of the one in the list. I if you have any idea to do that, thank you really much. Regards
I don't exactly know how to get left of one in list
But i have a trick to get desired table which you want as shown
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
what we can do is simply remove the column by index
But here as there are same name pandas will to try to delete the first row
But you can simply rename the columns if there are duplicates name and then you can use indexing to delete columns..
So here find some logic to rename it like PTime1 .. PTime2 .. PTime3 ..
and then use indexes to remove it
df.drop(df.columns[i], axis=1,inplace=True)
// or //
df = df.drop(df.columns[i], axis=1)
Here you have to pass the list of indices . In your case it will be like
df.drop(df.columns[[2,3]],axis=1)
After renaming columns
In my dataframe I will not have multiple columns with the same name. All names will be distinct.
So in the case I have ten columns to select it will be difficult to list them all in a list.
data = {'PTime1': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime2': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime3': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['first_column','fourth_column'] #define column to select
df = pd.DataFrame(data) #create dataframe
list_index = [] #create list to store index column
for col in list_c:
index_no = df.columns.get_loc(col) #get index column
list_index.append(index_no-1) #insert index-1 in a list. Get column from the left
list_index.append(index_no) #insert index from the column in the list.
df = df.iloc[:, list_index] #Subset the dataframe with the list of column selected.
df
Like this I can select the column from my list and the column on the left of each element in my list.

replacing the lines with the headers in pandas [duplicate]

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?
I want to do something like:
header = df[df['old_header_name1'] == 'new_header_name1']
df.columns = header
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])
In [22]: df
Out[22]:
0 1 2
0 1 2 3
1 foo bar baz
2 4 5 6
Set the column labels to equal the values in the 2nd row (index location 1):
In [23]: df.columns = df.iloc[1]
If the index has unique labels, you can drop the 2nd row using:
In [24]: df.drop(df.index[1])
Out[24]:
1 foo bar baz
0 1 2 3
2 4 5 6
If the index is not unique, you could use:
In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]:
1 foo bar baz
0 1 2 3
2 4 5 6
Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).
This works (pandas v'0.19.2'):
df.rename(columns=df.iloc[0])
It would be easier to recreate the data frame.
This would also interpret the columns types from scratch.
headers = df.iloc[0]
new_df = pd.DataFrame(df.values[1:], columns=headers)
To rename the header without reassign df:
df.rename(columns=df.iloc[0], inplace = True)
To drop the row without reassign df:
df.drop(df.index[0], inplace = True)
You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.
import pandas as pd
from io import StringIO
In[1]
csv = '''junk1, junk2, junk3, junk4, junk5
junk1, junk2, junk3, junk4, junk5
pears, apples, lemons, plums, other
40, 50, 61, 72, 85
'''
df = pd.read_csv(StringIO(csv), header=2)
print(df)
Out[1]
pears apples lemons plums other
0 40 50 61 72 85
Keeping it Python simple
Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:
table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
['testsala.cxf', '86', '95', '92', '87'],
['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
['630.cxf', '18', '8', '11', '18'],
['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
['dedo uv no filtro.cxf', '52', '93', '48', '58']]
import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])
or in the case is not the first row, but the 10th for instance:
columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)

Categories

Resources