create dataframe with outliers and then replace with nan

create dataframe with outliers and then replace with nan - python

I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:

IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5

Related

pandas: cannot set column with substring extracted from other column

I'm doing something wrong when attempting to set a column for a masked subset of rows to the substring extracted from another column.
Here is some example code that illustrates the problem I am facing:
import pandas as pd
data = [
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'},
{'type': 'A', 'base_col': 'key=val'},
{'type': 'B', 'base_col': 'other_val'}
]
df = pd.DataFrame(data)
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')
print("df:")
print(df)
print("mask:")
print(mask)
print("extraction:")
print(df[mask]['base_col'].str.extract(r'key=(.*)'))
The output I get from the above code is as follows:
df:
type base_col derived_col
0 A key=val NaN
1 B other_val NaN
2 A key=val NaN
3 B other_val NaN
mask:
0 True
1 False
2 True
3 False
Name: type, dtype: bool
extraction:
0
0 val
2 val
The boolean mask is as I expect and the extracted substrings on the subset of rows (indexes 0, 2) are also as I expect yet the new derived_col comes out as all NaN. The output I would expect in the derived_col would be 'val' for indexes 0 and 2, and NaN for the other two rows.
Please clarify what I am getting wrong here. Thanks!

You should assign the serise not df , check the column should pick 0
mask = df['type'] == 'A'
df.loc[mask, 'derived_col'] = df[mask]['base_col'].str.extract(r'key=(.*)')[0]
df
Out[449]:
type base_col derived_col
0 A key=val val
1 B other_val NaN
2 A key=val val
3 B other_val NaN

How to replicate dataframe rows based on condition

I have input dataframe df1 like this:
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
I want to replicate the rows in df1 with changes in region based on the column names.Lets say for this example , I need the below output as all three regions are in the column name
url_asia_us_emea
https://asia/image
https://asia/location
https://asia/video
https://us/image
https://us/location
https://us/video
https://emea/image
https://emea/location
https://emea/video

You could do something like this:
list_strings = [ 'us', 'ema', 'fr']
df_orig =pd.read_csv("ack.csv", sep=";")
which is
url
0 https://asia/image
1 https://asia/location
2 https://asia/video
and then
d = []
for element in list_strings:
df =pd.read_csv("ack.csv", sep=";")
df['url'] = df['url'].replace({'asia': str(element)}, regex=True)
d.append(df)
df = pd.concat(d)
DF = pd.concat([df,df_orig])
which results in
index url
0 https://us/image
1 https://us/location
2 https://us/video
3 https://ema/image
4 https://ema/location
5 https://ema/video
6 https://fr/image
7 https://fr/location
8 https://fr/video
9 https://asia/image
10 https://asia/location
11 https://asia/video

Merge pandas rows based on values and NaNs

My dataframe looks like this :
ID VALUE1 VALUE2 VALUE3
1 NaN [ab,c] Good
1 google [ab,c] Good
2 NaN [ab,c1] NaN
2 First [ab,c1] Good1
2 First [ab,c1]
3 NaN [ab,c] Good
Requirement is :
ID is the key. I have 3 rows for ID 2. So, I need to merge two rows into 1 row such that I have valid values (excluding Nulls and spaces) for all the columns.
My expected output is:
ID VALUE1 VALUE2 VALUE3
1 google [ab,c] Good
2 First [ab,c1] Good1
3 NaN [ab,c] Good
Do we have any pandas function to achieve this or should I have to seperate the data into two or more dataframes and process for merging based on NaN/spaces?
Thanks for your help

Micheal G has a more elegant solution above.
Here is my more time consuming and amateur approach:
import pandas as pd
import numpy as np
df = pd.DataFrame({"ID": [1,1,2,2,2,3],
"V1": [np.nan,'google',np.nan,'First','First',np.nan],
"V2": [['ab','c'],['ab','c'],['ab','c1'],['ab','c1'],['ab','c1'],['ab','c']],
"V3": ['Good','Good',np.nan,np.nan,'Good1','Good']
})
uniq = df.ID.unique() #Get the unique values in ID
df = df.set_index(['ID']) #Since we are try find the rows with the least amount of nan's.
#Setting the index by ID is going to make our future statements faster and easier.
newDf = pd.DataFrame()
for i in uniq: #Running the loop per unique value in column ID
temp = df.loc[i]
if(isinstance(temp, pd.Series)): #if there is only 1 row with the i, add that row to out new DataFrame
newDf = newDf.append(temp)
else:
NonNanCountSeries = temp.apply(lambda x: x.count(), axis=1)
#Get the number of non-nan's in the per each row. It is given in list.
NonNanCountList = NonNanCountSeries.tolist()
newDf = newDf.append(temp.iloc[NonNanCountList.index(max(NonNanCountList))])
#Let's break this down.
#Find the max in out nanCountList: max(NonNanCountList))
#Find the index of where the max is. Paraphrased: get the row number with the
#most amount of non-nan's: NonNanCountList.index(max(NonNanCountList))
#Get the row by passing the index into temp.iloc
#Add the row to newDf and update newDf
print(newDf)
Which should return:
V1 V2 V3
1 google [ab, c] Good
2 First [ab, c1] Good1
3 NaN [ab, c] Good

Note, I capitalised Google.
import pandas as pd
import numpy as np
data = {'ID' : [1,1,2,2,2,3], 'VALUE1':['NaN','Google','NaN', 'First', 'First','NaN'], 'VALUE2':['abc', 'abc', 'abc1', 'abc1', 'abc1', 'abc'], 'VALUE3': ['Good', 'Good', 'NaN', 'Good1', '0', 'Good']}
df = pd.DataFrame(data)
df_ = df.replace('NaN', np.NaN).fillna('zero', inplace=False)
df2 = df_.sort_values(['VALUE1', 'ID'])
mask = df2.ID.duplicated()
print (df_[~mask])
Output
ID VALUE1 VALUE2 VALUE3
1 1 Google abc Good
3 2 First abc1 Good1
5 3 zero abc Good
Finally, just be aware the tilda character (~) in the mask is essential

Pandas - fill new column with values from following day

In the following dataframe
#Create data
data = {'Day': [1,1,2,2,3,3],
'Where': ['A','B','A','B','B','B'],
'What': ['x','y','x','x','x','y'],
'Dollars': [100,200,100,100,100,200]}
index = range(len(data['Day']))
columns = ['Day','Where','What','Dollars']
df = pd.DataFrame(data, index=index, columns=columns)
df
I would like to add a column with the future values. In this case, the first value should be 100 as on day 2 at A x was sold for 100 dollars. The complete column should contain the values 100, None, None, 100, None, None.
I thought that I could index the cells in the following way
df2 = df
df2['Tomorrow_Dollars'] = df[df.Day == df2.Day+1,'Dollars']
but this throws the following error
TypeError: 'Series' objects are mutable, thus they cannot be hashed
Is there a solution to this or a smarter approach?

Idea is add create missing combinations by reindex with MultiIndex.from_product, reshape by unstack for unique Days, so possible shift. Last reshape back and join for new column:
df1 = df.set_index(['Day','Where','What'])
mux = pd.MultiIndex.from_product(df1.index.levels, names=df1.index.names)
s = df1.reindex(mux)['Dollars'].unstack([1,2]).shift(-1).unstack().rename('Tomorrow_Dollars')
df = df.join(s, on=['Where','What','Day'])
print (df)
Day Where What Dollars Tomorrow_Dollars
0 1 A x 100 100.0
1 1 B y 200 NaN
2 2 A x 100 NaN
3 2 B x 100 100.0
4 3 B x 100 NaN
5 3 B y 200 NaN

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.

The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)

you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps

Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()

You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

create dataframe with outliers and then replace with nan - python

Related

pandas: cannot set column with substring extracted from other column

How to replicate dataframe rows based on condition

Merge pandas rows based on values and NaNs

Pandas - fill new column with values from following day

pandas convert grouped rows into columns

Categories

Resources