Filter DataFrame after sklearn.feature_selection - python

I reduce dimensionality of a dataset (pandas DataFrame).
X = df.as_matrix()
sel = VarianceThreshold(threshold=0.1)
X_r = sel.fit_transform(X)
then I wanto to get back the reduced DataFrame (i.e. keep only ok columns)
I found only this ugly way to do so, which is very inefficient, do you have any cleaner idea?
cols_OK = sel.get_support() # which columns are OK?
c = list()
for i, col in enumerate(cols_OK):
if col:
c.append(df.columns[i])
return df[c]

I think you need if return mask:
cols_OK = sel.get_support()
df = df.loc[:, cols_OK]
and if return indices:
cols_OK = sel.get_support()
df = df.iloc[:, cols_OK]

Related

Find percent difference between two columns, that share same root name, but differ in suffix

My question is somewhat similar to subtracting-two-columns-named-in-certain-pattern
I'm having trouble performing operations on columns that share the same root substring, without a loop. Basically I want to calculate a percentage change using columns that end with '_PY' with another column that shares the same name except for the suffix.
What's a possible one line solution, or one that doesn't involve a for loop?
url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']
PY_cols = [col for col in df.columns if col.endswith("PY")]
reg_cols = [col.split("_PY")[0] for col in PY_cols]
for k,v in zip(reg_cols,PY_cols):
df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
df
You can use:
v = (df[df.columns[df.columns.str.endswith('_PY')]]
.rename(columns=lambda x: x.rsplit('_', maxsplit=1)[0]))
k = df[v.columns]
out = pd.concat([df, k.sub(v).div(v).mul(100).round(2).add_suffix('_YOY%')], axis=1)
Gotta subset the df into the columns you need. Then zip will pull the pairs you need to do the percent calculation.
url = r'https://www2.arccorp.com/globalassets/forms/corpstats.csv?1653338666304'
df = pd.read_csv(url)
df = df[df['TYPE'] == 'M']
df_cols = [col for col in df.columns]
PY_cols = [col for col in df.columns if col.endswith("PY")]
# find the matching column, where the names match without the suffix.
PY_use = [col for col in PY_cols if col.split("_PY")[0] in df_cols]
df_use = [col.split("_PY")[0] for col in PY_use]
for k,v in zip(df_use,PY_use):
df[f"{k}_YOY%"] = round((df[k] - df[v]) / df[v] * 100,2)
You can take advantage of numpy:
py_df_array = (df[df_use].values, df[PY_use].values)
perc_dif = np.round((py_df_array[0] - py_df_array[1]) / py_df_array[1] * 100, 2)
df_perc = pd.DataFrame(perc_def, columns=[f"{col}_YOY%" for col in df_use])
df = pd.concat([df, df_perc], axis=1)

Too many columns resulting in `PerformanceWarning: DataFrame is highly fragmented`

I have a list of filepaths in the first column of a dataframe. My goal is to create a second column that represents file categories, with categories reflecting the words in the filepath.
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
df["Animal"] =(df['filepath'].str.contains("dog|cat",case=False,regex=True))
df["Fish"] =(df['filepath'].str.contains("barracuda",case=False))
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)
This code works. The problem arises when I have 200 statements beginning with df['columnName'] =. Because I have so many, I get the error:
PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
To fix this I have tried:
dfAnimal = df.copy
dfAnimal['Animal'] = dfAnimal['filepath'].str.contains("dog|cat",case=False,regex=True)
dfFish = df.copy
dfFish["Fish"] =dfFish['filepath'].str.contains("barracuda",case=False)
df = pd.concat(dfAnimal,dfFish)
The above gives me errors such as method object is not iterable and method object is not subscriptable. I then tried df = df.loc[df['filepath'].isin(['cat','dog'])] but this only works when 'cat' or 'dog' is the only word in the column. How do I avoid the performance error?
Try creating all your new columns in a dict, and then convert that dict into a dataframe, and then use pd.concat to add the resulting dataframe (containing the new columns) to the original dataframe:
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
Added to your original code, it would be something like this:
import pandas as pd
import numpy as np
data = {'filepath': ['C:/barracuda/document.doc', 'C:/dog/document.doc', 'C:/cat/document.doc']
}
df = pd.DataFrame(data)
##### These are the new lines #####
new_columns = {
'Animal': df['filepath'].str.contains("dog|cat",case=False,regex=True),
'Fish': df['filepath'].str.contains("barracuda",case=False),
}
new_df = pd.DataFrame(new_columns)
df = pd.concat([df, new_df], axis=1)
##### End of new lines #####
df = df.loc[:, 'filepath':'Fish'].replace(True, pd.Series(df.columns, df.columns))
df = df.loc[:, 'filepath':'Fish'].replace(False,np.nan)
def squeeze_nan(x):
original_columns = x.index.tolist()
squeezed = x.dropna()
squeezed.index = [original_columns[n] for n in range(squeezed.count())]
return squeezed.reindex(original_columns, fill_value=np.nan)
df = df.apply(squeeze_nan, axis=1)
print(df)

check for value in pandas Dataframe cell which has a list

I have the following df:
df = pd.DataFrame(columns=['Place', 'PLZ','shortName','Parzellen'])
new_row1 = {'Place':'Winterthur', 'PLZ':[8400, 8401, 8402, 8404, 8405, 8406, 8407, 8408, 8409, 8410, 8411], 'shortName':'WIN', 'Parzellen':[]}
new_row2 = {'Place':'Opfikon', 'PLZ':[8152], 'shortName':'OPF', 'Parzellen':[]}
new_row3 = {'Place':'Stadel', 'PLZ':[8174], 'shortName':'STA', 'Parzellen':[]}
new_row4 = {'Place':'Kloten', 'PLZ':[8302], 'shortName':'KLO', 'Parzellen':[]}
new_row5 = {'Place':'Niederhasli', 'PLZ':[8155,8156], 'shortName':'NIH', 'Parzellen':[]}
new_row6 = {'Place':'Bassersdorf', 'PLZ':[8303], 'shortName':'BAS', 'Parzellen':[]}
new_row7 = {'Place':'Oberglatt', 'PLZ':[8154], 'shortName':'OBE', 'Parzellen':[]}
new_row8 = {'Place':'Bülach', 'PLZ':[8180], 'shortName':'BUE', 'Parzellen':[]}
df = df.append(new_row1, ignore_index=True)
df = df.append(new_row2, ignore_index=True)
df = df.append(new_row3, ignore_index=True)
df = df.append(new_row4, ignore_index=True)
df = df.append(new_row5, ignore_index=True)
df = df.append(new_row6, ignore_index=True)
df = df.append(new_row7, ignore_index=True)
df = df.append(new_row8, ignore_index=True)
print (df)
Now I have a number like 8405 and I want to know the Place or whole Row which has this number under df['PLZ'].
I also tried with classes but it was hard to get all Numbers of all Objects because I want to be able to call all PLZ in a list and also check, if I have any number, to which Place it belongs. Maybe there is an obvious better way and I just don't know it.
try with boolean masking and map() method:
df[df['PLZ'].map(lambda x:8405 in x)]
OR
via boolean masking and agg() method:
df[df['PLZ'].agg(lambda x:8405 in x)]
#you can also use apply() in place of agg
output of above code:
Place PLZ shortName Parzellen
0 Winterthur [8400, 8401, 8402, 8404, 8405, 8406, 8407, 840... WIN []

Pandas set element style dependent on another dataframe mith multi index

I have previously asked the question Pandas set element style dependent on another dataframe, which I have a working solution to, but now I am trying to apply it to a data frame with a multi index and I am getting an error, which I do not understand.
Problem
I have a pandas df and accompanying boolean matrix. I want to highlight the df depending on the boolean matrix.
Data
import pandas as pd
import numpy as np
from datetime import datetime
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
i = len(date)
dic = {'X':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.randn(i, 2),index = date, columns = ['A','B'])}
df = pd.concat(dic.values(),axis=1,keys=dic.keys())
boo = [True, False]
bool_matrix = {'X':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Y':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B']),
'Z':pd.DataFrame(np.random.choice(boo, (i,2), p=[0.3,.7]), index = date, columns = ['A','B'])}
bool_matrix =pd.concat(bool_matrix.values(),axis=1,keys=bool_matrix.keys())
My attempted solution
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in df[column].index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i, column])
my_style
Results
The above throws an AttributeError: 'Series' object has no attribute 'applymap'
I do not understand what is returning as a Series. This is a single value I am subsetting and this solution worked for non multi-indexed df's as shown below.
Without Multi-index
import pandas as pd
import numpy as np
from datetime import datetime
np.random.seed(24)
date = pd.date_range(start = datetime(2016,1,1), end = datetime(2016,2,1), freq = "D")
df = pd.DataFrame({'A': np.linspace(1, 100, len(date))})
df = pd.concat([df, pd.DataFrame(np.random.randn(len(date), 4), columns=list('BCDE'))],
axis=1)
df['date'] = date
df.set_index("date", inplace = True)
boo = [True, False]
bool_matrix = pd.DataFrame(np.random.choice(boo, (len(date), 5),p=[0.3,.7]), index = date,columns=list('ABCDE'))
def highlight(value):
return 'background-color: green'
my_style = df.style
for column in df.columns:
for i in bool_matrix.index:
data = bool_matrix.loc[i, column]
if data:
my_style = df.style.use(my_style.export()).applymap(highlight, subset = pd.IndexSlice[i,column])
my_style
Documentation
The docs make reference to CSS Classes and say that "Index label cells include level where k is the level in a MultiIndex." I am obviouly indexing this wrong, but am stumped on how to proceed.
It's very nice that there is a runable example.
You can use df.style.apply(..., axis=None) to apply a highlight method to the whole dataframe.
With your df and bool_matrix, try this:
def highlight(value):
d = value.copy()
for c in d.columns:
for r in df.index:
if bool_matrix.loc[r, c]:
d.loc[r, c] = 'background-color: green'
else:
d.loc[r, c] = ''
return d
df.style.apply(highlight, axis=None)
Or to make codes simple, you can try:
def highlight(value):
return bool_matrix.applymap(lambda x: 'background-color: green' if x else '')
df.style.apply(highlight, axis=None)
Hope this is what you need.

Python & Pandas: How to return a copy of a dataframe?

Here is the problem. I use a function to return a randomized data,
data1 = [3,5,7,3,2,6,1,6,7,8]
data2 = [1,5,2,1,6,4,3,2,7,8]
df = pd.DataFrame(data1, columns = ['c1'])
df['c2'] = data2
def randomize_data(df):
df['c1_ran'] = df['c1'].apply(lambda x: (x + np.random.uniform(0,1)))
df['c1']=df['c1_ran']
# df.drop(['c1_ran'], 1, inplace=True)
return df
temp_df = randomize_data(df)
display(df)
display(temp_df)
However, the df (source data) and the temp_df (randomized_data) is the same. Here is the result:
How can I make the temp_df and df different from each other?
I find I can get rid of the problem by adding df.copy() at the beginning of the function
def randomize_data(df):
df = df.copy()
But I'm not sure if this is the right way to deal with it?
Use DataFrame.assign():
def randomize_data(df):
return df.assign(c1=df.c1 + np.random.uniform(0, 1, df.shape[0]))
I think you are right, and DataFrame.copy() have an optional argument 'deep'. You can find details in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.copy.html

Categories

Resources