Appending Entire Column into Dictionary - python

I'm working with a dataframe. If the column in the dataframe has a certain percentage of blanks I want to append that column into a dictionary (and eventually turn that dictionary into a new dataframe).
features = {}
percent_is_blank = 0.4
for column in df:
x = df[column].isna().mean()
if x < percent_is_blank:
features[column] = ??
new_df = pd.DataFrame.from_dict([features], columns=features.keys())
What would go in the "??"

I think better is filtering with DataFrame.loc:
new_df = df.loc[:, df.isna().mean() < percent_is_blank]
In your solution is possible use:
for column in df:
x = df[column].isna().mean()
if x < percent_is_blank:
features[column] = df[column]

Related

replace NaNs with 0 for df columns where column name contains specific string (pandas)

I have a dataframe as a result of a pivot which has several thousand columns (representing time-boxed attributes). Below is a much shortened version for resemblance.
d = {'incount - 14:00': [1,'NaN', 1,1,'NaN','NaN','NaN','NaN',1],
'incount - 15:00': [2,1,2,'NaN','NaN','NaN',1,4,'NaN'],
'outcount - 14:00':[2,'NaN',1,1,1,1,2,2,1]
'outcount - 15:00':[2,2,1,1,'NaN',2,'NaN',1,1]}
df = pd.DataFrame(data=d)
I want to replace the NaNs in columns that contain "incount" with 0 (leaving other columns untouched). I have tried the following but predictably it does not recognise the column name.
df['incount'] = df_all['incount'].fillna(0)
I need the ability to search the column names and only impact those containing a defined string.
try this:
m = df.columns[df.columns.str.startswith('incount')]
df.loc[:, m] = df.loc[:, m].fillna(0)
print(df)
you can use:
loop_cols = list(df.columns[df.columns.str.contains('incount',na=False)]) #get columns containing incount as a list
#or
#loop_cols = [col for col in df.columns if 'incount' in col]
print(loop_cols)
'''
['incount - 14:00', 'incount - 15:00']
'''
for i in loop_cols:
df[i]=df[i].fillna(0)

How to fix displaced rows in excel with pandas?

I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Categories

Resources