I have a data frame like the following:
I need to keep some values matching the column's title which contains one extra column.
Could you please suggest the solution?
Here a solution:
# Building of a sample dataframe
df = pd.DataFrame({"index":["B","D","C","A"]}).groupby(["index"]).count()
df["value"] = None
# Function to fill the matching column
def match(index, column):
if(index==column):
return 1
else:
return ""
# Create the matching column and fill it with the right value
for index in df.index.array:
df[index] = df.apply(lambda row: match(row.name, index), axis=1)
# Print the result dataframe
print(df)
Related
As the title suggests I am attempting to assign quartiles to specific row values in my dataframe. However, I want to group this information by date and assign column quartiles based upon the values which exist on that date. I am not quite sure if my current method is working correctly. (I am utilizing a loop to achieve this for each column in my dataframe).
import pandas as pd
# This function reads my dataframe and removes the Year Month and Day columns because we have 'Concat' column which is used as my key for grouping (ddmmyyy).
def readCSV():
directory = 'Data'
file = 'Copy of Student_Data_9120.csv'
data = pd.read_csv(os.path.join(directory, file))
data.drop(columns=['Year', 'Month', 'Day'], inplace=True)
return data
def getDecile(data):
# Finds the columns in my dataframe
test_list = data.columns.values.tolist()
remove_list = ['acc_date', 'permno', 'Portfolio_Formation_date', 'SUE', 'FYearEnd', 'Concat']
# Concat is my primary key dd/mm/yyyy which I am utilizing to group by
# Removes columns which I don't need deciles for
keyCols = filter(lambda i: i not in remove_list, test_list)
# For each column in my data, group by each date.
# On each date for each column assign the quartile of the row to the new column 'Column_Decile'.
for column in keyCols:
name = column + '_Decile'
data[name] = data.groupby(['Concat'])[column].transform(lambda x: pd.qcut(x.rank(method='first'), q=10, labels=range(1,11)))
return data
def printToCSV(quartileData):
file = 'Data_With_Quartiles.csv'
quartileData.to_csv(file)
data = readCSV()
quartileData = getDecile(data)
printToCSV(quartileData)
Hello all I have a simple df as shown above. Target column contains column names created with a lambda from other table
targetmain['Target']=targetmain.apply(lambda row: row[row == 1].index.tolist() , axis=1)
What I want to do is to create a new column based on Target column named "Primary", checking what is the target and the number matching the respective column. (eg. for Joe the column "Primary" should be 5, jack 2,avarel 0, william 8)
also if the brackets are an issue I can remove them as well.
First remove lists by seelcting first value and then use lookup:
targetmain['Target'] = targetmain['Target'].str[0]
idx, cols = pd.factorize(targetmain['Target'])
df['Primary'] = targetmain.reindex(cols, axis=1).to_numpy()[np.arange(len(targetmain)), idx]
For old pandas versions use DataFrame.lookup:
targetmain['Target'] = targetmain['Target'].str[0]
targetmain['Primary'] = targetmain.lookup(targetmain.index, targetmain['Target'])
Considering that the values in the column Target are strings :
def primaryCount(row):
row['Primary'] = row[row['Target']]
return row
targetmain = targetmain.apply(primaryCount, axis=1)
You may need to cast your Target values like #jezrael suggested with this beforehand :
targetmain['Target'] = targetmain['Target'].str[0]
EDIT : This solution could be simplified :
df['Primary'] = df.apply(lambda row: row[row['Target']], axis=1)
I need to fix a large excel database where in some columns some cells are blank and all the data from the row is moved one cell to the right.
For example:
In this example I need a script that would detect that the first cell form the last row is blank and then it would move all the values one cell to the left.
I'm trying to do it with this function. Vencli_col is the dataset, df1 and df2 are copies. In df2 I drop column 12, which is where the error originates. I index the rows where the error happens and then I try to replace them with the values from df2.
df1 = vencli_col.copy()
df2 = vencli_col.copy()
df2 = df1.drop(columns=['Column12'])
df2['droppedcolumn'] = np.nan
i = 0
col =[]
for k, value in vencli_col.iterrows():
i +=1
if str(value['Column12']) == '' or str(value['Column12']) == str(np.nan):
col.append(i+1)
for j in col:
df1.iloc[j] = df2.iloc[j]
df1.head(25)
You could do something like the below. It is not very pretty but it does the trick.
# Select the column names that are correct and the ones that are shifted
# This is assuming the error column is the second one as in the image you have
correct_cols = df.columns[1:-1]
shifted_cols = df.columns[2:]
# Get the indexes of the rows that are NaN or ""
df = df.fillna("")
shifted_indexes = df[df["col1"] == ""].index
# Shift the data 1 column to the left
# It has to be transformed in numpy because if you don't the column names
# prevent from copying in the destination columns
df.loc[shifted_indexes ,correct_cols] = df.loc[shifted_indexes, shifted_cols].to_numpy()
EDIT: just realised there is an easier way using df.shift()
columns_to_shift = df.columns[1:]
shifted_indexes = df[df["col1"] == ""].index
df.loc[shifted_indexes, columns_to_shift] = df.loc[shifted_indexes, columns_to_shift].shift(-1, axis=1)
I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values
I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]