how do I perform the following operation in python dataframe - python

below are my two dfs
df = pd.DataFrame(np.random.randint(1,10,(5,3)),columns=['a','b','c'])
dd = pd.DataFrame(np.repeat(3,2),columns=['a'])
I want to replace the column 'a' of df with values in column 'a' of dd. Any empty rows are replaced by zero "only" for column 'a'. All other columns of df remain unchanged.
so column 'a' should contain 3,3,0,0,0

This is probably not the cleanest way, but it works.
df['a'] = dd['a']
df['a'] = df['a'].fillna(0)

Related

Replace elements of a dataframe with a values at another dataframe elements

I want to replace df2 elements with df1 elements but according to that: If df2 first row first column has value '1' than df1 first row first column element is getting there, If it is zero than '0' stands. If df2 any row last column element is '1' than df1 that row last column element is coming there. It is going to be like that.
So i want to replace all df2 '1' element with df1 elements according to that rule. df3 is going to be like:
abcde0000;
abcd0e000;
abcd00e00;...
We can use apply function for this. But first you have concat both frames along axis 1. I am using a dummy table with just three entries. It can be applied for any number of rows.
import pandas as pd
import numpy as np
# Dummy data
df1 = pd.DataFrame([['a','b','c','d','e'],['a','b','c','d','e'],['a','b','c','d','e']])
df2 = pd.DataFrame([[1,1,1,1,1,0,0,0,0],[1,1,1,1,0,1,0,0,0],[1,1,1,1,0,0,1,0,0]])
# Display dataframe . May not work in python scripts. I used them in jupyter notebooks
display(df1)
display(df2)
# Concat DFs
df3 = pd.concat([df1,df2],axis=1)
display(df3)
# Define function for replacing
def replace(letters,indexes):
seek =0
for i in range(len(indexes)):
if indexes[i]==1:
indexes[i]=letters[seek]
seek+=1
return ''.join(list(map(str,indexes)))
# Applying replace function to dataframe
df4 = df3.apply(lambda x: replace(x[:5],x[5:]),axis=1)
# Display df4
display(df4)
The result is
0 abcde0000
1 abcd0e000
2 abcd00e00
dtype: object
I think this will solve your problem

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

How can I append values from one df to the bottom of a column in my second in one expression?

I'm trying to append a list of values ('A') from a separate df to the bottom of my output (finalDf) where the values are always the same and don't need to be in order.
Heres what i have tried so far:
temp1 = pd.DataFrame(df['A'].append(df1['A'], ignore_index = True))
temp2 = pd.DataFrame(df['B'].append(df1['B'], ignore_index = True))
print(df.shape)
print(temp1.shape)
print(temp2.shape)
shape output (example from my code with + 28 values from df1):
(11641, 6)
(11669, 1)
(11669, 1)
Where appending the values seems to work based on the shape of temp1 but I cant seem to apply the values from both Col 'A' and Col 'B' to the bottom of col 'A' in dfFinal together - it's always either col 'A' or col 'B' from df1 never both in df
TLDR; How can I best take the values from col 'A' and Col 'B' in df1 and append them to Col 'A' and Col 'B' in df to make dfFinal which I can then export to csv ?
This can be done with the concat function along axis=0 i.e. it will join the data frames provided along rows. In layman terms, it will join the 2nd data frame below the 1st. Keep in mind that the number of columns should be the same in both the data frames.
df.concat([temp1, temp2], axis=0, ignore_index=True)
Over here, ignore_index ignores the new indexes that will be formed by concatenations and instead creates a new one from 0 to 'n-1'.
For more information: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html

Replacing few zeros of a dataframe with difference between two previous columns

I am trying to execute:
df['C'] = df['C'].replace(0, lambda x: pd.Timedelta(x['B']-x['A']).days)
which is to replace all instances of zeros in column 'C' of dataframe 'df' with difference between column 'B' and 'A' which are of type datetime64. I want the result 'C' to be in days. Here is the snap of result after the above execution:
I am getting the column C values likes this. Instead i should get only the days which is the difference between the first two. What I am doin wrong?
(note: I don't want to do df['C'] = df['B'] - df['A'] because i want to replace only zeros, not existing non-zeros)
Select columns by mask in DataFrame.loc and set days created by Series.dt.days:
df.loc[df['C'] == 0, 'C'] = (df['B']-df['A']).dt.days

Select columns based on != condition

I have a dataframe and I have a list of some column names that correspond to the dataframe. How do I filter the dataframe so that it != the list of column names, i.e. I want the dataframe columns that are outside the specified list.
I tried the following:
quant_vair = X != true_binary_cols
but get the output error of: Unable to coerce to Series, length must be 545: given 155
Been battling for hours, any help will be appreciated.
It will help:
df.drop(columns = ["col1", "col2"])
You can either drop the columns from the dataframe, or create a list that does not contain all these columns:
df_filtered = df.drop(columns=true_binary_cols)
Or:
filtered_col = [col for col in df if col not in true_binary_cols]
df_filtered = df[filtered_col]

Categories

Resources