Replacing few zeros of a dataframe with difference between two previous columns - python

I am trying to execute:
df['C'] = df['C'].replace(0, lambda x: pd.Timedelta(x['B']-x['A']).days)
which is to replace all instances of zeros in column 'C' of dataframe 'df' with difference between column 'B' and 'A' which are of type datetime64. I want the result 'C' to be in days. Here is the snap of result after the above execution:
I am getting the column C values likes this. Instead i should get only the days which is the difference between the first two. What I am doin wrong?
(note: I don't want to do df['C'] = df['B'] - df['A'] because i want to replace only zeros, not existing non-zeros)

Select columns by mask in DataFrame.loc and set days created by Series.dt.days:
df.loc[df['C'] == 0, 'C'] = (df['B']-df['A']).dt.days

Related

how do I perform the following operation in python dataframe

below are my two dfs
df = pd.DataFrame(np.random.randint(1,10,(5,3)),columns=['a','b','c'])
dd = pd.DataFrame(np.repeat(3,2),columns=['a'])
I want to replace the column 'a' of df with values in column 'a' of dd. Any empty rows are replaced by zero "only" for column 'a'. All other columns of df remain unchanged.
so column 'a' should contain 3,3,0,0,0
This is probably not the cleanest way, but it works.
df['a'] = dd['a']
df['a'] = df['a'].fillna(0)

Creating a new column based on multiple columns

I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here
If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Creating a column to have different values depending on its value in another column in pandas

Consider the following (dummy) pandas dataset:
How would the 'Type' column be constructed in Python?
To clarify, it needs to be constructed in a way that the 6 highest month values have an 'F' in Type, with the remaining rows having an 'A' in Type. This is because the Month column may not always be from 0-10 - e.g. if the Month ranged from 0-15, then Months 10-15 would have an 'F' in Type.
Even better with numpy:
month_limit = df['Month'].max()-5
df['type'] = np.where(df['Month'] >= month_limit, 'F', 'A')
Hope this will work:
month_limit = df['Month'].max()-5
df['Type'] = ['F' if x >= month_limit else 'A' for x in df['Month']]

Iterate and input data into a column in a pandas dataframe

I have a pandas dataframe with a column that is a small selection of strings. Let's call the column 'A' and all of the values in it are string_1, string_2, string_3.
Now, I want to add another column and fill it with numeric values that correspond to the strings.
I created a dictionary
d = { 'string_1' : 1, 'string_2' : 2, 'string_3': 3}
I then initialized the new column:
df['B'] = pd.Series(index=df.index)
Now, I want to fill it with the integer values. I can call the values associated with the strings in the dictionary by:
for s in df['A']:
n = d[s]
That works fine, but I've tried using just plain df['B'] = n to fill the new column in the for-loop, but that doesn't work, and I've tried to figure out indexing with pandas.
If I understand you correctly you can just call map:
df['B'] = df['A'].map(d)
This will perform the lookup and fill the values you are looking for.
Rather than fill as an empty column, you can simply populate this with an apply:
df['B'] = df['A'].apply(d.get)

Categories

Resources