Is there any way to shift row values in the dataframe? - python

I want to shift values of row 10 , Fintech into next column and fill the city column in same row with Bahamas. Is there any way to do that?
I found the dataframe.shift() function of pandas but it is limited to columns and it shifts all the values.

Use DataFrame.shift with filtered rows and axis=1:
#test None values like Nonetype
m = df['Select Investors'].isna()
#test None values like strings
#m = df['Select Investors'].eq('None')
df.loc[m, 'Country':] = df.loc[m, 'Country':].shift(axis=1)

Related

Creating a new column based on multiple columns

I'm trying to create a new column based on other columns existing in my df.
My new column, col, should be 1 if there is at least one 1 in columns A ~ E.
If all values in columns A ~ E is 0, then value of col should be 0.
I've attached image for a better understanding.
What is the most efficient way to do this with python, not using loop? Thanks.
enter image description here
If need test all columns use DataFrame.max or DataFrame.any with cast to integers for True/False to 1/0 mapping:
df['col'] = df.max(axis=1)
df['col'] = df.any(axis=1).astype(int)
Or if need test columns between A:E add DataFrame.loc:
df['col'] = df.loc[:, 'A':'E'].max(axis=1)
df['col'] = df.loc[:, 'A':'E'].any(axis=1).astype(int)
If need specify columns by list use subset:
cols = ['A','B','C','D','E']
df['col'] = df[cols].max(axis=1)
df['col'] = df[cols].any(axis=1).astype(int)

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Is there an equivalent Python function similar to complete.cases in R

I am removing a number of records in a pandas data frame which contains diverse combinations of NaN in the 4-columns frame. I have created a function called complete_cases to provide indexes of rows which met the following condition: all columns in the row are NaN.
I have tried this function below:
def complete_cases(dataframe):
indx = []
indx = [x for x in list(dataframe.index) \
if dataframe.loc[x, :].isna().sum() ==
len(dataframe.columns)]
return indx
I am wondering should this is optimal enough or there is a better way to do this.
Absolutely. All you need to do is
df.dropna(axis = 0, how = 'any', inplace = True)
This will remove all rows that have at least one missing value, and updates the data frame "inplace".
I'd recommend to use loc, isna, and any with 'columns' axis, like this:
df.loc[df.isna().any(axis='columns')]
This way you'll filter only the results like the complete.cases in R.
A possible solution:
Count the number of columns with "NA" creating a column to save it
Based on this new column, filter the rows of the data frame as you wish
Remove the (now) unnecessary column
It is possible to do it with a lambda function. For example, if you want to remove rows that have 10 "NA" values:
df['count'] = df.apply(lambda x: 0 if x.isna().sum() == 10 else 1, axis=1)
df = df[df.count != 0]
del df['count']

Pandas - Fill in missing values choosing values from a normal distribution

The code below will generate only one value of a normal distribution, and fill in all the missing values with this same value:
helper_df = df.dropna()
df = df.fillna(numpy.random.normal(loc=helper_df.mean(), scale=numpy.std(helper_df)))
What can we do to generate a value for each missing value?
You can create a series with normal values. You should extract the index of the Nan values in the column you are working on.
df: your dataframe
col: the col containing Nan values
index = df[df.col.isna()].index
value = np.random.normal(loc=data.col.mean(), scale=data.col.std(), size=data.Age.isna().sum())
data.Age.fillna(pd.Series(value, index=index), inplace=True)
You can create a series of random variables with the same length as your dataframe, then apply fillna:
df.fillna(pd.Series([np.random.normal() for x in range(len(df))]))
If a value in a row is not missing, fillna just ignores it.

Add values to bottom of DataFrame automatically with Pandas

I'm initializing a DataFrame:
columns = ['Thing','Time']
df_new = pd.DataFrame(columns=columns)
and then writing values to it like this:
for t in df.Thing.unique():
df_temp = df[df['Thing'] == t] #filtering the df
df_new.loc[counter,'Thing'] = t #writing the filter value to df_new
df_new.loc[counter,'Time'] = dftemp['delta'].sum(axis=0) #summing and adding that value to the df_new
counter += 1 #increment the row index
Is there are better way to add new values to the dataframe each time without explicitly incrementing the row index with 'counter'?
If I'm interpreting this correctly, I think this can be done in one line:
newDf = df.groupby('Thing')['delta'].sum().reset_index()
By grouping by 'Thing', you have the various "t-filters" from your for-loop. We then apply a sum() to 'delta', but only within the various "t-filtered" groups. At this point, the dataframe has the various values of "t" as the indices, and the sums of the "t-filtered deltas" as a corresponding column. To get to your desired output, we then bump the "t's" into their own column via reset_index().

Categories

Resources