I have a pandas dataframe that I want the numbers of the column C to be added together and created a new column D.
For example
Thanks in advance.
Use Series.str.extractall for get numbers separately, convert to integers and last sum per first level of MultiIndex:
df['D'] = df['C'].str.extractall('(\d)').astype(int).sum(level=0)
Related
Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)
I have the below dataframe from which I need to groupby using id column and the corresponding values should be in list at the same cell. Anyone please help me on this?
I have this processed Dataframe:
Actual Dataframe:
In the actual dataframe, the list values should be added in the new column called e_values to the respective id.
df['e_values'] = df.filter(like='col_').apply(list, axis=1)
If going from Actual to processed;
You can split, expand=True and then replace the corner brackets. This should give you a dataframe which you can rename columns
df['values'].str.split(',',expand=True).replace(regex={'\[': '', '\]': ''})
If going from processed to Actual use:
df.set_index('id').agg(list,1).reset_index()
I want a way to find the number of null elements in a DataFrame which gives just 1 number not another series or anything like that
You can simply get all null values from the dataframe and count them:
df.isnull().sum()
Or you can use individual column as well:
df['col_name'].isnull().sum()
you could use pd.isnull() and sum:
df = pd.DataFrame([[1,1,1], [2,2,np.nan], [3,np.nan, np.nan]])
pd.isnull(df).values.sum()
which gives: 3
This code chunk will help
# df is your dataframe
print(df.isnull().sum())
I have a dataframe of two columns Stock and DueDate, where I need to select first row from the repeated consecutive entries based on stock column.
df:
I am expecting output like below,
Expected output:
My Approach
The approach I tried to use is to first list out what all rows repeating based on stock column by creating a new column repeated_yes and then subset the first row only if any rows are repeating more than twice.
I have used the below line of code to create new column "repeated_yes",
ss = df.Stock.ne(df.Stock.shift())
df['repeated_yes'] = ss.groupby(ss.cumsum()).cumcount() + 1
so the new updated dataframe looks like this,
df_new
But I am stuck on subsetting only row number 3 and 8 inorder to attain the result. If there are any other effective approach it would be helpful.
Edited:
Forgot to include the actual full question,
If there are any other rows below the last row in the dataframe df it should not display any output.
Chain another mask created by Series.duplicated with keep=False by & for bitwise AND and filter in boolean indexing:
ss = df.Stock.ne(df.Stock.shift())
ss1 = ss.cumsum().duplicated(keep=False)
df = df[ss & ss1]
I have a dataset and I would like to merge the two first column and the two next and so on.
You didn't show your column names there for I have put random names into your columns. When you assign this dataset to pandas dataframe I assume your dataframe variable is df
In [2]: df
Out[2]:<your dataset>
First get sum of first two columns and assign it into single column
In [3]:df['Total1'] = df['first_column'] + df['Second_column']
Then we get sum of Third and forth column and assign it into another single column
In [4]:df['Total2'] = df['Third_column'] + df['Fourth_column']
All are complete then you can run this
In [5]:df
Out[5]:<your dataset with Total1 and Total2 columns>
Hope it will help you!