I have the below dataframe from which I need to groupby using id column and the corresponding values should be in list at the same cell. Anyone please help me on this?
I have this processed Dataframe:
Actual Dataframe:
In the actual dataframe, the list values should be added in the new column called e_values to the respective id.
df['e_values'] = df.filter(like='col_').apply(list, axis=1)
If going from Actual to processed;
You can split, expand=True and then replace the corner brackets. This should give you a dataframe which you can rename columns
df['values'].str.split(',',expand=True).replace(regex={'\[': '', '\]': ''})
If going from processed to Actual use:
df.set_index('id').agg(list,1).reset_index()
Related
Just a random q. If there's a dataframe, df, from the Boston Homes ds, and I'm trying to do EDA on a few of the columns, set to a variable feature_cols, which I could use afterwards to check for na, how would one go about this? I have the following, which is throwing an error:
This is what I was hoping to try to do after the above:
Any feedback would be greatly appreciated. Thanks in advance.
There are two problems in your pictures. First is a keyError, because if you want to access subset of columns of a dataframe, you need to pass the names of the columns in a list not a tuple, so the first line should be
feature_cols = df[['RM','ZN','B']]
However, this will return a dataframe with three columns. What you want to use in the for loop can not work with pandas. We usually iterate over rows, not columns, of a dataframe, you can use the one line:
df.isna().sum()
This will print all names of columns of the dataframe along with the count of the number of missing values in each column. Of course, if you want to check only a subset of columns, you can. replace df buy df[list_of_columns_names].
You need to store the names of the columns only in an array, to access multiple columns, for example
feature_cols = ['RM','ZN','B']
now accessing it as
x = df[feature_cols]
Now to iterate on columns of df, you can use
for column in df[feature_cols]:
print(df[column]) # or anything
As per your updated comment,. if your end goal is to see null counts only, you can achieve without looping., e.g
df[feature_cols].info(verbose=True,null_count=True)
data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')
I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')
I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.
I have pandas dataframe which i would like to be sliced after every 4 columns and then vertically stacked on top of each other which includes the date as index.Is this possible by using np.vstack()? Thanks in advance!
ORIGINAL DATAFRAME
Please refer the image for the dataframe.
I want something like this
WANT IT MODIFIED TO THIS
Until you provide a Minimal, Complete, and Verifiable example, I will not test this answer but the following should work:
given that we have the data stored in a Pandas DataFrame called df, we can use pd.melt
moltendfs = []
for i in range(4):
moltendfs.append(df.iloc[:, i::4].reset_index().melt(id_vars='date'))
newdf = pd.concat(moltendfs, axis=1)
We use iloc to take only every fourth column, starting with the i-th column. Then we reset_index in order to be able to keep the date column as our identifier variable. We use melt in order to melt our DataFrame. Finally we simply concatenate all of these molten DataFrames together side by side.