I have a data frame like this Main DataFrame.
How Can I extract new data frame from it like:
Second DataFrame
I know df['Group'].value_counts().index shows me counts of each 'Group' but I don't know how to create new dataframe with it.
You get a new dataframe by using for example groupby like this to do this counting:
df.groupby("Group", as_index=False).size()
Other interesting options for groupby that you might want to use depending on use case are dropna=False and sort=False.
And to rename the size column to counts you use can use rename: .rename(columns={'size': 'Count'})
Related
I have a list of columns from a dataframe
df_date=[df[var1],df[var2]]
I want to change the data in that columns to date time type
for t in df_date:
pd.DatetimeIndex(t)
for some reason its not working
I whould like to understand what is more general solution for applying sevral operations on several columns.
As an alternative, you can do:
for column_name in ["var1", "var2"]:
df[column_name] = pd.DatetimeIndex(df[column_name])
You can use pandas.to_datetime and pandas.DataFrame.apply to convert a dataframe's entire content to datetime. You can also filter out the columns you need and apply it only to them.
df[['column1', 'column2']] = df[['column1', 'column2']].apply(pd.to_datetime)
Note that a list of series and a DataFrame are not the same thing.
A DataFrame is accessed like this:
df[[columns]]
While a list of series is looks like this:
[seriesA, seriesB]
data={'order_num':[123,234,356,123,234,356],'email':['abc#gmail.com','pqr#hotmail.com','xyz#yahoo.com','abc#gmail.com','pqr#hotmail.com','xyz#gmail.com'],'product_code':['rdcf1','6fgxd','2sdfs','34fgdf','gvwt5','5ganb']}
df=pd.DataFrame(data,columns=['order_num','email','product_code'])
My data frame looks something like this:
Image of data frame
For sake of simplicity, while making the example, I omitted the other columns. What I need to do is that I need to groupby on the column called order_num, apply(list) on product_code, sort the groups based on a timestamp column and retain the columns like email as they are.
I tried doing something like:
df.groupby(['order_num', 'email', 'timestamp'])['product_code'].apply(list).sort_values(by='timestamp').reset_index()
Output: Expected output appearance
but I do not wish to groupby with other columns. Is there any other alternative to performing the list operation? I tried using transform but it threw me size mismatch error and I don't think it's the right way to go either.
If there is a lot another columns and need grouping by order_num only use Series.map for new column filled by lists and then remove duplicates by DataFrame.drop_duplicates by column order_num, last if necessary sorting:
df['product_code']=df['order_num'].map(df.groupby('order_num')['product_code'].apply(list))
df = df.drop_duplicates('order_num').sort_values(by='timestamp')
I need to create a new column in my df that holds the mean of another existing column, but I need it to take into account each individual location over time rather then the mean of the all the values in the existing column.
Based on the sample dataset below, what I am looking for is a new column that contains the Mean for each Site, not the mean of all the values independent of Site.
Sample Dataset
Use groupby and agg mean of that columns:
df = df.merge(df.groupby('Site',as_index=False).agg({'TIME_HOUR':'mean'})[['Site','TIME_HOUR']],on='Site',how='left')
Use groupby:
df.groupby('Site')['TIME_HOUR'].mean().reset_index()
And assign to a column
I have a DataFrame with four columns and want to generate a new DataFrame with only one column containing the maximum value of each row.
Using df2 = df1.max(axis=1) gave me the correct results, but the column is titled 0 and is not operable. Meaning I can not check it's data type or change it's name, which is critical for further processing. Does anyone know what is going on here? Or better yet, has a better way to generate this new DataFrame?
It is Series, for one column DataFrame use Series.to_frame:
df2 = df1.max(axis=1).to_frame('maximum')
I have pandas dataframe which i would like to be sliced after every 4 columns and then vertically stacked on top of each other which includes the date as index.Is this possible by using np.vstack()? Thanks in advance!
ORIGINAL DATAFRAME
Please refer the image for the dataframe.
I want something like this
WANT IT MODIFIED TO THIS
Until you provide a Minimal, Complete, and Verifiable example, I will not test this answer but the following should work:
given that we have the data stored in a Pandas DataFrame called df, we can use pd.melt
moltendfs = []
for i in range(4):
moltendfs.append(df.iloc[:, i::4].reset_index().melt(id_vars='date'))
newdf = pd.concat(moltendfs, axis=1)
We use iloc to take only every fourth column, starting with the i-th column. Then we reset_index in order to be able to keep the date column as our identifier variable. We use melt in order to melt our DataFrame. Finally we simply concatenate all of these molten DataFrames together side by side.