I have a database which has two columns with unique numbers. This is my reference dataframe (df_reference). In another dataframe (df_data) I want to get the rows of this dataframe of which a column values exist in this reference dataframe. I tried stuff like:
df_new = df_data[df_data['ID'].isin(df_reference)]
However, like this I can't get any results. What am I doing wrong here?
From what I see, you are passing the whole dataframe in .isin() method.
Try:
df_new = df_data[df_data['ID'].isin(df_reference['ID'])]
Convert the ID column to the index of the df_data data frame. Then you could do
matching_index = df_reference['ID']
df_new = df_data.loc[matching_index, :]
This should solve the issue.
Newby to pandas here so go easy on me.
I have a dataframe with lots of columns. I want to do something like
df.groupby('row').diff()
However, the result of the groupby don't include the row column.
How do I include the row column in the groupby results.
Alternatively, is it possible to merge the groupby results in the dataframe?
Create index by row column first:
df1 = df.set_index('row').groupby('row').diff().reset_index()
Or:
df1 = df.set_index('row').groupby(level=0).diff().reset_index()
You could use agg with np.diff:
df.groupby('row').agg(np.diff)
I have a table :
When I try to pivot in python:
df.pivot(columns = 'Type', values = 'Value')
it returns only columns from Type column with values but columns Col1-Col5 do not appear in my dataframe
In Power Query it's achieved very simple, I just choose column I want to pivot and values for this column:
And after this operation I get following result:
How do I achieve the same result using pd.pivot ?
Thanks!
In no way using pd.pivot. There is the pd.pivot_table function made for this case:
table = pd.pivot_table(df, values='Value', index=['Col1','Col2','Col3','Col4','Col5'],
columns=['Type'], aggfunc=np.sum)
You should be able to use pivot_table() to do this, same steps as outlined in this answer
res = df.pivot_table(values='Value', index=['Col1','Col2','Col3','Col4','Col5'], columns='Type')
This creates a multi-index dataframe, which you can then flatten:
res.reset_index(inplace=True)
res.columns.name = None
I have the below code that basically performs a group by operation, followed by a sum.
grouped = df.groupby(by=['Cabin'], as_index=False)['Fare'].sum()
I then rename the columns
grouped.columns = ['Cabin', 'testCol']
And I then merge the "grouped" dataframe with my original dataframe to calculate aggregate.
df2 = df.merge(grouped, on='Cabin')
What this does is to populate my initial dataframe with the 'testCol' from my "grouped" dataframe.
Can this code be optimized to fit in one line or something similar?
It seems need GroupBy.transform for new column of sums:
df['testCol'] = df.groupby('Cabin')['Fare'].transform('sum')
I have a pandas dataframe like so:
import pandas as pd
import numpy as np
df = pd.DataFrame([['WY','M',2014,'Seth',5],
['WY','M',2014,'Spencer',5],
['WY','M',2014,'Tyce',5],
['NY','M',2014,'Seth',25],
['MA','M',2014,'Spencer',23]],columns = ['state','sex','year','name','number'])
print df
How do I manipulate the data to get a dataframe like:
df1 = pd.DataFrame([['M',2014,'Seth',30],
['M',2014,'Spencer',28],
['M',2014,'Tyce',5]],
columns = ['sex','year','name','number'])
print df1
This is just part of a very large dataframe, how would I do this for every name for every year?
df[['sex','year','name','number']].groupby(['sex','year','name']).sum().reset_index()
For a brief description of what this does, from left to right:
Select only the columns we care about. We could replace this part with df.drop('state',axis=1)
Perform a groupby on the columns we care about.
Sum the remaining columns (in this case, just number).
Reset the index so that the columns ['sex','year','name'] are no longer a part of the index.
you can use pivot table
df.pivot_table(values = 'number',aggfunc = 'sum',columns = ['sex','year','name']).reset_index().rename(columns={0:'number'})
Group by the columns you want, sum number, and flatten the multi-index:
df.groupby(['sex','year','name'])['number'].sum().reset_index()
In your case the column state is not sum-able, so you can shorten to:
df.groupby(['sex','year','name']).sum().reset_index()