How to include grouping variable in the groupby().diff() results - python

Newby to pandas here so go easy on me.
I have a dataframe with lots of columns. I want to do something like
df.groupby('row').diff()
However, the result of the groupby don't include the row column.
How do I include the row column in the groupby results.
Alternatively, is it possible to merge the groupby results in the dataframe?

Create index by row column first:
df1 = df.set_index('row').groupby('row').diff().reset_index()
Or:
df1 = df.set_index('row').groupby(level=0).diff().reset_index()

You could use agg with np.diff:
df.groupby('row').agg(np.diff)

Related

How should I filter one dataframe by entries from another one in pandas with isin?

I have two dataframes (df1, df2). The columns names and indices are the same (the difference in columns entries). Also, df2 has only 20 entries (which also existed in df1 as i said).
I want to filter df1 by df2 entries, but when i try to do it with isin but nothing happens.
df1.isin(df2) or df1.index.isin(df2.index)
Tell me please what I'm doing wrong and how should I do it..
First of all the isin function in pandas returns a Dataframe of booleans and not the result you want. So it makes sense that the cmds you used did not work.
I am possitive that hte following psot will help
pandas - filter dataframe by another dataframe by row elements
If you want to select the entries in df1 with an index that is also present in df2, you should be able to do it with:
df1.loc[df2.index]
or if you really want to use isin:
df1[df1.index.isin(df2.index)]

Pandas Dataframe: How to drop_duplicates() based on index subset?

Wondering if someone could please help me on this:
Have a pandas df with a rather large amount of columns (over 50). I'd like to remove duplicates based on a subset (column 2 to 50).
Been trying to use df.drop_duplicates(subset=["col1","col2",....]), but wondering if there is a way to pass the column index instead so I don't have to actually write out all the column headers to consider for the drop but instead can do something along the lines of df.drop_duplicates(subset = [2:])
Thanks upfront
You can slice df.columns like:
df.drop_duplicates(subset = df.columns[2:])
Or:
df.drop_duplicates(subset = df.columns[2:].tolist())

How to find if a values exists in all rows of a dataframe?

I have an array of unique elements and a dataframe.
I want to find out if the elements in the array exist in all the row of the dataframe.
p.s- I am new to python.
This is the piece of code I've written.
for i in uniqueArray:
for index,row in newDF.iterrows():
if i in row['MKT']:
#do something to find out if the element i exists in all rows
Also, this way of iterating is quite expensive, is there any better way to do the same?
Thanks in Advance.
Pandas allow you to filter a whole column like if it was Excel:
import pandas
df = pandas.Dataframe(tableData)
Imagine your columns names are "Column1", "Column2"... etc
df2 = df[ df["Column1"] == "ValueToFind"]
df2 now has only the rows that has "ValueToFind" in df["Column1"]. You can concatenate several filters and use AND OR logical doors.
You can try
for i in uniqueArray:
if newDF['MKT'].contains(i).any():
# do your task
You can use isin() method of pd.Series object.
Assuming you have a data frame named df and you check if your column 'MKT' includes any items of your uniqueArray.
new_df = df[df.MKT.isin(uniqueArray)].copy()
new_df will only contain the rows where values of MKT is contained in unique Array.
Now do your things on new_df, and join/merge/concat to the former df as you wish.

pandas pivot_table returns empty dataframe

I get an empty dataframe when I try to group values using the pivot_table. Let's first create some stupid data:
import pandas as pd
df = pd.DataFrame({"size":['large','middle','xsmall','large','middle','small'],
"color":['blue','blue','red','black','red','red']})
When I use:
df1 = df.pivot_table(index='size', aggfunc='count')
returns me what I expect. Now I would like to have a complete pivot table with the color as column:
df2 = df.pivot_table(index='size', aggfunc='count',columns='color')
But this results in an empty dataframe. Why? How can I get a simple pivot table which counts me the number of combinations?
Thank you.
You need to use len as the aggfunc, like so
df.pivot_table(index='size', aggfunc=len, columns='color')
If you want to use count, here are the steps:
First add a frequency columns, like so:
df['freq'] = df.groupby(['color', 'size'])['color'].transform('count')
Then create the pivot table using the frequency column:
df.pivot_table(values='freq', index='size', aggfunc='count', columns='color')
you need another column to be used as values for aggregation.
Add a column -
df['freq']=1
Your code will work.

Code Optimization for groupby

I have the below code that basically performs a group by operation, followed by a sum.
grouped = df.groupby(by=['Cabin'], as_index=False)['Fare'].sum()
I then rename the columns
grouped.columns = ['Cabin', 'testCol']
And I then merge the "grouped" dataframe with my original dataframe to calculate aggregate.
df2 = df.merge(grouped, on='Cabin')
What this does is to populate my initial dataframe with the 'testCol' from my "grouped" dataframe.
Can this code be optimized to fit in one line or something similar?
It seems need GroupBy.transform for new column of sums:
df['testCol'] = df.groupby('Cabin')['Fare'].transform('sum')

Categories

Resources