Python Pandas to group columns only

Python Pandas to group columns only - python

A simple data-frame as below on the left and I want to achieve the right:
I use:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jason', 'Amy', 'Jason', 'River', 'Kate', 'David', 'Jack', 'David'],
'Department' : ['Sales', 'Operation', 'Operation', 'Sales', 'Operation', 'Sales', 'Operation', 'Sales', 'Finance', 'Finance', 'Finance'],
'Weight lost': [4, 4, 1, 4, 4, 4, 7, 2, 8, 1, 8],
'Point earned': [2, 2, 1, 2, 2, 2, 4, 1, 4, 1, 4]}
df = pd.DataFrame(data)
final = df.pivot_table(index=['Department','name'], values='Weight lost', aggfunc='count', fill_value=0).stack(dropna=False).reset_index(name='Weight_lost_count')
del final['level_2']
del final['Weight_lost_count']
print (final)
It seems non-necessary steps in the 'final' line.
What would be the better way to write it?

Try groupby with head
out = df.groupby(['Department','name']).head(1)

Isn't this just drop_duplicates:
df[['Department','name']].drop_duplicates()
Output:
Department name
0 Sales Jason
1 Operation Molly
2 Operation Tina
4 Operation Amy
6 Operation River
7 Sales Kate
8 Finance David
9 Finance Jack
And to exactly match the final:
(df[['Department','name']].drop_duplicates()
.sort_values(by=['Department','name'])
)
Output:
Department name
8 Finance David
9 Finance Jack
4 Operation Amy
1 Operation Molly
6 Operation River
2 Operation Tina
0 Sales Jason
7 Sales Kate

Related

How to split pandas dataframe into list of dataframes by id?

I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?

Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]

Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]

I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]

If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3

Can I make 4 new columns aggregating 4 previous ones?

I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!

df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!

You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7

Python string matching and give repeated numbers for unmatched strings

I have set of some words in list1:"management consultancy services better financial health"
user_search="management consultancy services better financial health"
user_split = nltk.word_tokenize(user_search)
user_length=len(user_split)
assign :management=1, consultancy=2,services=3 ,better=4, financial=5 ,health=6.
Then compare this with set of some lists.
list2: ['us',
'paleri',
'home',
'us',
'consulting',
'services',
'market',
'research',
'analysis',
'project',
'feasibility',
'studies',
'market',
'strategy',
'business',
'plan',
'model',
'health',
'human' etc..]
So that any match occurs it will reflect on corresponding positions as 1,2 3 etc. If the positions are unmatched then the positions are filled with number 6 on words.
Expected output example:
[1] 7 8 9 10 11 3 12 13 14 15 16 17 18 19 20 21 22 6 23 24
This means string 3 and 4, ie. services and health is there in this list(matched). Other numbers indicates the unmatched.user_length=6. So unmatched positions will starts from 7. How to get such a expected result in python?

You can use itertools.count to create a counter and iterate via next:
from itertools import count
user_search = "management consultancy services better financial health"
words = {v: k for k, v in enumerate(user_search.split(), 1)}
# {'better': 4, 'consultancy': 2, 'financial': 5,
# 'health': 6, 'management': 1, 'services': 3}
L = ['us', 'paleri', 'home', 'us', 'consulting', 'services',
'market', 'research', 'analysis', 'project', 'feasibility',
'studies', 'market', 'strategy', 'business', 'plan',
'model', 'health', 'human']
c = count(start=len(words)+1)
res = [next(c) if word not in words else words[word] for word in L]
# [7, 8, 9, 10, 11, 3, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 6, 23]

Matching keywords of list elements with pandas columns

This question is further part of this. So I added it as new question
If my dataframe B would be something like:
ID category words bucket_id
1 audi a4, a6 94
2 bugatti veyron, chiron 86
3 mercedez s-class, e-class 79
4 dslr canon, nikon 69
5 apple iphone,macbook,ipod 51
6 finance sales,loans,sales price 12
7 politics trump, election, votes 77
8 entertainment spiderman,thor, ironmen 88
9 music beiber, rihana,drake 14
........ ..............
......... .........
I want mapped category along with its corresponding column ID as dictionary. Something like:-
{'id': 2, 'term': 'bugatti', 'bucket_id': 86}
{'id': 3, 'term': 'mercedez', 'bucket_id': 79}
{'id': 6, 'term': 'finance', 'bucket_id': 12}
{'id': 7, 'term': 'politics', 'bucket_id': 77}
{'id': 9, 'term': 'music', 'bucket_id': 14}
edit
I just want to map keywords with exact match in between two commas in column words not in between strings or along with any other words.

EDIT:
df = pd.DataFrame({'ID': [1, 2, 3],
'category': ['bugatti', 'entertainment', 'mercedez'],
'words': ['veyron,chiron', 'spiderman,thor,ironmen',
's-class,e-class,s-class'],
'bucket_id': [94, 86, 79]})
print (df)
ID category words bucket_id
0 1 bugatti veyron,chiron 94
1 2 entertainment spiderman,thor,ironmen 86
2 3 mercedez s-class,e-class,s-class 79
A = ['veyron','s-class','derman']
idx = [i for i, x in enumerate(df['words']) for y in x.split(',') if y in A]
print (idx)
[0, 2, 2]
L = (df.loc[idx, ['ID','category','bucket_id']]
.rename(columns={'category':'term'})
.to_dict(orient='r'))
print (L)
[{'ID': 1, 'term': 'bugatti', 'bucket_id': 94},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79}]

How to find out difference of two dataframes in terms of column name using Python

I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?

There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016

You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas to group columns only - python

Try groupby with head out = df.groupby(['Department','name']).head(1)

Related

How to split pandas dataframe into list of dataframes by id?

Can I make 4 new columns aggregating 4 previous ones?

Python string matching and give repeated numbers for unmatched strings

Matching keywords of list elements with pandas columns

How to find out difference of two dataframes in terms of column name using Python

Categories

Resources