Python Pandas to group columns only - python

A simple data-frame as below on the left and I want to achieve the right:
I use:
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jason', 'Amy', 'Jason', 'River', 'Kate', 'David', 'Jack', 'David'],
'Department' : ['Sales', 'Operation', 'Operation', 'Sales', 'Operation', 'Sales', 'Operation', 'Sales', 'Finance', 'Finance', 'Finance'],
'Weight lost': [4, 4, 1, 4, 4, 4, 7, 2, 8, 1, 8],
'Point earned': [2, 2, 1, 2, 2, 2, 4, 1, 4, 1, 4]}
df = pd.DataFrame(data)
final = df.pivot_table(index=['Department','name'], values='Weight lost', aggfunc='count', fill_value=0).stack(dropna=False).reset_index(name='Weight_lost_count')
del final['level_2']
del final['Weight_lost_count']
print (final)
It seems non-necessary steps in the 'final' line.
What would be the better way to write it?

Try groupby with head
out = df.groupby(['Department','name']).head(1)

Isn't this just drop_duplicates:
df[['Department','name']].drop_duplicates()
Output:
Department name
0 Sales Jason
1 Operation Molly
2 Operation Tina
4 Operation Amy
6 Operation River
7 Sales Kate
8 Finance David
9 Finance Jack
And to exactly match the final:
(df[['Department','name']].drop_duplicates()
.sort_values(by=['Department','name'])
)
Output:
Department name
8 Finance David
9 Finance Jack
4 Operation Amy
1 Operation Molly
6 Operation River
2 Operation Tina
0 Sales Jason
7 Sales Kate

Related

How to split pandas dataframe into list of dataframes by id?

I have a big pandas dataframe (about 150000 rows). I have tried method groupby('id') but in returns group tuples. I need just a list of dataframes, and then I convert them into np array batches to put into an autoencoder (like this https://www.datacamp.com/community/tutorials/autoencoder-keras-tutorial but 1D)
So I have a pandas dataset :
data = {'Name': ['Tom', 'Joseph', 'Krish', 'John', 'John', 'John', 'John', 'Krish'], 'Age': [20, 21, 19, 18, 18, 18, 18, 18],'id': [1, 1, 2, 2, 3, 3, 3, 3]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
df.head(10)
I need the same output (just a list of pandas dataframe). Also, i need a list of unsorted lists, it is important, because its time series.
data1 = {'Name': ['Tom', 'Joseph'], 'Age': [20, 21],'id': [1, 1]}
data2 = {'Name': ['Krish', 'John', ], 'Age': [19, 18, ],'id': [2, 2]}
data3 = {'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18],'id': [3, 3, 3, 3]}
pd_1 = pd.DataFrame(data1)
pd_2 = pd.DataFrame(data2)
pd_3 = pd.DataFrame(data3)
array_list = [pd_1,pd_2,pd_3]
array_list
How can I split dataframe ?
Or you can TRY:
array_list = df.groupby(df.id.values).agg(list).to_dict('records')
Output:
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'],
'Age': [18, 18, 18, 18],
'id': [3, 3, 3, 3]}]
UPDATE:
If you need a dataframe list:
df_list = [g for _,g in df.groupby('id')]
#OR
df_list = [pd.DataFrame(i) for i in df.groupby(df.id.values).agg(list).to_dict('records')]
To reset the index of each dataframe:
df_list = [g.reset_index(drop=True) for _,g in df.groupby('id')]
Let us group on id and using to_dict with orientation list prepare records per id
[g.to_dict('list') for _, g in df.groupby('id', sort=False)]
[{'Name': ['Tom', 'Joseph'], 'Age': [20, 21], 'id': [1, 1]},
{'Name': ['Krish', 'John'], 'Age': [19, 18], 'id': [2, 2]},
{'Name': ['John', 'John', 'John', 'Krish'], 'Age': [18, 18, 18, 18], 'id': [3, 3, 3, 3]}]
I am not sure about your need but does something like this works for you?
df = df.set_index("id")
[df.loc[i].to_dict("list") for i in df.index.unique()]
or if you really want to keep your index in your list:
[df.query(f"id == {i}").to_dict("list") for i in df.id.unique()]
If you want to create new DataFrames storing the values:
(Previous answers are more relevant if you want to create a list)
This can be solved by iterating over each id using a for loop and create a new dataframe every loop.
I refer you to #40498463 and the other answers for the usage of the groupby() function. Please note that I have changed the name of the id column to Id.
for Id, df in df.groupby("Id"):
str1 = "df"
str2 = str(Id)
new_name = str1 + str2
exec('{} = pd.DataFrame(df)'.format(new_name))
Output:
df1
Name Age Id
0 Tom 20 1
1 Joseph 21 1
df2
Name Age Id
2 Krish 19 2
3 John 18 2
df3
Name Age Id
4 John 18 3
5 John 18 3
6 John 18 3
7 Krish 18 3

Can I make 4 new columns aggregating 4 previous ones?

I have a data set like this:
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]})
Where Dan in A has the corresponding number 3 in B, and where Dan in C has the corresponding number 6 in D.
I would like to create 2 new columns, one with the name Dan and the other with 9 (3+6).
Desired Output
data = ({'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12], 'E': ['Dan', 'Tom', 'Mary'], 'F': [9, 7, 9], 'G': ['John', 'Mike'], 'H': [1, 12]})
For names, John and Mike 2 different columns with their values unchanged.
I have tried using some for loops and .loc, but I am not anywhere close.
Thanks!
df = data[['A','B']]
_df = data[['C','D']]
_df.columns = ['A','B']
df = pd.concat([df,_df]).groupby(['A'],as_index=False)['B'].sum().reset_index()
df.columns = ['E','F']
data = data.merge(df,how='left',left_on=['A'],right_on=['E'])
Although you can join on column C too, that's something you have choose. Or alternatively if you want just columns E & F, then skip the last line!
You can try this:
import pandas as pd
data = {'A': ['John', 'Dan', 'Tom', 'Mary'], 'B': [1, 3, 4, 5], 'C': ['Tom', 'Mary', 'Dan', 'Mike'], 'D': [3, 4, 6, 12]}
df=pd.DataFrame(data)
df=df.rename(columns={"C": "A", "D": "B"})
df=df.stack().reset_index(0, drop=True).rename_axis("index").reset_index()
df=df.pivot(index=df.index//2, columns="index")
df.columns=map(lambda x: x[1], df.columns)
df=df.groupby("A", as_index=False).sum()
Outputs:
>>> df
A B
0 Dan 9
1 John 1
2 Mary 9
3 Mike 12
4 Tom 7

Python string matching and give repeated numbers for unmatched strings

I have set of some words in list1:"management consultancy services better financial health"
user_search="management consultancy services better financial health"
user_split = nltk.word_tokenize(user_search)
user_length=len(user_split)
assign :management=1, consultancy=2,services=3 ,better=4, financial=5 ,health=6.
Then compare this with set of some lists.
list2: ['us',
'paleri',
'home',
'us',
'consulting',
'services',
'market',
'research',
'analysis',
'project',
'feasibility',
'studies',
'market',
'strategy',
'business',
'plan',
'model',
'health',
'human' etc..]
So that any match occurs it will reflect on corresponding positions as 1,2 3 etc. If the positions are unmatched then the positions are filled with number 6 on words.
Expected output example:
[1] 7 8 9 10 11 3 12 13 14 15 16 17 18 19 20 21 22 6 23 24
This means string 3 and 4, ie. services and health is there in this list(matched). Other numbers indicates the unmatched.user_length=6. So unmatched positions will starts from 7. How to get such a expected result in python?
You can use itertools.count to create a counter and iterate via next:
from itertools import count
user_search = "management consultancy services better financial health"
words = {v: k for k, v in enumerate(user_search.split(), 1)}
# {'better': 4, 'consultancy': 2, 'financial': 5,
# 'health': 6, 'management': 1, 'services': 3}
L = ['us', 'paleri', 'home', 'us', 'consulting', 'services',
'market', 'research', 'analysis', 'project', 'feasibility',
'studies', 'market', 'strategy', 'business', 'plan',
'model', 'health', 'human']
c = count(start=len(words)+1)
res = [next(c) if word not in words else words[word] for word in L]
# [7, 8, 9, 10, 11, 3, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 6, 23]

Matching keywords of list elements with pandas columns

This question is further part of this. So I added it as new question
If my dataframe B would be something like:
ID category words bucket_id
1 audi a4, a6 94
2 bugatti veyron, chiron 86
3 mercedez s-class, e-class 79
4 dslr canon, nikon 69
5 apple iphone,macbook,ipod 51
6 finance sales,loans,sales price 12
7 politics trump, election, votes 77
8 entertainment spiderman,thor, ironmen 88
9 music beiber, rihana,drake 14
........ ..............
......... .........
I want mapped category along with its corresponding column ID as dictionary. Something like:-
{'id': 2, 'term': 'bugatti', 'bucket_id': 86}
{'id': 3, 'term': 'mercedez', 'bucket_id': 79}
{'id': 6, 'term': 'finance', 'bucket_id': 12}
{'id': 7, 'term': 'politics', 'bucket_id': 77}
{'id': 9, 'term': 'music', 'bucket_id': 14}
edit
I just want to map keywords with exact match in between two commas in column words not in between strings or along with any other words.
EDIT:
df = pd.DataFrame({'ID': [1, 2, 3],
'category': ['bugatti', 'entertainment', 'mercedez'],
'words': ['veyron,chiron', 'spiderman,thor,ironmen',
's-class,e-class,s-class'],
'bucket_id': [94, 86, 79]})
print (df)
ID category words bucket_id
0 1 bugatti veyron,chiron 94
1 2 entertainment spiderman,thor,ironmen 86
2 3 mercedez s-class,e-class,s-class 79
A = ['veyron','s-class','derman']
idx = [i for i, x in enumerate(df['words']) for y in x.split(',') if y in A]
print (idx)
[0, 2, 2]
L = (df.loc[idx, ['ID','category','bucket_id']]
.rename(columns={'category':'term'})
.to_dict(orient='r'))
print (L)
[{'ID': 1, 'term': 'bugatti', 'bucket_id': 94},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79},
{'ID': 3, 'term': 'mercedez', 'bucket_id': 79}]

How to find out difference of two dataframes in terms of column name using Python

I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name

Categories

Resources