Python Pandas to group values in 2 columns - python

A data frame like below. the names are in 5 groups, linking by the common in column A.
I want to group the names. I tried:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
df_1 = df.groupby('A')['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
print (row['A'], row['B'])
the outputs are:
('Brian', ['Ronald', 'Anthony'])
('Edward', ['David', 'Joseph'])
('James', ['John', 'Michael', 'William'])
('Jason', ['George', 'Kenneth', 'Steven'])
('Thomas', ['Christopher', 'Daniel'])
but I want one list for each group (it would be even better if there's an automatic way to assign a variable to each list), like:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
I tried row['B'].append(row['A']) but it returns None.
What's the right way to group them? thank you.

You can add values of A grouping column in GroupBy.apply with .name attribute:
s = df.groupby('A')['B'].apply(lambda x: [x.name] + list(x))
print (s)
A
Brian [Brian, Ronald, Anthony]
Edward [Edward, David, Joseph]
James [James, John, Michael, William]
Jason [Jason, George, Kenneth, Steven]
Thomas [Thomas, Christopher, Daniel]
Name: B, dtype: object

You can try this. Use pd.Series.tolist()
for k,g in df.groupby('A')['B']:
print([k]+g.tolist())
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']
The reason you got None as output is list.append returns None it mutates the list in-place.

try the following:
import pandas as pd
data = {'A': ["James","James","James","Edward","Edward","Thomas","Thomas","Jason","Jason","Jason","Brian","Brian"],
'B' : ["John","Michael","William","David","Joseph","Christopher","Daniel","George","Kenneth","Steven","Ronald","Anthony"]}
df = pd.DataFrame(data)
#display(df)
df_1 = df.groupby(list('A'))['B'].apply(list)
df_1 = df_1.to_frame().reset_index()
for index, row in df_1.iterrows():
''' The value of column A is not a list,
so need to split the string and store in to a list and then concatenate with column B '''
print(row['A'].split("delimiter") + row['B'])
output:
['Brian', 'Ronald', 'Anthony']
['Edward', 'David', 'Joseph']
['James', 'John', 'Michael', 'William']
['Jason', 'George', 'Kenneth', 'Steven']
['Thomas', 'Christopher', 'Daniel']

Related

Pandas merge only on where condition

Please help with logic on applying merge function only where condition is met.
In below example: merge should be applicable only when, np.where name = John, else show 0
df1 = pd.DataFrame({'Name': ['John', 'Tom', 'Simon', 'Jose'],
'Age': [5, 6, 4, 5]})
df2 = pd.DataFrame({'Name': ['John', 'Tom', 'Jose'],
'Class': ['Second', 'Third', 'Fifth']})
Expected result:
TIA
use merge and select the good rows of your df2.
df1.merge(df2[df2["Name"] == 'John'] , how = 'left' , on = 'Name')

Pandas rolling groupby with one group

I am doing groupby and rolling to a dataframe.
If I have more than 1 group, the result is a pandas series but if I only have 1 group then the result is a pandas dataframe. I replicated it below if you need to see what I am doing.
Is there a way to force pandas to return a series each time, even if there is only one group?
If you wish to recreate what I am seeing, you can run the below examples.
Example 1 (Series):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'Jill', 'Jill', 'Jill', 'Jill'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Example 2 (Dataframe):
df = pd.DataFrame(data={'Name':['John', 'John', 'John', 'John', 'John', 'John', 'John'],'Score':[1,1, 1,2,2, 2, 2]})
df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum()
Series:
pd.Series(df.groupby('Name', as_index=False, sort=False)['Score'].rolling(2,min_periods=0).sum().values)

Get values from a list of dictionaries in a Pandas Dataframe

Okay, so I have a dataframe. Each element of column 'z' is a list of dictionaries.
For example, row two of column 'z' looks like this:
[ {'name': 'Tom', 'hw': [180, 79]},
{'name': 'Mark', 'hw': [119, 65]} ]
I would like it to just contain the 'name' values, in this case the element would be Tom and Mark without the 'hw' values.
I've tried converting it into a list, then removing every second element, but I lost which values came from the same row. Not every row has the same number of elements in it, some have 2 names, some might have 4.
One way using list comprehension with dict.get:
Example
df = pd.DataFrame({'z': [[{'name': 'Tom', 'hw': [180, 79]},
{'name': 'Mark', 'hw': [119, 65]}]]})
df['name'] = [[d.get('name') for d in x] for x in df['z']]
[out]
z name
0 [{'name': 'Tom', 'hw': [180, 79]}, {'name': 'M... [Tom, Mark]
Let us use pandas get using series.str.get
df['name']=df.col.str.get('name')
df
col name
0 {'name': 'Tom', 'hw': [180, 79]} Tom
1 {'name': 'Mark', 'hw': [119, 65]} Mark

How to find out difference of two dataframes in terms of column name using Python

I want to find out the difference between two data frame in terms of column names.
This is sample table1
d1 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df1 = pd.DataFrame(data = d)
df1['month'] = pd.DatetimeIndex(df['DoB']).month
df1['year'] = pd.DatetimeIndex(df['DoB']).year
This is sample table2
d2 = {'row_num': [1, 2, 3, 4, 5], 'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
df2 = pd.DataFrame(data = d)
table 2 or df2 does not have the month and year column like df1. I want to find out which columns of df1 are missing in df2.
I know there's 'EXCEPT' in sql but how to do it using pandas/python , Any suggestions ?
There's a function meant just for this purpose: pd.Index.difference
df1.columns.difference(df2.columns)
Index(['month', 'year'], dtype='object')
And, the corresponding columns;
df1[df1.columns.difference(df2.columns)]
month year
0 1 2010
1 1 2012
2 11 2014
3 11 2014
4 9 2016
You can do:
[col for col in df1.columns if col not in df2.columns] to find the columns of df1 not in df2 and the output gives you a list of columns name

Merging 2 list of dicts based on common values

So I have 2 list of dicts which are as follows:
list1 = [
{'name':'john',
'gender':'male',
'grade': 'third'
},
{'name':'cathy',
'gender':'female',
'grade':'second'
},
]
list2 = [
{'name':'john',
'physics':95,
'chemistry':89
},
{'name':'cathy',
'physics':78,
'chemistry':69
},
]
The output list i need is as follows:
final_list = [
{'name':'john',
'gender':'male',
'grade':'third'
'marks': {'physics':95, 'chemistry': 89}
},
{'name':'cathy',
'gender':'female'
'grade':'second'
'marks': {'physics':78, 'chemistry': 69}
},
]
First i tried with iteration as follows:
final_list = []
for item1 in list1:
for item2 in list2:
if item1['name'] == item2['name']:
temp = dict(item_2)
temp.pop('name')
final_result.append(dict(name=item_1['name'], **temp))
However,this does not give me the desired result..I also tried pandas..limited experience there..
>>> import pandas as pd
>>> df1 = pd.DataFrame(list1)
>>> df2 = pd.DataFrame(list2)
>>> result = pd.merge(df1, df2, on=['name'])
However,i am clueless how to get the data back to the original format i need it in..Any help
You can first merge both dataframes
In [144]: df = pd.DataFrame(list1).merge(pd.DataFrame(list2))
Which would look like,
In [145]: df
Out[145]:
gender grade name chemistry physics
0 male third john 89 95
1 female second cathy 69 78
Then create a marks columns as a dict
In [146]: df['marks'] = df.apply(lambda x: [x[['chemistry', 'physics']].to_dict()], axis=1)
In [147]: df
Out[147]:
gender grade name chemistry physics \
0 male third john 89 95
1 female second cathy 69 78
marks
0 [{u'chemistry': 89, u'physics': 95}]
1 [{u'chemistry': 69, u'physics': 78}]
And, use to_dict(orient='records') method of selected columns of dataframe
In [148]: df[['name', 'gender', 'grade', 'marks']].to_dict(orient='records')
Out[148]:
[{'gender': 'male',
'grade': 'third',
'marks': [{'chemistry': 89L, 'physics': 95L}],
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': [{'chemistry': 69L, 'physics': 78L}],
'name': 'cathy'}]
Using your pandas approach, you can call
result.to_dict(orient='records')
to get it back as a list of dictionaries. It won't put marks in as a sub-field though, since there's nothing telling it to do that. physics and chemistry will just be fields on the same level as the rest.
You may also be having problems because your name is 'cathy' in the first list and 'kathy' in the second, which naturally won't get merged.
create a function that will add a marks column , this columns should contain a dictionary of physics and chemistry marks
def create_marks(df):
df['marks'] = { 'chemistry' : df['chemistry'] , 'physics' : df['physics'] }
return df
result_with_marks = result.apply( create_marks , axis = 1)
Out[19]:
gender grade name chemistry physics marks
male third john 89 95 {u'chemistry': 89, u'physics': 95}
female second cathy 69 78 {u'chemistry': 69, u'physics': 78}
then convert it to your desired result as follows
result_with_marks.drop( ['chemistry' , 'physics'], axis = 1).to_dict(orient = 'records')
Out[20]:
[{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89L, 'physics': 95L},
'name': 'john'},
{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69L, 'physics': 78L},
'name': 'cathy'}]
Considering you want a list of dicts as output, you can easily do what you want without pandas, use a dict to store all the info using the names as the outer keys, doing one pass over each list not like the O(n^2) double loops in your own code:
out = {d["name"]: d for d in list1}
for d in list2:
out[d.pop("name")]["marks"] = d
from pprint import pprint as pp
pp(list(out.values()))
Output:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]
That reuses the dicts in your lists, if you wanted to create new dicts:
out = {d["name"]: d.copy() for d in list1}
for d in list2:
k = d.pop("name")
out[k]["marks"] = d.copy()
from pprint import pprint as pp
pp(list(out.values()))
The output is the same:
[{'gender': 'female',
'grade': 'second',
'marks': {'chemistry': 69, 'physics': 78},
'name': 'cathy'},
{'gender': 'male',
'grade': 'third',
'marks': {'chemistry': 89, 'physics': 95},
'name': 'john'}]

Categories

Resources