how to limit the duplicate to 5 in pandas data frames?

how to limit the duplicate to 5 in pandas data frames? - python

col1= ['A','B','A','C','A','B','A','C','A','C','A','A','A']
col2= [1,1,4,2,4,5,6,3,1,5,2,1,1]
df = pd.DataFrame({'col1':col1, 'col2':col2})
for A we have [1,4,4,6,1,2,1,1], 8 items but i want to limit the size to 5 while converting Data frame to dict/list
Output:
Dict = {'A':[1,4,4,6,1],'B':[1,5],'C':[2,3,5]}

Use pandas.DataFrame.groupby with apply:
df.groupby('col1')['col2'].apply(lambda x:list(x.head(5))).to_dict()
Output:
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}

Use DataFrame.groupby with lambda function, convert to list and filter first 5 values by indexing, last convert to dictionary by Series.to_dict:
d = df.groupby('col1')['col2'].apply(lambda x: x.tolist()[:5]).to_dict()
print (d)
{'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}

Related

Find rows differences between two dataframes [duplicate]

This question already has answers here:
Anti-Join Pandas
(7 answers)
Closed 12 months ago.
I have two dataframes that have the same structure/indexes.
df1 = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'column_a': [5, 4, 3, 2, 1],
'column_b': [5, 4, 3, 2, 1],
'column_c': [5, 4, 3, 2, 1]
})
df1.set_index('id', drop=False, inplace=True)
and
df2 = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'column_a': [5, 4, 3, 2, 1],
'column_b': [5, 4, 3, 2, 1],
'column_c': [5, 4, 10, 2, 1]
})
df2.set_index('id', drop=False, inplace=True)
And I would like to get this result:
expected = pd.DataFrame({'id': [3], 'column_a': [3], 'column_b': [3], 'column_c': [10]})
I tried using for-loop, but I need to deal with a large data load, and it didn't become so performant...

Try with merge, filtering on the indicator:
>>> (df2.reset_index(drop=True).merge(df1.reset_index(drop=True),
indicator="Exist",
how="left")
.query("Exist=='left_only'")
.drop("Exist", axis=1)
)
id column_a column_b column_c
2 3 3 3 10

What you're asking for could be possibly be answered here.
Using the drop_duplicates example from that thread,
pd.concat([df1,df2]).drop_duplicates(keep=False)
you can end up with the following DataFrame.
id column_a column_b column_c
id
3 3 3 3 3
3 3 3 3 10
Albeit this approach will retrieve rows from both DataFrames.

how to use 1 for loop to add two list to dictionary in python? expected result {'a':[1,2]}

I have written the code which works perfectly but I am trying to use only one for loop but it didn't work out
Here is the python code
lst_one=[1,2,3,4,5,6,7,8]
lst_two=['a','b','c','a','b','c','a','a']
result={}
for createname in range(len(lst_one)):
result[lst_two[createname]]=[]
for value in range(len(lst_one)):
result[lst_two[value]].append(lst_one[value])
print(result)
above code result {'a': [1, 4, 7, 8], 'b': [2, 5], 'c': [3, 6]}
it is working fine using two loop
is it possible to use one loop instead of two-loop
I am using range loop, not lambda, zip and .....

Use zip + setdefault:
lst_one = [1, 2, 3, 4, 5, 6, 7, 8]
lst_two = ['a', 'b', 'c', 'a', 'b', 'c', 'a', 'a']
result = {}
for o, t in zip(lst_one, lst_two):
result.setdefault(t, []).append(o)
print(result)
Output
{'a': [1, 4, 7, 8], 'b': [2, 5], 'c': [3, 6]}

you can use defaultdict which create a dictionary where type of value you define like list or int or dict and it will handle if the key is there or not . if present then do operation on value and if not then make a key and value air
lst_one=[1,2,3,4,5,6,7,8]
lst_two=['a','b','c','a','b','c','a','a']
from collections import defaultdict
result = defaultdict(list)
for a,b in zip(lst_one, lst_two):
result[b].append(a)
print(dict(result))
output
{'a': [1, 4, 7, 8], 'b': [2, 5], 'c': [3, 6]}
if you not wana use default dict then, you can use below code which doing the same way like default dict
lst_one=[1,2,3,4,5,6,7,8]
lst_two=['a','b','c','a','b','c','a','a']
result ={}
for a, b in zip(lst_one, lst_two):
if b not in result.keys():
result.update({b:[a]})
else:
result[b].append(a)
print(result)
output
{'a': [1, 4, 7, 8], 'b': [2, 5], 'c': [3, 6]}

I'd recommend using groupby from the itertools package if you want to condense this:
from itertools import groupby
{a[0]:[e[1] for e in b] for a,b in groupby(sorted(zip(lst_two, lst_one)), lambda x:x[0])}

Convert list of lists to list of dictionaries based on order of items in sublist

I would like to convert my list of lists to list of dictionaries. Values of first list should be my keys and remaining all should be treated as values.
For example:
[['a','b','c'],[1,2,3],[4,5,6],[7,8,9]]
should convert to
[{'a':[1,4,7]}, {'b': [2,5,8]},{'b': [3,6,9]}]
I found this but it did n't help for me..
Any help would be greatly appreciated. Thanks

Use zip to transpose your array into [('a', 1, 4, 7), ...]; pop off the first element as key, listify the rest as value.
arr = [['a','b','c'],[1,2,3],[4,5,6],[7,8,9]]
[{ e[0]: list(e[1:])} for e in zip(*arr)]
# => [{'a': [1, 4, 7]}, {'b': [2, 5, 8]}, {'c': [3, 6, 9]}]

Using a list comprehension with sequence unpacking:
L = [['a','b','c'],[1,2,3],[4,5,6],[7,8,9]]
res = [{names: nums} for names, *nums in zip(*L)]
print(res)
[{'a': [1, 4, 7]}, {'b': [2, 5, 8]}, {'c': [3, 6, 9]}]

a=[['a','b','c'],[1,2,3],[4,5,6],[7,8,9]]
dictionary_values=[dict([(a[0][i],list(zip(*a[1:])[i])) for i in range (len(a)-1)])]
output:
[{'a': [1, 4, 7], 'b': [2, 5, 8], 'c': [3, 6, 9]}]

Filtering out python dictionary based on a key`s values

I have a dictionary dictM in the form of
dictM={movieID:[rating1,rating2,rating3,rating4]}
Key is a movieID and rating1, rating2, rating3, rating4 are its values. There are several movieID's with ratings. I want to move certain movieID's along with ratings to a new dicitonary if a movieID has a certain number of ratings.
What I'm doing is :
for movie in dictM.keys():
if len(dictM[movie])>=5:
dF[movie]=d[movie]
But I'm not getting the desired result. Does someone know a solution for this?

You can use dictionary comprehension, as follows:
>>> dictM = {1: [1, 2, 3, 4], 2: [1, 2, 3]}
>>> {k: v for (k, v) in dictM.items() if len(v) ==4}
{1: [1, 2, 3, 4]}

You can try this using simple dictionary comprhension:
dictM={3:[4, 3, 2, 5, 1]}
new_dict = {a:b for a, b in dictM.items() if len(b) >= 5}
One reason why your code above may not be producing any results is first, you have not defined dF and the the length of the only value in dictM is equal to 4, but you want 5 or above, as shown in the if statement in your code.

You don't delete the entries, you could do it like this:
dictM = {1: [1, 2, 3],
2: [1, 2, 3, 4, 5],
3: [1, 2, 3, 4, 5, 6, 7],
4: [1]}
dF = {}
for movieID in list(dictM):
if len(dictM[movieID]) >= 5:
dF[movieID] = dictM[movieID] # put the movie + rating in the new dict
del dictM[movieID] # remove the movie from the old dict
The result looks like this:
>>> dictM
{1: [1, 2, 3], 4: [1]}
>>> dF
{2: [1, 2, 3, 4, 5], 3: [1, 2, 3, 4, 5, 6, 7]}

Get DataFrame selection's row posititions

Instead of the indices, I'd like to obtain the row positions, so I can use the result later using df.iloc(row_positions).
This is the example:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
print df[df['a']>=2].index
# Int64Index([2, 7], dtype='int64')
# How do I convert the index list [2, 7] to [1, 2] (the row position)
# I managed to do this for 1 index element, but how can I do this for the entire selection/index list?
df.index.get_loc(2)
Update
I could use a list comprehension to apply the selected result on the get_loc function, but perhaps there's some Pandas-built-in function.

you can use where from numpy:
import numpy as np
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
np.where( df.a>=2)
returns row indices:
(array([1, 2], dtype=int64),)

#ssm's answer is what I would normally use. However to answer your specific query of how to select multiple rows try this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
indices = df[df['a']>=2].index
print df.ix[indices]
More information on .ix indexing scheme is here
[EDIT to answer the specific query]
How do I convert the index list [2, 7] to [1, 2] (the row position)
df[df['a']>=2].reset_index().index

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to limit the duplicate to 5 in pandas data frames? - python

Use pandas.DataFrame.groupby with apply: df.groupby('col1')['col2'].apply(lambda x:list(x.head(5))).to_dict() Output: {'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}

Use DataFrame.groupby with lambda function, convert to list and filter first 5 values by indexing, last convert to dictionary by Series.to_dict: d = df.groupby('col1')['col2'].apply(lambda x: x.tolist()[:5]).to_dict() print (d) {'A': [1, 4, 4, 6, 1], 'B': [1, 5], 'C': [2, 3, 5]}

Related

Find rows differences between two dataframes [duplicate]

how to use 1 for loop to add two list to dictionary in python? expected result {'a':[1,2]}

Convert list of lists to list of dictionaries based on order of items in sublist

Filtering out python dictionary based on a key`s values

Get DataFrame selection's row posititions

Categories

Resources