Double loop with dictionary avoiding repetitions - python

Consider the following code snippet:
foo = {'a': 0, 'b': 1, 'c': 2}
for k1 in foo:
for k2 in foo:
print(foo[k1], foo[k2])
The output will be
0 0
0 1
0 2
1 0
1 1
1 2
2 0
2 1
2 2
I do not care for the order of the key couples, so I would like a code that outputs
0 0
0 1
0 2
1 1
1 2
2 2
I tried with
foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo_2:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
but I clearly got
RuntimeError: dictionary changed size during iteration
Other solutions?

foo = {'a': 0, 'b': 1, 'c': 2}
foo_2 = foo.copy()
for k1 in foo:
for k2 in foo_2:
print(foo[k1], foo[k2])
foo_2.pop(k1)
You looped in foo_2 two times and when you tried to pop k1 from foo_2 it changed the dictionary while looping causing the error so by first looping foo you avoid the error.

>>> foo = {'a': 0, 'b': 1, 'c': 2}
>>> keys = list(foo.keys())
>>> for i, v in enumerate(keys):
... for j, v2 in enumerate(keys[i:]):
... print(foo[v], foo[v2])
...
0 0
0 1
0 2
1 1
1 2
2 2

A basic approach.
foo = {'a': 0, 'b': 1, 'c': 2}
for v1 in foo.values():
for v2 in foo.values():
if v1 <= v2:
print(v1, v2)
It could be done with itertools.combinations_with_replacement as well:
from itertools import combinations_with_replacement
foo = {'a': 0, 'b': 1, 'c': 2}
print(*[f'{foo[k1]} {foo[k2]}' for k1, k2 in combinations_with_replacement(foo.keys(), r=2)], sep='\n')

You can just pass the values of foo dictionary to a list and loop.
foo = {'a': 0, 'b': 1, 'c': 2}
val_list = list(foo.values())
for k1 in foo.values():
for row in val_list:
print(k1, row)
val_list.pop(0)

Related

How to find sum of dictionaries in a pandas DataFrame across all rows?

I have a DataFrame
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
I want to add all the elements across all rows that would give the result without using iterrows:
a: 8
b: 6
c: 28
d: 6
g: 1
h: 1
i: 1
note: no element occurs twice in a single row in the original DataFrame.
Using collections.Counter, you can sum an iterable of Counter objects. Since Counter is a subclass of dict, you can then feed to pd.DataFrame.from_dict.
from collections import Counter
counts = sum(map(Counter, df['keywords']), Counter())
res = pd.DataFrame.from_dict(counts, orient='index')
print(res)
0
a 8
b 6
c 28
d 6
g 1
h 1
i 1
Not sure how this compares in terms of optimization with #jpp's answer, but I'll give it a shot.
# What we're starting out with
df = pd.DataFrame({'keywords': [{'a': 3, 'b': 4, 'c': 5}, {'c':1, 'd':2}, {'a':5, 'c':21, 'd':4}, {'b':2, 'c':1, 'g':1, 'h':1, 'i':1}]})
# Turns the array of dictionaries into a DataFrame
values_df = pd.DataFrame(df["keywords"].values.tolist())
# Sums up the individual keys
sums = {key:values_df[key].sum() for key in values_df.columns}

How to sort pandas DataFrame with a key?

I'm looking for a way to sort pandas DataFrame. pd.DataFrame.sort_values doesn't accept a key function. I can convert it to list and apply a key to sorted function, but that will be slow. The other way seems something related to categorical index. I don't have a fixed number of rows so I don't know if categorical index will be applicable.
I have given an example case of what kind of data I want to sort:
Input DataFrame:
clouds fluff
0 {[} 1
1 >>> 2
2 {1 3
3 123 4
4 AAsda 5
5 aad 6
Output DataFrame:
clouds fluff
0 >>> 2
1 {[} 1
2 {1 3
3 123 4
4 aad 6
5 AAsda 5
The rule for sorting (priority):
First special characters (sort by ascii among themselves)
Next is by numbers
next is by lower case alphabets (lexicographical)
next is Capital case alphabets (lexicographical)
In plain python I'd do it like
from functools import cmp_to_key
def ks(a, b):
# "Not exactly this but similar"
if a.isupper():
return -1
else:
return 1
Case
sorted(['aa', 'AA', 'dd', 'DD'], key=cmp_to_key(ks))
Answer:
['DD', 'AA', 'aa', 'dd']
How would you do it with Pandas?
As of pandas 1.1.0, pandas.DataFrame.sort_values accepts an argument key with type callable.
So in this case we would use:
df.sort_values(by='clouds', key=kf)
where kf is the key function that operates on type Series. Accepts and returns Series.
As of pandas 1.2.0,
I did this
import numpy as np
import pandas as pd
df = pd.DataFrame(['aa', 'dd', 'DD', 'AA'], columns=["data"])
# This is the sorting rule
rule = {
"DD": 1,
"AA": 10,
"aa": 20,
"dd": 30,
}
def particular_sort(series):
"""
Must return one Series
"""
return series.apply(lambda x: rule.get(x, 1000))
new_df = df.sort_values(by=["data"], key=particular_sort)
print(new_df) # DD, AA, aa, dd
Of course, you can do this too, but it may be difficult to understand,smile
new_df = df.sort_values(by=["data"], key=lambda x: x.apply(lambda y: rule.get(y, 1000)))
print(new_df) # DD, AA, aa, dd
This might be useful, yet still not sure about special characters! can they actally be sorted!!
import pandas as pd
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df['upper'] = df['a'].str.isupper()
df['lower'] = df['a'].str.islower()
df['int'] = df['a'].apply(isinstance,args = [int])
df2 = pd.concat([df[df['int'] == True].sort_values(by=['a']),
df[df['lower'] == True].sort_values(by=['a']),
df[df['upper'] == True].sort_values(by=['a'])])
print(df2)
a upper lower int
3 1 NaN NaN True
0 2 NaN NaN True
6 3 NaN NaN True
4 a False True False
5 b False True False
2 c False True False
8 A True False False
1 B True False False
7 C True False False
you can also do it in one step with creating new True False columns!
a = [2, 'B', 'c', 1, 'a', 'b',3, 'C', 'A']
df = pd.DataFrame({"a": a})
df2 = pd.concat([df[df['a'].apply(isinstance,args = [int])].sort_values(by=['a']),
df[df['a'].str.islower() == True].sort_values(by=['a']),
df[df['a'].str.isupper() == True].sort_values(by=['a'])])
a
3 1
0 2
6 3
4 a
5 b
2 c
8 A
1 B
7 C
This seems to work:
def sort_dataframe_by_key(dataframe: DataFrame, column: str, key: Callable) -> DataFrame:
""" Sort a dataframe from a column using the key """
sort_ixs = sorted(np.arange(len(dataframe)), key=lambda i: key(dataframe.iloc[i][column]))
return DataFrame(columns=list(dataframe), data=dataframe.iloc[sort_ixs].values)
It passes tests:
def test_sort_dataframe_by_key():
dataframe = DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}])
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: x).equals(
DataFrame([{'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}, {'a': 3, 'b': 4, 'c': 0}]))
assert sort_dataframe_by_key(dataframe, column='a', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))
assert sort_dataframe_by_key(dataframe, column='b', key=lambda x: -x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 1, 'b': 2, 'c': 3}, {'a': 2, 'b': 1, 'c': 1}]))
assert sort_dataframe_by_key(dataframe, column='c', key=lambda x: x).equals(
DataFrame([{'a': 3, 'b': 4, 'c': 0}, {'a': 2, 'b': 1, 'c': 1}, {'a': 1, 'b': 2, 'c': 3}]))

Python: How to get the statistics of the position of each item in multiple lists?

I want to analyze the sequences items of the items and the positions in the sequence where the item appear.
For example:
dataframe['sequence_list'][0] = ['a','b', 'f', 'e']
dataframe['sequence_list'][1] = ['a','c', 'd', 'e']
dataframe['sequence_list'][2] = ['a','d']
...
dataframe['sequence_list'][i] = ['a','b', 'c']
What I want to get is:
How many times 'a' appear in position 0, 1, 2, 3 of the list ?
How many times 'b' appear in position 0, 1, 2, 3 of the list ?
...
Output would be like:
output[1,'a'] = 4
output[2,'a'] = 0
output[3,'a'] = 0
output[4,'a'] = 0
output[1,'b'] = 2
...
The output format could be different. I just want to tell if there are any quick matrix computing methodology help me get the stats quickly?
Start by converting the lists into Series using one of the two statements:
df_ser = dataframe.sequence_list.apply(pd.Series)
df_ser = pd.DataFrame(dataframe.sequence_list.tolist()) # ~30% faster?
The columns of the new dataframe are item positions for each row:
# 0 1 2 3
#0 a b f e
#1 a c d e
#2 a d NaN NaN
#3 a b c NaN
Convert the column numbers into the second-level index, then the second-level index into a column of its own:
df_col = df_ser.stack().reset_index(level=1)
# level_1 0
#0 0 a
#0 1 b
#0 2 f
#....
Count the combinations. This is your answer:
output = df_col.groupby(['level_1', 0]).size()
#level_1 0
#0 a 4
#1 b 2
# c 1
# d 1
#2 c 1
# d 1
# f 1
#3 e 2
You can have it as dictionary:
output.to_dict()
#{(0, 'a'): 4, (1, 'b'): 2, (1, 'c'): 1, (1, 'd'): 1,
# (2, 'c'): 1, (2, 'd'): 1, (2, 'f'): 1, (3, 'e'): 2}
All in one line:
dataframe.sequence_list.apply(pd.Series)\
.stack().reset_index(level=1)\
.groupby(['level_1',0]).size().to_dict()
Setup
Using the setup
df = pd.DataFrame({'col': [['a','b', 'f', 'e'], ['a','c', 'd', 'e'], ['a','d'], ['a','b', 'c']]})
col
0 [a, b, f, e]
1 [a, c, d, e]
2 [a, d]
3 [a, b, c]
You can apply+Counter
pd.DataFrame(df.col.tolist()).apply(Counter)
which yields
0 {'a': 4}
1 {'b': 2, 'c': 1, 'd': 1}
2 {'f': 1, 'd': 1, None: 1, 'c': 1}
3 {'e': 2, None: 2}
dtype: object
for each index.
You can just parse your data the way you need, e.g. fill your dicts now to add the zeroes or disconsider, if thats the case, the Nones.

python: How can combine rows in dataframe

I tried to combine rows with apply function in dataframe but couldn't.
I would like to combine rows to one list if column (c1+c2) information is same.
for example
Dataframe df1
c1 c2 c3
0 0 x {'a':1 ,'b':2}
1 0 x {'a':3 ,'b':4}
2 0 y {'a':5 ,'b':6}
3 0 y {'a':7 ,'b':8}
4 2 x {'a':9 ,'b':10}
5 2 x {'a':11 ,'b':12}
expected result
Dataframe df1
c1 c2 c3
0 0 x [{'a':1 ,'b':2},{'a':3 ,'b':4}]
1 0 y [{'a':5 ,'b':6},{'a':7 ,'b':8}]
2 2 z [{'a':9 ,'b':10},{'a':11,'b':12}]
Source Pandas DF:
In [20]: df
Out[20]:
c1 c2 c3
0 0 x {'a': 1, 'b': 2}
1 0 x {'a': 3, 'b': 4}
2 0 y {'a': 5, 'b': 6}
3 0 y {'a': 7, 'b': 8}
4 2 x {'a': 9, 'b': 10}
5 2 x {'a': 11, 'b': 12}
Solution:
In [21]: df.groupby(['c1','c2'])['c3'].apply(list).to_frame('c3').reset_index()
Out[21]:
c1 c2 c3
0 0 x [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]
1 0 y [{'a': 5, 'b': 6}, {'a': 7, 'b': 8}]
2 2 x [{'a': 9, 'b': 10}, {'a': 11, 'b': 12}]
NOTE: I would recommend you to avoid using non-scalar values in Pandas DFs cells - this might cause various difficulties and performance issues

Calculating column wise for a matrix using numpy in python

By the following program, I am trying to calculate the number of occurance of '0','1','2',and '3' for each column. The program is not working as desired. I read somewhere that slicing of the matrix should be done for computing the occurance column wise but I am not sure how to do it. The program is written using numpy in python. How to do it using numpy?
import numpy as np
a=np.array([[ 2,1,1,2,1,1,2], #t1 is horizontal
[1,1,2,2,1,1,1],
[2,1,1,1,1,2,1],
[3,3,3,2,3,3,3],
[3,3,2,3,3,3,2],
[3,3,3,2,2,2,3],
[3,2,2,1,1,1,0]])
print(a)
i=0
j=0
two=0
zero=0
one=0
three=0
r=a.shape[0]
c=a.shape[1]
for i in range(1,r):
#print(repr(a))
for j in range(1,c):
#sele=a[i,j]
if (a[i,j]==0):
zero+=1
if (a[i,j]==1):
one+=1
if (a[i,j]==2):
two+=1
if (a[i,j]==3):
three+=1
if i==c-1:
#print(zero)
print(one)
i+=0
j=j+1
#print(two)
#print(three)
i=i+1
#print(zero)`
Also I want to print it in the following manner:
column: 0 1 2 3 4 5 6
occurrences: 0 0 0 0 0 0 0 1
1 1 3 2 2 4 3 1
2 2 1 3 4 1 2 2
3 4 3 2 1 2 2 2
Here is the code using list functionality
import numpy as np
inputArr=np.array([[ 2,1,1,2,1,1,2],
[1,1,2,2,1,1,1],
[2,1,1,1,1,2,1],
[3,3,3,2,3,3,3],
[3,3,2,3,3,3,2],
[3,3,3,2,2,2,3],
[3,2,2,1,1,1,0]
])
occurance = dict()
toFindList = [0,1,2,3]
for col in range(len(inputArr)):
collist = inputArr[:, col]
collist = (list(collist))
occurance['col_' + str(col)] = {}
for num in toFindList:
occurcount = collist.count(num)
occurance['col_' + str(col)][str(num)] = occurcount
for key, value in occurance.iteritems():
print key, value
Output:
col_2 {'1': 2, '0': 0, '3': 2, '2': 3}
col_3 {'1': 2, '0': 0, '3': 1, '2': 4}
col_0 {'1': 1, '0': 0, '3': 4, '2': 2}
col_1 {'1': 3, '0': 0, '3': 3, '2': 1}
col_6 {'1': 2, '0': 1, '3': 2, '2': 2}
col_4 {'1': 4, '0': 0, '3': 2, '2': 1}
col_5 {'1': 3, '0': 0, '3': 2, '2': 2}
This should give you the output format you want:
def col_unique(a):
return np.sum(np.dstack([np.in1d(a,i).reshape(a.shape) for i in np.unique(a)]), axis = 0).T

Categories

Resources