is there a simpler way to group and count with python?

is there a simpler way to group and count with python? - python

I am grouping and counting a set of data.
df = pd.DataFrame({'key': ['A', 'B', 'A'],
'data': np.ones(3,)})
df.groupby('key').count()
outputs
data
key
A 2
B 1
The piece of code above works though, I wonder if there is a simpler one.
'data': np.ones(3,) seems to be a placeholder and indispensable.
pd.DataFrame(['A', 'B', 'A']).groupby(0).count()
outputs
A
B
My question is, is there a simpler way to do this, produce the count of 'A' and 'B' respectively, without something like 'data': np.ones(3,) ?
It doesn't have to be a pandas method, numpy or python native function are also appreciated.

Use a Series instead.
>>> import pandas as pd
>>>
>>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D']
>>>
>>> pd.Series(data).value_counts()
D 5
A 3
C 2
B 1
dtype: int64

Use a defaultdict:
from collections import defaultdict
data = ['A', 'A', 'B', 'A', 'C', 'C', 'A']
d = defaultdict(int)
for element in data:
d[element] += 1
d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})

There's not any grouping , just counting, so you can use
from collections import Counter
counter(['A', 'B', 'A'])

Related

Query dataframe by column name as a variable

I know this question has already been asked here, but my question a bit different. Lets say I have following df:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'b'), 'B': ('a', 'a', 'g', 'l', 'e', 'a', 'b'), 'C': ('b', 'b', 'g', 'a', 'e', 'a', 'b')})
myList = ['a', 'e', 'b']
I use this line to count the total number of occurrence of each elements of myList in my df columns:
print(df.query('A in #myList ').A.count())
5
Now, I am trying to execute the same thing by looping through columns names. Something like this:
for col in df.columns:
print(df.query('col in #myList ').col.count())
Also, I was wondering if using query for this is the most efficient way?
Thanks for the help.

Use this :
df.isin(myList).sum()
A 5
B 5
C 6
dtype: int64
It checks every cell in the dataframe through myList and returns True or False. Sum uses the 1 or 0 reference and gets the total for each column

Append a value to list only if it has not occurred previously

I have a list of strings. I am trying to append the values to a new list but, only those values which are not consecutive.
like for example,
If i have a list like this,
['a', 'a', 'a', 'b', 'b', 'a']
I need output like
['a', 'b', 'a']

Use itertools.groupby
Ex:
from itertools import groupby
data = ['a', 'a', 'a', 'b', 'b', 'a']
print([k for k, _ in groupby(data)])
# --> ['a', 'b', 'a']

You can use zip_longest from itertools and compare with the next element in the list:
from itertools import zip_longest
a = ['a', 'a', 'a', 'b', 'b', 'a']
b = [i for i,j in zip_longest(a,a[1:]) if i!=j]
print(b)

Python Iterating through two lists only iterates through last element

I am trying to iterate through a double list but am getting the incorrect results. I am trying to get the count of each element in the list.
l = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
dict = {}
for words in l:
for letters in words:
dict[letters] = words.count(letters)
for x in countVocabDict:
print(x + ":" + str(countVocabDict[x]))
at the moment, I am getting:
<s>:1
a:1
b:2
c:2
</s>:1
It seems as if it is only iterating through the last list in 'l' : ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']
but I am trying to get:
<s>: 3
a: 4
b: 5
c: 6
</s>:3

In each inner for loop, you are not adding to the current value of dict[letters] but set it to whatever amount is counted for the current sublist (peculiarly) named word.
Fixing your code with a vanilla dict:
>>> l = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
>>> d = {}
>>>
>>> for sublist in l:
...: for x in sublist:
...: d[x] = d.get(x, 0) + 1
>>> d
{'<s>': 3, 'a': 4, 'b': 5, 'c': 6, '</s>': 3}
Note that I am not calling list.count in each inner for loop. Calling count will iterate over the whole list again and again. It is far more efficient to just add 1 every time a value is seen, which can be done by looking at each element of the (sub)lists exactly once.
Using a Counter.
>>> from collections import Counter
>>> Counter(x for sub in l for x in sub)
Counter({'<s>': 3, 'a': 4, 'b': 5, 'c': 6, '</s>': 3})
Using a Counter and not manually unnesting the nested list:
>>> from collections import Counter
>>> from itertools import chain
>>> Counter(chain.from_iterable(l))
Counter({'<s>': 3, 'a': 4, 'b': 5, 'c': 6, '</s>': 3})

The dictionary is being overwritten in every iteration, rather it should update
count_dict[letters] += words.count(letters)
Initialize the dictionary with defaultdict
from collections import defaultdict
count_dict = defaultdict(int)

As #Vishnudev said, you must add current counter. But dict[letters] must exists (else you'll get a KeyError Exception). You can use the get method of dict with a default value to avoir this:
l = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'],
['<s>', 'a', 'c', 'b', 'c', '</s>'],
['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
dict = {}
for words in l:
for letters in words:
dict[letters] = dict.get(letters, 0) + 1

As per your question, you seem to know that it only takes on the result of the last sublist. This happens because after every iteration your previous dictionary values are replaced and overwritten by the next iteration values. So, you need to maintain the previous states values and add it to the newly calculated values.
You can try this-
l = [['<s>', 'a', 'a', 'b', 'b', 'c', 'c', '</s>'], ['<s>', 'a', 'c', 'b', 'c', '</s>'], ['<s>', 'b', 'c', 'c', 'a', 'b', '</s>']]
d={}
for lis in l:
for x in lis:
if x in d:
d[x]+=1
else:
d[x]=1
So the resulting dictionary d will be as-
{'<s>': 3, 'a': 4, 'c': 6, 'b': 5, '</s>': 3}
I hope this helps!

how to get the unique value of a column pandas that contains list or value?

how to get the unique value of a column pandas that contains list or value ?
my column:
column | column
test | [A,B]
test | [A,C]
test | C
test | D
test | [E,B]
i want list like that :
list = [A, B, C, D, E]
thank you

You can apply pd.Series to split up the lists, then stack and unique.
import pandas as pd
df = pd.DataFrame({'col': [['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]})
df.col.apply(pd.Series).stack().unique().tolist()
Outputs
['A', 'B', 'C', 'D', 'E']

You can use a flattening function Credit #wim
import collections
def flatten(l):
for i in l:
if isinstance(i, collections.abc.Iterable) and not isinstance(i, str):
yield from flatten(i)
else:
yield i
Then use set
list(set(flatten(df.B)))
['A', 'B', 'E', 'C', 'D']
Setup
df = pd.DataFrame(dict(
B=[['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]
))

Create a combined list of random order from a fixed number of items

Say I have 3 different items being A, B and C. I want to create a combined list containing NA copies of A, NB copies of B and NC copies of C in random orders. So the results should look like this:
finalList = [A, C, A, A, B, C, A, C,...]
Is there a clean way to get around this using np.random.rand Pythonically? If not, any other packages besides numpy?

I don't think you need numpy for that. You can use the random builtin package:
import random
na = nb = nc = 5
l = ['A'] * na + ['B'] *nb + ['C'] * nc
random.shuffle(l)
list l will look something like:
['A', 'C', 'A', 'B', 'C', 'A', 'C', 'B', 'B', 'B', 'A', 'C', 'B', 'C', 'A']

You can define a list of tuples. Each tuple should contain a character and desired frequency. Then you can create a list where each element is repeated with specified frequency and finally shuffle it using random.shuffle
>>> import random
>>> l = [('A',3),('B',5),('C',10)]
>>> a = [val for val, freq in l for i in range(freq)]
>>> random.shuffle(a)
>>> ['A', 'B', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'C', 'B', 'C']

Yes, this is very much possible (and simple) with numpy. You'll have to create an array with your unique elements, repeat each element a specified number of times using np.repeat (using an axis argument makes this possible), and then shuffle with np.random.shuffle.
Here's an example with NA as 1, NB as 2, and NC as 3.
a = np.array([['A', 'B', 'C']]).repeat([1, 2, 3], axis=1).squeeze()
np.random.shuffle(a)
print(a)
array(['B', 'C', 'A', 'C', 'B', 'C'],
dtype='<U1')
Note that it is simpler to use numpy, specifying an array of unique elements and repeats, versus a pure python implementation when you have a large number of unique elements to repeat.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

is there a simpler way to group and count with python? - python

Use a Series instead. >>> import pandas as pd >>> >>> data = ['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D', 'D', 'D', 'D'] >>> >>> pd.Series(data).value_counts() D 5 A 3 C 2 B 1 dtype: int64

Use a defaultdict: from collections import defaultdict data = ['A', 'A', 'B', 'A', 'C', 'C', 'A'] d = defaultdict(int) for element in data: d[element] += 1 d # output: defaultdict(int, {'A': 4, 'B': 1, 'C': 2})

There's not any grouping , just counting, so you can use from collections import Counter counter(['A', 'B', 'A'])

Related

Query dataframe by column name as a variable

Append a value to list only if it has not occurred previously

Python Iterating through two lists only iterates through last element

how to get the unique value of a column pandas that contains list or value?

Create a combined list of random order from a fixed number of items

Categories

Resources