groupby and get entry with highest occurrence pandas [duplicate] - python

This question already has answers here:
GroupBy pandas DataFrame and select most common value
(13 answers)
Closed 1 year ago.
I have a dataframe having data of cities having different product types, such as :
city
product_type
A
B
A
B
A
D
A
E
X
B
X
C
X
C
X
C
I want to know what the most common product type is, for each city. For the above df, it would be product B for city A and product C for city X.
I am trying to solve this by first grouping then iterating over the groups and trying to find the product type with max occurrence but it doesn't seem to work:
d = df.groupby('city')['product_type']
prods=[]
for name,group in d:
l = [group]
prod = max(l, key=l.count)
prods.append(prod)
print(prods)# this is list of products with highest occurrence in each city
This piece of code seems to give me ALL the product types, not just the most frequent ones.

You can try something like this:
data = pd.DataFrame({
'city': ['A', 'A', 'A', 'A', 'X', 'X', 'X', 'X'],
'product_type': ['B', 'B', 'D', 'E', 'B', 'C', 'C', 'C']
})
result_dict = {city: city_data.product_type.value_counts().index[0]
for city, city_data in data.groupby('city')}
print(result_dict)
This will result in dictionary: {'A': 'B', 'X': 'C'}. Note that if more than one product has the same number of occurrences this code will only return one of them.

Related

How to get intersection of two arrays/lists in sqlalchemy

I have similar problem to this one (most similar is the answer with &&). For postgres, I would like to get the intersection of array column and python list. I've tried to do that with && operator:
query(Table.array_column.op('&&')(cast(['a', 'b'], ARRAY(Unicode)))).filter(Table.array_column.op('&&')(cast(['a', 'b'], ARRAY(Unicode))))
but it seems that op('&&') return bool type (what has sense for filter) not the intersection.
So for table data:
id | array_column
1 {'7', 'xyz', 'a'}
2 {'b', 'c', 'd'}
3 {'x', 'y', 'ab'}
4 {'ab', 'ba', ''}
5 {'a', 'b', 'ab'}
I would like to get:
id | array_column
1 {'a'}
2 {'b'}
5 {'a', 'b'}
One way* to do this would be to unnest the array column, and then re-aggregate the rows that match the list values, grouping on id. This could be done as a subquery:
select id, array_agg(un)
from (select id, unnest(array_column) as un from tbl) t
where un in ('a', 'b')
group by id
order by id;
The equivalent SQLAlchemy construct is:
subq = sa.select(
tbl.c.id, sa.func.unnest(tbl.c.array_column).label('col')
).subquery('s')
stmt = (
sa.select(subq.c.id, sa.func.array_agg(subq.c.col))
.where(subq.c.col.in_(['a', 'b']))
.group_by(subq.c.id)
.order_by(subq.c.id)
)
returning
(1, ['a'])
(2, ['b'])
(5, ['a', 'b'])
* There may well be more efficient ways.

Filtering a dataframe using a dictionary's values

I have a dictionary mapping strings to lists of strings, for example:
{'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
I am looking to use this to filter a dataframe, creating new dfs with the name of the df being the key and the columns to be copied containing the string listed as the values.
So dataframe A would contain columns from the original dataframe that contain 'A', dataframe B would contain columns that contain 'A', 'B', 'C'. I know that I need to use regex filtering for selecting the columns but am unsure how to do this.
Use DataFrame.filter with regex - join values by | for regex or - it means for key C are selected columns with B or E or C:
d = {'A':['A'], 'B':['A', 'B', 'C'], 'C':['B', 'E', 'F']}
dfs = {k:df.filter(regex='|'.join(v)) for k, v in d.items()}

Convert all rows of a Pandas dataframe column to comma-separated values with each value in single quote

I have a Pandas dataframe similar to:
df = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['Col'])
df
Col
0 a
1 b
2 c
3 d
I am trying to convert all rows of this column to a comma-separated string with each value in single quotes, like below:
'a', 'b', 'c', 'd'
I have tried the following with several different combinations, but this is the closest I got:
s = df['Col'].str.cat(sep="', '")
s
"a', 'b', 'c', 'd"
I think that the end result should be:
"'a', 'b', 'c', 'd'"
A quick fix will be
"'" + df['Col1'].str.cat(sep="', '") + "'"
"'a', 'b', 'c', 'd'"
Another alternative is adding each element with an extra quote and then use the default .join;
', '.join([f"'{i}'" for i in df['Col1']])
"'a', 'b', 'c', 'd'"
Try this:
s = df['Col'].tolist()
Try something like this:
df = pd.DataFrame(['a', 'b', 'c', 'd'], columns=['Col1'])
values = df['Col1'].to_list()
with_quotes = ["'"+x+"'" for x in values]
','.join(with_quotes)
Output:
"'a','b','c','d'"

Query dataframe by column name as a variable

I know this question has already been asked here, but my question a bit different. Lets say I have following df:
import pandas as pd
df = pd.DataFrame({'A': ('a', 'b', 'c', 'd', 'e', 'a', 'b'), 'B': ('a', 'a', 'g', 'l', 'e', 'a', 'b'), 'C': ('b', 'b', 'g', 'a', 'e', 'a', 'b')})
myList = ['a', 'e', 'b']
I use this line to count the total number of occurrence of each elements of myList in my df columns:
print(df.query('A in #myList ').A.count())
5
Now, I am trying to execute the same thing by looping through columns names. Something like this:
for col in df.columns:
print(df.query('col in #myList ').col.count())
Also, I was wondering if using query for this is the most efficient way?
Thanks for the help.
Use this :
df.isin(myList).sum()
A 5
B 5
C 6
dtype: int64
It checks every cell in the dataframe through myList and returns True or False. Sum uses the 1 or 0 reference and gets the total for each column

how to get the unique value of a column pandas that contains list or value?

how to get the unique value of a column pandas that contains list or value ?
my column:
column | column
test | [A,B]
test | [A,C]
test | C
test | D
test | [E,B]
i want list like that :
list = [A, B, C, D, E]
thank you
You can apply pd.Series to split up the lists, then stack and unique.
import pandas as pd
df = pd.DataFrame({'col': [['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]})
df.col.apply(pd.Series).stack().unique().tolist()
Outputs
['A', 'B', 'C', 'D', 'E']
You can use a flattening function Credit #wim
import collections
def flatten(l):
for i in l:
if isinstance(i, collections.abc.Iterable) and not isinstance(i, str):
yield from flatten(i)
else:
yield i
Then use set
list(set(flatten(df.B)))
['A', 'B', 'E', 'C', 'D']
Setup
df = pd.DataFrame(dict(
B=[['A', 'B'], ['A', 'C'], 'C', 'D', ['E', 'B']]
))

Categories

Resources