I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?
Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64
You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4
First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4
Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4
Related
This might be a quite easy problem but I can't deal with it properly and didn't find the exact answer here. So, let's say we have a Python Dataframe as below:
df:
ID a b c d
0 1 3 4 9
1 2 8 8 3
2 1 3 10 12
3 0 1 3 0
I want to remove all the rows that contain repeating values in different columns. In other words, I am only interested in keeping rows with unique values. Referring to the above example, the desired output should be:
ID a b c d
0 1 3 4 9
2 1 3 10 12
(I didn't change the ID values on purpose to make the comparison easier). Please let me know if you have any ideas. Thanks!
You can compare length of sets with length of columns names:
lc = len(df.columns)
df1 = df[df.apply(lambda x: len(set(x)) == lc, axis=1)]
print (df1)
a b c d
ID
0 1 3 4 9
2 1 3 10 12
Or test by Series.duplicated and Series.any:
df1 = df[~df.apply(lambda x: x.duplicated().any(), axis=1)]
Or DataFrame.nunique:
df1 = df[df.nunique(axis=1).eq(lc)]
Or:
df1 = df[[len(set(x)) == lc for x in df.to_numpy()]]
What I want to do is group by column A and then take the sum of first two rows, then assign that value as a new column. Example below:
DF:
ColA ColB
AA 2
AA 1
AA 5
AA 3
BB 9
BB 3
BB 2
BB 12
CC 0
CC 10
CC 5
CC 3
Desired DF:
ColA ColB NewCol
AA 2 3
AA 1 3
AA 5 3
AA 3 3
BB 9 12
BB 3 12
BB 2 12
BB 12 12
CC 0 10
CC 10 10
CC 5 10
CC 3 10
For AA, it looks at ColB and take the sum of the first two rows and assigns that summed value to newCol. I've tried this by creating a dictionary by looping through the unique ColA values, creating a subset dataframe of the first two rows, summing, then populating the dictionary with values. Then mapping the dictionary back - but my dataframe is VERY big and it takes forever. Any ideas?
Thank you!
You can use transform to get a new value per each row and a lambda function. In lambda you can use head(2) to get first 2 rows for each group and sum() them:
df.groupby('ColA')['ColB'].transform(lambda x: x.head(2).sum())
I have a dataframe with a Keywords column. Each cell in that column has 5-10 individual values (comma seperated) which consist of 1 - 3 words. How can I count the most occurring keywords in the column?
I have tried df.Keywords.mode but it returns all values for each cell because they obviously don't occur multiple times within each cell.
Here an image to clarify:
All input is appreciated,
Thanks!
First use Series.str.split with expand=True for DataFrame and reshape by DataFrame.stack, then count by Series.value_counts and get top values by Series.head:
df = pd.DataFrame({'Keywords':['aa,bb,vv,vv','aa,aa,cc,bb','zz,bb,aa,ss']})
N = 5
df1 = (df.Keywords.str.split(',', expand=True)
.stack()
.value_counts()
.head(N)
.rename_axis('val')
.reset_index(name='count'))
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1
Another solution if no missing values is flatten splitted lists and count by Counter:
from collections import Counter
N = 5
df1 = pd.DataFrame(Counter([y for x in df.Keywords for y in x.split(',')]).most_common(N),
columns=['val','count'])
print (df1)
val count
0 aa 4
1 bb 3
2 vv 2
3 zz 1
4 cc 1
I have a very large DF that contains data like the following:
import pandas as pd
df = pd.DataFrame()
df['CODE'] = [1,2,3,1,2,4,2,2,4,5]
df["DATA"] = [ 'AA', 'BB', 'CC', 'DD', 'AA', 'BB', 'EE', 'FF','GG', 'HH']
df.sort_values('CODE')
df
CODE DATA
0 1 AA
3 1 DD
1 2 BB
4 2 AA
6 2 EE
7 2 FF
2 3 CC
5 4 BB
8 4 GG
9 5 HH
because of the size I need to split it into chunks and parse it.
However equals element contained in the CODE column should not end up in different chunks, instead those should be added in the previous chunk even if the size is exceeded.
Basically if I choose a chunk size of 4 rows the first chunk could be increased up to include all elements with "2" and be:
chunk1:
CODE DATA
0 1 AA
3 1 DD
1 2 BB
4 2 AA
6 2 EE
7 2 FF
I found some posts about chunking and grouping like the following:
split dataframe into multiple dataframes based on number of rows
However the above provide an equal size chunking and I need a smart chunking that takes into account the values in the CODE column.
Any ideas how to do that?
I maybe came up with a solution, (still testing all cases), not very elegant though.
I create a recursive function returning the intervals to take:
def findrange(start,step):
for i in range(start,len(df)+1, step):
if i+step > len(df): return [i, len(df)]
if df.CODE[i+step:i+step+1].values != df.CODE[i+step-1:i+step].values:
return [i,i+step]
else:
return findrange(i,step+1)
Then I call the function to get the ranges and process the data
interval = [0,0]
idx = 0
N=2
while interval[1] < len(df):
if idx < interval[1]: idx = interval[1]
interval = findrange(idx, N)
idx+=N # this point became useless once interval[1] > idx
I tried with the DF posted using many different values for N > 0 and looks good.
if you have an approach more pandas like I am open to that.
I think you can create new column GROUPS by cumcount and then floor divide by N - get chunks for each CODE values:
N = 2
df['GROUPS'] = df.groupby('CODE').cumcount() // N
print (df)
CODE DATA GROUPS
0 1 AA 0
3 1 DD 0
1 2 BB 0
4 2 AA 0
6 2 EE 1
7 2 FF 1
2 3 CC 0
5 4 BB 0
8 4 GG 0
9 5 HH 0
groups = df.groupby(['CODE','GROUPS'])
for (frameno, frame) in groups:
print (frame.to_csv("%s.csv" % frameno))
You can also create new Series and use it for groupby:
chunked_ser = df.groupby('CODE').cumcount() // N
print (chunked_ser)
0 0
3 0
1 0
4 0
6 1
7 1
2 0
5 0
8 0
9 0
dtype: int64
groups = df.groupby([df.CODE,chunked_ser])
for (frameno, frame) in groups:
print (frame.to_csv("%s.csv" % frameno))
I got lost in Pandas doc and features trying to figure out a way to groupby a DataFrame by the values of the sum of the columns.
for instance, let say I have the following data :
In [2]: dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
In [3]: df = pd.DataFrame(dat)
In [4]: df
Out[4]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
I would like columns a, b and c to be grouped since they all have their sum equal to 1. The resulting DataFrame would have columns labels equals to the sum of the columns it summed. Like this :
1 9
0 2 2
1 1 3
2 0 4
Any idea to put me in the good direction ? Thanks in advance !
Here you go:
In [57]: df.groupby(df.sum(), axis=1).sum()
Out[57]:
1 9
0 2 2
1 1 3
2 0 4
[3 rows x 2 columns]
df.sum() is your grouper. It sums over the 0 axis (the index), giving you the two groups: 1 (columns a, b, and, c) and 9 (column d) . You want to group the columns (axis=1), and take the sum of each group.
Because pandas is designed with database concepts in mind, it's really expected information to be stored together in rows, not in columns. Because of this, it's usually more elegant to do things row-wise. Here's how to solve your problem row-wise:
dat = {'a':[1,0,0], 'b':[0,1,0], 'c':[1,0,0], 'd':[2,3,4]}
df = pd.DataFrame(dat)
df = df.transpose()
df['totals'] = df.sum(1)
print df.groupby('totals').sum().transpose()
#totals 1 9
#0 2 2
#1 1 3
#2 0 4