How to create a frequency table in pandas python - python

If i have data like
Col1
A
B
A
B
A
C
I need output like
Col_value Count
A 3
B 2
C 1
I need to col_value and count be column names.
So I can access it like a['col_value']

Use value_counts:
df = pd.value_counts(df.Col1).to_frame().reset_index()
df
A 3
B 2
C 1
then rename your columns if needed:
df.columns = ['Col_value','Count']
df
Col_value Count
0 A 3
1 B 2
2 C 1

Another solution is groupby with aggregating size:
df = df.groupby('Col1')
.size()
.reset_index(name='Count')
.rename(columns={'Col1':'Col_value'})
print (df)
Col_value Count
0 A 3
1 B 2
2 C 1

Use pd.crosstab as another alternative:
import pandas as pd
help(pd.crosstab)
Help on function crosstab in module pandas.core.reshape.pivot:
crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
Example:
df_freq = pd.crosstab(df['Col1'], columns='count')
df_freq.head()

def frequencyTable(alist):
'''
list -> chart
Returns None. Side effect is printing two columns showing each number that
is in the list, and then a column indicating how many times it was in the list.
Example:
>>> frequencyTable([1, 3, 3, 2])
ITEM FREQUENCY
1 1
2 1
3 2
'''
countdict = {}
for item in alist:
if item in countdict:
countdict[item] = countdict[item] + 1
else:
countdict[item] = 1
itemlist = list(countdict.keys())
itemlist.sort()
print("ITEM", "FREQUENCY")
for item in itemlist:
print(item, " ", countdict[item])
return None

Related

Pandas groupby and get nunique of multiple columns in a dataframe

I have a dataframe like as below
stu_id,Mat_grade,sci_grade,eng_grade
1,A,C,A
1,A,C,A
1,B,C,A
1,C,C,A
2,D,B,B
2,D,C,B
2,D,D,C
2,D,A,C
tf = pd.read_clipboard(sep=',')
My objective is to
a) Find out how many different unique grades that a student got under Mat_grade, sci_grade and eng_grade
So, I tried the below
tf['mat_cnt'] = tf.groupby(['stu_id'])['Mat_grade'].nunique()
tf['sci_cnt'] = tf.groupby(['stu_id'])['sci_grade'].nunique()
tf['eng_cnt'] = tf.groupby(['stu_id'])['eng_grade'].nunique()
But this doesn't provide the expected output. Since, I have more than 100K unique ids, any efficient and elegant solution is really helpful
I expect my output to be like as below
You can specify columns names in list and for column cols call DataFrameGroupBy.nunique with rename:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = dict(zip(cols, new))
df = tf.groupby(['stu_id'], as_index=False)[cols].nunique().rename(columns=d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
Another idea is used named aggregation:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = {v: (k,'nunique') for k, v in zip(cols, new)}
print (d)
{'mat_cnt': ('Mat_grade', 'nunique'),
'sci_cnt': ('sci_grade', 'nunique'),
'eng_cnt': ('eng_grade', 'nunique')}
df = tf.groupby(['stu_id'], as_index=False).agg(**d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2

Keep pair of row data in pandas [duplicate]

I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates().
Let's say this is my data:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, i.e. I would like to keep only the rows 1 and 2.
I tried the following:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!
Solutions for select all duplicated rows:
You can use duplicated with subset and parameter keep=False for select all duplicates:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please #jezrael answer, I think it is safest(?), as I am using pandas indexes here.
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.

How to give duplicated columns distinct names in Pandas [duplicate]

I have several columns named the same in a df. I need to rename them but the problem is that the df.rename method renames them all the same way. How I can rename the below blah(s) to blah1, blah4, blah5?
df = pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns = ['blah','blah2','blah3','blah','blah']
df
# blah blah2 blah3 blah blah
# 0 0 1 2 3 4
# 1 5 6 7 8 9
Here is what happens when using the df.rename method:
df.rename(columns={'blah':'blah1'})
# blah1 blah2 blah3 blah1 blah1
# 0 0 1 2 3 4
# 1 5 6 7 8 9
Starting with Pandas 0.19.0 pd.read_csv() has improved support for duplicate column names
So we can try to use the internal method:
In [137]: pd.io.parsers.ParserBase({'names':df.columns})._maybe_dedup_names(df.columns)
Out[137]: ['blah', 'blah2', 'blah3', 'blah.1', 'blah.2']
Since Pandas 1.3.0:
pd.io.parsers.base_parser.ParserBase({'names':df.columns, 'usecols':None})._maybe_dedup_names(df.columns)
This is the "magic" function:
def _maybe_dedup_names(self, names):
# see gh-7160 and gh-9424: this helps to provide
# immediate alleviation of the duplicate names
# issue and appears to be satisfactory to users,
# but ultimately, not needing to butcher the names
# would be nice!
if self.mangle_dupe_cols:
names = list(names) # so we can index
counts = {}
for i, col in enumerate(names):
cur_count = counts.get(col, 0)
if cur_count > 0:
names[i] = '%s.%d' % (col, cur_count)
counts[col] = cur_count + 1
return names
I was looking to find a solution within Pandas more than a general Python solution.
Column's get_loc() function returns a masked array if it finds duplicates with 'True' values pointing to the locations where duplicates are found. I then use the mask to assign new values into those locations. In my case, I know ahead of time how many dups I'm going to get and what I'm going to assign to them but it looks like df.columns.get_duplicates() would return a list of all dups and you can then use that list in conjunction with get_loc() if you need a more generic dup-weeding action
'''UPDATED AS-OF SEPT 2020'''
cols=pd.Series(df.columns)
for dup in df.columns[df.columns.duplicated(keep=False)]:
cols[df.columns.get_loc(dup)] = ([dup + '.' + str(d_idx)
if d_idx != 0
else dup
for d_idx in range(df.columns.get_loc(dup).sum())]
)
df.columns=cols
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
New Better Method (Update 03Dec2019)
This code below is better than above code. Copied from another answer below (#SatishSK):
#sample df with duplicate blah column
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
# you just need the following 4 lines to rename duplicates
# df is the dataframe that you want to rename duplicated columns
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
# rename the columns with the cols list.
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
You could use this:
def df_column_uniquify(df):
df_columns = df.columns
new_columns = []
for item in df_columns:
counter = 0
newitem = item
while newitem in new_columns:
counter += 1
newitem = "{}_{}".format(item, counter)
new_columns.append(newitem)
df.columns = new_columns
return df
Then
import numpy as np
import pandas as pd
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
so that df:
blah blah2 blah3 blah blah
0 0 1 2 3 4
1 5 6 7 8 9
then
df = df_column_uniquify(df)
so that df:
blah blah2 blah3 blah_1 blah_2
0 0 1 2 3 4
1 5 6 7 8 9
You could assign directly to the columns:
In [12]:
df.columns = ['blah','blah2','blah3','blah4','blah5']
df
Out[12]:
blah blah2 blah3 blah4 blah5
0 0 1 2 3 4
1 5 6 7 8 9
[2 rows x 5 columns]
If you want to dynamically just rename the duplicate columns then you could do something like the following (code taken from answer 2: Index of duplicates items in a python list):
In [25]:
import collections
dups = collections.defaultdict(list)
dup_indices=[]
col_list=list(df.columns)
for i, e in enumerate(list(df.columns)):
dups[e].append(i)
for k, v in sorted(dups.items()):
if len(v) >= 2:
dup_indices = v
for i in dup_indices:
col_list[i] = col_list[i] + ' ' + str(i)
col_list
Out[25]:
['blah 0', 'blah2', 'blah3', 'blah 3', 'blah 4']
You could then use this to assign back, you could also have a function to generate a unique name that is not present in the columns prior to renaming.
duplicated_idx = dataset.columns.duplicated()
duplicated = dataset.columns[duplicated_idx].unique()
rename_cols = []
i = 1
for col in dataset.columns:
if col in duplicated:
rename_cols.extend([col + '_' + str(i)])
else:
rename_cols.extend([col])
dataset.columns = rename_cols
Thank you #Lamakaha for the solution. Your idea gave me a chance to modify it and make it workable in all the cases.
I am using Python 3.7.3 version.
I tried your piece of code on my data set which had only one duplicated column i.e. two columns with same name. Unfortunately, the column names remained As-Is without being renamed. On top of that I got a warning that "get_duplicates() is deprecated and same will be removed in future version". I used duplicated() coupled with unique() in place of get_duplicates() which did not yield the expected result.
I have modified your piece of code little bit which is working for me now for my data set as well as in other general cases as well.
Here are the code runs with and without code modification on the example data set mentioned in the question along with results:
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in df.columns.get_duplicates():
cols[df.columns.get_loc(dup)]=[dup+'.'+str(d_idx) if d_idx!=0 else dup for d_idx in range(df.columns.get_loc(dup).sum())]
df.columns=cols
df
f:\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: FutureWarning:
'get_duplicates' is deprecated and will be removed in a future
release. You can use idx[idx.duplicated()].unique() instead
Output:
blah blah2 blah3 blah blah.1
0 0 1 2 3 4
1 5 6 7 8 9
Two of the three "blah"(s) are not renamed properly.
Modified code
df=pd.DataFrame(np.arange(2*5).reshape(2,5))
df.columns=['blah','blah2','blah3','blah','blah']
df
cols=pd.Series(df.columns)
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '.' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
df.columns=cols
df
Output:
blah blah2 blah3 blah.1 blah.2
0 0 1 2 3 4
1 5 6 7 8 9
Here is a run of modified code on some another example:
cols = pd.Series(['X', 'Y', 'Z', 'A', 'B', 'C', 'A', 'A', 'L', 'M', 'A', 'Y', 'M'])
for dup in cols[cols.duplicated()].unique():
cols[cols[cols == dup].index.values.tolist()] = [dup + '_' + str(i) if i != 0 else dup for i in range(sum(cols == dup))]
cols
Output:
0 X
1 Y
2 Z
3 A
4 B
5 C
6 A_1
7 A_2
8 L
9 M
10 A_3
11 Y_1
12 M_1
dtype: object
Hope this helps anybody who is seeking answer to the aforementioned question.
Since the accepted answer (by Lamakaha) is not working for recent versions of pandas, and because the other suggestions looked a bit clumsy, I worked out my own solution:
def dedupIndex(idx, fmt=None, ignoreFirst=True):
# fmt: A string format that receives two arguments:
# name and a counter. By default: fmt='%s.%03d'
# ignoreFirst: Disable/enable postfixing of first element.
idx = pd.Series(idx)
duplicates = idx[idx.duplicated()].unique()
fmt = '%s.%03d' if fmt is None else fmt
for name in duplicates:
dups = idx==name
ret = [ fmt%(name,i) if (i!=0 or not ignoreFirst) else name
for i in range(dups.sum()) ]
idx.loc[dups] = ret
return pd.Index(idx)
Use the function as follows:
df.columns = dedupIndex(df.columns)
# Result: ['blah', 'blah2', 'blah3', 'blah.001', 'blah.002']
df.columns = dedupIndex(df.columns, fmt='%s #%d', ignoreFirst=False)
# Result: ['blah #0', 'blah2', 'blah3', 'blah #1', 'blah #2']
Here's a solution that also works for multi-indexes
# Take a df and rename duplicate columns by appending number suffixes
def rename_duplicates(df):
import copy
new_columns = df.columns.values
suffix = {key: 2 for key in set(new_columns)}
dup = pd.Series(new_columns).duplicated()
if type(df.columns) == pd.core.indexes.multi.MultiIndex:
# Need to be mutable, make it list instead of tuples
for i in range(len(new_columns)):
new_columns[i] = list(new_columns[i])
for ix, item in enumerate(new_columns):
item_orig = copy.copy(item)
if dup[ix]:
for level in range(len(new_columns[ix])):
new_columns[ix][level] = new_columns[ix][level] + f"_{suffix[tuple(item_orig)]}"
suffix[tuple(item_orig)] += 1
for i in range(len(new_columns)):
new_columns[i] = tuple(new_columns[i])
df.columns = pd.MultiIndex.from_tuples(new_columns)
# Not a MultiIndex
else:
for ix, item in enumerate(new_columns):
if dup[ix]:
new_columns[ix] = item + f"_{suffix[item]}"
suffix[item] += 1
df.columns = new_columns
I just wrote this code it uses a list comprehension to update all duplicated names.
df.columns = [x[1] if x[1] not in df.columns[:x[0]] else f"{x[1]}_{list(df.columns[:x[0]]).count(x[1])}" for x in enumerate(df.columns)]
Created a function with some tests so it should be drop in ready; this is a little different than Lamakaha's excellent solution since it renames the first appearance of a duplicate column:
from collections import defaultdict
from typing import Dict, List, Set
import pandas as pd
def rename_duplicate_columns(df: pd.DataFrame) -> pd.DataFrame:
"""Rename column headers to ensure no header names are duplicated.
Args:
df (pd.DataFrame): A dataframe with a single index of columns
Returns:
pd.DataFrame: The dataframe with headers renamed; inplace
"""
if not df.columns.has_duplicates:
return df
duplicates: Set[str] = set(df.columns[df.columns.duplicated()].tolist())
indexes: Dict[str, int] = defaultdict(lambda: 0)
new_cols: List[str] = []
for col in df.columns:
if col in duplicates:
indexes[col] += 1
new_cols.append(f"{col}.{indexes[col]}")
else:
new_cols.append(col)
df.columns = new_cols
return df
def test_rename_duplicate_columns():
df = pd.DataFrame(data=[[1, 2]], columns=["a", "b"])
assert rename_duplicate_columns(df).columns.tolist() == ["a", "b"]
df = pd.DataFrame(data=[[1, 2]], columns=["a", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "a.2"]
df = pd.DataFrame(data=[[1, 2, 3]], columns=["a", "b", "a"])
assert rename_duplicate_columns(df).columns.tolist() == ["a.1", "b", "a.2"]
We can just assign each column a different name.
Suppoese duplicate column name is like = [a,b,c,d,d,c]
Then just create a list of name what you want to assign:
C = [a,b,c,d,D1,C1]
df.columns = c
This works for me.
This is my solution:
cols = [] # for tracking if we alread seen it before
new_cols = []
for col in df.columns:
cols.append(col)
count = cols.count(col)
if count > 1:
new_cols.append(f'{col}_{count}')
else:
new_cols.append(col)
df.columns = new_cols
Here's an elegant solution:
Isolate a dataframe with only the repeated columns (looks like it will be a series but it will be a dataframe if >1 column with that name):
df1 = df['blah']
For each "blah" column, give it a unique number
df1.columns = ['blah_' + str(int(x)) for x in range(len(df1.columns))]
Isolate a dataframe with all but the repeated columns:
df2 = df[[x for x in df.columns if x != 'blah']]
Merge back together on indices:
df3 = pd.merge(df1, df2, left_index=True, right_index=True)
Et voila:
blah_0 blah_1 blah_2 blah2 blah3
0 0 3 4 1 2
1 5 8 9 6 7

Pandas DataFrame - list columns with lowest distinct values

I have the following code to find the columns in a data frame with the lowest number of distinct values and list them.
import pandas as pd
df = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df)
unique_counts = df.nunique()
lowest_distinct = 100
#
#Find the lowest distinct count across all columns
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count < lowest_distinct:
lowest_distinct = distinct_count
lowest_distinct_columns = []
#
#Collect the columns having that count
#
for column_name, distinct_count in unique_counts.iteritems():
if distinct_count == lowest_distinct:
lowest_distinct_columns.append(column_name)
#
#Get the columns and values returned as a data frame
#
melted_df = df.melt(value_vars=lowest_distinct_columns,var_name='column', value_name='value')
print(melted_df)
It feels a bit clunky so I'm wondering if there is a better way to do it? Ultimately I'm trying to get a list of the columns and values that have the lowest number of distinct values.
Any thoughts or tips appreciated.
Cheers
David
Does it do what you want:
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts == lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Use
In [114]: df[unique_count[unique_count == unique_count.min()].index].melt(
var_name='column', value_name='value')
Out[114]:
column value
0 C 1
1 C 1
2 C 2
3 D 3
4 D 3
5 D 4
For older versions of pandas (< v.20), consider apply to return a series:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
Thank you for the responses. The 3 solutions to the first part of the problem work equally well and the 2 responses to the second part of the problem also work very well.
I'll need to use them in practice to see if there is any material difference in performance or behaviour but to summarise the complete solutions:
#Parfait's solution:
unique_ser = df.apply(lambda col: col.nunique(), axis=0)
print(unique_ser)
# A 3
# B 3
# C 2
# D 2
lowest_unique_ser = unique_ser[unique_ser == unique_ser.min()]
print(lowest_unique_ser)
# C 2
# D 2
final_ser = df[lowest_unique_ser.index].apply(lambda col: col.unique().tolist(), axis=0)
print(final_ser)
# C (1, 2)
# D (3, 4)
and #Priker's
unique_counts = df.nunique()
lowest_distinct = unique_counts.min()
lowest_distinct_columns = unique_counts[unique_counts ==
lowest_distinct].index.tolist()
result = pd.DataFrame({col: df[col].unique() for col in lowest_distinct_columns})
Use
df1 = pd.DataFrame({"A": [1,2,3], "B": [2,3,4],"C":[1,1,2],"D":[3,3,4]})
print(df1)
unique_counts = df1.nunique()
A B C D
0 1 2 1 3
1 2 3 1 3
2 3 4 2 4
unique_counts[unique_counts==unique_counts.min()]
C 2
D 2
dtype: int64

How to iterate over DataFrame and generate a new DataFrame

I have a data frame looks like this:
P Q L
1 2 3
2 3
4 5 6,7
The objective is to check if there is any value in L, if yes, extract the value on L and P column:
P L
1 3
4,6
4,7
Note there might more than one values in L, in the case of more than 1 value, I would need two rows.
Bellow is my current script, it cannot generate the expected result.
df2 = []
ego
other
newrow = []
for item in data_DF.iterrows():
if item[1]["L"] is not None:
ego = item[1]['P']
other = item[1]['L']
newrow = ego + other + "\n"
df2.append(newrow)
data_DF2 = pd.DataFrame(df2)
First, you can extract all rows of the L and P columns where L is not missing like so:
df2 = df[~pd.isnull(df.L)].loc[:, ['P', 'L']].set_index('P')
Next, you can deal with the multiple values in some of the remaining L rows as follows:
df2 = df2.L.str.split(',', expand=True).stack()
df2 = df2.reset_index().drop('level_1', axis=1).rename(columns={0: 'L'}).dropna()
df2.L = df2.L.str.strip()
To explain: with P as index, the code splits the string content of the L column on ',' and distributes the individual elements across various columns. It then stacks the various new columns into a single new column, and cleans up the result.
First I extract multiple values of column L to new dataframe s with duplicity index from original index. Remove unnecessary columns L and Q. Then output join to original df and drop rows with NaN values.
print df
P Q L
0 1 2 3
1 2 3 NaN
2 4 5 6,7
s = df['L'].str.split(',').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1) # to line up with df's index
s.name = 'L'
print s
0 3
2 6
2 7
Name: L, dtype: object
df = df.drop( ['L', 'Q'], axis=1)
df = df.join(s)
print df
P L
0 1 3
1 2 NaN
2 4 6
2 4 7
df = df.dropna().reset_index(drop=True)
print df
P L
0 1 3
1 4 6
2 4 7
I was solving a similar issue when I needed to create a new dataframe as a subset of a larger dataframe. Here's how I went about generating the second dataframe:
import pandas as pd
df2 = pd.DataFrame(columns=['column1','column2'])
for i, row in df1.iterrows():
if row['company_id'] == 12345 or row['company_id'] == 56789:
df2 = df2.append(row, ignore_index = True)

Categories

Resources