Find unique values for each column - python

I am looking to find the unique values for each column in my dataframe. (Values unique for the whole dataframe)
Col1 Col2 Col3
1 A A B
2 C A B
3 B B F
Col1 has C as a unique value, Col2 has none and Col3 has F.
Any genius ideas ? thank you !

You can use stack for Series, then drop_duplicates - keep=False remove all, remove first level by reset_index and last reindex:
df = df.stack()
.drop_duplicates(keep=False)
.reset_index(level=0, drop=True)
.reindex(index=df.columns)
print (df)
Col1 C
Col2 NaN
Col3 F
dtype: object
Solution above works nice if only one unique value per column.
I try create more general solution:
print (df)
Col1 Col2 Col3
1 A A B
2 C A X
3 B B F
s = df.stack().drop_duplicates(keep=False).reset_index(level=0, drop=True)
print (s)
Col1 C
Col3 X
Col3 F
dtype: object
s = s.groupby(level=0).unique().reindex(index=df.columns)
print (s)
Col1 [C]
Col2 NaN
Col3 [X, F]
dtype: object

I don't believe this is exactly what you want, but as useful information - you can find unique values for a DataFrame using numpy's .unique() like so:
>>> np.unique(df[['Col1', 'Col2', 'Col3']])
['A' 'B' 'C' 'F']
You can also get unique values of a specific column, e.g. Col3:
>>> df.Col3.unique()
['B' 'F']

Related

Group a pandas df on variable columns

I have a dataframe which I want to aggregate as follows: I want to group by col1 and col6 and make a new column for col2 and col3.
col1 col2 col3 col6
a it1 3 f
a it2 5 f
b it6 7 g
b it7 8 g
I would like the result to look like this:
col1 col6 new_col
a f pd.DataFrame({"col2": ["it1", "it2"],"col3":[3,5]})
b g pd.DataFrame({"col2": ["it6", "it7"],"col3":[7,8]})
I tried the following:
def aggregate(gr):
return(pd.DataFrame("col2":gr["col2"], "col3":gr["col3"]))
df.groupby("col1").agg(aggregate)
but aggregate seems not to be the right solution for this.
What is the right way to do this?
It is not entirely clear what you are trying to achieve so here are two ideas. First, if you are going to convert to json anyway, you can convert each group to json:
df.groupby(['col1','col6']).apply(lambda d: d.to_json())
produces
col1 col6
a f {"col1":{"0":"a","1":"a"},"col2":{"0":"it1","1...
b g {"col1":{"2":"b","3":"b"},"col2":{"2":"it6","3...
second, you can have dataframes inside a dataframe, here is how you can do that:
dd = {}
for idx, gr in df.groupby(['col1','col6']):
dd[idx] = aggregate(gr)
dfout = pd.DataFrame(columns = ['newcol'], index = dd.keys())
for idx in dfout.index:
dfout.at[idx, 'newcol'] = dd[idx]
Here is how it is printed with the help of the tabulate package:nicely:
from tabulate import tabulate
print(tabulate(dfout, headers = 'keys'))
newcol
---------- ------------
('a', 'f') col2 col3
0 it1 3
1 it2 5
('b', 'g') col2 col3
2 it6 7
3 it7 8
so has the right DataFrames inside dfout. When converted to json it looks like this:
dfout.to_json()
'{"newcol":{"(\'a\', \'f\')":{"col2":{"0":"it1","1":"it2"},"col3":{"0":3,"1":5}},"(\'b\', \'g\')":{"col2":{"2":"it6","3":"it7"},"col3":{"2":7,"3":8}}}}'

Extract regex match from pandas column and use as index to access list element

I have a pandas dataframe that looks like this:
col1 col2 col3
0 A,B,C 0|0 1|1
1 D,E,F 2|2 3|3
2 G,H,I 4|4 0|0
My goal is to apply a function on col2 through the last column of the dataframe that splits the corresponding string in col1, using the comma as the delimiter, and uses the first number as the index to get the corresponding list element. For numbers that are greater than the length of the list, I'd like to replace with the 0th element of the list.
Expected output:
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
In reality, my dataframe has thousands of columns with millions of entries that need this replacement, so I need a method that doesn't refer to 'col2' and 'col3' explicity (and preference for a computationally efficient method).
You can use this code to create the original dataframe:
df = pd.DataFrame(
{
'col1': ['A,B,C', 'D,E,F', 'G,H,I'],
'col2': ['0|0', '2|2', '4|4'],
'col3': ['1|1', '3|3', '0|0']
}
)
Taking into account that you could have a lot of columns and the length of the arrays in col1 could vary, you can use the following generalization, which only loops through the columns:
for col in df.columns[1:]:
df[col] = (df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
which outputs for your example:
>>> print(df)
col1 col2 col3
0 A,B,C A B
1 D,E,F F D
2 G,H,I G G
Explanation:
first you get the index as string from colX and append to the string in col1 so that you get something like 'A,B,C,0' and split it to get a list with the last element been the index that you need ([A,B,C,0]):
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',')
Then you apply a function that returns the ith element been i the last element of the list and if i is bigger then the len of the list - 1 then return just the first element of the list.
(df['col1']+','+df[col].str.split('|').str[0]).str.split(',') \
.apply(lambda x: x[int(x[-1])] if int(x[-1]) < len(x[:-1]) else x[0])
Last but not least, you just put it in a loop over your desired columns.
I would first reduce your strange x|x for all x format:
df['col2'] = df['col2'].str.split('|', expand=True).iloc[:, 0]
df['col3'] = df['col3'].str.split('|', expand=True).iloc[:, 0]
Then split the letter mappings while keeping them aligned by row.
ndf = pd.concat([df, df['col1'].str.split(',', expand=True)], axis=1)
After that, map them back by row while making sure to prevent overflows:
def bad_mapping(row, c):
value = int(row[c])
if value <= 2: # adjust if needed
return row[value]
else:
return row[0]
for c in ['col2', 'col3']:
ndf['mapped_' + c] = ndf.apply(lambda r: bad_mapping(r, c), axis=1)
Output looks like:
col1 col2 col3 0 1 2 mapped_col2 mapped_col3
0 A,B,C 0 1 A B C A B
1 D,E,F 2 3 D E F F D
2 G,H,I 4 0 G H I G G
Drop columns with df.drop(columns=['your', 'columns', 'here'], inplace=True) as needed.

How to find common elements in several dataframes

I have the following dataframes:
df1 = pd.DataFrame({'col1': ['A','M','C'],
'col2': ['B','N','O'],
# plus many more
})
df2 = pd.DataFrame({'col3': ['A','A','A','B','B','B'],
'col4': ['M','P','Q','J','P','M'],
# plus many more
})
Which look like these:
df1:
col1 col2
A B
M N
C O
#...plus many more
df2:
col3 col4
A M
A P
A Q
B J
B P
B M
#...plus many more
The objective is to create a dataframe containing all elements in col4 for each col3 that occurs in one row in df1. For example, let's look at row 1 of df1. We see that A is in col1 and B is in col2. Then, we go to df2, and check what col4 is for df2[df2['col3'] == 'A'] and df2[df2['col3'] == 'B']. We get, for A: ['M','P','Q'], and for B, ['J','P','M']. The intersection of these is['M', 'P'], so what I want is something like this
col1 col2 col4
A B M
A B P
....(and so on for the other rows)
The naive way to go about this is to iterate over rows and then get the intersection, but I was wondering if it's possible to solve this via merging techniques or other faster methods. So far, I can't think of any way how.
This should achieve what you want, using a combination of merge, groupby and set intersection:
# Getting tuple of all col1=col3 values in col4
df3 = pd.merge(df1, df2, left_on='col1', right_on='col3')
df3 = df3.groupby(['col1', 'col2'])['col4'].apply(tuple)
df3 = df3.reset_index()
# Getting tuple of all col2=col3 values in col4
df3 = pd.merge(df3, df2, left_on='col2', right_on='col3')
df3 = df3.groupby(['col1', 'col2', 'col4_x'])['col4_y'].apply(tuple)
df3 = df3.reset_index()
# Taking set intersection of our two tuples
df3['col4'] = df3.apply(lambda row: set(row['col4_x']) & set(row['col4_y']), axis=1)
# Dropping unnecessary columns
df3 = df3.drop(['col4_x', 'col4_y'], axis=1)
print(df3)
col1 col2 col4
0 A B {P, M}
If required, see this answer for examples of how to 'melt' col4.

Count unique symbols per column in Pandas

I was wondering how to calculate the number of unique symbols that occur in a single column in a dataframe. For example:
df = pd.DataFrame({'col1': ['a', 'bbb', 'cc', ''], 'col2': ['ddd', 'eeeee', 'ff', 'ggggggg']})
df col1 col2
0 a ddd
1 bbb eeeee
2 cc ff
3 gggggg
It should calculate that col1 contains 3 unique symbols, and col2 contains 4 unique symbols.
My code so far (but this might be wrong):
unique_symbols = [0]*203
i = 0
for col in df.columns:
observed_symbols = []
df_temp = df[[col]]
df_temp = df_temp.astype('str')
#This part is where I am not so sure
for index, row in df_temp.iterrows():
pass
if symbol not in observed_symbols:
observed_symbols.append(symbol)
unique_symbols[i] = len(observed_symbols)
i += 1
Thanks in advance
Option 1
str.join + set inside a dict comprehension
For problems like this, I'd prefer falling back to python, because it's so much faster.
{c : len(set(''.join(df[c]))) for c in df.columns}
{'col1': 3, 'col2': 4}
Option 2
agg
If you want to stay in pandas space.
df.agg(lambda x: set(''.join(x)), axis=0).str.len()
Or,
df.agg(lambda x: len(set(''.join(x))), axis=0)
col1 3
col2 4
dtype: int64
Here is one way:
df.apply(lambda x: len(set(''.join(x.astype(str)))))
col1 3
col2 4
Maybe
df.sum().apply(set).str.len()
Out[673]:
col1 3
col2 4
dtype: int64
One more option:
In [38]: df.applymap(lambda x: len(set(x))).sum()
Out[38]:
col1 3
col2 4
dtype: int64

Transforming a CSV from wide to long format

I have a csv like this:
col1,col2,col2_val,col3,col3_val
A,1,3,5,6
B,2,3,4,5
and i want to transfer this csv like this :
col1,col6,col7,col8
A,Col2,1,3
A,col3,5,6
there are col3 and col3_val so i want to keep col3 in col6 and values of col3 in col7 and col3_val's value in col8 in the same row where col3's value is stored.
I think what you're looking for is df.melt and df.groupby:
In [63]: df.rename(columns=lambda x: x.strip('_val')).melt('col1')\
.groupby(['col1', 'variable'], as_index=False)['value'].apply(lambda x: pd.Series(x.values))\
.add_prefix('value')\
.reset_index()
Out[63]:
col1 variable value0 value1
0 A col2 1 3
1 A col3 5 6
2 B col2 2 3
3 B col3 4 5
Credit to John Galt for help with the second part.
If you wish to rename columns, assign the whole expression above to df_out and then do:
df_out.columns = ['col1', 'col6', 'col7', 'col8']
Saving this should be straightforward with df.to_csv.

Categories

Resources