Pandas merge duplicate DataFrame columns preserving column names - python

How can I merge duplicate DataFrame columns and also keep all original column names?
e.g. If I have the DataFrame
df = pd.DataFrame({"col1" : [0, 0, 1, 2, 5, 3, 7],
"col2" : [0, 1, 2, 3, 3, 3, 4],
"col3" : [0, 1, 2, 3, 3, 3, 4]})
I can remove the duplicate columns (yes the transpose is slow for large DataFrames) with
df.T.drop_duplicates().T
but this only preserves one column name per unique column
col1 col2
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
How can I keep the information on which columns were merged? e.g. something like
[col1] [col2, col3]
0 0 0
1 0 1
2 1 2
3 2 3
4 5 3
5 3 3
6 7 4
Thanks!

# group columns by their values
grouped_columns = df.groupby(list(df.values), axis=1).apply(lambda g: g.columns.tolist())
# pick one column from each group of the columns
unique_df = df.loc[:, grouped_columns.str[0]]
# make a new column name for each group, don't think the list can work as a column name, you need to join them
unique_df.columns = grouped_columns.apply("-".join)
unique_df

I also used T and tuple to groupby
def f(x):
d = x.iloc[[0]]
d.index = ['-'.join(x.index.tolist())]
return d
df.T.groupby(df.apply(tuple), group_keys=False).apply(f).T

Related

Pandas compare columns and drop rows based on values in another column

Is there a way to drop values in one column based on comparison with another column? Assuming the columns are of equal length
For example, iterate through each row and drop values in col1 greater than values in col2? Something like this:
df['col1'].drop.where(df['col1']>=df['col2']
Pandas compare columns and drop rows based on values in another column
import pandas as pd
d = {
'1': [1, 2, 3, 4, 5],
'2': [2, 4, 1, 6, 3]
}
df = pd.DataFrame(d)
print(df)
dfd = df.drop(df[(df['1'] >= df['2'])].index)
print('update')
print(dfd)
Output
1 2
0 1 2
1 2 4
2 3 1
3 4 6
4 5 3
update
1 2
0 1 2
1 2 4
3 4 6

Select rows of pandas dataframe in order of a given list with repetitions and keep the original index

After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3

Is there a function for getting the number of unique values in the dataframe in each group?

I have a dataframe that has two columns: label and value. I would like to identify the number of unique values in the dataframe that occurs in each label group.
For example, given the following dataframe:
test_df = pd.DataFrame({
'label': [1, 1, 1, 1, 2, 2, 3, 3, 3],
'value': [0, 0, 1, 2, 1, 2, 2, 3, 4]})
test_df
label value
0 1 0
1 1 0
2 1 1
3 1 2
4 2 1
5 2 2
6 3 2
7 3 3
8 3 4
The expected output is:
label uni_val
0 1 1 -> {0} is unique value for this label compared to other labels
1 2 0 -> no unique values for this label compared to other labels
2 3 2 -> {3, 4} are unique values for this label compared to other labels
One way of doing this is to get the unique values for each label and then count the non-duplicates of them across all elements.
test_df.groupby('label')['value'].unique()
label
1 [0, 1, 2]
2 [1, 2]
3 [2, 3, 4]
Name: value, dtype: object
Is there a more efficient and simpler way?
You could drop duplicates on ['label', 'value'], then drop duplicates on value:
(test_df.drop_duplicates(['label','value']) # remove duplicates on pair (label, value)
.drop_duplicates('value', keep=False) # only keep unique `value`
.groupby('label')['value'].count() # count as usual
.reindex(test_df.label.unique(), fill_value=0) # fill missing labels with 0
)
Output:
label
1 1
2 0
3 2
Name: value, dtype: int64

Removing certain Rows from subset of df

I have a pandas dataframe. All the columns right of column#2 may only contain the value 0 or 1. If they contain a value that is NOT 0 or 1, I want to remove that entire row from the dataframe.
So I created a subset of the dataframe to only contain columns right of #2
Then I found the indices of the rows that had values other than 0 or 1 and deleted it from the original dataframe.
See code below please
#reading data file:
data=pd.read_csv('MyData.csv')
#all the columns right of column#2 may only contain the value 0 or 1. So "prod" is a subset of the data df containing these columns:
prod = data.iloc[:,2:]
index_prod = prod[ (prod!= 0) & (prod!= 1)].dropna().index
data = data.drop(index_prod)
However when I run this, the index_prod vector is empty and so does not drop anything at all.
okay so my friend just told me that the data is not numeric and he fixed it by making it numeric. Can anyone please advise how I can find that out? Because all the columns were numeric it seemed like to me. All numbers
You can check dtypes by DataFrame.dtypes.
print (data.dtypes)
Or:
print (data.columns.difference(data.select_dtypes(np.number).columns))
And then convert all values without first 2 to numeric:
data.iloc[:,2:] = data.iloc[:,2:].apply(lambda x: pd.to_numeric(x, errors='coerce'))
Or all columns:
data = data.apply(lambda x: pd.to_numeric(x, errors='coerce'))
And last apply solution:
subset = data.iloc[:,2:]
data1 = data[subset.isin([0,1]).all(axis=1)]
Let's say you have this dataframe:
data = {'A': [1, 2, 3, 4, 5], 'B': [0, 1, 4, 3, 1], 'C': [2, 1, 0, 3, 4]}
df = pd.DataFrame(data)
A B C
0 1 0 2
1 2 1 1
2 3 4 0
3 4 3 3
4 5 1 4
And you want to delete rows based on column B that don't contain 0 or 1, we could accomplish by:
subset = df.iloc[:,1:]
index = subset[ (subset!= 0) & (subset!= 1)].dropna().index
df.drop(index)
A B C
0 1 0 2
1 2 1 1
4 5 1 4
df.reset_index(drop=True)
A B C
0 1 0 2
1 2 1 1
2 5 1 4

Use Pandas to list all values of a second column and count of the first column

Given this dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
I can use groupby to show the size of the group of combinations:
df.groupby(['A','B']).size()
A B
1 2 1
3 1
4 6 1
How can I combine the unique values of B into a list and also display the size of A like this?
A B
1 2,3 2
4 6 1
Use:
df['B'].astype(str).groupby(df['A']).agg([','.join,'size'])
Out[134]:
join size
A
1 2,3 2
4 6 1
Group only on A and use .agg specifying a dictionary to use for each column.
df.groupby('A').agg({'B': list, 'A': 'size'}).rename(columns={'A': 'Size'})
B Size
A
1 [2, 3] 2
4 [6] 1

Categories

Resources