Pandas duplicated indexes still shows correct elements - python

I have a pandas DataFrame like this:
test = pd.DataFrame({'score1' : pandas.Series(['a', 'b', 'c', 'd', 'e']), 'score2' : pandas.Series(['b', 'a', 'k', 'n', 'c'])})
Output:
score1 score2
0 a b
1 b a
2 c k
3 d n
4 e c
I then split the score1 and score2 columns and concatenate them together:
In (283): frame1 = test[['score1']]
frame2 = test[['score2']]
frame2.rename(columns={'score2': 'score1'}, inplace=True)
test = pandas.concat([frame1, frame2])
test
Out[283]:
score1
0 a
1 b
2 c
3 d
4 e
0 b
1 a
2 k
3 n
4 c
Notice the duplicate indexes. Now if I do a groupby and then retrieve a group using get_group(), pandas is still able to retrieve the elements with the correct index, even though the indexes are duplicated!
In (283): groups = test.groupby('score1')
groups.get_group('a') # Get group with key a
Out[283]:
score1
0 a
1 a
In (283): groups.get_group('b') # Get group with key b
Out[283]:
score1
1 b
0 b
I understand that pandas uses an inverted index data structure for storing the groups, which looks like this:
In (284): groups.groups
Out[284]: {'a': [0, 1], 'b': [1, 0], 'c': [2, 4], 'd': [3], 'e': [4], 'k': [2], 'n': [3]}
If both a and b are stored at index 0, how does pandas show me the elements correctly when I do get_group()?

This is into the internals (i.e., don't rely on this API!) but the way it works now is that there is a Grouping object which stores the groups in terms of positions, rather than index labels.
In [25]: gb = test.groupby('score1')
In [26]: gb.grouper
Out[26]: <pandas.core.groupby.BaseGrouper at 0x4162b70>
In [27]: gb.grouper.groupings
Out[27]: [Grouping(score1)]
In [28]: gb.grouper.groupings[0]
Out[28]: Grouping(score1)
In [29]: gb.grouper.groupings[0].indices
Out[29]:
{'a': array([0, 6], dtype=int64),
'b': array([1, 5], dtype=int64),
'c': array([2, 9], dtype=int64),
'd': array([3], dtype=int64),
'e': array([4], dtype=int64),
'k': array([7], dtype=int64),
'n': array([8], dtype=int64)}
See here for where it's actually implemented.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L2091

Related

Filling column of dataframe based on 'groups' of values of another column

I am trying to fill values of a column based on the value of another column. Suppose I have the following dataframe:
import pandas as pd
data = {'A': [4, 4, 5, 6],
'B': ['a', np.nan, np.nan, 'd']}
df = pd.DataFrame(data)
And I would like to fill column B but only if the value of column A equals 4. Hence, all rows that have the same value as another in column A should have the same value in column B (by filling this).
Thus, the desired output should be:
data = {'A': [4, 4, 5, 6],
'B': ['a', a, np.nan, 'd']}
df = pd.DataFrame(data)
I am aware of the fillna method, but this gives the wrong output as the third row also gets the value 'A' assigned:
df['B'] = fillna(method="ffill", inplace=True)
data = {'A': [4, 4, 5, 6],
'B': ['a', 'a', 'a', 'd']}
df = pd.DataFrame(data)
How can I get the desired output?
Try this:
df['B'] = df.groupby('A')['B'].ffill()
Output:
>>> df
A B
0 4 a
1 4 a
2 5 NaN
3 6 d

How to convert Pandas data frame to dict with values in a list

I have a huge Pandas data frame with the structure follows as an example below:
import pandas as pd
df = pd.DataFrame({'col1': ['A', 'A', 'B', 'C', 'C', 'C'], 'col2': [1, 2, 5, 2, 4, 6]})
df
col1 col2
0 A 1
1 A 2
2 B 5
3 C 2
4 C 4
5 C 6
The task is to build a dictionary with elements in col1 as keys and corresponding elements in col2 as values. For the example above the output should be:
A -> [1, 2]
B -> [5]
C -> [2, 4, 6]
Although I write a solution as
from collections import defaultdict
dd = defaultdict(set)
for row in df.itertuples():
dd[row.col1].append(row.col2)
I wonder if somebody is aware of a more "Python-native" solution, using in-build pandas functions.
Without apply we do it by for loop
{x : y.tolist() for x , y in df.col2.groupby(df.col1)}
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}
Use GroupBy.apply with list for Series of lists and then Series.to_dict:
d = df.groupby('col1')['col2'].apply(list).to_dict()
print (d)
{'A': [1, 2], 'B': [5], 'C': [2, 4, 6]}

How to get index for all the duplicates in a dataframe (pandas - python)

I have a data frame with multiple columns, and I want to find the duplicates in some of them. My columns go from A to Z. I want to know which lines have the same values in columns A, D, F, K, L, and G.
I tried:
df = df[df.duplicated(keep=False)]
df = df.groupby(df.columns.tolist()).apply(lambda x: tuple(x.index)).tolist()
However, this uses all of the columns.
I also tried
print(df[df.duplicated(['A', 'D', 'F', 'K', 'L', 'P'])])
This only returns the duplication's index. I want the index of both lines that have the same values.
Your final attempt is close. Instead of grouping by all columns, just use a list of the ones you want to consider:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [3, 3, 3, 4, 4, 5],
'C': [6, 7, 8, 9, 10, 11]})
res = df.groupby(['A', 'B']).apply(lambda x: (x.index).tolist()).reset_index()
print(res)
# A B 0
# 0 1 3 [0, 1, 2]
# 1 2 4 [3, 4]
# 2 2 5 [5]
Different layout of groupby
df.index.to_series().groupby([df['A'],df['B']]).apply(list)
Out[449]:
A B
1 3 [0, 1, 2]
2 4 [3, 4]
5 [5]
dtype: object
You can have .groupby return a dict with keys being the group labels (tuples for multiple columns) and the values being the Index
df.groupby(['A', 'B']).groups
#{(1, 3): Int64Index([0, 1, 2], dtype='int64'),
# (2, 4): Int64Index([3, 4], dtype='int64'),
# (2, 5): Int64Index([5], dtype='int64')}

Efficient way to select row from a DataFrame based on varying list of columns

Suppose, we have the following DataFrame:
dt = {'A': ['a','a','a','a','a','a','b','b','c'],
'B': ['x','x','x','y','y','z','x','z','y'],
'C': [10, 14, 15, 11, 10, 14, 14, 11, 10],
'D': [1, 3, 2, 1, 3, 5, 1, 4, 2]}
df = pd.DataFrame(data=dt)
I want to extract certain rows based on a dictionary where keys are column names and values are row values. For example:
d = {'A': 'a', 'B': 'x'}
d = {'A': 'a', 'B': 'y', 'C': 10}
d = {'A': 'b', 'B': 'z', 'C': 11, 'D': 4}
It can be done using loop (consider the last dictionary):
for iCol in d:
df = df[df[iCol] == d[iCol]]
Out[215]:
A B C D
7 b z 11 4
Since DataFrame is expected to be pretty large and it may have many columns to select on, I am looking for the efficient way to solve the problem without using for loop to iterate the dataframe.
Use the below, Make the dict a Series:
print(df[(df[list(d)] == pd.Series(d)).all(axis=1)])
Output:
A B C D
7 b z 11 4

Assign new column using a set of sub-columns

I have a dataframe with a column 'name' of the form ['A','B','C',A','B','B'....] and a set of arrays: one corresponding to 'A', say array_A = [0, 1, 2 ...] and array_B = [3, 1, 0 ...], array_C etc...
I want to create a new column 'value' by assigning array_A where the row name in the dataframe is 'A', and similarly for 'B' and 'C'.
The function df['value']=np.where(df['name']=='A',array_A, df['value']) won't do it because it would overwrite the values for other names or have dimensionality issues.
For example:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
Desired output:
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1
You can use a for loop with a dictionary:
arrays = {'A': np.array([0, 1, 2]),
'B': np.array([3, 1])}
df = pd.DataFrame({'name': ['A', 'B', 'A', 'A', 'B']})
for k, v in arrays.items():
df.loc[df['name'] == k, 'value'] = v
df['value'] = df['value'].astype(int)
print(df)
name value
0 A 0
1 B 3
2 A 1
3 A 2
4 B 1

Categories

Resources