I have a large dataframe mostly hast unique values, still there are multiple same IDs with different values stored. I want to group the same IDs then apply a logic to those to select one among them then remove the others.
df = pd.DataFrame({'ID': [11, 11,11,11,22,22,33] ,
'Source': [2, 2,4,3,3,2,3],
'Price':[10, 20,30,40,50,60,70]})
the logic is :if in group there is a row with SOURCE==4 keep and remove the others
else in group there is a row with SOURCE==2 keep and remove the others
else in group there is a row with SOURCE==3 keep and remove the others
so hierarchy is based on Source column and it is 4>2>3.
Expected output:
expected = pd.DataFrame({'ID': [11,22,33] ,
'Source': [4,2,3],
'Price':[30,60,70]})
A possible solution is creating a new column of hierarchy if source ==4 then hierarchy ==1... and then sort it and and select nth(1) . However I wonder most how can I do conditional select after groupby.
d= {4:1,2:2, 3:3} # dict of drop hierarchy
new=(df.assign(rank=df.Source.map(d))#Create a rank column that maps the hierachy of selection
.sort_values(by='rank')#Sort new dataframe by rank
.drop_duplicates(subset='ID',keep='first')#Drop all the duplicated Source values
.drop('rank',1)#Drop the temp sorting column
)
print(new)
ID Source Price
2 11 4 30
5 22 2 60
6 33 3 70
I feel like you are hunting for even and odd numbers, hence the 4, 2, 3 order. The code below should suffice, and avoid the anonymous function, while offering some speed up (depending on the data size); it is quite verbose in my opinion though:
(df.assign(even_odd = np.where(df.Source % 2 == 0, 'even', 'odd'))
.groupby(['ID', 'even_odd'], as_index = False)
.max()
.drop_duplicates('ID', keep='first')
.filter([*df.columns])
)
ID Source Price
0 11 4 30
2 22 2 60
4 33 3 70
Of course, this would fail if you had 5, 9, 6, 12, ..., in which case, another logic is required. This only works if the numbers are restricted to 4, 2, 3
Related
With this data set I want to know the people (id) who have made payments for both types a and b. Want to create a subset of data with the people who have made both a and b payments. (this is just an example set of data, one I'm using is much larger)
I've tried grouping by the id then making subset of data where type.len >= 2. Then tried creating another subset based on conditions df.loc[(df.type == 'a') & (df.type == 'b')]. I thought if I grouped by the id first then ran that df.loc code it would work but it doesn't.
Any help is much appreciated.
Thanks.
Separate the dataframe into two, one with type a payments and the other with type b payments, then merge them,
df_typea = df[(df['type'] == 'a')]
df_typeb = df[(df['type'] == 'b')]
df_merge = pd.merge(df_typea, df_typeb, how = 'outer', on = ['id', 'id'], suffixes =('_a', '_b'))
This will create a separate column for each payment type.
Now, you can find the ids for which both payments have been made,
df_payments = df_merge[(df_merge['type_a'] == 'a') & (df_merge['type_b'] == 'b')]
Note that this will create two records for items similar to that of id 9, for which there is more than two payments. I am assuming that you simply want to check if any payments of type 'a' and 'b' have been made for each id. In this case, you can simply drop any duplicates,
df_payments_no_duplicates = df_payments['id'].drop_duplicates()
You first split your DataFrame into two DataFrames:
one with type a payments only
one with type b payments only
You then join both DataFrames on id.
You can use groupby to solve this problem. This first time, group by id and type and then you can group again to see if the id had both types.
import pandas as pd
df = pd.DataFrame({"id" : [1, 1, 2, 3, 4, 4, 5, 5], 'payment' : [10, 15, 5, 20, 35, 30, 10, 20], 'type' : ['a', 'b', 'a','a','a','a','b', 'a']})
df_group = df.groupby(['id', 'type']).nunique()
#print(df_group)
'''
payment
id type
1 a 1
b 1
2 a 1
3 a 1
4 a 2
5 a 1
b 1
'''
# if the value in this series is 2, the id has both a and b
data = df_group.groupby('id').size()
#print(data)
'''
id
1 2
2 1
3 1
4 1
5 2
dtype: int64
'''
You can use groupby and nunique to get the count of unique payment types done.
print (df.groupby('id')['type'].agg(['nunique']))
This will give you:
id
1 2
2 1
3 1
4 1
5 1
6 2
7 1
8 1
9 2
If you want to list out only the rows that had both a and b types.
df['count'] = df.groupby('id')['type'].transform('nunique')
print (df[df['count'] > 1])
By using groupby.transform, each row will be populated with the unique count value. Then you can use count > 1 to filter out the rows that have both a and b.
This will give you:
id payment type count
0 1 10 a 2
1 1 15 b 2
7 6 10 b 2
8 6 15 a 2
11 9 35 a 2
12 9 30 a 2
13 9 10 b 2
You may also use the length of the returned set for the given id for column 'type':
len(set(df[df['id']==1]['type'])) # returns 2
len(set(df[df['id']==2]['type'])) # returns 1
Thus, the following would give you an answer to your question
paid_both = []
for i in set(df['id']):
if len(set(df[df['id']==i]['type'])) == 2:
paid_both.append(i)
## paid_both = [1,6,9] #the id's who paid both
You could then iterate through the unique id values to return the results for all ids. If 2 is returned, then the people have made payments for both types (a) and (b).
Lets say I have data like this:
df = pd.DataFrame({'category': ["blue","blue","blue", "blue","green"], 'val1': [5, 3, 2, 2, 5], 'val2':[1, 3, 2, 2, 5]})
print(df)
category val1 val2
0 blue 5 1
1 blue 3 3
2 blue 2 2
3 blue 2 2
4 green 5 5
I want filter by category, then select a column and a row-range, like this:
print(df.loc[df['category'] == 'blue'].loc[1:2, 'val1'])
1 3
2 2
Name: val1, dtype: int64
This works for selecting the data I am interested in, but when I try to overwrite part of my dataframe with the above-selected data, I get A value is trying to be set on a copy of a slice from a DataFrame.
I am familiar with this error message and I know it occurs when trying to overwrite something with a dataframe that was selected like df.loc[columns].loc[rows] instead of df.loc[columns, rows].
However, I can't figure out how to put all 3 things I am filtering for (a certain value for category, a certain column and a certain row range) into a single .loc[...]. How can I select the part of the data in a way that I can use it to overwrite part of the dataframe?
This makes sense because you are chaining two loc calls. My suggestion is to squash the two loc calls together. You can do this by filtering, then grabbing the index and to use in another loc:
df.loc[df[df['category'].eq('blue')].index[1:3], 'val1'] = 123
Notice I have to use df.index[1:3] instead of df.index[1:2] because the end range is not inclusive for positional slicing (unlike loc which is label-based slicing).
This is just to share a very basic concept for beginners of multiindex dataframes.
I noticed empty items in the index column of a 2-index (multiindex) df. Though this must be the basics of multiindex dataframes, I was not familiar with it and had forgotten about it. I did not quickly notice the possible sense of this because I had very large numbers as index values where you do not even start to check out their significance. Sorting with df.sort_index(inplace=True) did not help to get rid of empty items either. It seemed to me at first sight that the dataset itself had partly empty lines for the first index. Searching for "empty items of a multiindex" did not help either.
That is why I want to share this very simple problem with other beginners of multiindex dataframes.
Here are the "empty items" in index column 'A_idx':
A_idx B_idx
12344 12345 0.289163 -0.464633 -0.060487
12345 0.224442 0.177609 2.156436
12346 12346 -0.262329 -0.248384 0.925580
12347 12347 0.051350 0.452014 0.206809
12348 2.757255 -0.739196 0.183735
12349 -0.064909 -0.963130 1.364771
12350 12351 -1.330857 1.881588 -0.262170
This is just to share a very basic concept for beginners of multiindex dataframes.
The "empty" items are part of the multiindex view and only appear when you output a df, it is helping you to understand the hierarchy. If you output the isolated Multiindex class, no item will be empty. Thus, the index items are never really empty, and the "empty" fields only appear for df ouputs:
If the "A_idx" index is assigned to more than one "B_idx" index value, the "A_idx" index is not repeated, because it is the parent.
If "A_idx" index points to more than one value row while "B_idx" index is repeating, B_idx is still repeated, because it is the child.
If you take the df.head(10) and find out that the "empty" index item is in line 1, you can also check this quickly in your df using df.iloc[1].reset_index(). You will see that the index item is not empty.
In the following, "first" and "second" are index names seemingly with equal rights to be either parent as they are on the same output line, but in reality the hierarchy goes from left to right.
first second
bar one 0.289163 -0.464633 -0.060487
two 0.224442 0.177609 2.156436
baz one -0.262329 -0.248384 0.925580
foo one 0.051350 0.452014 0.206809
two 2.757255 -0.739196 0.183735
three -0.064909 -0.963130 1.364771
qux one -1.330857 1.881588 -0.262170
Thanks for the example go to Access last elements of inner multiindex level in pandas dataframe.
This actually means:
first second
bar one 0.289163 -0.464633 -0.060487
bar two 0.224442 0.177609 2.156436
baz one -0.262329 -0.248384 0.925580
foo one 0.051350 0.452014 0.206809
foo two 2.757255 -0.739196 0.183735
foo three -0.064909 -0.963130 1.364771
qux one -1.330857 1.881588 -0.262170
####
Example of how to create a hierarchy accordingly.
The order of the column list that is passed to the set_index() creates the hierarchy in the same order.
You can check this out in a small example I borrowed from pandas multiindex reindex by rows, with df2 covering a switch of the two indices. Only df shows the secret "empty items", see df vs. df2 output:
df = pd.DataFrame({'month': [1, 4, 7, 10],
'year': [2012, 2012, 2013, 2013],
'sale': [55, 40, 84, 31]})
df2 = df.copy()
df=df.set_index(['year','month'])
df2=df2.set_index(['month','year'])
df:
sale
year month
2012 1 55
4 40
2013 7 84
10 31
df2:
month year sale
0 1 2012 55
1 4 2012 40
2 7 2013 84
3 10 2013 31
df.index
Output:
MultiIndex([(2012, 1),
(2012, 4),
(2013, 7),
(2013, 10)],
names=['year', 'month'])
Or:
df2.index
Output:
MultiIndex([( 1, 2012),
( 4, 2012),
( 7, 2013),
(10, 2013)],
names=['month', 'year'])
Have a look at the levels in the df:
df.index.levels[0]
Int64Index([2012, 2013], dtype='int64', name='year')
df.index.levels[1]
Int64Index([1, 4, 7, 10], dtype='int64', name='month')
df2.index.levels[0]
Int64Index([1, 4, 7, 10], dtype='int64', name='month')
df2.index.levels[1]
Int64Index([2012, 2013], dtype='int64', name='year')
If you want to check or clarify the different levels of the hierarchy in the output view, choose one row and reset the index:
df.iloc[1].reset_index()
Output:
index 2012
4
0 sale 40
Or:
df2.iloc[1].reset_index()
Output:
index 4
2012
0 sale 40
I am a beginner in Python and Pandas, and it has been 2 days since I opened Wes McKinney's book. So, this question might be a basic one.
I am using Anaconda distribution (Python 3.6.6) and Pandas 0.21.0. I researched the following threads (https://pandas.pydata.org/pandas-docs/stable/advanced.html, xs function at https://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-xs, Select only one index of multiindex DataFrame, Selecting rows from pandas by subset of multiindex, and https://pandas.pydata.org/pandas-docs/stable/indexing.html) before posting this. All of them explain how to subset data.frame using either hierarchical index or hierarchical column, but not both.
Here's the data.
import pandas as pd
import numpy as np
from numpy import nan as NA
#Hierarchical index for row and column
data = pd.DataFrame(np.arange(36).reshape(6,6),
index=[['a']*2+['b']*1+['c']*1+['d']*2,
[1, 2, 3, 1, 3, 1]],
columns = [['Title1']*3+['Title2']*3,
['A']*2+['B']*2+['C']*2])
data.index.names = ['key1','key2']
data.columns.names = ['state','color']
Here are my questions:
Question:1 I'd like to access key1 = a, key2 = 1, state = Title1 (column), and color = A (column).
After a few trial and errors, I found that this version works (I really don't know why this works--my hypothesis is that data.loc['a',1] gives an indexed dataframe, which is then subset...and so on):
data.loc['a',1].loc['Title1'].loc['A']
Is there a better way to subset above?
Question:2 How do I subset the data after deleting the indices?
data_wo_index = data.reset_index()
I'm relatively comfortable with data.table in R. So, I thought of using http://datascience-enthusiast.com/R/pandas_datatable.html to subset the data using my data.table knowledge.
I tried one step at a time, but even the first step (i.e. subsetting key1 = a gave me an error:
data_wo_index[data_wo_index['key1']=='a']
Exception: cannot handle a non-unique multi-index!
I don't know why Pandas is still thinking that there is multi-index. I have already reset it.
Question:3 If I run data.columns command, I get the following output:
MultiIndex(levels=[['Title1', 'Title2'], ['A', 'B', 'C']],
labels=[[0, 0, 0, 1, 1, 1], [0, 0, 1, 1, 2, 2]],
names=['state', 'color'])
It seems to me that column names are also indexes. I am saying this because I see MultiIndex class, which is what I see if I run data.index:
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 1, 2, 3, 3], [0, 1, 2, 0, 2, 0]],
names=['key1', 'key2'])
I am unsure why column names are also on object of MultiIndex class. If they are indeed an object of MultiIndex class, then why do we need to set aside a few columns (e.g. key1 and key2 in our example above) as indices, meaning why can't we just use column-based indices? (As a comparison, in data.table in R, we can setkey to whatever columns we want.)
Question 4 Why are column names an object of MultiIndex class? It will be great if someone can offer a theoretical treatment for this.
As a beginner, I'd really appreciate your thoughts. I have spent 3-4 hours researching this topic and have hit a dead-end.
First off, MultiIndex's can be tricky to work with, so it's worth considering whether they actually provide enough benefit for what you're actually doing (in terms of speed/organisation) to make those hassles worthwhile.
To answer your question 1, you can subset a MultiIndexed dataframe by providing tuples of the keys you want for each axis. So you first example subset can be done as:
# We want to use ":" to get all the states, but can't just
# have ":" by itself due to Python's syntax rules
# So pandas provides the IndexSlice object to wrap it in
slicer = pd.IndexSlice
data.loc[('a', 1), (slicer[:], 'A')]
Which gives:
state color
Title1 A 0
A 1
Name: (a, 1), dtype: int32
Wow seems like a lot of questions ..
Q1 Multiple index I will recommend IndexSlice
data.loc[pd.IndexSlice['a',1],pd.IndexSlice['Title1','A']]
Out[410]:
state color
Title1 A 0
A 1
Q2 when you reset the index for this complete data frame it will have some issue , I do not think in R you can do that without ftable
Here is the way doing with pandas
data_wo_index.loc[np.concatenate(data_wo_index.loc[:,pd.IndexSlice['key1',:]].values=='a')]
Out[434]:
state key1 key2 Title1 Title2
color A A B B C C
0 a 1 0 1 2 3 4 5
1 a 2 6 7 8 9 10 11
Q3 I think the column and index multiple level offer 4 dimension, yes you can using one columns or index to represent all just do stack
data.stack()
Out[436]:
state Title1 Title2
key1 key2 color
a 1 A 0 3
B 1 4
C 2 5
2 A 6 9
B 7 10
C 8 11
b 3 A 12 15
B 13 16
C 14 17
c 1 A 18 21
B 19 22
C 20 23
d 3 A 24 27
B 25 28
C 26 29
1 A 30 33
B 31 34
C 32 35
Q4 MultiIndex is one of type for index , and pandas treat index and columns to index type
For example
df.index # index but just different type of index
Out[441]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
df.columns # index as well
Out[442]: Index(['A', 'B'], dtype='object')
Say I want a function that changes the value of a named column in a given row number of a DataFrame.
One option is to find the column's location and use iloc, like that:
def ChangeValue(df, rowNumber, fieldName, newValue):
columnNumber = df.columns.get_loc(fieldName)
df.iloc[rowNumber, columnNumber] = newValue
But I wonder if there is a way to use the magic of iloc and loc in one go, and skip the manual conversion.
Any ideas?
I suggest just using iloc combined with the Index.get_loc method. eg:
df.iloc[0:10, df.columns.get_loc('column_name')]
A bit clumsy, but simple enough.
MultiIndex has both get_loc and get_locs which takes a sequence; unfortunately Index just seems to have the former.
Using loc
One has to resort to either employing integer location iloc all the way —as suggested in this answer,— or using plain location loc all the way, as shown here:
df.loc[df.index[[0, 7, 13]], 'column_name']
According to this answer,
ix usually tries to behave like loc but falls back to behaving like iloc if the label is not in the index.
So you should especially be able to use df.ix[rowNumber, fieldname] in case type(df.index) != type(rowNumber).
Even it does not hold for each case, I'd like to add an easy one, if you are looking for top or bottom entries:
df.head(1)['column_name'] # first entry in 'column_name'
df.tail(5)['column_name'] # last 5 entries in 'column_name'
Edit: doing the following is not a good idea. I leave the answer as a counter example.
You can do this:
df.iloc[rowNumber].loc[fieldName] = newValue
Example
import pandas as pd
def ChangeValue(df, rowNumber, fieldName, newValue):
df.iloc[rowNumber].loc[fieldName] = newValue
df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]],
index=[4, 5, 6], columns=['A', 'B', 'C'])
print(df)
A B C
4 0 2 3
5 0 4 1
6 10 20 30
ChangeValue(df, 1, "B", 999)
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
But be careful, if newValue is not the same type it does not work and will fail silently
ChangeValue(df, 1, "B", "Oops")
print(df)
A B C
4 0 2 3
5 0 999 1
6 10 20 30
There is some good info about working with columns data types here: Change column type in pandas