Say you have this MultiIndex-ed DataFrame:
df = pd.DataFrame({'country':['DE','DE','FR','FR'],
'biome':['Lake','Forest','Lake','Forest'],
'area':[10,20,30,40],
'count':[7,5,2,3]})
df = df.set_index(['country','biome'])
Which looks like this:
area count
country biome
DE Lake 10 7
Forest 20 5
FR Lake 30 2
Forest 40 3
I would like to retrieve the unique values per index level. This can be accomplished using
>>> df.index.levels[0]
['DE', 'FR']
>>> df.index.levels[1]
['Lake', 'Forest']
What I would really like to do, is to retrieve these lists by addressing the levels by their name, i.e. 'country' and 'biome'. The shortest two ways I could find looks like this:
>>> list(set(df.index.get_level_values('country')))
['DE', 'FR']
>>> df.index.levels[df.index.names.index('country')]
['DE', 'FR']
But non of them are very elegant. Is there a shorter and/or more performant way?
Pandas 0.23.0 finally introduced a much cleaner solution to this problem: the level argument to Index.unique():
In [3]: df.index.unique(level='country')
Out[3]: Index(['DE', 'FR'], dtype='object', name='country')
This is now the recommended solution. It is far more efficient because it avoids creating a complete representation of the level values in memory, and re-scanning it.
I guess u want unique values in a certain level (and by level names) of a multiindex. I usually do the following, which is a bit long.
In [11]: df.index.get_level_values('country').unique()
Out[11]: array(['DE', 'FR'], dtype=object)
An alternative approach is to find the number of levels by calling df.index.levels[level_index] where level_index can be inferred from df.index.names.index(level_name). In the above example level_name = 'co'.
The proposed answer by #Happy001 computes the unique which may be computationally intensive.
If you're going to do the level lookup repeatedly, you could create a map of your index level names to level unique values with:
df_level_value_map = {
name: level
for name, level in zip(df.index.names, df.index.levels)
}
df_level_value_map['']
But this is not in any way more efficient (or shorter) than your original attempts if you're only going to do this lookup once.
I really wish there was a method on indexes that returned such a dictionary (or series?) with a name like:
df.index.get_level_map(levels={...})
Where the levels parameter can limit the map to a subset of the existing levels. I could do without the parameter if it could be a property like:
df.index.level_map
If you already know the index names, is it not straightforward to simply do:
df['co'].unique() ?
Related
I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)
EDIT: A suggested possible duplicate (this question) is not a duplicate. I'm asking if a slice of a dataframe can be edited and have that slice affect the original dataframe. The "duplicate" Q/A suggested is just looking for an alternate to .loc. The simple answer to my original question appears to be, "no".
Original Question:
This question likely has a duplicate somewhere, but I couldn't find it. Also, I'm guessing what I'm about to ask isn't possible, but worth a shot.
I'm looking to be able to filter or mask a large dataframe, get a smaller dataframe for ease of coding, edit the smaller dataframe, and have it affect the larger dataframe.
So something like this:
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_part = df_full[df_full['a'] == 2]
df_part['b'] = 'Kentucky Fried Chicken'
print df_full
Would result in:
a b
0 1 4
1 2 Kentucky Fried Chicken
2 3 6
I'm well aware of the ability to use the .loc[row_indexer, col_indexer] functionality, but even with a mask variable as the row_indexer, it can be a little unwieldy for more complex purposes.
A little context - I'm loading large database tables into a dataframe and want to make many edits on a small slice of it. So the .loc[] gets tedious. Maybe I could filter out that small slice, edit it, then re-append to the original?
Any thoughts?
Short answer
No. You don't want to play the game where you have to keep checking / guessing whether you are using a copy or a view of a dataframe.
Single update: the right way
.loc accessor is the way to go. There is nothing unwieldy about it, though it takes some getting used to.
However complex your criteria, if it boils down to Boolean arrays, .loc accessor is still often the right choice. You need to show an example where it is genuinely difficult to implement.
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_full.loc[df_full['a'] == 2, 'b'] = 'Kentucky Fried Chicken'
# a b
# 0 1 4
# 1 2 Kentucky Fried Chicken
# 2 3 6
Single update: an alternative way
If you find .loc accessor difficult to implement, one alternative is numpy.where:
df_full['b'] = np.where(df_full['a'] == 2, 'Kentucky Fried Chicken', df_full['b'])
Multiple updates: for many conditions
pandas.cut, numpy.select or numpy.vectorize can be used to good effect to streamline your code. The usefulness of these will depend on the specific logic you are attempting to apply. The below question includes examples for each of these:
Numpy “where” with multiple conditions
I have a pandas dataframe containing indices that have a one-to-many relationship. A very simplified and shortened example of my data is shown in the DataFrame Example link. I want to get a list or Series or ndarray of the unique namIdx values in which nCldLayers <= 1. The final result should show indices of 601 and 603.
I am able to accomplish this with the 3 statements below, but I am wondering if there is a much better, more succinct way with perhaps 'filter', 'select', or 'where'.
grouped=(namToViirs['nCldLayers']<=1).groupby(namToViirs.index).all(axis=0)
grouped = grouped[grouped==True]
filterIndex = grouped.index
Is there a better approach in accomplishing this result by applying the logical condition (namToViirs['nCldLayers >= 1) in a subsequent part of the chain, i.e., first group then apply logical condition, and then retrieve only the namIdx where the logical result is true for each member of the group?
I think your code works nice, only you can add use small changes:
In all can be omit axis=0
grouped==True can be omit ==True
grouped=(namToViirs['nCldLayers']<=1).groupby(level='namldx').all()
grouped = grouped[grouped]
filterIndex = grouped.index
print (filterIndex)
Int64Index([601, 603], dtype='int64', name='namldx')
I think better is first filter by boolean indexing and then groupby, because less loops -> better performance.
For question 1, see jezrael answer. For question 2, you could play with indexes as sets:
namToViirs.index[namToViirs.nCldLayers <= 1] \
.difference(namToViirs.index[namToViirs.nCldLayers > 1])
You might be interested in this answer.
The implementation is currently a bit hackish, but it should reduce your statement above to:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[W].index)
EDIT: also see this answer for an analogous approach not requiring external components, resulting in:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[lambda x : x].index)
Another option is to use .pipe() and a function which applies the desired filtering.
For instance:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index)
.all(axis=0)
.pipe(lambda s : s[s])
.index)
Given a dataframe, I want to get the duplicated indexes, which do not have duplicate values in the columns, and see which values are different.
Specifically, I have this dataframe:
import pandas as pd
wget https://www.dropbox.com/s/vmimze2g4lt4ud3/alt_exon_repeatmasker_intersect.bed
alt_exon_repeatmasker = pd.read_table('alt_exon_repeatmasker_intersect.bed', header=None, index_col=3)
In [74]: alt_exon_repeatmasker.index.is_unique
Out[74]: False
And some of the indexes have duplicate values in the 9th column (the type of DNA repetitive element in this location), and I want to know what are the different types of repetitive elements for individual locations (each index = a genome location).
I'm guessing this will require some kind of groupby and hopefully some groupby ninja can help me out.
To simplify even further, if we only have the index and the repeat type,
genome_location1 MIR3
genome_location1 AluJb
genome_location2 Tigger1
genome_location3 AT_rich
So the output I'd like to see all duplicate indexes and their repeat types, as such:
genome_location1 MIR3
genome_location1 AluJb
EDIT: added toy example
Also useful and very succinct:
df[df.index.duplicated()]
Note that this only returns one of the duplicated rows, so to see all the duplicated rows you'll want this:
df[df.index.duplicated(keep=False)]
df.groupby(level=0).filter(lambda x: len(x) > 1)['type']
We added filter method for this kind of operation. You can also use masking and transform for equivalent results, but this is faster, and a little more readable too.
Important:
The filter method was introduced in version 0.12, but it failed to work on DataFrames/Series with nonunique indexes. The issue -- and a related issue with transform on Series -- was fixed for version 0.13, which should be released any day now.
Clearly, nonunique indexes are the heart of this question, so I should point out that this approach will not help until you have pandas 0.13. In the meantime, the transform workaround is the way to go. Be ware that if you try that on a Series with a nonunique index, it too will fail.
There is no good reason why filter and transform should not be applied to nonunique indexes; it was just poorly implemented at first.
Even faster and better:
df.index.get_duplicates()
As of 9/21/18 Pandas indicates FutureWarning: 'get_duplicates' is deprecated and will be removed in a future release, instead suggesting the following:
df.index[df.index.duplicated()].unique()
>>> df[df.groupby(level=0).transform(len)['type'] > 1]
type
genome_location1 MIR3
genome_location1 AluJb
More succinctly:
df[df.groupby(level=0).type.count() > 1]
FYI a multi-index:
df[df.groupby(level=[0,1]).type.count() > 1]
This gives you index values along with a preview of duplicated rows
def dup_rows_index(df):
dup = df[df.duplicated()]
print('Duplicated index loc:',dup[dup == True ].index.tolist())
return dup
I am using a rather large dataset of ~37 million data points that are hierarchically indexed into three categories country, productcode, year. The country variable (which is the countryname) is rather messy data consisting of items such as: 'Austral' which represents 'Australia'. I have built a simple guess_country() that matches letters to words, and returns a best guess and confidence interval from a known list of country_names. Given the length of the data and the nature of hierarchy it is very inefficient to use .map() to the Series: country. [The guess_country function takes ~2ms / request]
My question is: Is there a more efficient .map() which takes the Series and performs map on only unique values? (Given there are a LOT of repeated countrynames)
There isn't, but if you want to only apply to unique values, just do that yourself. Get mySeries.unique(), then use your function to pre-calculate the mapped alternatives for those unique values and create a dictionary with the resulting mappings. Then use pandas map with the dictionary. This should be about as fast as you can expect.
On Solution is to make use of the Hierarchical Indexing in DataFrame!
data = data.set_index(keys=['COUNTRY', 'PRODUCTCODE', 'YEAR'])
data.index.levels[0] = pd.Index(data.index.levels[0].map(lambda x: guess_country(x, country_names)[0]))
This works well ... by replacing the data.index.levels[0] -> when COUNTRY is level 0 in the index, replacement then which propagates through the data model.
Call guess_country() on unique country names, and make a country_map Series object with the original name as the index, converted name as the value. Then you can use country_map[df.country] to do the conversion.
import pandas as pd
c = ["abc","abc","ade","ade","ccc","bdc","bxy","ccc","ccx","ccb","ccx"]
v = range(len(c))
df = pd.DataFrame({"country":c, "data":v})
def guess_country(c):
return c[0]
uc = df.country.unique()
country_map = pd.Series(list(map(guess_country, uc)), index=uc)
df["country_id"] = country_map[df.country].values
print(df)