Pandas: Edit part of dataframe, have it affect main dataframe - python

EDIT: A suggested possible duplicate (this question) is not a duplicate. I'm asking if a slice of a dataframe can be edited and have that slice affect the original dataframe. The "duplicate" Q/A suggested is just looking for an alternate to .loc. The simple answer to my original question appears to be, "no".
Original Question:
This question likely has a duplicate somewhere, but I couldn't find it. Also, I'm guessing what I'm about to ask isn't possible, but worth a shot.
I'm looking to be able to filter or mask a large dataframe, get a smaller dataframe for ease of coding, edit the smaller dataframe, and have it affect the larger dataframe.
So something like this:
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_part = df_full[df_full['a'] == 2]
df_part['b'] = 'Kentucky Fried Chicken'
print df_full
Would result in:
a b
0 1 4
1 2 Kentucky Fried Chicken
2 3 6
I'm well aware of the ability to use the .loc[row_indexer, col_indexer] functionality, but even with a mask variable as the row_indexer, it can be a little unwieldy for more complex purposes.
A little context - I'm loading large database tables into a dataframe and want to make many edits on a small slice of it. So the .loc[] gets tedious. Maybe I could filter out that small slice, edit it, then re-append to the original?
Any thoughts?

Short answer
No. You don't want to play the game where you have to keep checking / guessing whether you are using a copy or a view of a dataframe.
Single update: the right way
.loc accessor is the way to go. There is nothing unwieldy about it, though it takes some getting used to.
However complex your criteria, if it boils down to Boolean arrays, .loc accessor is still often the right choice. You need to show an example where it is genuinely difficult to implement.
df_full = pd.DataFrame({'a':[1,2,3], 'b':[4,5,6]})
df_full.loc[df_full['a'] == 2, 'b'] = 'Kentucky Fried Chicken'
# a b
# 0 1 4
# 1 2 Kentucky Fried Chicken
# 2 3 6
Single update: an alternative way
If you find .loc accessor difficult to implement, one alternative is numpy.where:
df_full['b'] = np.where(df_full['a'] == 2, 'Kentucky Fried Chicken', df_full['b'])
Multiple updates: for many conditions
pandas.cut, numpy.select or numpy.vectorize can be used to good effect to streamline your code. The usefulness of these will depend on the specific logic you are attempting to apply. The below question includes examples for each of these:
Numpy “where” with multiple conditions

Related

Can I use pd.concat to add new columns equal to other columns in DataFrame?

I am new to Python and am converting SQL to Python and want to learn the most efficient way to process a large dataset (rows > 1 million and columns > 100). I need to create multiple new columns based on other columns in the DataFrame. I have recently learned how to use pd.concat for new boolean columns, but I also have some non-boolean columns that rely on the values of other columns.
In SQL I would use a single case statement (case when age > 1000 then sample_id else 0 end as custom1, etc...). In Python I can achieve the same result in 2 steps (pd.concat + loc find & replace) as shown below. I have seen references in other posts to using the apply method but have also read in other posts that the apply method can be inefficient.
My question is then, for the code shown below, is there a more efficient way to do this? Can I do it all in one step within the pd.concat (so far I haven't been able to get that to work)? I am okay doing it in 2 steps if necessary. I need to be able to handle large integers (100 billion) in my custom1 element and have decimals in my custom2 element.
And finally, I tried using multiple separate np.where statements but received a warning that my DataFrame was fragmented and that I should try to use concat. So I am not sure which approach overall is most efficient or recommended.
Update - after receiving a comment and an answer pointing me towards use of np.where, I decided to test the approaches. Using a data set with 2.7 million rows and 80 columns, I added 25 new columns. First approach was to use the concat + df.loc replace as shown in this post. Second approach was to use np.where. I ran the test 10 times and np.where was faster in all 10 trials. As noted above, I think repeated use of np.where in this way can cause fragmentation, so I suppose now my decision comes down to faster np.where with potential fragmentation vs. slower use of concat without risk of fragmentation. Any further insight on this final update is appreciated.
df = pd.DataFrame({'age': [120, 4000],
'weight': [505.31, 29.01],
'sample_id': [999999999999, 555555555555]},
index=['rock1', 'rock2'])
#step 1: efficiently create starting custom columns using concat
df = pd.concat(
[
df,
(df["age"] > 1000).rename("custom1").astype(int),
(df["weight"] < 100).rename("custom2").astype(float),
],
axis=1,
)
#step2: assign final values to custom columns based on other column values
df.loc[df.custom1 == 1, 'custom1'] = (df['sample_id'])
df.loc[df.custom2 == 1, 'custom2'] = (df['weight'] / 2)
Thanks for any feedback you can provide...I appreciate your time helping me.
The standard way to do this is using numpy where:
import numpy as np
df['custom1'] = np.where(df.age.gt(1000), df.sample_id, 0)
df['custom2'] = np.where(df.weight.lt(100), df.weight / 2, 0)

Pandas loc uses inclusive ranges and iloc uses exclusive on upper side [duplicate]

For some reason, the following 2 calls to iloc / loc produce different behavior:
>>> import pandas as pd
>>> df = pd.DataFrame(dict(A=range(3), B=range(3)))
>>> df.iloc[:1]
A B
0 0 0
>>> df.loc[:1]
A B
0 0 0
1 1 1
I understand that loc considers the row labels, while iloc considers the integer-based indices of the rows. But why is the upper bound for the loc call considered inclusive, while the iloc bound is considered exclusive?
Quick answer:
It often makes more sense to do end-inclusive slicing when using labels, because it requires less knowledge about other rows in the DataFrame.
Whenever you care about labels instead of positions, end-exclusive label slicing introduces position-dependence in a way that can be inconvenient.
Longer answer:
Any function's behavior is a trade-off: you favor some use cases over others. Ultimately the operation of .iloc is a subjective design decision by the Pandas developers (as the comment by #ALlollz indicates, this behavior is intentional). But to understand why they might have designed it that way, think about what makes label slicing different from positional slicing.
Imagine we have two DataFrames df1 and df2:
df1 = pd.DataFrame(dict(X=range(4)), index=['a','b','c','d'])
df2 = pd.DataFrame(dict(X=range(3)), index=['b','c','z'])
df1 contains:
X
a 0
b 1
c 2
d 3
df2 contains:
X
b 0
c 1
z 2
Let's say we have a label-based task to perform: we want to get rows between b and c from both df1 and df2, and we want to do it using the same code for both DataFrames. Because b and c don't have the same positions in both DataFrames, simple positional slicing won't do the trick. So we turn to label-based slicing.
If .loc were end-exclusive, to get rows between b and c we would need to know not only the label of our desired end row, but also the label of the next row after that. As constructed, this next label would be different in each DataFrame.
In this case, we would have two options:
Use separate code for each DataFrame: df1.loc['b':'d'] and df2.loc['b':'z']. This is inconvenient because it means we need to know extra information beyond just the rows that we want.
For either dataframe, get the positional index first, add 1, and then use positional slicing: df.iloc[df.index.get_loc('b'):df.index.get_loc('c')+1]. This is just wordy.
But since .loc is end-inclusive, we can just say .loc['b':'c']. Much simpler!
Whenever you care about labels instead of positions, and you're trying to write position-independent code, end-inclusive label slicing re-introduces position-dependence in a way that can be inconvenient.
That said, maybe there are use cases where you really do want end-exclusive label-based slicing. If so, you can use #Willz's answer in this question:
df.loc[start:end].iloc[:-1]

Filtering syntax for pandas dataframe groupby with logic condition

I have a pandas dataframe containing indices that have a one-to-many relationship. A very simplified and shortened example of my data is shown in the DataFrame Example link. I want to get a list or Series or ndarray of the unique namIdx values in which nCldLayers <= 1. The final result should show indices of 601 and 603.
I am able to accomplish this with the 3 statements below, but I am wondering if there is a much better, more succinct way with perhaps 'filter', 'select', or 'where'.
grouped=(namToViirs['nCldLayers']<=1).groupby(namToViirs.index).all(axis=0)
grouped = grouped[grouped==True]
filterIndex = grouped.index
Is there a better approach in accomplishing this result by applying the logical condition (namToViirs['nCldLayers >= 1) in a subsequent part of the chain, i.e., first group then apply logical condition, and then retrieve only the namIdx where the logical result is true for each member of the group?
I think your code works nice, only you can add use small changes:
In all can be omit axis=0
grouped==True can be omit ==True
grouped=(namToViirs['nCldLayers']<=1).groupby(level='namldx').all()
grouped = grouped[grouped]
filterIndex = grouped.index
print (filterIndex)
Int64Index([601, 603], dtype='int64', name='namldx')
I think better is first filter by boolean indexing and then groupby, because less loops -> better performance.
For question 1, see jezrael answer. For question 2, you could play with indexes as sets:
namToViirs.index[namToViirs.nCldLayers <= 1] \
.difference(namToViirs.index[namToViirs.nCldLayers > 1])
You might be interested in this answer.
The implementation is currently a bit hackish, but it should reduce your statement above to:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[W].index)
EDIT: also see this answer for an analogous approach not requiring external components, resulting in:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index).all(axis=0)[lambda x : x].index)
Another option is to use .pipe() and a function which applies the desired filtering.
For instance:
filterIndex = ((namToViirs['nCldLayers']<=1)
.groupby(namToViirs.index)
.all(axis=0)
.pipe(lambda s : s[s])
.index)

High-dimensional data structure in Python

What is best way to store and analyze high-dimensional date in python? I like Pandas DataFrame and Panel where I can easily manipulate the axis. Now I have a hyper-cube (dim >=4) of data. I have been thinking of stuffs like dict of Panels, tuple as panel entries. I wonder if there is a high-dim panel thing in Python.
update 20/05/16:
Thanks very much for all the answers. I have tried MultiIndex and xArray, however I am not able to comment on any of them. In my problem I will try to use ndarray instead as I found the label is not essential and I can save it separately.
update 16/09/16:
I came up to use MultiIndex in the end. The ways to manipulate it are pretty tricky at first, but I kind of get used to it now.
MultiIndex is most useful for higher dimensional data as explained in the docs and this SO answer because it allows you to work with any number of dimension in a DataFrame environment.
In addition to the Panel, there is also Panel4D - currently in experimental stage. Given the advantages of MultiIndex I wouldn't recommend using either this or the three dimensional version. I don't think these data structures have gained much traction in comparison, and will indeed be phased out.
If you need labelled arrays and pandas-like smart indexing, you can use xarray package which is essentially an n-dimensional extension of pandas Panel (panels are being deprecated in pandas in future in favour of xarray).
Otherwise, it may sometimes be reasonable to use plain numpy arrays which can be of any dimensionality; you can also have arbitrarily nested numpy record arrays of any dimension.
I recommend continuing to use DataFrame but utilize the MultiIndex feature. DataFrame is better supported and you preserve all of your dimensionality with the MultiIndex.
Example
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'], index=['A', 'B'])
df3 = pd.concat([df for _ in [0, 1]], keys=['one', 'two'])
df4 = pd.concat([df3 for _ in [0, 1]], axis=1, keys=['One', 'Two'])
print df4
Looks like:
One Two
a b a b
one A 1 2 1 2
B 3 4 3 4
two A 1 2 1 2
B 3 4 3 4
This is a hyper-cube of data. And you'll be much better served with support and questions and less bugs and many other benefits.

Pandas: Get unique MultiIndex level values by label

Say you have this MultiIndex-ed DataFrame:
df = pd.DataFrame({'country':['DE','DE','FR','FR'],
'biome':['Lake','Forest','Lake','Forest'],
'area':[10,20,30,40],
'count':[7,5,2,3]})
df = df.set_index(['country','biome'])
Which looks like this:
area count
country biome
DE Lake 10 7
Forest 20 5
FR Lake 30 2
Forest 40 3
I would like to retrieve the unique values per index level. This can be accomplished using
>>> df.index.levels[0]
['DE', 'FR']
>>> df.index.levels[1]
['Lake', 'Forest']
What I would really like to do, is to retrieve these lists by addressing the levels by their name, i.e. 'country' and 'biome'. The shortest two ways I could find looks like this:
>>> list(set(df.index.get_level_values('country')))
['DE', 'FR']
>>> df.index.levels[df.index.names.index('country')]
['DE', 'FR']
But non of them are very elegant. Is there a shorter and/or more performant way?
Pandas 0.23.0 finally introduced a much cleaner solution to this problem: the level argument to Index.unique():
In [3]: df.index.unique(level='country')
Out[3]: Index(['DE', 'FR'], dtype='object', name='country')
This is now the recommended solution. It is far more efficient because it avoids creating a complete representation of the level values in memory, and re-scanning it.
I guess u want unique values in a certain level (and by level names) of a multiindex. I usually do the following, which is a bit long.
In [11]: df.index.get_level_values('country').unique()
Out[11]: array(['DE', 'FR'], dtype=object)
An alternative approach is to find the number of levels by calling df.index.levels[level_index] where level_index can be inferred from df.index.names.index(level_name). In the above example level_name = 'co'.
The proposed answer by #Happy001 computes the unique which may be computationally intensive.
If you're going to do the level lookup repeatedly, you could create a map of your index level names to level unique values with:
df_level_value_map = {
name: level
for name, level in zip(df.index.names, df.index.levels)
}
df_level_value_map['']
But this is not in any way more efficient (or shorter) than your original attempts if you're only going to do this lookup once.
I really wish there was a method on indexes that returned such a dictionary (or series?) with a name like:
df.index.get_level_map(levels={...})
Where the levels parameter can limit the map to a subset of the existing levels. I could do without the parameter if it could be a property like:
df.index.level_map
If you already know the index names, is it not straightforward to simply do:
df['co'].unique() ?

Categories

Resources