How to sort index in Dask following pivot_table - python

Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this:
# make dataframe, fist in pandas and then in dask
df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a','b', 'c'], 'dist': [0, .1, .2, .1, 0, .3, .4, .1, 0]})
df.sort_values(by='A', inplace=True)
dd = dask.dataframe.from_pandas(df, chunksize=3) # just for demo's sake, you obviously don't ever want a chunksize of 3
print(dd.known_divisions) # Here I get True, which means my data is sorted
# now pivot and see if the index remains sorted
dd = dd.categorize('B')
pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
print(pivot_dd.known_divisions) # Here I get False, which makes me sad
I would love to find a way to get pivot_dd to have a sorted index, but I don't see a sort_index method in dask and cannot set 'A' as an index w/out getting a key error (it already is the index!).
In this toy example, I could pivot the pandas table first and then sort. The real application I have in mind won't allow me to do that.
Thanks in advance for any help/suggestions.

This may not be what you were wishing for, and perhaps not even the best answer, but it does seem to work. The first wrinkle, is that pivot operations create a categorical index for the columns, which is annoying. You could do the following.
>>> pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
>>> pivot_dd.columns = list(pivot_dd.columns)
>>> pivot_dd = pivot_dd.reset_index().set_index('A', sorted=True)
>>> pivot_dd.known_divisions
True

Related

Difference between index.name and index.names in pandas

I have just started leaning pandas and this is really my first question here so pls don't mind if its too basic !
When to use df.index.name and when df.index.names ?
Would really appreciate to know the difference and application.
Many Thanks
name returns the name of the Index or MultiIndex.
names returns a list of the level names of an Index (just one level) or a MultiIndex (more than one level).
Index:
df1 = pd.DataFrame(columns=['a', 'b', 'c']).set_index('a')
print(df1.index.name, df1.index.names)
# a ['a']
MultiIndex:
df2 = pd.DataFrame(columns=['a', 'b', 'c']).set_index(['a', 'b'])
print(df2.index.name, df2.index.names)
# None ['a', 'b']
df2.index.name = 'my_multiindex'
print(df2.index.name, df2.index.names)
# my_multiindex ['a', 'b']

Parallel Processing of Loop of Pandas Columns

I have the following code which I would like to speed up.
EDIT: we would like the columns in 'colsi' to be shifted by the group columns in 'colsj'. Pandas allows us to shift multiple columns at once through vectorization of 'colsi'. I loop through each group column and perform the vectorized shifts. Then I fill the NAs by the medians of the columns in 'colsi'. The reindex is just to create new blank columns before they are assigned. The issue is that I have many groups and looping through each is becoming time consuming.
EDIT2: My goal is to engineer new columns by the lag of each group. I have many group columns and many columns to be shifted. 'colsi' contains the columns to be shifted. 'colsj' contains the group columns. I am able to vectorize 'colsi', but looping through each group column in 'colsj' is still time consuming.
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
for j in colsj:
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
Parallelization seems to be a good way to do it. Leaning on this code, I attempted the following but it didn't work:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=3)
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
def funct(j):
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
for j in colsj:
pool.apply_async(funct, (j))
I do not have any knowledge on how to go about parallel processing, so I am not sure whats missing here. Please advise.

Unpredictable pandas slice assignment behavior with no SettingWithCopyWarning

It's well known (and understandable) that pandas behavior is essentially unpredictable when assigning to a slice. But I'm used to being warned about it by SettingWithCopy warning.
Why is the warning not generated in either of the following two code snippets, and what techniques could reduce the chance of writing such code unintentionally?
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data['a']
new_data.loc[0] = 100 # no warning, propagates to data
data[0] == 100
True
I thought the explanation was that pandas only produces the warning when the parent DataFrame is still reachable from the current context. (This would be a weakness of the detection algorithm, as my previous examples show.)
In the next snippet, AFAIK the original two-column DataFrame is no longer reachable, and yet pandas warning mechanism manages to trigger (luckily):
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
new_data = data['a']
data = data[['a']]
new_data.loc[0] = 100 # warning, so we're safe
Edit:
While investigating this, I found another case of a missing warning:
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # no warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Even though an almost identical example does trigger a warning:
data = pd.DataFrame({'a': [1, 2, 2], 'b': ['a', 'b', 'c']})
data = data.groupby('a')
new_data = data.filter(lambda g: len(g)==1)
new_data.loc[0, 'a'] = 100 # warning, does not propagate to data
assert data.filter(lambda g: True).loc[0, 'a'] == 1
Update: I'm responding to the answer by #firelynx here because it's hard to put it in the comment.
In the answer, #firelynx says that the first code snippet results in no warning because I'm taking the entire dataframe. But even if I took part of it, I still don't get a warning:
# pandas 0.18.1, python 3.5.1
import pandas as pd
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c'], c: range(3)})
new_data = data[['a', 'b']]
data = data['a']
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
data[0] == 1
True
Explaining what you're doing, step by step
The Dataframe you create, is not a view
data = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']})
data._is_view
False
new_data is also not a view, because you are taking all columns
new_data = data[['a', 'b']]
new_data._is_view
False
now you are assigning data to be the Series 'a'
data = data['a']
type(data)
pandas.core.series.Series
Which is a view
data._is_view
True
Now you update a value in the non-copy new_data
new_data.loc[0, 'a'] = 100 # no warning, doesn't propagate to data
This should not give a warning. It is the whole dataframe.
The Series you've created flags itself as a view, but it's not a DataFrame and does not behave as a DataFrame view.
Avoiding writing code like this
The Series vs. Dataframe problem is a very common one in pandas[citation not needed if you've worked with pandas for a while]
The problem is really that you should always be writing
data[['a']] not data['a']
Left creates a dataframe view, right creates a series.
Some people may argue to never ever write data['a'] but do data.a instead. Thus you can add warnings to your environment for data['a'] code.
This does not work. First of all using data.a syntax causes cognitive dissonance.
A dataframe is a collection of columns. In python we access members of collections with the [] operator. We access attributes by the . operator. Switching these around causes cognitive dissonance for anyone who is a python programmer. Especially when you start doing things like del data.a and notice that it does not work. See this answer for more extensive explaination
Clean code to the rescue
It is hard to see the difference between data[['a']] and data['a']
This is a smell. We should be doing neither.
The proper way using clean code principles and the zen of python "Explicit is better than implicit"
is this:
columns = ['a']
data[columns]
This may not be so mind boggling, but take a look at the following example:
data[['ad', 'cpc', 'roi']]
What does this mean? What are these columns? What data are you getting here?
These are the first questions to arrive in anyone's head when reading this line of code.
How to solve it? Don't say a comment.
ad_performance_columns = ['ad', 'cpc', 'roi']
data[ad_performance_columns]
More explicit is always better.
For more, please consider buying a book on clean code. Maybe this one

Pandas - Category variable and group by - is this a bug?

I came across a strange result while playing around with Pandas and I am not sure why this would work like this. Wondering if it is a bug.
cf = pd.DataFrame({'sc': ['b' , 'b', 'c' , 'd'], 'nn': [1, 2, 3, 4], 'mvl':[10, 20, 30, 40]})
df = cf.groupby('sc').mean()
df.loc['b', 'mvl']
This gives "15.0" as result.
cf1 = cf
cf1['sc'] = cf1['sc'].astype('category', categories=['b', 'c', 'd'], ordered = True)
df1 = cf1.groupby('sc').mean()
df1.loc['b','mvl']
This gives as result a Series:
sc
b 15.0
Name: mvl, dtype: float64
type(df1.loc['b','mvl']) -> pandas.core.series.Series
type(df.loc['b','mvl']) -> numpy.float64
Why would declaring the variable as categorical change the output of the loc from a scalar to a Series?
I hope it is not a stupid question. Thanks!
This may be a pandas bug. The difference is due to the fact that when you group on a categorical variable, you get a categorical index. You can see it more simply without any groupby:
nocat = pandas.Series(['a', 'b', 'c'])
cat = nocat.astype('category', categories=['a', 'b', 'c'], ordered=True)
xno = pandas.Series([8, 88, 888], index=nocat)
xcat = pandas.Series([8, 88, 888], index=cat)
>>> xno.loc['a']
8
>>> xcat.loc['a']
a 8
dtype: int64
The docs note that indexing operations on a CategoricalIndex preserve the categorical index. It appears they even do this if you get only one result, which doesn't exactly contradict the docs but seems like undesirable behavior.
There is a related pull request that seems to fix this behavior, but it was only recently merged. It looks like the fix should be in pandas 0.18.1.

Pandas column access w/column names containing spaces

If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?
Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]
I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.
If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})
While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression
You can do it with df['Column Name']
If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.

Categories

Resources