Refer to pandas dataframe columns or index, depending on parameter - python

I am writing a function that operates on the labels of a pandas dataframe and I want to have a parameter axis to decide whether to operate on index or columns.
So I wrote something like:
if axis==0:
to_sort = df.index
elif axis==1:
to_sort = df.columns
else:
raise AttributeError
where df is a pandas dataframe.
Is there a better way of doing this?
Note I am not asking for a code review, but more specifically asking if there is a pandas attribute (something like labels would make sense to me) that allows me to get index or columns depending on a parameter/index to be passed.
For example (code not working):
df.labels[0] # index
df.labels[1] # columns

Short answer: You can use iloc(axis=...)
Documentation: http://pandas.pydata.org/pandas-docs/stable/advanced.html
You can also specify the axis argument to .loc to interpret the passed
slicers on a single axis.
(They seem to have omitted iloc in regards to the axis parameter)
A complete example
df = pd.DataFrame({"A":['a1', 'a2'], "B":['b1', 'b2']})
print(df)
Output:
A B
0 a1 b1
1 a2 b2
With axis=0
print(df.iloc(axis=0)[0].index)
Output:
Index(['A', 'B'], dtype='object')
With axis=1
print(df.iloc(axis=1)[0].index)
Output:
RangeIndex(start=0, stop=2, step=1)

Looking at reindex documentation examples, I realized I can do something like this:
Let the parameter be axis={'index', 'columns'}
Get the relevant labels using getattr: labels = getattr(df, axis)
Open to other pandas specific solutions.
If I were forced to use axis={1, 0}, then #Bharath suggestion to use an helper function makes sense.

Related

Assigning values to cross selection of MultiIndex DataFrame (ellipsis style of numpy)

In numpy we can select the last axis with ellipsis indexing, f.i. array[..., 4].
In Pandas DataFrames for structuring large amounts of data, I like to use MultiIndex (which I see as some kind of additional dimensions of the DataFrame). If I want to select a given subset of a DataFrame df, in this case all columns 'key' in the last level of the columns MultiIndex, I can do it with the cross selection method xs:
# create sample multiindex dataframe
mi = pd.MultiIndex.from_product((('a', 'b', 'c'), (1, 2), ('some', 'key', 'foo')))
data = pd.DataFrame(data=np.random.rand(20, 18), columns=mi)
# make cross selection:
xs_df = data.xs('key', axis=1, level=-1)
But if I want to assign values to the cross selection, xs won't work.
The documentation proposes to use IndexSlice to access and set values to a cross selection:
idx = pd.IndexSlice
data.loc[:, idx[:, :, 'key']] *= 10
Which is working well as long as I explicitly enter the number of levels by inserting the correct amount of : before 'key'.
Assuming I just want to give the number of levels to a selection function or f.i. always select the last level, independent of the number of levels of the DataFrame, this won't work (afaik).
My current workaround is using None slices for n_levels to skip:
n_levels = data.columns.nlevels - 1 # assuming I want to select the last level
data.loc[:, (*n_levels*[slice(None)], 'key')] *= 100
This is imho a quite nasty and cumbersome workaround. Is there any more pythonic/nicer/better way?
In this case, you may be better off with get_level_values:
s = data.columns.get_level_values(-1) == 'key'
data.loc[:,s] *= 10
I feel like we can do update and pass drop_level with xs
data.update(data.xs('key',level=-1,axis=1,drop_level=False)*10)
I don't think there is as straightforward a way to index and set values the way you want. Adding to previous answers, I'd suggest naming your columns, ... makes it easier to wrangle with the query method:
#assign names
data.columns = data.columns.set_names(['first','second','third'])
#select interested level :
ind=data.T.query('third=="key"').index
#assign value
data.loc(axis=1)[ind] *=10

Index based style.format

You can specify a format for each column by using df.style.format(), however, i want this behavior but then index based instead of column based. I realise its a bit more tricky because a column has a specific datatype, and a row can be mixed.
Is there a workaround to get it anyway? The df.style.apply() method has the flexibility, but i don't think it supports number formatting, only (CSS) styling.
Some sample data:
import pandas as pd
df = pd.DataFrame([[150.00, 181.00, 186.00],
[ 5.85, 3.73, 2.12]],
index=['Foo', 'Bar'],
columns=list('ABC'))
If i transpose the Dataframe, is easy:
mapper = {'Foo': '{:.0f}',
'Bar': '{:.1f}%'}
df.T.style.format(mapper)
But i want this formatting without transposing, something like:
df.style.format(mapper, axis=1)
You may not need to use the Styler class for this if the target is to re-format row values. You can use that mapper dictionary to match the formats you want, through a map and apply combination by row. The following should be a decent start:
df.apply(lambda s: s.map(mapper.get(s.name).format), axis=1)
Thanks!

Pandas DataFrame naming only 1 column

Is there a way with Pandas Dataframe to name only the first or first and second column even if there's 4 columns :
Here
for x in range(1, len(table2_query) + 1):
if x == 1:
cursor.execute(table2_query[x])
df = pd.DataFrame(data=cursor.fetchall(), columns=['Q', col_name[x-1]])
and it gives me this :
AssertionError: 2 columns passed, passed data had 4 columns
Consider the df:
df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('ABCD'))
df
then use rename and pass a dictionary with the name changes to the argument columns:
df.rename(columns=dict(A='a', B='b'))
Instantiating a DataFrame while only naming a subset of the columns
When constructing a dataframe with pd.DataFrame, you either don't pass an index/columns argument and let pandas auto-generate the index/columns object, or you pass one in yourself. If you pass it in yourself, it must match the dimensions of your data. The trouble of mimicking the auto-generation of pandas while augmenting just the ones you want is not worth the trouble and is ugly and is probably non-performant. In other words, I can't even think of a good reason to do it.
On the other hand, it is super easy to rename the columns/index values. In fact, we can rename just a few. I think below is more in line with the spirit of your question:
df = pd.DataFrame(np.arange(8).reshape(2, 4)).rename(columns=str).rename(columns={'1': 'A', '3': 'F'})
df

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Pandas: Selecting column from data frame

Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]

Categories

Resources