Index based style.format - python

You can specify a format for each column by using df.style.format(), however, i want this behavior but then index based instead of column based. I realise its a bit more tricky because a column has a specific datatype, and a row can be mixed.
Is there a workaround to get it anyway? The df.style.apply() method has the flexibility, but i don't think it supports number formatting, only (CSS) styling.
Some sample data:
import pandas as pd
df = pd.DataFrame([[150.00, 181.00, 186.00],
[ 5.85, 3.73, 2.12]],
index=['Foo', 'Bar'],
columns=list('ABC'))
If i transpose the Dataframe, is easy:
mapper = {'Foo': '{:.0f}',
'Bar': '{:.1f}%'}
df.T.style.format(mapper)
But i want this formatting without transposing, something like:
df.style.format(mapper, axis=1)

You may not need to use the Styler class for this if the target is to re-format row values. You can use that mapper dictionary to match the formats you want, through a map and apply combination by row. The following should be a decent start:
df.apply(lambda s: s.map(mapper.get(s.name).format), axis=1)
Thanks!

Related

Filters in Pandas - why doesn't this work?

This is a basic question so apologies in advance.
I am using Pandas and I am grouping data with the following line:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()['keyword']
This is referencing the following:
The data frame: [page_serp_df]
Grouping by the column: meta_keywords_1_length
Counting with the filter: keyword column
What I don't understand is why does the filtering condition have to be ['keyword'] i.e. a string in quotes?
For example, this doesn't work and it is very counterintuituve to me:
page_serp_df.groupby([page_serp_df.meta_keywords_1_length]).count()[page_serp_df.keyword]
Thanks in advance!
I think there is a misunderstanding on what the .count() method returns.
Try to follow this example:
Create a sample data frame
df = pd.DataFrame({
'A':[0,1,0,1, 1],
'B':[100,200,300, 400, 500],
'C': [1,2,3,4,5]
})
This is what the count() method will return after groupby
# similarly to your example I am grouping by A and counting
df.groupby([df.A]).count()
As you can see, the count() method returns a dataframe itself, having the count of each other column values for the column where the grouped column has the same value.
After that, you can query for a specific column form the return of count() like this
df.groupby([df.A]).count()['C']
But the second case in your example, which in my example would correspond to df.groupby([df.A]).count()[df.C]
Will throw an error!
In fact, you would query a dataframe (in this case df.groupby([df.A]).count()) via a pandas Series but as you know you need a string or a column from df.columns.
You can check yourself that df.C and 'C' are two very different variable types.
print(type(df.C))
print(type('C'))
# <class 'pandas.core.series.Series'>
# <class 'str'>
If for some reason your code still works with the equivalent of df.C there might be some contingency like the only value of the df.C is a string with the same name of a column.. or something unintentional like that.

Python.pandas: how to select rows where objects start with letters 'PL'

I have specific problem with pandas: I need to select rows in dataframe which start with specific letters.
Details: I've imported my data to dataframe and selected columns that I need. I've also narrowed it down to row index I need. Now I also need to select rows in other column where objects START with letters 'pl'.
Is there any solution to select row only based on first two characters in it?
I was thinking about
pl = df[‘Code’] == pl*
but it won't work due to row indexing. Advise appreciated!
Use startswith for this:
df = df[df['Code'].str.startswith('pl')]
Fully reproducible example for those who want to try it.
import pandas as pd
df = pd.DataFrame([["plusieurs", 1], ["toi", 2], ["plutot", 3]])
df.columns = ["Code", "number"]
df = df[df.Code.str.startswith("pl")] # alternative is df = df[df["Code"].str.startswith("pl")]
If you use a string method on the Series that should return you a true/false result. You can then use that as a filter combined with .loc to create your data subset.
new_df = df.loc[df[‘Code’].str.startswith('pl')].copy()
The condition is just a filter, then you need to apply it to the dataframe. as filter you may use the method Series.str.startswith and do
df_pl = df[df['Code'].str.startswith('pl')]

Pandas DataFrame naming only 1 column

Is there a way with Pandas Dataframe to name only the first or first and second column even if there's 4 columns :
Here
for x in range(1, len(table2_query) + 1):
if x == 1:
cursor.execute(table2_query[x])
df = pd.DataFrame(data=cursor.fetchall(), columns=['Q', col_name[x-1]])
and it gives me this :
AssertionError: 2 columns passed, passed data had 4 columns
Consider the df:
df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('ABCD'))
df
then use rename and pass a dictionary with the name changes to the argument columns:
df.rename(columns=dict(A='a', B='b'))
Instantiating a DataFrame while only naming a subset of the columns
When constructing a dataframe with pd.DataFrame, you either don't pass an index/columns argument and let pandas auto-generate the index/columns object, or you pass one in yourself. If you pass it in yourself, it must match the dimensions of your data. The trouble of mimicking the auto-generation of pandas while augmenting just the ones you want is not worth the trouble and is ugly and is probably non-performant. In other words, I can't even think of a good reason to do it.
On the other hand, it is super easy to rename the columns/index values. In fact, we can rename just a few. I think below is more in line with the spirit of your question:
df = pd.DataFrame(np.arange(8).reshape(2, 4)).rename(columns=str).rename(columns={'1': 'A', '3': 'F'})
df

Pandas cannot use datetime in multiindex

I spend lots of time try to insert data into pandas' DataFrame
but just cannot as I expected.
there are two index:
1. current_time
2. company_name
After I use data.ix[] to insert a row,
the Dataframe create another column (named by the company_name)
Can anyone give me some advice, please.
import pandas
data=pandas.DataFrame(columns=['Date', 'Name', 'd1'])
data.set_index(['Date', 'Name'], inplace=True)
now = pandas.datetime.now()
data.ix[now, 'ACompany'] = [1]
To let pandas know the now, 'ACompany' are the levels of the index, you have to use some extra parantheses:
data.ix[(now, 'ACompany'), :] = 1
By just doing data.ix[now, 'ACompany'], pandas will by default try to interpret this as index=now, column='ACompany' (in the sense of .ix[rows, columns])
Further, it is recommended to use .loc instead of .ix if you want to index solely by the labels.

Pandas: Selecting column from data frame

Pandas beginner here. I'm looking to return a full column's data and I've seen a couple of different methods for this.
What is the difference between the two entries below, if any? It looks like they return the same thing.
loansData['int_rate']
loansData.int_rate
The latter is basically syntactic sugar for the former. There are (at least) a couple of gotchas:
If the name of the column is not a valid Python identifier (e.g., if the column name is my column name?!, you must use the former.
Somewhat surprisingly, you can only use the former form to completely correctly add a new column (see, e.g., here).
Example for latter statement:
import pandas as pd
df = pd.DataFrame({'a': range(4)})
df.b = range(4)
>> df.columns
Index([u'a'], dtype='object')
For some reason, though, df.b returns the correct results.
They do return the same thing. The column names in pandas are akin to dictionary keys that refer to a series. The column names themselves are named attributes that are part of the dataframe object.
The first method is preferred as it allows for spaces and other illegal operators.
For a more complete explanation, I recommend you take a look at this article:
http://byumcl.bitbucket.org/bootcamp2013/labs/pd_types.html#pandas-types
Search 'Access using dict notation' to find the examples where they show that these two methods return identical values.
They're the same but for me the first method handles spaces in column names and illegal characters so is preferred, example:
In [115]:
df = pd.DataFrame(columns=['a', ' a', '1a'])
df
Out[115]:
Empty DataFrame
Columns: [a, a, 1a]
Index: []
In [116]:
print(df.a) # works
print([' a']) # works
print(df.1a) # error
File "<ipython-input-116-4fa4129a400e>", line 3
print(df.1a)
^
SyntaxError: invalid syntax
Really when you use dot . it's trying to find a key as an attribute, if for some reason you have used column names that match an attribute then using dot will not do what you expect.
Example:
In [121]:
df = pd.DataFrame(columns=['index'], data = np.random.randn(3))
df
Out[121]:
index
0 0.062698
1 -1.066654
2 -1.560549
In [122]:
df.index
Out[122]:
Int64Index([0, 1, 2], dtype='int64')
The above has now shown the index as opposed to the column 'index'
In case if you are working on any ML projects and you want to extract feature and target variables separately and need to have them separably.
Below code will be useful: This is selecting features through indexing as a list and applying them to the dataframe. in this code data is DF.
len_col=len(data.columns)
total_col=list(data.columns)
Target_col_Y=total_col[-1]
Feature_col_X=total_col[0:-1]
print('The dependent variable is')
print(Target_col_Y)
print('The independent variables are')
print(Feature_col_X)
The output for the same can be obtained as given below:
The dependent variable is
output
The independent variables are
['age', 'job', 'marital', 'education','day_of_week', ... etc]

Categories

Resources