Related
I'm trying to apply a simple function on a pd.DataFrame but I need the index of each element while applying.
Consider this DataFrame:
CLM_1
CLM_1
A
foo
bar
B
bar
foo
C
bar
foo
and I want a pd.Series as result like so:
A 'A'
B 'B'
C 'D'
Length: 3, dtype: object
My approach:
I used df.apply(lambda row: row.index, axis=1) which obviously didn't work.
Use to_series() on the index:
>>> df.index.to_series()
A A
B B
C C
dtype: object
If you want to use the index in a function, you can assign it as a column and then apply whatever function you need:
df["index"] = df.index
>>> df.apply(lambda row: row["CLM_1"]+row["index"], axis=1)
A fooA
B barB
C barC
dtype: object
I used name attribute of the applying row and it worked just fine! no need to add more columns to my DataFrame.
df.apply(lambda row: row.name, axis=1)
I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()
Context:
i am new in Pandas and I need a function that creates new columns based on existing columns. The new columns name will have the name from the original column plus new characters (example: create a "As NEW" column from "As" column).
Can i access the old column header string to make the name of the new column?
Problem:
i have df['columnA'] and need to get "columnA" string
If I understand you correctly, this may be what you're looking for.
You can use str.contains() for the columns, then use string formatting to create the new column name.
df = pd.DataFrame({'col1':['A', 'A', 'B','B'], 'As': ['B','B','C','C'], 'col2': ['C','C','A','A'], 'col3': [30,10,14,91]})
col = df.columns[df.columns.str.contains('As')]
df['%s New' % col[0]] = 'foo'
print (df)
As col1 col2 col3 As New
0 B A C 30 foo
1 B A C 10 foo
2 C B A 14 foo
3 C B A 91 foo
Assuming that you have an empty DataFrame df with columns, you could access the columns of df as a list with:
>>> df.columns
Index(['columnA', 'columnB'], dtype='object')
.columns will allow you to overwrite the columns of df, but you don't need to pass in another Index. You can pass it a regular list, like so:
>>> df.columns = ['columna', 'columnb']
>>> df
Empty DataFrame
Columns: [columna, columnb]
Index: []
This can be done through the columns attribute.
cols = df.columns
# Do whatever operation you want on the list of strings in cols
df.columns = cols
I have a dataframe with this type of data (too many columns):
col1 int64
col2 int64
col3 category
col4 category
col5 category
Columns look like this:
Name: col3, dtype: category
Categories (8, object): [B, C, E, G, H, N, S, W]
I want to convert all the values in each column to integer like this:
[1, 2, 3, 4, 5, 6, 7, 8]
I solved this for one column by this:
dataframe['c'] = pandas.Categorical.from_array(dataframe.col3).codes
Now I have two columns in my dataframe - old col3 and new c and need to drop old columns.
That's bad practice. It works but in my dataframe there are too many columns and I don't want do it manually.
How can I do this more cleverly?
First, to convert a Categorical column to its numerical codes, you can do this easier with: dataframe['c'].cat.codes.
Further, it is possible to select automatically all columns with a certain dtype in a dataframe using select_dtypes. This way, you can apply above operation on multiple and automatically selected columns.
First making an example dataframe:
In [75]: df = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
In [76]: df['col2'] = df['col2'].astype('category')
In [77]: df['col3'] = df['col3'].astype('category')
In [78]: df.dtypes
Out[78]:
col1 int64
col2 category
col3 category
dtype: object
Then by using select_dtypes to select the columns, and then applying .cat.codes on each of these columns, you can get the following result:
In [80]: cat_columns = df.select_dtypes(['category']).columns
In [81]: cat_columns
Out[81]: Index([u'col2', u'col3'], dtype='object')
In [83]: df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
In [84]: df
Out[84]:
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
This works for me:
pandas.factorize( ['B', 'C', 'D', 'B'] )[0]
Output:
[0, 1, 2, 0]
If your concern was only that you making a extra column and deleting it later, just dun use a new column at the first place.
dataframe = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'), 'col3':list('ababb')})
dataframe.col3 = pd.Categorical.from_array(dataframe.col3).codes
You are done. Now as Categorical.from_array is deprecated, use Categorical directly
dataframe.col3 = pd.Categorical(dataframe.col3).codes
If you also need the mapping back from index to label, there is even better way for the same
dataframe.col3, mapping_index = pd.Series(dataframe.col3).factorize()
check below
print(dataframe)
print(mapping_index.get_loc("c"))
Here multiple columns need to be converted. So, one approach i used is ..
for col_name in df.columns:
if(df[col_name].dtype == 'object'):
df[col_name]= df[col_name].astype('category')
df[col_name] = df[col_name].cat.codes
This converts all string / object type columns to categorical. Then applies codes to each type of category.
What I do is, I replace values.
Like this-
df['col'].replace(to_replace=['category_1', 'category_2', 'category_3'], value=[1, 2, 3], inplace=True)
In this way, if the col column has categorical values, they get replaced by the numerical values.
For converting categorical data in column C of dataset data, we need to do the following:
from sklearn.preprocessing import LabelEncoder
labelencoder= LabelEncoder() #initializing an object of class LabelEncoder
data['C'] = labelencoder.fit_transform(data['C']) #fitting and transforming the desired categorical column.
To convert all the columns in the Dataframe to numerical data:
df2 = df2.apply(lambda x: pd.factorize(x)[0])
Answers here seem outdated. Pandas now has a factorize() function and you can create categories as:
df.col.factorize()
Function signature:
pandas.factorize(values, sort=False, na_sentinel=- 1, size_hint=None)
One of the simplest ways to convert the categorical variable into dummy/indicator variables is to use get_dummies provided by pandas.
Say for example we have data in which sex is a categorical value (male & female)
and you need to convert it into a dummy/indicator here is how to do it.
tranning_data = pd.read_csv("../titanic/train.csv")
features = ["Age", "Sex", ] //here sex is catagorical value
X_train = pd.get_dummies(tranning_data[features])
print(X_train)
Age Sex_female Sex_male
20 0 1
33 1 0
40 1 0
22 1 0
54 0 1
you can use .replace as the following:
df['col3']=df['col3'].replace(['B', 'C', 'E', 'G', 'H', 'N', 'S', 'W'],[1,2,3,4,5,6,7,8])
or .map:
df['col3']=df['col3'].map({1: 'B', 2: 'C', 3: 'E', 4:'G', 5:'H', 6:'N', 7:'S', 8:'W'})
categorical_columns =['sex','class','deck','alone']
for column in categorical_columns:
df[column] = pd.factorize(df[column])[0]
Factorize will make each unique categorical data in a column into a specific number (from 0 to infinity).
#Quickbeam2k1 ,see below -
dataset=pd.read_csv('Data2.csv')
np.set_printoptions(threshold=np.nan)
X = dataset.iloc[:,:].values
Using sklearn
from sklearn.preprocessing import LabelEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0])
You can do it less code like below :
f = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab'),'col3':list('ababb')})
f['col1'] =f['col1'].astype('category').cat.codes
f['col2'] =f['col2'].astype('category').cat.codes
f['col3'] =f['col3'].astype('category').cat.codes
f
Just use manual matching:
dict = {'Non-Travel':0, 'Travel_Rarely':1, 'Travel_Frequently':2}
df['BusinessTravel'] = df['BusinessTravel'].apply(lambda x: dict.get(x))
For a certain column, if you don't care about the ordering, use this
df['col1_num'] = df['col1'].apply(lambda x: np.where(df['col1'].unique()==x)[0][0])
If you care about the ordering, specify them as a list and use this
df['col1_num'] = df['col1'].apply(lambda x: ['first', 'second', 'third'].index(x))
This is basically a pandas syntax question.
I have a dataframe that contains, among other things, rows that are tagged with a Quantification and a Calibration, both of which are text. There are >100,000 rows, but only ~200 unique Quantification tags and ~10 unique Calibration tags. I'm trying to concatenate these into a single tag, and I ran into a curiosity:
this works:
df['n_q'] = df['Quantification'] + " (" + df['Calibration'] + ')'
but this doesn't:
df['n_q'] = "{0} ({1})".format(df['Quantification'], df['Calibration'])
The latter seems to make give every row the same, giant string that I guess is all the tags concatenated.
My question is how can I do what I want to do using str.format?
One way is to use an apply:
In [11]: df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['A', 'B'])
In [12]: df['A'] + ' (' + df['B'] + ')'
Out[12]:
0 a (b)
1 c (d)
dtype: object
In [13]: df.apply(lambda x: '{0} ({1})'.format(*x), axis=1)
Out[13]:
0 a (b)
1 c (d)
dtype: object
Note: this work when you are using all columns.
You can reference by column names for a neater and more robust solution:
In [14]: df.apply(lambda x: '{A} ({B})'.format(**x), axis=1)
Out[14]:
0 a (b)
1 c (d)
dtype: object