Related
Let say I have below calculation,
import pandas as pd
dat = pd.DataFrame({'xx1' : [1,2,3], 'aa2' : ['qq', '4', 'd'], 'xx3' : [4,5,6]})
dat2 = (dat
.assign(xx1 = lambda x : [str(i) for i in x['xx1'].values])
.assign(xx3 = lambda x : [str(i) for i in x['xx3'].values])
)
Basically, I need to find those columns for which column names match pattern xx + sequence of numbers (i.e. xx1, xx2, xx3 etc) and then apply some transformation to those column (e.g. apply str function)
One way I can do this is like above i.e. find manually those columns and perform transformation. I wonder if there is any way to generalise this approach. I prefer to use pipe like above.
Any pointer will be very helpful.
You could do:
# Matches all columns starting with 'xx' with a sequence of numbers afterwards.
cols_to_transform = dat.columns[dat.columns.str.match('^xx[0-9]+$')]
# Transform to apply (column-wise).
transform_function = lambda c: c.astype(str)
# If you want a new DataFrame and not modify the other in-place.
dat2 = dat.copy()
dat2[cols_to_transform] = dat2[cols_to_transform].transform(transform_function, axis=0)
To use it within assign:
# Here I put a lambda to avoid precomputing all the transformations in the dict comprehension.
dat.assign(**{col: lambda df: df[col].astype(str) for col in cols_to_transform})
import pandas as pd
frame = pd.DataFrame({'xx1' : [1,2,3], 'aa2' : ['qq', '4', 'd'], 'xx3' : [4,5,6]})
def parse_column(col, vals):
if "xx" == col[:2] and col[2:].isdigit():
return [str(i) for i in vals]
return vals
for (name, col) in frame.iteritems():
frame[name] = parse_column(name, col.values)
you can iterate over columns, getting their names and values as a series
the incredibly niche str.isdigits() function exists as an inherent part of python for some reason, but it came in useful here
One option is to select the relevant columns, apply your function and assign them back to the dataframe via unpacking:
result = dat.assign(**dat.filter(regex=r"xx\d+").astype(str))
result.dtypes
xx1 object
aa2 object
xx3 object
dtype: object
dat.dtypes
xx1 int64
aa2 object
xx3 int64
dtype: object
I need to perform some aggregations on a pandas dataframe. I'm using pandas version 1.3.3.
It seems I am only able to use builtin python functions, such as the max function, to aggregate columns that contain strings. Trying to do the same thing using any custom function (even one that only calls the builtin max) causes an error, as shown in the example below.
Can anyone tell me what I'm doing wrong in this example, and what is the correct way to use a custom function for string aggregation?
import pandas as pd
# Define a dataframe with two columns - one with strings (a-e), one with numbers (1-5)
foo = pd.DataFrame(
data={
'string_col': ['a', 'b', 'c', 'd', 'e'],
'num_col': [1,2,3,4,5]
}
)
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals)
# This works - gives a pandas Series with string_col = e, and num_col = 5
a = foo.agg(func={'string_col': max, 'num_col': max})
# This crashes with 'ValueError: cannot perform both aggregation and transformation operations simultaneously'
b = foo.agg(func={'string_col': lambda x: max(x), 'num_col': max})
# Crashes with same error
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})
If you try to run:
foo['string_col'].agg(','.join)
you will see that you get back a Series:
0 a
1 b
2 c
3 d
4 e
Name: string_col, dtype: object
Indeed, your custom function is applied per element, not on the whole Series. Thus the "cannot perform both aggregation and transformation operations simultaneously".
You can change your function to:
# Custom aggregation function to concatenate strings
def custom_aggregation_funcion(vals):
return ", ".join(vals.to_list())
c = foo.agg(func={'string_col': custom_aggregation_funcion, 'num_col': max})
output:
string_col a, b, c, d, e
num_col 5
dtype: object
I am wondering how to pass pandas data frame column values into a regular expression. I have tried the below but get "TypeError: 'Series' objects are mutable, thus they cannot be hashed"
Im after the result below. (I could just use a different regex but was wondering how this might be done dynamically)
Thoughts appreciated :)
to_search search_string search_result
ABC-T3-123 ABC ABC-T3
ABC-T2-123 ABC ABC-T3
DEF-T1-123 ABC DEF-T1
import pandas as pd
# create list for data frame
data = [['ABC-T3-123', 'ABC'], ['ABC-T2-123', 'ABC'], ['DEF-T1-123', 'DEF']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['to_search', 'search_string'])
df['search_results']=df['to_search'].str.extract("(" + df['search_string'] + "-T[0-9])")}```
I know that you want an efficient solution, but typically these pandas functions do not take values such as Serieses. Here is an apply-based solution, which I think, besides simplifying the regular expression, is the only viable solution here:
searched = df.apply(lambda row: re.search("(" + row['search_string'] + "-T[0-9])", row['to_search']).group(1), axis=1)
Output:
>>> searched
0 ABC-T3
1 ABC-T2
2 DEF-T1
dtype: object
Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)
I have the following DataFrame from a SQL query:
(Pdb) pp total_rows
ColumnID RespondentCount
0 -1 2
1 3030096843 1
2 3030096845 1
and I pivot it like this:
total_data = total_rows.pivot_table(cols=['ColumnID'])
which produces
(Pdb) pp total_data
ColumnID -1 3030096843 3030096845
RespondentCount 2 1 1
[1 rows x 3 columns]
When I convert this dataframe into a dictionary (using total_data.to_dict('records')[0]), I get
{3030096843: 1, 3030096845: 1, -1: 2}
but I want to make sure the 303 columns are cast as strings instead of integers so that I get this:
{'3030096843': 1, '3030096845': 1, -1: 2}
One way to convert to string is to use astype:
total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)
However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):
In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'
In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'
Note: you can pass in a buffer/file to save this to, along with some other options...
If you need to convert ALL columns to strings, you can simply use:
df = df.astype(str)
This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):
df[["D", "E"]] = df[["D", "E"]].astype(int)
pandas >= 1.0: It's time to stop using astype(str)!
Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:
# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')
From pandas 1.0 onwards, consider using "string" type instead.
# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype
Here's why, as quoted by the docs:
You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.
object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear than 'string'.
See also the section on Behavioral Differences between "string" and object.
Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.
Why should I stop using it?
Accidentally mixing dtypes
The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.
# pandas <= 0.25
pd.Series(['a', 'b', 1.23]) # whoops, this should have been "1.23"
0 a
1 b
2 1.23
dtype: object
pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23] # oops, pandas was storing this as float all the time.
# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")
0 a
1 b
2 1.23
dtype: string
pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23'] # it's a string and we just averted some potentially nasty bugs.
Challenging to differentiate strings and other python objects
Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.
Consider,
# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
A B
0 a {}
1 b [1, 2, 3]
2 c 123
Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.
# pandas <= 0.25
df.dtypes
A object
B object
dtype: object
df.select_dtypes(object)
A B
0 a {}
1 b [1, 2, 3]
2 c 123
From pandas 1.0, this becomes a lot simpler:
# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes
A string
B object
dtype: object
df.select_dtypes("string")
A
0 a
1 b
2 c
Readability
This is self-explanatory ;-)
OK, so should I stop using it right now?
...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!
Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:
In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
...: 'A': [20, 30.0, np.nan],
...: 'B': ["a45a", "a3", "b1"],
...: 'C': [10, 5, np.nan]})
...:
In [79]: df.dtypes ## Current datatype
Out[79]:
A float64
B object
C float64
dtype: object
## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str)
In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]:
A object
B object
C object
dtype: object
There are four ways to convert columns to string
1. astype(str)
df['column_name'] = df['column_name'].astype(str)
2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)
3. map(str)
df['column_name'] = df['column_name'].map(str)
4. apply(str)
df['column_name'] = df['column_name'].apply(str)
Lets see the performance of each type
#importing libraries
import numpy as np
import pandas as pd
import time
#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')
#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')
#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')
#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')
Output
time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds
map(str) and apply(str) are takes less time compare with remaining two techniques
I usually use this one:
pd['Column'].map(str)
pandas version: 1.3.5
Updated answer
df['colname'] = df['colname'].astype(str) => this should work by default. But if you create str variable like str = "myString" before using astype(str), this won't work. In this case, you might want to use the below line.
df['colname'] = df['colname'].astype('str')
===========
(Note: incorrect old explanation)
df['colname'] = df['colname'].astype('str') => converts dataframe column into a string type
df['colname'] = df['colname'].astype(str) => gives an error
1. .map(repr) is very fast
If you want to convert values to strings in a column, consider .map(repr). For multiple columns, consider .applymap(str).
df['col_as_str'] = df['col'].map(repr)
# multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(str)
# or
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda col: col.map(repr))
In fact, a timeit test shows that map(repr) is 3 times faster than astype(str) (and is faster than any other method mentioned on this page). Even for multiple columns, this runtime difference still holds. The following is the runtime plot of various methods mentioned here.
astype(str) has very little overhead but for larger frames/columns, map/applymap outperforms it.
2. Don't convert to strings in the first place
There's very little reason to convert a numeric column into strings given pandas string methods are not optimized and often get outperformed by vanilla Python string methods. If not numeric, there are dedicated methods for those dtypes. For example, datetime columns should be converted to strings using pd.Series.dt.strftime().
One way numeric->string seems to be used is in a machine learning context where a numeric column needs to be treated as categorical. In that case, instead of converting to strings, consider other dedicated methods such as pd.get_dummies or sklearn.preprocessing.LabelEncoder or sklearn.preprocessing.OneHotEncoder to process your data instead.
3. Use rename to convert column names to specific types
The specific question in the OP is about converting column names to strings, which can be done by rename method:
df = total_rows.pivot_table(columns=['ColumnID'])
df.rename(columns=str).to_dict('records')
# [{'-1': 2, '3030096843': 1, '3030096845': 1}]
The code used to produce the above plots:
import numpy as np
from perfplot import plot
plot(
setup=lambda n: pd.Series(np.random.default_rng().integers(0, 100, size=n)),
kernels=[lambda s: s.astype(str), lambda s: s.astype("string"), lambda s: s.apply(str), lambda s: s.map(str), lambda s: s.map(repr)],
labels= ['col.astype(str)', 'col.astype("string")', 'col.apply(str)', 'col.map(str)', 'col.map(repr)'],
n_range=[2**k for k in range(4, 22)],
xlabel='Number of rows',
title='Converting a single column into string dtype',
equality_check=lambda x,y: np.all(x.eq(y)));
plot(
setup=lambda n: pd.DataFrame(np.random.default_rng().integers(0, 100, size=(n, 100))),
kernels=[lambda df: df.astype(str), lambda df: df.astype("string"), lambda df: df.applymap(str), lambda df: df.apply(lambda col: col.map(repr))],
labels= ['df.astype(str)', 'df.astype("string")', 'df.applymap(str)', 'df.apply(lambda col: col.map(repr))'],
n_range=[2**k for k in range(4, 18)],
xlabel='Number of rows in dataframe',
title='Converting every column of a 100-column dataframe to string dtype',
equality_check=lambda x,y: np.all(x.eq(y)));
Using .apply() with a lambda conversion function also works in this case:
total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))
For entire dataframes you can use .applymap().
(but in any case probably .astype() is faster)
currently i do it like this
df_pg['store_id'] = df_pg['store_id'].astype('string')