Pandas str.count - python

Consider the following dataframe. I want to count the number of '$' that appear in a string. I use the str.count function in pandas (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.count.html).
>>> import pandas as pd
>>> df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
>>> df['A'].str.count('$')
0 1
1 1
2 1
Name: A, dtype: int64
I was expecting the result to be [2,2,1]. What am I doing wrong?
In Python, the count function in the string module returns the correct result.
>>> a = "$$$$abcd"
>>> a.count('$')
4
>>> a = '$abcd$dsf$'
>>> a.count('$')
3

$ has a special meaning in RegEx - it's end-of-line, so try this:
In [21]: df.A.str.count(r'\$')
Out[21]:
0 2
1 2
2 1
Name: A, dtype: int64

As the other answers have noted, the issue here is that $ denotes the end of the line. If you do not intend to use regular expressions, you may find that using str.count (that is, the method from the built-in type str) is faster than its pandas counterpart;
In [39]: df['A'].apply(lambda x: x.count('$'))
Out[39]:
0 2
1 2
2 1
Name: A, dtype: int64
In [40]: %timeit df['A'].str.count(r'\$')
1000 loops, best of 3: 243 µs per loop
In [41]: %timeit df['A'].apply(lambda x: x.count('$'))
1000 loops, best of 3: 202 µs per loop

Try pattern [$] so it doesn't treat $ as end of character (see this cheatsheet) if you place it in square brackets [] then it treats it as a literal character:
In [3]:
df = pd.DataFrame(['$$a', '$$b', '$c'], columns=['A'])
df['A'].str.count('[$]')
Out[3]:
0 2
1 2
2 1
Name: A, dtype: int64

taking a cue from #fuglede
pd.Series([x.count('$') for x in df.A.values.tolist()], df.index)
as pointed by #jezrael, the above fails when there is a null type, so...
def tc(x):
try:
return x.count('$')
except:
return 0
pd.Series([tc(x) for x in df.A.values.tolist()], df.index)
timings
np.random.seed([3,1415])
df = pd.Series(np.random.randint(0, 100, 100000)) \
.apply(lambda x: '\$' * x).to_frame('A')
df.A.replace('', np.nan, inplace=True)
def tc(x):
try:
return x.count('$')
except:
return 0

Related

How Do I Input Message Data Into a DataFrame Using pandas? [duplicate]

I have a following DataFrame:
from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
It looks like this:
bar foo
0 1 a
1 2 b
2 3 c
Now I want to have something like:
bar
0 1 is a
1 2 is b
2 3 is c
How can I achieve this?
I tried the following:
df['foo'] = '%s is %s' % (df['bar'], df['foo'])
but it gives me a wrong result:
>>>print df.ix[0]
bar a
foo 0 a
1 b
2 c
Name: bar is 0 1
1 2
2
Name: 0
Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.
df['bar'] = df.bar.map(str) + " is " + df.foo
This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.
Here are some useful solutions to this problem, in increasing order of performance.
DataFrame.agg
This is a simple str.format-based approach.
df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
You can also use f-string formatting here:
df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
char.array-based Concatenation
Convert the columns to concatenate as chararrays, then add them together.
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
df['baz'] = (a + b' is ' + b).astype(str)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List Comprehension with zip
I cannot overstate how underrated list comprehensions are in pandas.
df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]
Alternatively, using str.join to concat (will also scale better):
df['baz'] = [
' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.
The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.
def try_concat(x, y):
try:
return str(x) + ' is ' + y
except (ValueError, TypeError):
return np.nan
df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]
perfplot Performance Measurements
Graph generated using perfplot. Here's the complete code listing.
Functions
def brenbarn(df):
return df.assign(baz=df.bar.map(str) + " is " + df.foo)
def danielvelkov(df):
return df.assign(baz=df.apply(
lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))
def chrimuelle(df):
return df.assign(
baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))
def vladimiryashin(df):
return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))
def erickfis(df):
return df.assign(
baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs1_format(df):
return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))
def cs1_fstrings(df):
return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs2(df):
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
return df.assign(baz=(a + b' is ' + b).astype(str))
def cs3(df):
return df.assign(
baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])
The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:
df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
It's longer than the other answer but is more generic (can be used with values that are not strings).
You could also use
df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')
df.astype(str).apply(lambda x: ' is '.join(x), axis=1)
0 1 is a
1 2 is b
2 3 is c
dtype: object
series.str.cat is the most flexible way to approach this problem:
For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
df.foo.str.cat(df.bar.astype(str), sep=' is ')
>>> 0 a is 1
1 b is 2
2 c is 3
Name: foo, dtype: object
OR
df.bar.astype(str).str.cat(df.foo, sep=' is ')
>>> 0 1 is a
1 2 is b
2 3 is c
Name: bar, dtype: object
Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.
#DanielVelkov answer is the proper one BUT
using string literals is faster:
# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:
df.astype(str).apply(' is '.join, axis=1)
You can shave off two more characters with df.agg(), but it's slower:
df.astype(str).agg(' is '.join, axis=1)
It's been 10 years and no one proposed the most simple and intuitive way which is 50% faster than all examples proposed on these 10 years.
df.bar.astype(str) + ' is ' + df.foo
I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.
def Create_Interaction_var(df,Varnames):
'''
:df data frame
:list of 2 column names, say "X" and "Y".
The two columns should be strings or categories
convert strings columns to categories
Add a column with the "interaction of X and Y" : X x Y, with name
"Interaction-X_Y"
'''
df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
CatVar = "Interaction-" + "-".join(Varnames)
Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
NbLevels=len(Var0Levels)
names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
names["code01"]=names["code0"] + NbLevels*names["code1"]
df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
df.loc[:, CatVar]= df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
return df
from pandas import *
x = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
x
x['bar'] = x.bar.astype("str") + " " + "is" + " " + x.foo
x.drop(['foo'], axis=1)

How to add a value in one column to the end of another value in a different column? [duplicate]

I have a following DataFrame:
from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
It looks like this:
bar foo
0 1 a
1 2 b
2 3 c
Now I want to have something like:
bar
0 1 is a
1 2 is b
2 3 is c
How can I achieve this?
I tried the following:
df['foo'] = '%s is %s' % (df['bar'], df['foo'])
but it gives me a wrong result:
>>>print df.ix[0]
bar a
foo 0 a
1 b
2 c
Name: bar is 0 1
1 2
2
Name: 0
Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.
df['bar'] = df.bar.map(str) + " is " + df.foo
This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.
Here are some useful solutions to this problem, in increasing order of performance.
DataFrame.agg
This is a simple str.format-based approach.
df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
You can also use f-string formatting here:
df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
char.array-based Concatenation
Convert the columns to concatenate as chararrays, then add them together.
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
df['baz'] = (a + b' is ' + b).astype(str)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List Comprehension with zip
I cannot overstate how underrated list comprehensions are in pandas.
df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]
Alternatively, using str.join to concat (will also scale better):
df['baz'] = [
' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.
The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.
def try_concat(x, y):
try:
return str(x) + ' is ' + y
except (ValueError, TypeError):
return np.nan
df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]
perfplot Performance Measurements
Graph generated using perfplot. Here's the complete code listing.
Functions
def brenbarn(df):
return df.assign(baz=df.bar.map(str) + " is " + df.foo)
def danielvelkov(df):
return df.assign(baz=df.apply(
lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))
def chrimuelle(df):
return df.assign(
baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))
def vladimiryashin(df):
return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))
def erickfis(df):
return df.assign(
baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs1_format(df):
return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))
def cs1_fstrings(df):
return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs2(df):
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
return df.assign(baz=(a + b' is ' + b).astype(str))
def cs3(df):
return df.assign(
baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])
The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:
df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
It's longer than the other answer but is more generic (can be used with values that are not strings).
You could also use
df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')
df.astype(str).apply(lambda x: ' is '.join(x), axis=1)
0 1 is a
1 2 is b
2 3 is c
dtype: object
series.str.cat is the most flexible way to approach this problem:
For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
df.foo.str.cat(df.bar.astype(str), sep=' is ')
>>> 0 a is 1
1 b is 2
2 c is 3
Name: foo, dtype: object
OR
df.bar.astype(str).str.cat(df.foo, sep=' is ')
>>> 0 1 is a
1 2 is b
2 3 is c
Name: bar, dtype: object
Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.
#DanielVelkov answer is the proper one BUT
using string literals is faster:
# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:
df.astype(str).apply(' is '.join, axis=1)
You can shave off two more characters with df.agg(), but it's slower:
df.astype(str).agg(' is '.join, axis=1)
It's been 10 years and no one proposed the most simple and intuitive way which is 50% faster than all examples proposed on these 10 years.
df.bar.astype(str) + ' is ' + df.foo
I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.
def Create_Interaction_var(df,Varnames):
'''
:df data frame
:list of 2 column names, say "X" and "Y".
The two columns should be strings or categories
convert strings columns to categories
Add a column with the "interaction of X and Y" : X x Y, with name
"Interaction-X_Y"
'''
df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
CatVar = "Interaction-" + "-".join(Varnames)
Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
NbLevels=len(Var0Levels)
names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
names["code01"]=names["code0"] + NbLevels*names["code1"]
df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
df.loc[:, CatVar]= df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
return df
from pandas import *
x = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
x
x['bar'] = x.bar.astype("str") + " " + "is" + " " + x.foo
x.drop(['foo'], axis=1)

Python str() function applied to dataframe column [duplicate]

I have a following DataFrame:
from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
It looks like this:
bar foo
0 1 a
1 2 b
2 3 c
Now I want to have something like:
bar
0 1 is a
1 2 is b
2 3 is c
How can I achieve this?
I tried the following:
df['foo'] = '%s is %s' % (df['bar'], df['foo'])
but it gives me a wrong result:
>>>print df.ix[0]
bar a
foo 0 a
1 b
2 c
Name: bar is 0 1
1 2
2
Name: 0
Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.
df['bar'] = df.bar.map(str) + " is " + df.foo
This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.
Here are some useful solutions to this problem, in increasing order of performance.
DataFrame.agg
This is a simple str.format-based approach.
df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
You can also use f-string formatting here:
df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
char.array-based Concatenation
Convert the columns to concatenate as chararrays, then add them together.
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
df['baz'] = (a + b' is ' + b).astype(str)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List Comprehension with zip
I cannot overstate how underrated list comprehensions are in pandas.
df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]
Alternatively, using str.join to concat (will also scale better):
df['baz'] = [
' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.
The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.
def try_concat(x, y):
try:
return str(x) + ' is ' + y
except (ValueError, TypeError):
return np.nan
df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]
perfplot Performance Measurements
Graph generated using perfplot. Here's the complete code listing.
Functions
def brenbarn(df):
return df.assign(baz=df.bar.map(str) + " is " + df.foo)
def danielvelkov(df):
return df.assign(baz=df.apply(
lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))
def chrimuelle(df):
return df.assign(
baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))
def vladimiryashin(df):
return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))
def erickfis(df):
return df.assign(
baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs1_format(df):
return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))
def cs1_fstrings(df):
return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs2(df):
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
return df.assign(baz=(a + b' is ' + b).astype(str))
def cs3(df):
return df.assign(
baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])
The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:
df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
It's longer than the other answer but is more generic (can be used with values that are not strings).
You could also use
df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')
df.astype(str).apply(lambda x: ' is '.join(x), axis=1)
0 1 is a
1 2 is b
2 3 is c
dtype: object
series.str.cat is the most flexible way to approach this problem:
For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
df.foo.str.cat(df.bar.astype(str), sep=' is ')
>>> 0 a is 1
1 b is 2
2 c is 3
Name: foo, dtype: object
OR
df.bar.astype(str).str.cat(df.foo, sep=' is ')
>>> 0 1 is a
1 2 is b
2 3 is c
Name: bar, dtype: object
Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.
#DanielVelkov answer is the proper one BUT
using string literals is faster:
# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:
df.astype(str).apply(' is '.join, axis=1)
You can shave off two more characters with df.agg(), but it's slower:
df.astype(str).agg(' is '.join, axis=1)
It's been 10 years and no one proposed the most simple and intuitive way which is 50% faster than all examples proposed on these 10 years.
df.bar.astype(str) + ' is ' + df.foo
I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.
def Create_Interaction_var(df,Varnames):
'''
:df data frame
:list of 2 column names, say "X" and "Y".
The two columns should be strings or categories
convert strings columns to categories
Add a column with the "interaction of X and Y" : X x Y, with name
"Interaction-X_Y"
'''
df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
CatVar = "Interaction-" + "-".join(Varnames)
Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
NbLevels=len(Var0Levels)
names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
names["code01"]=names["code0"] + NbLevels*names["code1"]
df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
df.loc[:, CatVar]= df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
return df
from pandas import *
x = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
x
x['bar'] = x.bar.astype("str") + " " + "is" + " " + x.foo
x.drop(['foo'], axis=1)

Python what is the fastest way to join (values) two dataframe columns [duplicate]

I have a following DataFrame:
from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
It looks like this:
bar foo
0 1 a
1 2 b
2 3 c
Now I want to have something like:
bar
0 1 is a
1 2 is b
2 3 is c
How can I achieve this?
I tried the following:
df['foo'] = '%s is %s' % (df['bar'], df['foo'])
but it gives me a wrong result:
>>>print df.ix[0]
bar a
foo 0 a
1 b
2 c
Name: bar is 0 1
1 2
2
Name: 0
Sorry for a dumb question, but this one pandas: combine two columns in a DataFrame wasn't helpful for me.
df['bar'] = df.bar.map(str) + " is " + df.foo
This question has already been answered, but I believe it would be good to throw some useful methods not previously discussed into the mix, and compare all methods proposed thus far in terms of performance.
Here are some useful solutions to this problem, in increasing order of performance.
DataFrame.agg
This is a simple str.format-based approach.
df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
You can also use f-string formatting here:
df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
char.array-based Concatenation
Convert the columns to concatenate as chararrays, then add them together.
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
df['baz'] = (a + b' is ' + b).astype(str)
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List Comprehension with zip
I cannot overstate how underrated list comprehensions are in pandas.
df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]
Alternatively, using str.join to concat (will also scale better):
df['baz'] = [
' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]
df
foo bar baz
0 a 1 1 is a
1 b 2 2 is b
2 c 3 3 is c
List comprehensions excel in string manipulation, because string operations are inherently hard to vectorize, and most pandas "vectorised" functions are basically wrappers around loops. I have written extensively about this topic in For loops with pandas - When should I care?. In general, if you don't have to worry about index alignment, use a list comprehension when dealing with string and regex operations.
The list comp above by default does not handle NaNs. However, you could always write a function wrapping a try-except if you needed to handle it.
def try_concat(x, y):
try:
return str(x) + ' is ' + y
except (ValueError, TypeError):
return np.nan
df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]
perfplot Performance Measurements
Graph generated using perfplot. Here's the complete code listing.
Functions
def brenbarn(df):
return df.assign(baz=df.bar.map(str) + " is " + df.foo)
def danielvelkov(df):
return df.assign(baz=df.apply(
lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))
def chrimuelle(df):
return df.assign(
baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))
def vladimiryashin(df):
return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))
def erickfis(df):
return df.assign(
baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs1_format(df):
return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))
def cs1_fstrings(df):
return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))
def cs2(df):
a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)
return df.assign(baz=(a + b' is ' + b).astype(str))
def cs3(df):
return df.assign(
baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])
The problem in your code is that you want to apply the operation on every row. The way you've written it though takes the whole 'bar' and 'foo' columns, converts them to strings and gives you back one big string. You can write it like:
df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
It's longer than the other answer but is more generic (can be used with values that are not strings).
You could also use
df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')
df.astype(str).apply(lambda x: ' is '.join(x), axis=1)
0 1 is a
1 2 is b
2 3 is c
dtype: object
series.str.cat is the most flexible way to approach this problem:
For df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
df.foo.str.cat(df.bar.astype(str), sep=' is ')
>>> 0 a is 1
1 b is 2
2 c is 3
Name: foo, dtype: object
OR
df.bar.astype(str).str.cat(df.foo, sep=' is ')
>>> 0 1 is a
1 2 is b
2 3 is c
Name: bar, dtype: object
Unlike .join() (which is for joining list contained in a single Series), this method is for joining 2 Series together. It also allows you to ignore or replace NaN values as desired.
#DanielVelkov answer is the proper one BUT
using string literals is faster:
# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I think the most concise solution for arbitrary numbers of columns is a short-form version of this answer:
df.astype(str).apply(' is '.join, axis=1)
You can shave off two more characters with df.agg(), but it's slower:
df.astype(str).agg(' is '.join, axis=1)
It's been 10 years and no one proposed the most simple and intuitive way which is 50% faster than all examples proposed on these 10 years.
df.bar.astype(str) + ' is ' + df.foo
I have encountered a specific case from my side with 10^11 rows in my dataframe, and in this case none of the proposed solution is appropriate. I have used categories, and this should work fine in all cases when the number of unique string is not too large. This is easily done in the R software with XxY with factors but I could not find any other way to do it in python (I'm new to python). If anyone knows a place where this is implemented I'd be glad to know.
def Create_Interaction_var(df,Varnames):
'''
:df data frame
:list of 2 column names, say "X" and "Y".
The two columns should be strings or categories
convert strings columns to categories
Add a column with the "interaction of X and Y" : X x Y, with name
"Interaction-X_Y"
'''
df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
CatVar = "Interaction-" + "-".join(Varnames)
Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
NbLevels=len(Var0Levels)
names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
names["code01"]=names["code0"] + NbLevels*names["code1"]
df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
df.loc[:, CatVar]= df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
return df
from pandas import *
x = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})
x
x['bar'] = x.bar.astype("str") + " " + "is" + " " + x.foo
x.drop(['foo'], axis=1)

Removing leading zeros from pandas.core.series.Series

I have a pandas.core.series.Series with data
0 [00115840, 00110005, 001000033, 00116000...
1 [00267285, 00263627, 00267010, 0026513...
2 [00335595, 00350750]
I want to remove leading zeros from the series.I tried
x.astype('int64')
But got error message
ValueError: setting an array element with a sequence.
Can you suggest me how to do this in python 3.x?
s=pd.Series(s.apply(pd.Series).astype(int).values.tolist())
s
Out[282]:
0 [1, 2]
1 [3, 4]
dtype: object
Data input
s=pd.Series([['001','002'],['003','004']])
Update: Thanks for Jez and cold point it out :-)
pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
Out[317]:
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
If want list of strings convert to list of integerss use list comprehension:
s = pd.Series([[int(y) for y in x] for x in s], index=s.index)
s = s.apply(lambda x: [int(y) for y in x])
Sample:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
print (s)
0 [00115840, 00110005, 001000033, 00116000]
1 [00267285, 00263627, 00267010, 0026513]
2 [00335595, 00350750]
dtype: object
s = s.apply(lambda x: [int(y) for y in x])
print (s)
0 [115840, 110005, 1000033, 116000]
1 [267285, 263627, 267010, 26513]
2 [335595, 350750]
dtype: object
EDIT:
If want integers only you can flatten values and cast to ints:
s = pd.Series([item for sublist in s for item in sublist]).astype(int)
Alternative solution:
import itertools
s = pd.Series(list(itertools.chain(*s))).astype(int)
print (s)
0 115840
1 110005
2 1000033
3 116000
4 267285
5 263627
6 267010
7 26513
8 335595
9 350750
dtype: int32
Timings:
a = [['00115840', '00110005', '001000033', '00116000'],
['00267285', '00263627', '00267010', '0026513'],
['00335595', '00350750']]
s = pd.Series(a)
s = pd.concat([s]*1000).reset_index(drop=True)
In [203]: %timeit pd.Series([[int(y) for y in x] for x in s], index=s.index)
100 loops, best of 3: 4.66 ms per loop
In [204]: %timeit s.apply(lambda x: [int(y) for y in x])
100 loops, best of 3: 5.13 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [205]: %%timeit
...: v = pd.Series(np.concatenate(s.values.tolist()))
...: v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
...:
1 loop, best of 3: 226 ms per loop
#Wen solution
In [211]: %timeit pd.Series(s.apply(pd.Series).stack().astype(int).groupby(level=0).apply(list))
1 loop, best of 3: 1.12 s per loop
Solutions with flatenning (idea of #cᴏʟᴅsᴘᴇᴇᴅ):
In [208]: %timeit pd.Series([item for sublist in s for item in sublist]).astype(int)
100 loops, best of 3: 2.55 ms per loop
In [209]: %timeit pd.Series(list(itertools.chain(*s))).astype(int)
100 loops, best of 3: 2.2 ms per loop
#cᴏʟᴅsᴘᴇᴇᴅ sol
In [210]: %timeit pd.Series(np.concatenate(s.values.tolist()))
100 loops, best of 3: 7.71 ms per loop
Flatten your data with np.concatenate -
s
0 [00115840, 36869, 262171, 39936]
1 [00267285, 92055, 93704, 11595]
2 [00335595, 119272]
Name: 1, dtype: object
v = pd.Series(np.concatenate(s.tolist()))
Or (thanks to jezrael for the suggestion), using .values.tolist which is faster -
v = pd.Series(np.concatenate(s.values.tolist()))
v
0 00115840
1 36869
2 262171
3 39936
4 00267285
5 92055
6 93704
7 11595
8 00335595
9 119272
dtype: object
Now, what you're doing with astype should work -
v.astype(int)
0 115840
1 36869
2 262171
3 39936
4 267285
5 92055
6 93704
7 11595
8 335595
9 119272
dtype: int64
If you have data as floats, use astype(float) instead.
If you want to, you could reshape the result back to its original format using groupby + agg -
v.astype(int).groupby(s.index.repeat(s.str.len())).agg(pd.Series.tolist)
0 [115840, 36869, 262171, 39936]
1 [267285, 92055, 93704, 11595]
2 [335595, 119272]
dtype: object
If you want a more crisp solution, you could try following:
Assuming a is the original series.
b = a.explode().astype(int)
a = b.groupby(b.index).agg(list)
Albeit, this is slower than solutions posted by #cs95 and #jezrael
#where x is a series
x = x.str.lstrip('0')
Below lines should work if you have mixed dtype
df['col'] = df['col'].apply(lambda x:x.lstrip('0') if type(x) == str else x)

Categories

Resources