This is how I currently do it:
# Turn all table elements to strings
df = df.astype(str)
df.columns = df.columns.map(str)
df.index = df.index.map(str)
Is there a one liner that will turn df data, columns and indeces to strings?
Update
Out of curiosity I timed the various answers.
My method: 909 µs ± 37.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#Wen's method: 749 µs ± 41.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
#COLDSPEED's method: 732 ns ± 44.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Hence the accepted answer.
This isn't a bad question at all. Well, there's the obvious astype solution by #Wen. There are a couple of innovative solutions as well.
Let's try something a bit more interesting with operator.methodcaller.
from operator import methodcaller
df, df.columns, df.index = map(
methodcaller('astype', str), (df, df.columns, df.index)
)
Since you mentioned a one liner, you can recreate your df.
new_df = pd.DataFrame(
data=df.values.astype(str),
columns=df.columns.astype(str),
index=df.index.astype(str)
)
Related
I am looking for a more efficient way of applying the below code. It is functional but when I start doing it to very large dataframes it becomes quite slow. Is there a more efficient way I can do functionally the same thing? The out put is just creating a unique column from a list of columns like col1_col2_col3.
df['unique_thread'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
Try with
df['unique_thread'] = df[cols].astype(str).agg('_'.join,1)
Before you try out BENY's answer, I would suggest you to have a look at these timing results:
In [15]: %timeit a[0] + '_' + a[1]
250 µs ± 5.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [16]: %timeit a.agg('_'.join,1)
675 µs ± 12.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since you already know that you have to operate on 3 columns, you should do something like this:
df['unique_thread'] = df[col[0]] + '_' + df[col[1]] + '_' + df[col[2]]
I have a pandas dataframe with two columns, where I need to check where the value at each row of column A is a string that starts with the value of the corresponding row at column B or viceversa.
It seems that the Series method .str.startswith cannot deal with vectorized input, so I needed to zip over the two columns in a list comprehension and create a new pd.Series with the same index as any of the two columns.
I would like this to be a vectorized operation with the .str accessor available to operate on iterables, but something like this returns NaN:
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df['a'].str.startswith(df['b'])
while my working solution is the following:
pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
I suspect that there may be a better way to tackle this issue as it also would benefit all string methods on series.
Is there any more beautiful or efficient method to do this?
One idea is use np.vecorize, but because working with strings performance is only a bit better like your solution:
def fun (a,b):
return a.startswith(b) or b.startswith(a)
f = np.vectorize(fun)
a = pd.Series(f(df['a'],df['b']), index=df.index)
print (a)
0 True
1 False
dtype: bool
df = pd.DataFrame(data={'a':['x','yy'], 'b':['xyz','uvw']})
df = pd.concat([df] * 10000, ignore_index=True)
In [132]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in df[['a', 'b']].to_numpy()])
42.3 ms ± 516 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [133]: %timeit pd.Series(f(df['a'],df['b']), index=df.index)
9.81 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit pd.Series(index=df.index, data=[a.startswith(b) or b.startswith(a) for a,b in zip(df['a'],df['b'])])
14.1 ms ± 262 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#sammywemmy solution
In [135]: %timeit pd.Series([any((a.startswith(b), b.startswith(a))) for a, b in df.to_numpy()], index=df.index)
46.3 ms ± 683 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Currently I am generating unique uuid in each row using loop like this -
df['uuid'] = df.apply(lambda x: uuid.uuid4(), axis=1)
Is there way to do this without loop?
Remains a loop, but is a little faster than the current way
df['uuid'] = [uuid.uuid4() for x in range(df.shape[0])]
24.4 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Using apply
df['uuid'] = df.apply(lambda x: uuid.uuid4(), axis=1)
1.25 ms ± 39.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Data
pb = {"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222"},"mark_up":{"0":1.2987,"1":1.5625,"2":1.3698,"3":1.3333,"4":1.4589}}
data = {"id":{"0":"K69","1":"K70","2":"K71","3":"K72","4":"K73","5":"K74","6":"K75","7":"K79","8":"K86","9":"K100"},"cost":{"0":29.74,"1":9.42,"2":9.42,"3":9.42,"4":9.48,"5":9.48,"6":24.36,"7":5.16,"8":9.8,"9":3.28},"mark_up_id":{"0":"123","1":"456","2":"789","3":"111","4":"222","5":"333","6":"444","7":"555","8":"666","9":"777"}}
pb = pd.DataFrame(data=pb).set_index('mark_up_id')
df = pd.DataFrame(data=data)
Expected Output
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
Or..
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
Question
Is there a function that does the above, or is this the best way to go about this type of operation?
I would use the second solution you propose or better this:
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
I think using update can be uncomfortable because it doesn't return anything.
Let's say Series.fillna is more flexible.
We can also use DataFrame.assign
in order to continue working on the DataFrame that the assignment returns.
df.assign( Cost=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost']) )
Time comparision with join method
%%timeit
df['cost']=(df['mark_up_id'].map(pb['mark_up']) + df['cost']).fillna(df['cost'])
#945 µs ± 46 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
test = df.join(pb, on='mark_up_id', how='left')
test['cost'].update(test['cost'] + test['mark_up'])
test.drop('mark_up',axis=1,inplace=True)
#3.59 ms ± 137 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
slow..
%%timeit
df['cost'].update(df['mark_up_id'].map(pb['mark_up']) + df['cost'])
#985 µs ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally,I recommend you see: Underastanding inplace and When I should use apply
I've run across some legacy code with data stored as a single-row pd.DataFrame.
My intuition would be that working with a pd.Series would be faster in this case - I don't know how they do optimization, but I know that they can and do so.
Is my intuition correct? Or is there no significant difference for most actions?
(to clarify - obviously the best practice would not be a single row DataFrame, but I'm asking about performance)
Yes for a large number of columns there will be a noticeable impact on performance.
You should consider that a DataFrame is a dict of Series so when you perform an operation on the single row, pandas has to coalesce all the column values first before performing the operation.
Even for 100 elements you can see there is a hit:
s = pd.Series(np.random.randn(100))
df = pd.DataFrame(np.random.randn(1,100))
%timeit s.sum()
%timeit df.sum(axis=1)
104 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
194 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In my opinion there is no reason to have a single row df that couldn't be achieved with a Series where the index values are the same as the column names for that df
The performance degradation isn't linear as for a 10k array it's not quite 2x worse:
s = pd.Series(np.random.randn(10000))
df = pd.DataFrame(np.random.randn(1,10000))
%timeit s.sum()
%timeit df.sum(axis=1)
149 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
253 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)