Pandas DataFrame to Dict Format with new Keys - python

What would be the best way to convert this:
deviceid devicetype
0 b569dcb7-4498-4cb4-81be-333a7f89e65f Google
1 04d3b752-f7a1-42ae-8e8a-9322cda4fd7f Android
2 cf7391c5-a82f-4889-8d9e-0a423f132026 Android
into this:
0 {"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":["Google"]}
1 {"deviceid":"04d3b752-f7a1-42ae-8e8a-9322cda4fd7f","devicetype":["Android"]}
2 {"deviceid":"cf7391c5-a82f-4889-8d9e-0a423f132026","devicetype":["Android"]}
I've tried df.to_dict() but that just gives:
{'deviceid': {0: 'b569dcb7-4498-4cb4-81be-333a7f89e65f',
1: '04d3b752-f7a1-42ae-8e8a-9322cda4fd7f',
2: 'cf7391c5-a82f-4889-8d9e-0a423f132026'},
'devicetype': {0: 'Google', 1: 'Android', 2: 'Android'}}

You can use apply with to_json:
In [11]: s = df.apply((lambda x: x.to_json()), axis=1)
In [12]: s[0]
Out[12]: '{"deviceid":"b569dcb7-4498-4cb4-81be-333a7f89e65f","devicetype":"Google"}'
To get the list for the device type you could do this manually:
In [13]: s1 = df.apply((lambda x: {"deviceid": x["deviceid"], "devicetype": [x["devicetype"]]}), axis=1)
In [14]: s1[0]
Out[14]: {'deviceid': 'b569dcb7-4498-4cb4-81be-333a7f89e65f', 'devicetype': ['Google']}

To expand on on the previous answer to_dict() should be a little faster than to_json()
This appears to be true for a larger test data frame, but the to_dict() method is actually a little slower for the example you provided.
Large test set
In [1]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[1]: 100 loops, best of 3: 5.88 ms per loop
In [2]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[2]: 100 loops, best of 3: 3.91 ms per loop
Provided example
In [3]: %timeit s = df.apply((lambda x: x.to_json()), axis=1)
Out[3]: 1000 loops, best of 3: 375 µs per loop
In [4]: %timeit s = df.apply((lambda x: x.to_dict()), axis=1)
Out[4]: 1000 loops, best of 3: 450 µs per loop

Related

First name of Pandas column with sorted list

Suppose I have the following dataframe:
df = pd.DataFrame({'AB': [['ab', 'ef', 'bd'], ['abc', 'efg', 'cd'], ['bd', 'aaaa']],
'CD': [['xy', 'gh'], ['trs', 'abc'], ['ab', 'bcd', 'efg']],
'EF': [['uxyz', 'abc'], ['peter', 'adam'], ['ab', 'zz', 'bd']]})
df
AB CD EF
0 [ab, ef, bd] [xy, gh] [uxyz, abc]
1 [abc, efg, cd] [trs, abc] [peter, adam]
2 [bd, aaaa] [ab, bcd, efg] [ab, zz, bd]
I want to extract the column which contains a sorted list. In this case it is CD, since ['ab','bcd','efg'] is sorted in ascending order. It is guaranteed that no list is empty and it will contain at least two elements. I am stuck at how to combine applymap and sort function together using Pandas ?
I tried to come up with the solution from here but couldn't figure out a way to combine applymap and sort.
I am working in Python 2.7 and pandas
Use applymap with sorted
In [2078]: df.applymap(sorted).eq(df).any()
Out[2078]:
AB False
CD True
EF False
dtype: bool
Get result into a list
In [2081]: cond = df.applymap(sorted).eq(df).any()
In [2082]: cond[cond].index
Out[2082]: Index([u'CD'], dtype='object')
In [2083]: cond[cond].index.tolist()
Out[2083]: ['CD']
If you need specific columns with data
In [2086]: df.loc[:, cond]
Out[2086]:
CD
0 [xy, gh]
1 [trs, abc]
2 [ab, bcd, efg]
And, get first of column name
In [2092]: cond[cond].index[0]
Out[2092]: 'CD'
Use applymap and for filter columns loc:
df = df.loc[:, df.applymap(lambda x: sorted(x) == x).any()]
print (df)
CD
0 [xy, gh]
1 [trs, abc]
2 [ab, bcd, efg]
And for column names:
a = df.applymap(lambda x: sorted(x) == x).any()
print (a)
AB False
CD True
EF False
dtype: bool
L = a.index[a].tolist()
print (L)
['CD']
Timings
Conclusion - df.applymap(lambda x: sorted(x) == x) is approx. same as df.applymap(sorted) == df:
#3k rows
df = pd.concat([df]*1000).reset_index(drop=True)
In [134]: %timeit df.applymap(lambda x: sorted(x) == x)
100 loops, best of 3: 8.08 ms per loop
In [135]: %timeit df.applymap(sorted).eq(df)
100 loops, best of 3: 9.96 ms per loop
In [136]: %timeit df.applymap(sorted) == df
100 loops, best of 3: 9.84 ms per loop
In [137]: %timeit df.applymap(lambda x: (np.asarray(x[:-1]) <= np.asarray(x[1:])))
10 loops, best of 3: 62 ms per loop
#30k rows
df = pd.concat([df]*10000).reset_index(drop=True)
In [126]: %timeit df.applymap(lambda x: sorted(x) == x)
10 loops, best of 3: 77.5 ms per loop
In [127]: %timeit df.applymap(sorted).eq(df)
10 loops, best of 3: 81.1 ms per loop
In [128]: %timeit df.applymap(sorted) == df
10 loops, best of 3: 75.7 ms per loop
In [129]: %timeit df.applymap(lambda x: (np.asarray(x[:-1]) <= np.asarray(x[1:])))
1 loop, best of 3: 617 ms per loop
#300k rows
df = pd.concat([df]*100000).reset_index(drop=True)
In [131]: %timeit df.applymap(lambda x: sorted(x) == x)
1 loop, best of 3: 750 ms per loop
In [132]: %timeit df.applymap(sorted).eq(df)
1 loop, best of 3: 801 ms per loop
In [133]: %timeit df.applymap(sorted) == df
1 loop, best of 3: 744 ms per loop
In [134]: %timeit df.applymap(lambda x: (np.asarray(x[:-1]) <= np.asarray(x[1:])))
1 loop, best of 3: 6.25 s per loop
Checking for sortedness without sorting.
is_sorted = lambda x: (np.asarray(x[:-1]) <= np.asarray(x[1:])).all()
df.applymap(is_sorted).any()
AB False
CD True
EF False
dtype: bool

Is there a more efficient and elegant way to filter pandas index by date?

I often use DatetimeIndex.date, especially in groupby methods. However, DatetimeIndex.date is slow when compared to DatetimeIndex.year/month/day. From what I understand, it is because the .date attribute works with a lambda function over the index and returns a datetime ordered index, while index.year/month/day just returns integer indices. I have made a small example function that performs a bit better and would speed up some of my code (at least for finding the values in a groupby), but I feel that there must be a better way:
In [217]: index = pd.date_range('2011-01-01', periods=100000, freq='h')
In [218]: data = np.random.rand(len(index))
In [219]: df = pd.DataFrame({'data':data},index)
In [220]: def func(df):
...: groupby = df.groupby([df.index.year, df.index.month, df.index.day]).mean()
...: index = pd.date_range(df.index[0], periods = len(groupby), freq='D')
...: groupby.index = index
...: return groupby
...:
In [221]: df.groupby(df.index.date).mean().equals(func(df))
Out[221]: True
In [222]: df.groupby(df.index.date).mean().index.equals(func(df).index)
Out[222]: True
In [223]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.32 s per loop
In [224]: %timeit func(df)
10 loops, best of 3: 89.2 ms per loop
Does the pandas/index have a similar functionality that I am not finding?
You can even improve it a little bit:
In [69]: %timeit func(df)
10 loops, best of 3: 84.3 ms per loop
In [70]: %timeit df.groupby(pd.TimeGrouper('1D')).mean()
100 loops, best of 3: 6 ms per loop
In [84]: %timeit df.groupby(pd.Grouper(level=0, freq='1D')).mean()
100 loops, best of 3: 6.48 ms per loop
In [71]: (func(df) == df.groupby(pd.TimeGrouper('1D')).mean()).all()
Out[71]:
data True
dtype: bool
another solution - using DataFrame.resample() method:
In [73]: (df.resample('1D').mean() == func(df)).all()
Out[73]:
data True
dtype: bool
In [74]: %timeit df.resample('1D').mean()
100 loops, best of 3: 6.63 ms per loop
UPDATE: grouping by the string:
In [75]: %timeit df.groupby(df.index.strftime('%Y%m%d')).mean()
1 loop, best of 3: 2.6 s per loop
In [76]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.07 s per loop

Difference between df.loc['col name'], df.loc[index]['col name'] and df.loc[index, 'col name'] in pandas?

I have a dataframe df with a column name 'Store'. If I want to retrieve the column, the following lines work equally well - df['Store'] or df[:]['Store'] or df[:,'Store'].
What is the difference between the two? And should one be used over the other?
Thank you.
df.loc[index, 'col name'] is more idiomatic and preferred, especially if you want to filter rows
Demo: for 1.000.000 x 3 shape DF
In [26]: df = pd.DataFrame(np.random.rand(10**6,3), columns=list('abc'))
In [27]: %timeit df[df.a < 0.5]['a']
10 loops, best of 3: 45.8 ms per loop
In [28]: %timeit df.loc[df.a < 0.5]['a']
10 loops, best of 3: 45.8 ms per loop
In [29]: %timeit df.loc[df.a < 0.5, 'a']
10 loops, best of 3: 37 ms per loop
For construction where you need only one column and don't filter rows like df[:]['Store'] - it's better to use simply df['Store']:
In [30]: %timeit df[:]['a']
1000 loops, best of 3: 436 µs per loop
In [31]: %timeit df.loc[:]['a']
10000 loops, best of 3: 25.9 µs per loop
In [36]: %timeit df['a'].loc[:]
10000 loops, best of 3: 26.5 µs per loop
In [32]: %timeit df.loc[:, 'a']
10000 loops, best of 3: 126 µs per loop
In [33]: %timeit df['a']
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.17 µs per loop
Unconditional access of multiple columns:
In [34]: %timeit df[['a','b']]
10 loops, best of 3: 22 ms per loop
In [35]: %timeit df.loc[:, ['a','b']]
10 loops, best of 3: 22.6 ms per loop

Turn pandas series to series of lists or numpy array to array of lists

I have a series s
s = pd.Series([1, 2])
What is an efficient way to make s look like
0 [1]
1 [2]
dtype: object
Here's one approach that extracts into array and extends to 2D by introducing a new axis with None/np.newaxis -
pd.Series(s.values[:,None].tolist())
Here's a similar one, but extends to 2D by reshaping -
pd.Series(s.values.reshape(-1,1).tolist())
Runtime test using #P-robot's setup -
In [43]: s = pd.Series(np.random.randint(1,10,1000))
In [44]: %timeit pd.Series(np.vstack(s.values).tolist()) # #Nickil Maveli's soln
100 loops, best of 3: 5.77 ms per loop
In [45]: %timeit pd.Series([[a] for a in s]) # #P-robot's soln
1000 loops, best of 3: 412 µs per loop
In [46]: %timeit s.apply(lambda x: [x]) # #mgc's soln
1000 loops, best of 3: 551 µs per loop
In [47]: %timeit pd.Series(s.values[:,None].tolist()) # Approach1
1000 loops, best of 3: 307 µs per loop
In [48]: %timeit pd.Series(s.values.reshape(-1,1).tolist()) # Approach2
1000 loops, best of 3: 306 µs per loop
If you want the result to still be a pandas Series you can use the apply method :
In [1]: import pandas as pd
In [2]: s = pd.Series([1, 2])
In [3]: s.apply(lambda x: [x])
Out[3]:
0 [1]
1 [2]
dtype: object
This does it:
import numpy as np
np.array([[a] for a in s],dtype=object)
array([[1],
[2]], dtype=object)
Adjusting atomh33ls' answer, here's a series of lists:
output = pd.Series([[a] for a in s])
type(output)
>> pandas.core.series.Series
type(output[0])
>> list
Timings for a selection of the suggestions:
import numpy as np, pandas as pd
s = pd.Series(np.random.randint(1,10,1000))
>> %timeit pd.Series(np.vstack(s.values).tolist())
100 loops, best of 3: 3.2 ms per loop
>> %timeit pd.Series([[a] for a in s])
1000 loops, best of 3: 393 µs per loop
>> %timeit s.apply(lambda x: [x])
1000 loops, best of 3: 473 µs per loop

Vectorizing a very simple pandas lambda function in apply

pandas apply/map is my nemesis and even on small datasets can be agonizingly slow. Below is a very simple example where there is nearly a 3 order of magnitude difference in speed. Below I create a Series with 1 million values and simply want to map values greater than .5 to 'Yes' and those less than .5 to 'No'. How do I vectorize this or speed it up significantly?
ser = pd.Series(np.random.rand(1000000))
# vectorized and fast
%%timeit
ser > .5
1000 loops, best of 3: 477 µs per loop
%%timeit
ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 255 ms per loop
np.where(cond, A, B) is the vectorized equivalent of A if cond else B:
import numpy as np
import pandas as pd
ser = pd.Series(np.random.rand(1000000))
mask = ser > 0.5
result = pd.Series(np.where(mask, 'Yes', 'No'))
expected = ser.map(lambda x: 'Yes' if x > .5 else 'No')
assert result.equals(expected)
In [77]: %timeit mask = ser > 0.5
1000 loops, best of 3: 1.44 ms per loop
In [76]: %timeit np.where(mask, 'Yes', 'No')
100 loops, best of 3: 14.8 ms per loop
In [73]: %timeit pd.Series(np.where(mask, 'Yes', 'No'))
10 loops, best of 3: 86.5 ms per loop
In [74]: %timeit ser.map(lambda x: 'Yes' if x > .5 else 'No')
1 loop, best of 3: 223 ms per loop
Since this Series only has two values, you might consider using a Categorical instead:
In [94]: cat = pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
Out[94]:
[No, Yes, No, Yes, Yes, ..., Yes, No, Yes, Yes, No]
Length: 1000000
Categories (2, object): [Yes, No]
In [95]: %timeit pd.Categorical.from_codes(codes=mask.astype(int), categories=['Yes', 'No']); cat
100 loops, best of 3: 6.26 ms per loop
Not only is this faster, it is more memory efficient since it avoids creating the array of strings. The category codes are an array of ints which map to categories:
In [96]: cat.codes
Out[96]: array([1, 0, 1, ..., 0, 0, 1], dtype=int8)
In [97]: cat.categories
Out[99]: Index(['Yes', 'No'], dtype='object')

Categories

Resources