check start and end available in data using python pandas - python

df1
id start end data
1 2001 2004 [[2004,1],[2003,2],[2002,6],[2001,0.9]]
2 2001 2004 [[2005,1],[2003,2],[2002,6],[2001,0.9]]
3 2001 2004 [[2004,1],[2003,2],[2002,6]]
output
id missed_one
2 2004
3 2001
That is the output.
I have to check from the start to end and the available in the data. If any data is missed it should print the output.

You can use set differencing
df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
1 {}
2 {2004}
3 {2001}
dtype: object

Using a list comprehension and zip:
out = df.assign(missing=[
[i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
So if you want only rows with a year missing:
out.loc[out.missing.notnull()]
id start end data missing
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] [2004]
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] [2001]
If you only want to show a single missing value, and not a list of missing values, you can use next:
df.assign(missing=[
next((i for i in range(start, end+1) if i not in {d for d, _ in datum}), np.nan)
for datum, start, end in zip(df.data, df.start, df.end)
])
id start end data missing
0 1 2001 2004 [[2004, 1], [2003, 2], [2002, 6], [2001, 0.9]] NaN
1 2 2001 2004 [[2005, 1], [2003, 2], [2002, 6], [2001, 0.9]] 2004.0
2 3 2001 2004 [[2004, 1], [2003, 2], [2002, 6]] 2001.0
Some timings:
df = pd.concat([df]*10000)
In [145]: %%timeit
...: out = df.assign(missing=[^M
...: [i for i in range(start, end+1) if i not in {d for d, _ in datum}] or np.nan^M
...: for datum, start, end in zip(df.data, df.start, df.end)^M
...: ])
...:
72.3 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [146]: %%timeit
...: df[['start', 'end']].agg(set,1) - df.data.transform(lambda k: set([item for z in k for item in z]))
...:
503 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can substract sets:
#if necessary convert to nested lists
import ast
df['data'] = df['data'].apply(ast.literal_eval)
df = df.set_index('id')
ranges = df[['start', 'end']].apply(lambda x: set(range(x['start'], x['end'] + 1)), axis=1)
data = df['data'].apply(lambda k: set([z[0] for z in k]))
out = (ranges - data).to_dict()
print (out)
{1: set(), 2: {2004}, 3: {2001}}
df1 = pd.DataFrame([(k, v1) for k, v in out.items() for v1 in v], columns=['id','missed_one'])
print (df1)
id missed_one
0 2 2004
1 3 2001
Details:
print (ranges)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2004}
3 {2001, 2002, 2003, 2004}
print (data)
id
1 {2001, 2002, 2003, 2004}
2 {2001, 2002, 2003, 2005}
3 {2002, 2003, 2004}
Name: data, dtype: object

Related

How to perform a groupby operation on a pandas Dataframe where the average over a list column is taken?

I have a pandas DataFrame like below:
df = pd.DataFrame({"A": [1, 1, 1, 2, 2],
"B": ["apple", "apple", "banana", "pineapple", "pineapple"],
"C": [[6, 5, 2], [2, 10, 2], [5, 37, 1], [4, 19, 2], [1, 5, 1]]})
Now I want to perform a groupby-operation on columns A and B, and get the average of the lists in column C. The average of multiple lists is defined as an element-wise average, so the average of all 1st elements in the 1st position of the list, the average of all 2nd elements in the second position of the list and so on...
The desired output for this example looks like this:
A B C
1 apple [4, 7.5, 2]
1 banana [5, 37, 1]
2 pineapple [2.5, 12, 1.5]
(It is always guaranteed that the lists for each group have the same length)
How to solve this?
Usually I know how to perform groupby operations, either as list aggregations or as averages, but I could not find how to do this when comparing multiple lists. Should a groupby operation not be the most efficient solution, I'm also open to other suggestions.
Approach 1
Here, we create a new dataframe from the lists contained in column C and set the index of this newly created dataframe to columns A and B. Now, aggregate this frame by taking mean on levels present in the index
Then using .values + tolist take the view of mean values as numpy array, convert this view to list and assign to the column C
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
Approach 2
Naive approach which can be slower when dealing with big dataframes. Here we group the dataframe by columns A and B and apply a lambda function on column C, the lambda function then creates a numpy array from the lists and takes mean along axis=0
out = df.groupby(['A', 'B'])['C'].apply(
lambda s: np.array(list(s)).mean(axis=0)).reset_index()
Result
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]
Performance Profiling
On sample dataframe with 50000 rows and 30000 unique groups
df = pd.concat([df.assign(B=df['B'] + str(i))
for i in range(10000)], ignore_index=True)
%%timeit
s = df.set_index(['A', 'B'])
out = pd.DataFrame(list(s['C']), s.index).mean(level=[0, 1])
_ = out.drop(out.columns.tolist(), 1).assign(C=out.values.tolist()).reset_index()
# 173 ms ± 19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
_ = df.groupby(['A', 'B'])['C'].apply(lambda s: np.array(list(s)).mean(axis=0)).reset_index()
# 2.24 s ± 68.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
TRY:
df = pd.concat([df.pop('C').apply(pd.Series), df], 1).groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
or:
df = df.T.apply(pd.Series.explode).T.convert_dtypes().groupby(
['A', 'B']).mean().apply(list, 1).reset_index()
Try This
df = df.groupby(['A','B'])['C'].agg(list).reset_index()
df['C'] = df['C'].apply(lambda x: np.mean(x, axis=0))
Output
A B C
0 1 apple [4.0, 7.5, 2.0]
1 1 banana [5.0, 37.0, 1.0]
2 2 pineapple [2.5, 12.0, 1.5]

Create a sliding window of data index positions

I am trying to write a function that returns the index positions of a sliding window over a Pandas DataFrame as a list of (train, test) tuples.
Example:
df.head(10)
col_a col_b
0 20.1 6.0
1 19.1 7.1
2 19.1 8.9
3 16.5 11.0
4 16.0 11.1
5 17.4 8.7
6 19.3 9.7
7 22.8 12.6
8 21.4 11.9
9 23.0 12.8
def split_function(df, train_length, test_length):
some_logic_to_split_dataframe
split_indices = [(train_idx, test_idx) for index_tuples in split_dataframe_logic]
return split_indices
Desired outcome:
train_length = 2
test_length = 1
split_indices = split_function(df, train_length, test_length)
split_indices
output:
[((0,1), (2)), ((1,2),(3)),...,((7,8), (9)) etc]
The function loop/generator expression would need to terminate when the test_index == last observation too.
All help very much appreciated
I would suggest using the rolling method offered by pandas.
split_indices = []
def split(x):
split_indices.append((x.index[:train_length], x.index[-test_length:]))
return np.nan
df['col1'].rolling(train_length + test_length).apply(split)
This code will create the following split_indices
>>> split_indices
[(Int64Index([0, 1], dtype='int64'), Int64Index([2], dtype='int64')),
(Int64Index([1, 2], dtype='int64'), Int64Index([3], dtype='int64')),
(Int64Index([2, 3], dtype='int64'), Int64Index([4], dtype='int64')),
(Int64Index([3, 4], dtype='int64'), Int64Index([5], dtype='int64')),
(Int64Index([4, 5], dtype='int64'), Int64Index([6], dtype='int64')),
(Int64Index([5, 6], dtype='int64'), Int64Index([7], dtype='int64')),
(Int64Index([6, 7], dtype='int64'), Int64Index([8], dtype='int64')),
(Int64Index([7, 8], dtype='int64'), Int64Index([9], dtype='int64'))]
After you can easily get the data of your dataframe of a given index
>>> df.loc[split_indices[3][0]]
col1 col2
3 16.5 11.0
4 16.0 11.1

How to concatenate Pandas Dataframe columns dynamically?

I have a dataframe df (see program below) whose column names and number are not fixed.
However, there is a list ls which will have the list of columns of df that needs to be appended together.
I tried
df['combined'] = df[ls].apply(lambda x: '{}{}{}'.format(x[0], x[1], x[2]), axis=1)
but here I am assuming that the list ls has 3 elements which is hard coding and incorrect.What if the list has 10 elements.. I want to dynamically read the list and append the columns of the dataframe.
import pandas as pd
def main():
df = pd.DataFrame({
'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7],
'col_3': [14, 15, 16, 19],
'col_4': [22, 23, 24, 25],
'col_5': [30, 31, 32, 33],
})
ls = ['col_1','col_4', 'col_3']
df['combined'] = df[ls].apply(lambda x: '{}{}'.format(x[0], x[1]), axis=1)
print(df)
if __name__ == '__main__':
main()
You can use ''.join after converting the columns' data type to str:
df[ls].astype(str).apply(''.join, axis=1)
#0 02214
#1 12315
#2 22416
#3 32519
#dtype: object
You can use cumulative sum over strings for this for more speed i.e
df[ls].astype(str).cumsum(1).iloc[:,-1].values
Output :
0 02214
1 12315
2 22416
3 32519
Name: combined, dtype: object
If you need to add space then first add ' ' then find sum i.e
n = (df[ls].astype(str)+ ' ').sum(1)
0 0 22 14
1 1 23 15
2 2 24 16
3 3 25 19
dtype: object
Timings :
ndf = pd.concat([df]*10000)
%%timeit
ndf[ls].astype(str).cumsum(1).iloc[:,-1].values
1 loop, best of 3: 538 ms per loop
%%timeit
ndf[ls].astype(str).apply(''.join, axis=1)
1 loop, best of 3: 1.93 s per loop

Hierarchical data: efficiently build a list of every descendant for each node

I have a two column data set depicting multiple child-parent relationships that form a large tree. I would like to use this to build an updated list of every descendant for each node.
Original Input:
child parent
1 2010 1000
7 2100 1000
5 2110 1000
3 3000 2110
2 3011 2010
4 3033 2100
0 3102 2010
6 3111 2110
Graphical depiction of relationships:
Expected output:
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
Originally I decided to use a recursive solution with DataFrames. It works as intended, but Pandas is awfully inefficient. My research has led me to believe that an implementation using NumPy arrays (or other simple data structures) would be much faster on large data sets (of 10's of thousands of records).
Solution using data frames:
import pandas as pd
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
def get_ancestry_dataframe_flat(df):
def get_child_list(parent_id):
list_of_children = list()
list_of_children.append(df[df['parent'] == parent_id]['child'].values)
for i, r in df[df['parent'] == parent_id].iterrows():
if r['child'] != parent_id:
list_of_children.append(get_child_list(r['child']))
# flatten list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
for index, row in df.iterrows():
temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
temp_df['ancestor'] = row['parent']
new_df = new_df.append(temp_df)
new_df = new_df\
.drop_duplicates()\
.sort_values(['ancestor', 'descendant'])\
.reset_index(drop=True)
return new_df
Because using pandas DataFrames in this way is very inefficient on large data sets, I need to improve the performance of this operation. My understanding is that this can be done by using more efficient data structures better suited for looping and recursion. I want to perform this same operation in the most efficient way possible.
Specifically, I'm asking for optimization of speed.
This is a method using numpy to iterate down the tree a generation at a time.
Code:
import numpy as np
import pandas as pd # only used to return a dataframe
def list_ancestors(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Return pandas dataframe of all descendant/ancestor node pairs
Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})
df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300
list_ancestors(df.values)
returns
descendant ancestor
0 200 100
1 201 100
2 300 200
3 300 100
4 301 200
5 301 100
6 302 201
7 302 100
8 400 300
9 400 200
10 400 100
"""
ancestors = []
for ar in trace_nodes(edges):
ancestors.append(np.c_[np.repeat(ar[:, 0], ar.shape[1]-1),
ar[:, 1:].flatten()])
return pd.DataFrame(np.concatenate(ancestors),
columns=['descendant', 'ancestor'])
def trace_nodes(edges):
"""
Take edge list of a rooted tree as a numpy array with shape (E, 2),
child nodes in edges[:, 0], parent nodes in edges[:, 1]
Yield numpy array with cross-section of tree and associated
ancestor nodes
Ex:
df = pd.DataFrame({'child': [200, 201, 300, 301, 302, 400],
'parent': [100, 100, 200, 200, 201, 300]})
df
child parent
0 200 100
1 201 100
2 300 200
3 301 200
4 302 201
5 400 300
trace_nodes(df.values)
yields
array([[200, 100],
[201, 100]])
array([[300, 200, 100],
[301, 200, 100],
[302, 201, 100]])
array([[400, 300, 200, 100]])
"""
mask = np.in1d(edges[:, 1], edges[:, 0])
gen_branches = edges[~mask]
edges = edges[mask]
yield gen_branches
while edges.size != 0:
mask = np.in1d(edges[:, 1], edges[:, 0])
next_gen = edges[~mask]
gen_branches = numpy_col_inner_many_to_one_join(next_gen, gen_branches)
edges = edges[mask]
yield gen_branches
def numpy_col_inner_many_to_one_join(ar1, ar2):
"""
Take two 2-d numpy arrays ar1 and ar2,
with no duplicate values in first column of ar2
Return inner join of ar1 and ar2 on
last column of ar1, first column of ar2
Ex:
ar1 = np.array([[1, 2, 3],
[4, 5, 3],
[6, 7, 8],
[9, 10, 11]])
ar2 = np.array([[ 1, 2],
[ 3, 4],
[ 5, 6],
[ 7, 8],
[ 9, 10],
[11, 12]])
numpy_col_inner_many_to_one_join(ar1, ar2)
returns
array([[ 1, 2, 3, 4],
[ 4, 5, 3, 4],
[ 9, 10, 11, 12]])
"""
ar1 = ar1[np.in1d(ar1[:, -1], ar2[:, 0])]
ar2 = ar2[np.in1d(ar2[:, 0], ar1[:, -1])]
if 'int' in ar1.dtype.name and ar1[:, -1].min() >= 0:
bins = np.bincount(ar1[:, -1])
counts = bins[bins.nonzero()[0]]
else:
counts = np.unique(ar1[:, -1], False, False, True)[1]
left = ar1[ar1[:, -1].argsort()]
right = ar2[ar2[:, 0].argsort()]
return np.concatenate([left[:, :-1],
right[np.repeat(np.arange(right.shape[0]),
counts)]], 1)
Timing Comparison:
Test cases 1 & 2 provided by #taky2, test cases 3 & 4 comparing performance on tall and wide tree structures respectively – most use cases are likely somewhere in the middle.
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}
)
df2 = pd.DataFrame(
{
'child': [4321, 3102, 4023, 2010, 5321, 4200, 4113, 6525, 4010, 4001,
3011, 5010, 3000, 3033, 2110, 6100, 3111, 2100, 6016, 4311],
'parent': [3111, 2010, 3000, 1000, 4023, 3011, 3033, 5010, 3011, 3102,
2010, 4023, 2110, 2100, 1000, 5010, 2110, 1000, 5010, 3033]
}
)
df3 = pd.DataFrame(np.r_[np.c_[np.arange(1, 501), np.arange(500)],
np.c_[np.arange(501, 1001), np.arange(500)]],
columns=['child', 'parent'])
df4 = pd.DataFrame(np.r_[np.c_[np.arange(1, 101), np.repeat(0, 100)],
np.c_[np.arange(1001, 11001),
np.repeat(np.arange(1, 101), 100)]],
columns=['child', 'parent'])
%timeit get_ancestry_dataframe_flat(df)
10 loops, best of 3: 53.4 ms per loop
%timeit add_children_of_children(df)
1000 loops, best of 3: 1.13 ms per loop
%timeit all_descendants_nx(df)
1000 loops, best of 3: 675 µs per loop
%timeit list_ancestors(df.values)
1000 loops, best of 3: 391 µs per loop
%timeit get_ancestry_dataframe_flat(df2)
10 loops, best of 3: 168 ms per loop
%timeit add_children_of_children(df2)
1000 loops, best of 3: 1.8 ms per loop
%timeit all_descendants_nx(df2)
1000 loops, best of 3: 1.06 ms per loop
%timeit list_ancestors(df2.values)
1000 loops, best of 3: 933 µs per loop
%timeit add_children_of_children(df3)
10 loops, best of 3: 156 ms per loop
%timeit all_descendants_nx(df3)
1 loop, best of 3: 952 ms per loop
%timeit list_ancestors(df3.values)
10 loops, best of 3: 104 ms per loop
%timeit add_children_of_children(df4)
1 loop, best of 3: 503 ms per loop
%timeit all_descendants_nx(df4)
1 loop, best of 3: 238 ms per loop
%timeit list_ancestors(df4.values)
100 loops, best of 3: 2.96 ms per loop
Notes:
get_ancestry_dataframe_flat not timed on cases 3 & 4 due to time and memory concerns.
add_children_of_children modified to identify root node internally, but allowed to assume a unique root. First line root_node = (set(dataframe.parent) - set(dataframe.child)).pop() added.
all_descendants_nx modified to accept a dataframe as an argument, instead of pulling from an external namespace.
Example demonstrating proper behavior:
np.all(get_ancestry_dataframe_flat(df2).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True) ==\
list_ancestors(df2.values).sort_values(['descendant', 'ancestor'])\
.reset_index(drop=True))
Out[20]: True
Here is a method which builds a dict to allow easier navigation of the tree. Then runs the tree once and adds the children to their grand parents and above. And finally adds the new data to the dataframe.
Code:
def add_children_of_children(dataframe, root_node):
# build a dict of lists to allow easy tree descent
tree = {}
for idx, (child, parent) in dataframe.iterrows():
tree.setdefault(parent, []).append(child)
data = []
def descend_tree(parent):
# get list of children of this parent
children = tree[parent]
# reverse order so that we can modify the list while looping
for child in reversed(children):
if child in tree:
# descend tree and find children which need to be added
lower_children = descend_tree(child)
# add children from below to parent at this level
data.extend([(c, parent) for c in lower_children])
# return lower children to parents above
children.extend(lower_children)
return children
descend_tree(root_node)
return dataframe.append(
pd.DataFrame(data, columns=dataframe.columns))
Timings:
There are three test methods in the test code, seconds from a timeit run:
0.073 - add_children_of_children() from above.
0.153 - add_children_of_children() with the output sorted.
3.385 - original get_ancestry_dataframe_flat() pandas implementation.
So a native data structure approach is considerably faster than the original implementation.
Test Code:
import pandas as pd
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
def method1():
# the root node is the node which is not a child
root = set(df.parent) - set(df.child)
assert len(root) == 1, "Number of roots != 1 '{}'".format(root)
return add_children_of_children(df, root.pop())
def method2():
dataframe = method1()
names = ['ancestor', 'descendant']
rename = {o: n for o, n in zip(dataframe.columns, reversed(names))}
return dataframe.rename(columns=rename) \
.sort_values(names).reset_index(drop=True)
def method3():
return get_ancestry_dataframe_flat(df)
def get_ancestry_dataframe_flat(df):
def get_child_list(parent_id):
list_of_children = list()
list_of_children.append(
df[df['parent'] == parent_id]['child'].values)
for i, r in df[df['parent'] == parent_id].iterrows():
if r['child'] != parent_id:
list_of_children.append(get_child_list(r['child']))
# flatten list
list_of_children = [
item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['descendant', 'ancestor']).astype(int)
for index, row in df.iterrows():
temp_df = pd.DataFrame(columns=['descendant', 'ancestor'])
temp_df['descendant'] = pd.Series(get_child_list(row['parent']))
temp_df['ancestor'] = row['parent']
new_df = new_df.append(temp_df)
new_df = new_df\
.drop_duplicates()\
.sort_values(['ancestor', 'descendant'])\
.reset_index(drop=True)
return new_df
print(method2())
print(method3())
from timeit import timeit
print(timeit(method1, number=50))
print(timeit(method2, number=50))
print(timeit(method3, number=50))
Test Results:
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110
0.0737142168563
0.153700592966
3.38558308083
A solution using networkx, there may be a more efficient method in the documentation, but this nested loop does the trick.
import pandas as pd
from timeit import timeit
df = pd.DataFrame(
{
'child': [3102, 2010, 3011, 3000, 3033, 2110, 3111, 2100],
'parent': [2010, 1000, 2010, 2110, 2100, 1000, 2110, 1000]
}, columns=['child', 'parent']
)
In networkx 2.0, use from_pandas_edgelist to create a directed graph:
import networkx as nx
DiG = nx.from_pandas_edgelist(df, 'parent', 'child', create_using=nx.DiGraph())
Simply iterate over the nodes and the ancestors of each node.
for n1 in DiG.nodes():
for n2 in nx.ancestors(DiG, n1):
print(n1,n2)
3000 1000
3000 2110
3011 1000
3011 2010
2100 1000
2110 1000
3111 1000
3111 2110
3033 1000
3033 2100
2010 1000
3102 1000
3102 2010
Wrapped into a function:
def all_descendants_nx():
DiG = nx.from_pandas_edgelist(df,'parent','child',create_using=nx.DiGraph())
return pd.DataFrame.from_records([(n1,n2) for n1 in DiG.nodes() for n2 in nx.ancestors(DiG, n1)], columns=['descendant','ancestor'])
print(timeit(all_descendants_nx, number=50)) #to compare to Stephen's nice answer
0.05033063516020775
all_descendants_nx()
descendant ancestor
0 3000 1000
1 3000 2110
2 3011 1000
3 3011 2010
4 2100 1000
5 2110 1000
6 3111 1000
7 3111 2110
8 3033 1000
9 3033 2100
10 2010 1000
11 3102 1000
12 3102 2010
Here is one way using isin() and map
df_new = df.append(df[df['parent'].isin(df['child'].values.tolist())])\
.reset_index(drop = True)
df_new.loc[df_new.duplicated(), 'parent'] = df_new.loc[df_new.duplicated(), 'parent']\
.map(df.set_index('child')['parent'])
df_new = df_new.sort_values('parent').reset_index(drop=True)
df_new.columns = [' descendant' , 'ancestor']
You get
descendant ancestor
0 2010 1000
1 2100 1000
2 2110 1000
3 3000 1000
4 3011 1000
5 3033 1000
6 3102 1000
7 3111 1000
8 3011 2010
9 3102 2010
10 3033 2100
11 3000 2110
12 3111 2110

Accessing every 1st element of Pandas DataFrame column containing lists

I have a Pandas DataFrame with a column containing lists objects
A
0 [1,2]
1 [3,4]
2 [8,9]
3 [2,6]
How can I access the first element of each list and save it into a new column of the DataFrame? To get a result like this:
A new_col
0 [1,2] 1
1 [3,4] 3
2 [8,9] 8
3 [2,6] 2
I know this could be done via iterating over each row, but is there any "pythonic" way?
As always, remember that storing non-scalar objects in frames is generally disfavoured, and should really only be used as a temporary intermediate step.
That said, you can use the .str accessor even though it's not a column of strings:
>>> df = pd.DataFrame({"A": [[1,2],[3,4],[8,9],[2,6]]})
>>> df["new_col"] = df["A"].str[0]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
>>> df["new_col"]
0 1
1 3
2 8
3 2
Name: new_col, dtype: int64
You can use map and a lambda function
df.loc[:, 'new_col'] = df.A.map(lambda x: x[0])
Use apply with x[0]:
df['new_col'] = df.A.apply(lambda x: x[0])
print df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
You can use the method str.get:
df['A'].str.get(0)
You can just use a conditional list comprehension which takes the first value of any iterable or else uses None for that item. List comprehensions are very Pythonic.
df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
>>> df
A new_col
0 [1, 2] 1
1 [3, 4] 3
2 [8, 9] 8
3 [2, 6] 2
Timings
df = pd.concat([df] * 10000)
%timeit df['new_col'] = [val[0] if hasattr(val, '__iter__') else None for val in df["A"]]
100 loops, best of 3: 13.2 ms per loop
%timeit df["new_col"] = df["A"].str[0]
100 loops, best of 3: 15.3 ms per loop
%timeit df['new_col'] = df.A.apply(lambda x: x[0])
100 loops, best of 3: 12.1 ms per loop
%timeit df.A.map(lambda x: x[0])
100 loops, best of 3: 11.1 ms per loop
Removing the safety check ensuring an interable.
%timeit df['new_col'] = [val[0] for val in df["A"]]
100 loops, best of 3: 7.38 ms per loop

Categories

Resources