I have a Pandas DataFrame with several lists in columns that I would like to split. Each list has the same length and they have to be split at the same indices.
What I have now uses a suggestion from here but I cannot make it work:
import numpy as np
import pandas as pd
from itertools import chain
split_size = 2
def split_list(arr, keep_partial=False):
arrs = []
while len(arr) >= split_size:
sub = arr[:split_size]
arrs.append(sub)
arr = arr[split_size:]
if keep_partial:
arrs.append(arr)
return arrs
df = pd.DataFrame({'id': [1, 2, 3], 't': [[1,2,3,4], [1,2,3,4,5,6], [0,2]], 'v': [[0,-1,1,0], [0,-1,1,0,2,-2], [0,0]]})
def chainer(lst):
return list(chain.from_iterable(split_list(lst, split_size)))
def chain_col(col):
return col.apply(lambda x: chainer(x))
lens = df.t.apply(lambda x: len(split_list(x)))
pd.DataFrame({'id': np.repeat(df.id, lens), 't': chain_col(df.t), 'v': chain_col(df.v)})
The problem is that it repeats each full list rather than splits it across lines. I think the issue is the usage of chain.from_iterable but without it I simply get the list of lists (i.e. split lists) repeated rather than each split to its own row in the DataFrame.
My data set is not very large (a few thousand rows), so if there is a better way I'd be happy to learn. I looked at explode but that seems to split the data set based on a single column and I want multiple columns to be split in the same way.
My desired output is for id = 1 is
1. a row with t = [1,2] and v = [0,-1]
2. another row with t = [3,4] = [1,0]
Ideally I'd add a sub-index to each 'id' (e.g. 1 -> 1.1 and 1.2, so I can distinguish them) but that's a cosmetic thing, not my main problem.
Using explode, pd.concat and GroupBy:
note: this answer uses the new explode method only available from pandas>=0.25.0
d1 = df.explode('t').drop(columns='v')
d2 = df.explode('v').drop(columns=['id', 't'])
df2 = pd.concat([d1,d2], axis=1)
df2
s = df2.groupby('id')['id'].cumcount()//2
final = df2.groupby(['id', s]).agg({'t':list,
'v':list}).reset_index(level=0)
final['id'] = final['id'].astype(str).str.cat('.'+final.groupby('id').cumcount().add(1).astype(str))
Output
id t v
0 1.1 [1, 2] [0, -1]
1 1.2 [3, 4] [1, 0]
0 2.1 [1, 2] [0, -1]
1 2.2 [3, 4] [1, 0]
2 2.3 [5, 6] [2, -2]
0 3.1 [0, 2] [0, 0]
IIUC, here is one way using a funstion which splits lists to n chunks , then applymap to split each cell , followed by explode and concat:
def split_lists(l, n):
"""splits a list to n chunks"""
for i in range(0, len(l), n):
yield l[i:i + n]
def explode_multiple(x):
"""This will use the prev func,
explode each columns and concat them to a dataframe"""
m=x.applymap(lambda x: [*split_lists(x,2)])
m=pd.concat([m.explode(i).loc[:,i] for i in m.columns],axis=1).reset_index()
return m
explode_multiple(df.set_index('id')) #setting id as index since other columns have list
id t v
0 1 [1, 2] [0, -1]
1 1 [3, 4] [1, 0]
2 2 [1, 2] [0, -1]
3 2 [3, 4] [1, 0]
4 2 [5, 6] [2, -2]
5 3 [0, 2] [0, 0]
Related
Assume that we have a dataframe and inside the dataframe in a column we have lists. How can I count the number per list? For example
A B
(1,2,3) (1,2,3,4)
(1) (1,2,3)
I would like to create 2 new columns with the count of each column. something like the following
A B C D
(1,2,3) (1,2,3,4) 3 4
(1) (1,2,3) 1 3
where C corresponds to the number of the elements in the column A for that row, and D for the number of elements in the list in column B for that row
I cannot just do
df['A'] = len(df['A'])
Because that returns the len of my dataframe
You can use the .apply method on the Series for the column df['A'].
>>> import pandas
>>> import pandas as pd
>>> pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
column
0 [1, 2]
1 [1]
2 [1, 2, 3]
>>> df = pd.DataFrame({"column": [[1, 2], [1], [1, 2, 3]]})
>>> df["column"].apply
<bound method Series.apply of 0 [1, 2]
1 [1]
2 [1, 2, 3]
Name: column, dtype: object>
>>> df["column"].apply(len)
0 2
1 1
2 3
Name: column, dtype: int64
>>> df["column"] = df["column"].apply(len)
>>>
See Python Pandas, apply function for a more general discussion of apply.
You can pandas' apply with the len function to each column like bellow to obtain what you are looking for
# package importation
import pandas as pd
# creating a sample dataframce
df = pd.DataFrame(
{
'A':[[1,2,3],[32,4],[45,67,23,54,3],[],[0]],
'B':[[2],[3],[2,3],[5,6,1],[98,44]]
},
index=['z','y','m','n','o']
)
# computing lengths of lists in the column
df['items_in_A'] = df['A'].apply(len)
df['items_in_B'] = df['B'].apply(len)
# check the putput
print(df)
output
A B items_in_A items_in_B
z [1, 2, 3] [2] 3 1
y [32, 4] [3] 2 1
m [45, 67, 23, 54, 3] [2, 3] 5 2
n [] [5, 6, 1] 0 3
o [0] [98, 44] 1 2
Given the following list of lists representing column names:
names = [['a','b'],['c','c'],['b','c']]
and the following dataframe
df
a b c
0 1 2 6
1 1 3 2
2 4 6 4
I would like to generate the list with the same dimensions as names with the following functionality:
lst = []
for idx, cols in enumerate(names):
lst.append([])
for col in cols:
lst[-1].append(df.iloc[idx][col])
lst:
[[1,2],[2,2],[6,4]
I.e, the names array points to the pulled columns from df in the relevant row_idx.
I'm trying to avoid the nested loop.
You can select multiple columns with list
lst = []
for idx, cols in enumerate(names):
lst.append(df.iloc[idx][cols].tolist())
# or list comprehension
lst = [df.iloc[idx][cols].tolist() for idx, cols in enumerate(names)]
print(lst)
[[1, 2], [2, 2], [6, 4]]
As you said that the length of names is same as dataframe length, and you don't want to loop over names nor perform nested loop. In that case, would looping over range be allowed?
index = range(len(names))
[df.iloc[i][names[i]].tolist() for i in index]
Out[16]: [[1, 2], [2, 2], [6, 4]]
Or df.loc
[df.loc[i,names[i]].tolist() for i in index]
Out[35]: [[1, 2], [2, 2], [6, 4]]
In the Python
If there are
[1,2,3,4,1,2,3,1,2,1]
lists,
I want to split the list when the size is reduced like this
[1,2,3,4], [1,2,3], [1,2], [1]
How do I code it?
You can use Pandas to do this in three lines:
import pandas as pd
s = pd.Series([1,2,3,4,1,2,3,1,2,1])
s.groupby(s.diff().lt(0).cumsum()).apply(list).tolist()
Output:
[[1, 2, 3, 4], [1, 2, 3], [1, 2], [1]]
Details how it works:
First create a pandas series from the list, then use the diff method in pd.Series to get the difference from the previous value:
s.diff()
0 NaN
1 1.0
2 1.0
3 1.0
4 -3.0
5 1.0
6 1.0
7 -2.0
8 1.0
9 -1.0
dtype: float64
The negative values are indicate the start of a new "sub" list. So, we use lt(0) to mark those records where a new "sub" list should start.
s.diff().lt(0)
0 False
1 False
2 False
3 False
4 True
5 False
6 False
7 True
8 False
9 True
dtype: bool
Next, we are going to use cumsum to create a grouping term. cumsum will only increment when True, so all falses that are next to each other get the same value, then True increments and next group of falses get that new value, until the next True.
s.diff().lt(0).cumsum()
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 2
8 2
9 3
dtype: int32
Now, we can use groupby with apply to create a new series with these sublist as rows. We are grouping on that newly create grouping term from above and apply the python list to those values in a that group, thus creating the "sub" list.
s.groupby(s.diff().lt(0).cumsum()).apply(list)
0 [1, 2, 3, 4]
1 [1, 2, 3]
2 [1, 2]
3 [1]
dtype: object
Lastly, we apply the tolist method on the series to return the series as a list.
s.groupby(s.diff().lt(0).cumsum()).apply(list).tolist()
Final Output:
[[1, 2, 3, 4], [1, 2, 3], [1, 2], [1]]
This might be the algorithm you are looking for -
a=[1,2,3,4,1,2,3,1,2,1]
b=[]
c=[]
for i in range(len(a)-1):
b.append(a[i])
if a[i] > a[i+1]:
c.append(b)
b=[]
print(c)
It outputs a list of sorted lists -
[[1, 2, 3, 4], [1, 2, 3], [1, 2]]
Let me know if that helps.
If you are looking to split the list when the next number is less than previous one than this might help:
arr = [1,2,3,4,1,2,3,1,2,1]
b = []
start = 0
for i in range(len(arr)):
if(arr[i] < arr[i-1]):
b.append(arr[start:i])
start = i
b.append(arr[start:])
print(b)
Output:
[[1, 2, 3, 4], [1, 2, 3], [1, 2], [1]]
Hope this helps.
Just for fun, I wanted to see if you could rework the code given in the docs as a sample implementation of itertools.groupby to suit your needs in a general way. The result is a generator whose elements are sub-generators representing your sub-lists. The determination when to split is done by a user-defined function of two variables that accepts each successive pair of neighboring elements and returns True when they are in different groups:
from collections import deque
class splitby:
# [''.join(s) for s in splitby('AAAABBBCCDAABBB', operator.eq)] --> ['AAAA', 'BBB', 'CC', 'D', 'AA', 'BBB']
def __init__(self, iterable, splitter):
self.splitfunc = splitter
self.it = iter(iterable)
self.segment = None
def __iter__(self):
return self
def __next__(self):
if self.segment:
deque(self.segment, maxlen=0)
if self.segment is None:
raise StopIteration
else:
self.curvalue = next(self.it)
self.segment = self._splitter()
return self.segment
def _splitter(self):
split = False
while not split:
yield self.curvalue
prev = self.curvalue
try:
self.curvalue = next(self.it)
except StopIteration:
self.segment = None
return
split = self.splitfunc(prev, self.curvalue)
The whole thing can be applied to your input list with a splitter function of operator.gt or int.__gt__ if your list will only ever contain ints. A suitable wrapping in list will not only properly consume the elements, but will also make the output match your question:
from operator import gt
x = [1, 2, 3, 4, 1, 2, 3, 1, 2, 1]
[list(s) for s in splitby(x, gt)]
The result is:
[[1, 2, 3, 4], [1, 2, 3], [1, 2], [1]]
Here is an IDEOne link: https://ideone.com/UW483U
TL;DR
This is massive overkill for most situations, so don't do it this way. I was just having some fun, but the code here does technically solve your problem. If you put the class into your library somewhere, the actual usage is a one-liner.
So I'm using python 3 and importing pandas as pd, and I have a dataframe, df, for which the elements are a list of ints. I'm trying to write a function that will remove an element from a list in a specific cell of the dataframe.
what I have is:
def eliminate(r,c,v):
'''for row r and column c eliminate value v'''
df[c][r].remove(v)
However when i run the function it removes v from the list in every cell. I'm not sure what's going wrong here.
You can use loc for selecting and swap c and r:
def eliminate(r,c,v):
'''for row r and column c eliminate value v'''
df.loc[r, c].remove(v)
return df
Sample:
df = pd.DataFrame({'A':[[1,2,3],[3,4]],
'B':[[4,5],[6, 3]]})
print (df)
A B
0 [1, 2, 3] [4, 5]
1 [3, 4] [6, 3]
def eliminate(r,c,v):
'''for row r and column c eliminate value v'''
#another solution
#df.loc[r, c] = [x for x in df.loc[r, c] if x != v]
df.loc[r, c].remove(v)
return df
print (eliminate(0,'A',2))
A B
0 [1, 3] [4, 5]
1 [3, 4] [6, 3]
print (eliminate(1,'B',3))
A B
0 [1, 3] [4, 5]
1 [3, 4] [6]
I am currently using Pandas in python 2.7. My dataframe looks similar to this:
>>> df
0
1 [1, 2]
2 [2, 3]
3 [4, 5]
Is it possible to filter rows by values in column 1? For example, if my filter value is 2, the filter should return a dataframe containing the first two rows.
I have tried a couple of ways already. The best thing I can think of is to do a list comprehension that returns the index of rows in which the value exist. Then, I could filter the dataframe with the list of indices. But, this would be very slow if I want to filter multiple times with different values. Ideally, I would like something that uses the build in Pandas functions in order to speed up the process.
You can use boolean indexing:
import pandas as pd
df = pd.DataFrame({'0':[[1, 2],[2, 3], [4, 5]]})
print (df)
0
0 [1, 2]
1 [2, 3]
2 [4, 5]
print (df['0'].apply(lambda x: 2 in x))
0 True
1 True
2 False
Name: 0, dtype: bool
print (df[df['0'].apply(lambda x: 2 in x)])
0
0 [1, 2]
1 [2, 3]
You can also use boolean indexing with a list comprehension:
>>> df[[2 in row for row in df['0']]]
0
0 [1, 2]
1 [2, 3]