I want to convert pandas dataframe into a list.
For example, I have a dataframe like below, and I want to make list with all columns.
Dataframe (df)
A B C
0 4 8
1 5 9
2 6 10
3 7 11
Expected result
[[0,1,2,3], [4,5,6,7], [8,9,10,11]]
If I use df.values.tolist(), it will return in row-based order list like below.
[[0,4,8], [1,5,9], [2,6,10], [3,7,11]]
It is possible to transpose the dataframe, but I want to know whether there are better solutions.
I think simpliest is transpose.
Use T or numpy.ndarray.transpose:
df1 = df.T.values.tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Or:
df1 = df.values.transpose().tolist()
print (df1)
[[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]]
Another answer with list comprehension, thank you John Galt:
L = [df[x].tolist() for x in df.columns]
Related
I have a large pandas series that each row in it, is a list of numbers.
I want to detect rows that are subset of other rows and delete them from series.
my solution is using 2 for loops but it is very slow. Can anyone help me and introduce a faster way for this because my for loop is very slow.
for example, we must delete rows 2, 4 in the below sample because they are subsets of rows 1, 3 respectively.
import pandas as pd
cycles = pd.Series([[1, 2, 3, 4], [3, 4], [5, 6, 9, 7], [5, 9]])
First, you could sort the lists since they are numbers and convert them to string. Then for every string simply check if it is a substring of any of the other rows, if so it is a subset. Since everything is sorted we can be sure the order of the numbers will not affect this step.
Finally, filter out only the ones that are not identified as a subset.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'cycles': [[9, 5, 4, 3], [9, 5, 4], [2, 4, 3], [2, 3]],
'members': [4, 3, 3, 2]
})
print(df)
cycles members
0 [9, 5, 4, 3] 4
1 [9, 5, 4] 3
2 [2, 4, 3] 3
3 [2, 3] 2
df['cycles'] = df['cycles'].map(np.sort)
df['cycles_str'] = [','.join(map(str, c)) for c in df['cycles']]
# Here we check if matches are >1, because it will match with itself once!
df['is_subset'] = [df['cycles_str'].str.contains(c_str).sum() > 1 for c_str in df['cycles_str']]
df = df.loc[df['is_subset'] == False]
df = df.drop(['cycles_str', 'is_subset'], axis=1)
cycles members
0 [3, 4, 5, 9] 4
2 [2, 3, 4] 3
Edit - The above doesn't work for [1, 2, 4] & [1, 2, 3, 4]
Rewrote the code. This uses 2 loops and set to check for subsets using list comprehension:
# check if >1 True, as it will match with itself once!
df['is_subset'] = [[set(y).issubset(set(x)) for x in df['cycles']].count(True)>1 for y in df['cycles']]
df = df.loc[df['is_subset'] == False]
df = df.drop('is_subset', axis=1)
print(df)
cycles members
0 [9, 5, 4, 3] 4
2 [2, 4, 3] 3
I have a pandas series with data of type tuple as list elements. The length of the tuple is exactly 2 and there are a bunch of NaNs. I am trying to split each list in the tuple into its own column.
import pandas as pd
import numpy as np
df = pd.DataFrame({'val': [([1,2,3],[4,5,6]),
([7,8,9],[10,11,12]),
np.nan]
})
Expected Output:
If you know the lenght of tuples are exactly 2, you can do:
df["x"] = df.val.str[0]
df["y"] = df.val.str[1]
print(df[["x", "y"]])
Prints:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
You could also convert the column to a list and cast it to the DataFrame constructor (fill None with np.nan as well):
out = pd.DataFrame(df['val'].tolist(), columns=['x','y']).fillna(np.nan)
Output:
x y
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
One way using pandas.Series.apply:
new_df = df["val"].apply(pd.Series)
print(new_df)
Output:
0 1
0 [1, 2, 3] [4, 5, 6]
1 [7, 8, 9] [10, 11, 12]
2 NaN NaN
I have two pandas df and they do not have the same length. df1 has unique id's in column id. These id's occur (multiple times) in df2.colA. I'd like to add a list of all occurrences of df1.id in df2.colA (and another column at the matching index of df1.id == df2.colA) into a new column in df1. Either with the index of df2.colA of the match or additionally with other row entries of all matches.
Example:
df1.id = [1, 2, 3, 4]
df2.colA = [3, 4, 4, 2, 1, 1]
df2.colB = [5, 9, 6, 5, 8, 7]
So that my operation creates something like:
df1.colAB = [ [[1,8],[1,7]], [[2,5]], [[3,5]], [[4,9],[4,6]] ]
I've tries a bunch of approaches with mapping, looping explicitly (super slow), checking with isin etc.
You could use Pandas apply to iterate over each row of df1 value while creating a list with all the indices in df2.colA. This can be achieved by using Pandas index and loc over the df2.colB to create a list with all the indices in df2.colA that match the row in df1.id. Then, within the apply itself use a for-loop to create the list of matched values.
import pandas as pd
# setup
df1 = pd.DataFrame({'id':[1,2,3,4]})
print(df1)
df2 = pd.DataFrame({
'colA' : [3, 4, 4, 2, 1, 1],
'colB' : [5, 9, 6, 5, 8, 7]
})
print(df2)
#code
df1['colAB'] = df1['id'].apply(lambda row:
[[row, idx] for idx in df2.loc[df2[df2.colA == row].index,'colB']])
print(df1)
Output from df1
id colAB
0 1 [[1, 8], [1, 7]]
1 2 [[2, 5]]
2 3 [[3, 5]]
3 4 [[4, 9], [4, 6]]
I have a pandas dataframe in which a column is formed by arrays. So every cell is an array.
Say there is a column A in dataframe df, such that
A = [ [1, 2, 3],
[4, 5, 6],
[7, 8, 9],
... ]
I want to operate in each array and get, e.g. the maximum of each array, and store it in another column.
In the example, I would like to obtain another column
B = [ 3,
6,
9,
...]
I have tried these approaches so far, none of them giving what I want.
df['B'] = np.max(df['A']);#
df.applymap (lambda B: A.max())
df['B'] = df.applymap (lambda B: np.max(np.array(df['A'].tolist()),0))
How should I proceed? And is this the best way to have my dataframe organized?
You can just apply(max). It doesn't matter if the values are lists or np.array.
df = pd.DataFrame({'a': [[1, 2, 3], [4, 5, 6], [7, 8, 9]]})
df['b'] = df['a'].apply(max)
print(df)
Outputs
a b
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Here is one way without apply:
df['B']=np.max(df['A'].values.tolist(),axis=1)
A B
0 [1, 2, 3] 3
1 [4, 5, 6] 6
2 [7, 8, 9] 9
Let's consider a dataframe A with three columns: a, b and c. Suppose we have also Series B of the the same size as A. In each row it contains the name of one of the A's column. I want to construct the Series which would contains the values form the table A at columns specified by B.
The simplest example would be the following:
idxs = np.arange(0, 5)
A = pd.DataFrame({
'a': [3, 1, 5, 7, 8],
'b': [5, 6, 7, 3, 1],
'c': [2, 7, 8, 2, 1],
}, index=idxs)
B = pd.Series(['b', 'c', 'c', 'a', 'a'], index=idxs)
I need to apply some operation which will give the result identical to the following series:
C = pd.Series([5, 7, 8, 7, 8], index=idxs)
In such a simple example one can perform 'broadcasting' as following on pure numpy arrays:
d = {'a':0, 'b':1, 'c':2 }
AA = A.rename(columns=d).as_matrix()
BB = B.apply(lambda x: d[x]).as_matrix()
CC = AA[idxs, BB]
That works, but in my real problem I have multiindexed Dataframe, and things become more complicated.
Is it possible to do so, using pandas tools?
The first thing that comes into my mind is:
A['idx'] = B;
C = A.apply(lambda x: x[x['idx']], axis=1)
It works!
You can use DataFrame.lookup:
pd.Series(A.lookup(B.index, B), index=B.index)
0 5
1 7
2 8
3 7
4 8
dtype: int64
A NumPy solution involving broadcasting is:
A.values[B.index, (A.columns.values == B[:, None]).argmax(1)]
# array([5, 7, 8, 7, 8])