I have a pandas Dataframe where one of the columns is full of lists:
import pandas
df = pandas.DataFrame([[1, [a, b, c]],
[2, [d, e, f]],
[3, [a, b, c]]])
And I'd like to make a pivot table that shows the list and a count of occurrences
List Count
[a,b,c] 2
[d,e,f] 1
Because list is a non-hashable type, what aggregate functions could do this?
You can zip a list of rows and a list of counts, then make a dataframe from the zip object:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d', 'e', 'f']],
[3, ['a', 'b', 'c']]])
rows = []
counts = []
for index,row in df.iterrows():
if row[1] not in rows:
rows.append(row[1])
counts.append(1)
else:
counts[rows.index(row[1])] += 1
df = pandas.DataFrame(zip(rows, counts))
print(df)
The solution I ended up using was:
import pandas
df = pandas.DataFrame([[1, ['a', 'b', 'c']],
[2, ['d','e', 'f']],
[3, ['a', 'b', 'c']]])
print(df[1])
df[1] = df[1].map(tuple)
#Thanks Ch3steR
df2 = pandas.pivot_table(df,index=df[1], aggfunc='count')
print(df2)
Related
I get different results in a for loop from print and df.at. Can this be explained?
import pandas as pd
data = [['A', []], ['B', []], ['C', []], ['D', []]]
df = pd.DataFrame(data, columns = ['Act', 'PreviousActs'])
actssofar = []
for i, row in df.iterrows():
actssofar.append(row['Act'])
print (i, actssofar)
df.at[i,'PreviousActs'] = actssofar
Now, the output of the print function in the for loop is this:
0 ['A']
1 ['A', 'B']
2 ['A', 'B', 'C']
3 ['A', 'B', 'C', 'D']
But the output of the dataframe is this:
Acts
PreviousActs
A
A, B, C, D
B
A, B, C, D
C
A, B, C, D
D
A, B, C, D
Logically, shouldn't it show the same step-by-step appending behavior as the print function, since we are filling the dataframe with the same value?
If I understand correctly, the problem is that, when the loop finishes, your dataframe contains ['A', 'B', 'C', 'D'] for all rows.This happens because you are passing the list as "reference", which means all rows are storing the same list. You should add a list() call to create a new list everytime you assign it to the dataframe.
import pandas as pd
data = [['A', []], ['B', []], ['C', []], ['D', []]]
df = pd.DataFrame(data, columns = ['Act', 'PreviousActs'])
actssofar = []
for i, row in df.iterrows():
actssofar.append(row['Act'])
print (i, actssofar)
df.at[i,'PreviousActs'] = list(actssofar)
Updated answer that makes use of mutability and is more memory efficient as it only creates one list.
import pandas as pd
data = [['A', []], ['B', []], ['C', []], ['D', []]]
df = pd.DataFrame(data, columns = ['Act', 'PreviousActs'])
actssofar = []
for i, row in df.iterrows():
actssofar.append(row['Act'])
print (i, actssofar)
df.at[i,'PreviousActs'] = actssofar[:i+1]
You need to copy the list before putting it in the DataFrame. It's a mutable object, and what you are currently storing in the DataFrame is a reference to the original list, not a copy of it. Every element in the PreviousActs column is the same list.
enter image description here
enter image description here
enter image description here
As you can see in the picture, I want to convert the array to a list and finally to excel. But the problem I have is all the data on both sides have square brackets and quotations. what should I do to remove them?
please help me to solve this problem, thank you!
To put it simply: My list is like [['a'], ['b'], ['c']] and I want to convert it to a,b,c
here is my code
def top_rank(result, md_adj, miRNAs, diseases):
row, col = result.shape
rows_list = []
for i in range(3):
pidx = np.argsort(-result[:, i])
sidx = np.argwhere(md_adj[:, i] == 1)
indices = np.argwhere(np.isin(pidx, sidx))
index = np.delete(pidx, indices)
a = diseases[i]
b = miRNAs[index]
c = np.vstack([a,b]).tolist()
rows_list.append(c)
df = pd.DataFrame(rows_list)
df = df.T
df.to_excel('test.xlsx')
If you have nested lists [['a'], ['b'], ['c']] then you can use for-loop to make it flatten ['a', 'b', 'c']
data = [ ['a'], ['b'], ['c']]
flatten = [row[0] for row in data]
print(flatten)
Or you can also use fact that ["a"] + ["b"] gives ["a", "b"] - so you can use sum() with [] as starting value
data = [['a'], ['b'], ['c']]
flatten = sum(data, [])
print(flatten)
And if you have numpy.array then you could simply use arr.flatten()
import numpy as np
data = [['a'], ['b'], ['c']]
arr = np.array(data)
flatten = arr.flatten()
print(flatten)
BUT ... images show that you have [['X', 'Y'], ['a'], ['b'], ['c']] and first element has two values - and this need different method to create flatten ['X Y', 'a', 'b', 'c']. It needs to use for-loop with join()
data = [['X', 'Y'], ['a'], ['b'], ['c']]
flatten = [' '.join(row) for row in data]
print(flatten)
The same using map()
data = [['X', 'Y'], ['a'], ['b'], ['c']]
flatten = list(map(",".join, data))
print(flatten)
And when you have flatten list then your code+
rows_list = [flatten]
df = pd.DataFrame(rows_list)
df = df.T
print(df)
gives
0
0 X Y
1 a
2 b
3 c
without [] and ''
BTW:
If you would creat dictionary rows_list[a] = b (after converting a to string and b to flatten list) then you wouldn't need to transpose df = df.T
import pandas as pd
a = [['X', 'Y']]
b = [['a'], ['b'], ['c']]
print('a:', a)
print('b:', b)
print('---')
a = " ".join(sum(a, []))
b = sum(b, [])
print('a:', a)
print('b:', b)
print('---')
rows = dict()
rows[a] = b
df = pd.DataFrame(rows)
print(df)
gives
a: [['X', 'Y']]
b: [['a'], ['b'], ['c']]
---
a: X Y
b: ['a', 'b', 'c']
---
X Y
0 a
1 b
2 c
I have a pandas dataframe being generated by some other piece of code - the dataframe may have different number of columns each time it is generated: let's call them col1,col2,...,coln where n is not fixed. Please note that col1,col2,... are just placeholders, the actual names of columns can be arbitrary like TimeStamp or PrevState.
From this, I want to convert each column into a list, with the name of the list being the same as the column. So, I want a list named col1 with the entries in the first column of the dataframe and so on till coln.
How do I do this?
Thanks
It is not recommended, better is create dictionary:
d = df.to_dict('list')
And then select list by keys of dict from columns names:
print (d['col'])
Sample:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
})
d = df.to_dict('list')
print (d)
{'A': ['a', 'b', 'c', 'd', 'e', 'f'], 'B': [4, 5, 4, 5, 5, 4], 'C': [7, 8, 9, 4, 2, 3]}
print (d['A'])
['a', 'b', 'c', 'd', 'e', 'f']
import pandas as pd
df = pd.DataFrame()
df["col1"] = [1,2,3,4,5]
df["colTWO"] = [6,7,8,9,10]
for col_name in df.columns:
exec(col_name + " = " + df[col_name].values.__repr__())
Suppose I have two datasets
DS1
ArrayCol
[1,2,3,4]
[1,2,3]
DS2
Key Name
1 A
2 B
3 C
4 D
how to look up the values in the array to map the "Name" so that I can have another dataset like the following?
DS3
COlNew
[A,B,C,D]
[A,B,C]
Thanks, it's in databricks, so method is ok . python,sql,scala…...
you can try this
ds1 = [[1, 2, 3, 4], [1, 2, 3]]
ds2 = {1: 'A', 2: 'B', 3: 'C', 4: 'D'}
new_data = [[ds2[cell] for cell in col] for col in ds1]
print(new_data)
output:
[['A', 'B', 'C', 'D'], ['A', 'B', 'C']]
hope that will be help. :)
Lets consider your dataset are in files and you can do something like this,
making use of dict
f=open("ds1.txt").readlines()
g=open("ds2.txt").readlines()
u=dict(item.rstrip().split("\t") for item in g)
for i in f:
i = i.rstrip().strip('][').split(',')
print [u[col] for col in i]
Output
['A', 'B', 'C', 'D']
['A', 'B', 'C']
I have a dataframe containing strings and NaNs. I want to str.lower() certain columns by name to_lower = ['b', 'd', 'e']. Ideally I could do it with a method on the whole dataframe, rather than with a method on df[to_lower]. I have
df[to_lower] = df[to_lower].apply(lambda x: x.astype(str).str.lower())
but I would like a way to do it without assigning to the selected columns.
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b']})
to_lower = ['a']
df2 = df.copy()
df2[to_lower] = df2[to_lower].apply(lambda x: x.astype(str).str.lower())
You can use assign method and unpack the result as keyword argument:
df = pd.DataFrame({'a': ['A', 'a'], 'b': ['B', 'b'], 'c': ['C', 'c']})
to_lower = ['a', 'b']
df.assign(**df[to_lower].apply(lambda x: x.astype(str).str.lower()))
# a b c
#0 a b C
#1 a b c
You want this:
for column in to_lower:
df[column] = df[column].str.lower()
This is far more efficient assuming you have more rows than columns.