Related
I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))
I'm a little bit stuck. I habe a Dataframe with a list in a column.
id
list
1
[a, b]
2
[a,a,a,b]
3
c,b,b
4
[c,a]
5
[f,f,b]
I have the values, a, b, c, d, e, f in general.
I want to count if two values are in a list togehter and also if a value appears more than once in that list.
I want to get that to create a heatmap, with all values in x and y axis. and the counts where e.g. a is x times in a list with itself or e.g. a and b are x times togehter.
I tried this so far, but it is not exactly the solution i want.
Make ne columns and count values
df['a'] = df['list'].explode().str.contains('a').groupby(level=0).any().astype('int')
df['b'] = df['list'].explode().str.contains('b').groupby(level=0).any().astype('int')
df['c'] = df['list'].explode().str.contains('c').groupby(level=0).any().astype('int')
df['d'] = df['list'].explode().str.contains('d').groupby(level=0).any().astype('int')
df['e'] = df['list'].explode().str.contains('e').groupby(level=0).any().astype('int')
df['f'] = df['list'].explode().str.contains('f').groupby(level=0).any().astype('int')
here i get the first problem, i create a new df with rows names the list and counting the values in the list, but I also get the count if i only have the value once in the list
make x axis
df_explo = df.explode(['list'],ignore_index=True)
get sum of all
df2=df_explo.groupby(['list']).agg({'a':'sum','b':'sum','c':'sum','d':'sum','e':'sum','f':'sum').reset_index()
set index to list
df3 = df2.set_index('list')
create heatmap
sns.heatmap(df3,cmap='RdYlGn_r', linewidths=0.5,annot=True,fmt="d")
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations
data = [
['a', 'b'],
['a', 'a', 'a', 'b'],
['b', 'b', 'b'],
['c', 'a'],
['f', 'f', 'b']
]
letters = ['a', 'b', 'c', 'd', 'e', 'f']
duplicate_occurrences = pd.DataFrame(0, index=[0], columns=letters)
co_occurrences = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
duplicates = [k for k, v in Counter(l).items() if v > 1]
for d in duplicates:
duplicate_occurrences[d] += 1
co = list(combinations(set(l), 2))
for a, b in co:
co_occurrences.loc[a, b] += 1
co_occurrences.loc[b, a] += 1
plt.figure(figsize=(7, 1))
sns.heatmap(duplicate_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Duplicate Occurrence Counts')
plt.show()
sns.heatmap(co_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Co-Occurrence Counts')
plt.show()
The first plot shows how often each letter occurs at least twice in a list, the second shows how often each pair of letters occurs together in a list.
In case you want to plot the duplicate occurrences on the diagonal, you could do it e.g. as follows:
df = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
for k, v in Counter(l).items():
if v > 1:
df.loc[k, k] += 1
for a, b in combinations(set(l), 2):
df.loc[a, b] += 1
df.loc[b, a] += 1
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
I wrote a simple recursive function to drop the column with maximum sum until the dataframe has been reduced to the size I want. Here's the code:
s = pd.DataFrame({'a': [1,1,1,1,1,1],
'b': [2,2,2,2,2,2],
'c': [3,3,3,3,3,3],
'd': [4,4,4,4,4,4],
'e': [5,5,5,5,5,5]})
def recSelect(inputdf):
if inputdf.shape[1]<=2:
return inputdf
else:
total = inputdf.sum()
idx = total.idxmax()
inputdf.drop(idx, axis=1, inplace=True)
return recSelect(inputdf)
recSelect(s)
In this above code, the column 'e' is first removed, then the column 'd', and then 'c'. My question is: how do I properly return the 'idx' and get a list as ['e', 'd', 'c']??
Here's what I have tried but it is not working:
idxs = [] # create an empty list
def recSelect(inputdf):
if inputdf.shape[1]<=2:
return inputdf
else:
total = inputdf.sum()
idx = total.idxmax()
idxs.append(idx) # append each idx
inputdf.drop(idx, axis=1, inplace=True)
return recSelect(inputdf), idxs
Try to avoid using a global variable - use it in your recursion! Add an extra parameter to the function. This will need to be a list to store the removed column names, but we set the default value to None so the list isn't shared between function calls. Initialise the empty list on the first call, and update it each time you remove a column.
import pandas as pd
s = pd.DataFrame({'a': [1,1,1,1,1,1],
'b': [2,2,2,2,2,2],
'c': [3,3,3,3,3,3],
'd': [4,4,4,4,4,4],
'e': [5,5,5,5,5,5]})
def recSelect(inputdf, removed=None):
if not removed:
removed=[]
if inputdf.shape[1]<=2:
return inputdf, removed
else:
total = inputdf.sum()
idx = total.idxmax()
inputdf.drop(idx, axis=1, inplace=True)
removed.append(idx)
return recSelect(inputdf, removed)
vals, removed = recSelect(s)
print(removed)
prints
['e', 'd', 'c']
You can try the following code:
your_list = list() # the traking list
s = pd.DataFrame({'a': [1,1,1,1,1,1],
'b': [2,2,2,2,2,2],
'c': [3,3,3,3,3,3],
'd': [4,4,4,4,4,4],
'e': [5,5,5,5,5,5]})
def recSelect(inputdf):
if inputdf.shape[1]<=2:
return inputdf
else:
total = inputdf.sum()
idx = total.idxmax()
your_list.append(idx) # append the dropped idx
inputdf.drop(idx, axis=1, inplace=True)
return recSelect(inputdf)
recSelect(s)
print your_list # there you go!
Output ['e', 'd', 'c']
If you are ohk with my approach to the problem, why create a recursive function when you can sort the values by sum and can access their index. i.e
s = pd.DataFrame({'a': [1,1,1,1,1,1],
'b': [2,2,2,2,2,2],
'c': [5,5,5,5,5,5],
'd': [4,4,4,4,4,4],
'e': [1,5,5,5,5,5]})
sum_order = s.sum().sort_values().index
ndf = s[sum_order[:2]]
li = sum_order[2:][::-1].tolist()
Output:
ndf
a b
0 1 2
1 1 2
2 1 2
3 1 2
4 1 2
5 1 2
li
['c', 'e', 'd']
When you are in pandas try to avoid loops as much as possible.
I want to create a matrix.
Input:
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
...
]
Output:
a p cat g
1st 2 0 0 1
2nd 5 3 4 0
This is my code. But I think it's not smart and very slow when data size huge.
Have any good ways to do this one?
Thank you.
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
### Create matrix ###
result = []
for row in data:
matrix = [0] * len(key_map)
for k, v in row.iteritems():
matrix[key_map.index(k)] = v
result.append(matrix)
print result
# [[2, 0, 0, 1], [5, 3, 4, 0]]
Edited
By #wwii advice. Use Pandas looks good:
from pandas import DataFrame
result = DataFrame(data, index=range(len(data)))
print result.fillna(0, downcast=int).as_matrix().tolist()
# [[2, 0, 1, 0], [5, 4, 0, 3]]
You can use set comprehension to generate the key_map
key_map = list({data for row in data for data in row})
Here is a partial answer. I couldn't get the columns in the order specified - it is limited by how the keys get ordered in the set, key_map. It uses string formatting to line the data up - you can play around with the spacing to fit larger or smaller numbers.
# ordinal from
# http://code.activestate.com/recipes/576888-format-a-number-as-an-ordinal/
from ordinal import ordinal
data = [
{'a': 2, 'g': 1},
{'p': 3, 'a': 5, 'cat': 4}
]
### Get keyword map ###
key_map = set()
for row in data:
key_map = key_map.union(set(row.keys()))
key_map = list(key_map) # ['a', 'p', 'g', 'cat']
# strings to format the output
header = '{: >10}{: >8}{: >8}{: >8}'.format(*key_map)
line_fmt = '{: <8}{: >2}{: >8}{: >8}{: >8}'
print header
def ordered_data(d, keys):
"""Returns an ordered list of dictionary values.
returns 0 if key not in d
d --> dict
keys --> list of keys
returns list
"""
return [d.get(key, 0) for key in keys]
for i, thing in enumerate(data):
print line_fmt.format(ordinal(i+1), *ordered_data(thing, key_map))
Output
a p g cat
1st 2 0 1 0
2nd 5 3 0 4
It might be worthwhile to dig into the Pandas docs and look at its DataFrame - it might make life easier.
I second the answer using the Pandas dataframes. However, my code should be a bit simpler than yours.
In [1]: import pandas as pd
In [5]: data = [{'a': 2, 'g': 1},{'p': 3, 'a': 5, 'cat': 4}]
In [6]: df = pd.DataFrame(data)
In [7]: df
Out[7]:
a cat g p
0 2 NaN 1 NaN
1 5 4 NaN 3
In [9]: df = df.fillna(0)
In [10]: df
Out[10]:
a cat g p
0 2 0 1 0
1 5 4 0 3
I did my coding in iPython, which I highly recommend!
To save to csv, just use an additional line of code:
df.to_csv('filename.csv')
I am a freshie in python, just suggestions that may be helpful hopefully:)
key_map = []
for row in data:
key_map.extend(row.keys())
key_map = list(set(key_map))
you can change the middle part to this, which will save you some time to find the key_map
In your case union will at least scan through each row to find the different item.
I have a list of numpy arrays that contains a list of name-value pairs which are both strings. Every name and value can be found multiple times in the list, and I would like to convert it to a binary matrix.
The columns represent the values while the rows represent a key/name, and when a field is set to 1 it represents that particular name value pair.
E.g
I have
A : aa
A : bb
A : cc
B : bb
C : aa
and i want to convert it to
aa bb cc
A 1 1 1
B 0 1 0
C 1 0 0
I have some code that does this but I was wondering if there is an easier/out of the box way of doing this with numpy or some other library.
This is my code so far:
resources = Set(result[:,1])
resourcesDict = {}
i = 0
for r in resources:
resourcesDict[r] = i
i = i + 1
clients = Set(result[:,0])
clientsDict = {}
i = 0
for c in clients:
clientsDict[c] = i
i = i + 1
arr = np.zeros((len(clientsDict),len(resourcesDict)), dtype = 'bool')
for line in result[:,0:2]:
arr[clientsDict[line[0]],resourcesDict[line[1]]] = True
and in result theres the following
array([["a","aa"],["a","bb"],..]
I feel that using Pandas.DataFrame.pivot is the best way
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
Or
you can load your pair list using
>>> df = pd.read_csv('ratings.csv')
Then
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6
you probably have something like
m_dict = {'A': ['aa', 'bb', 'cc'], 'B': ['bb'], 'C': ['aa']}
i would go like this:
res = {}
for k, v in m_dict.items():
res[k] = defaultdict(int)
for col in v:
res[k][v] = 1
edit
given your format, it would probably be more in the line of :
m_array = [['A', 'aa'], ['A', 'bb'], ['A', 'cc'], ['B', 'bb'], ['C', 'aa']]
res = defaultdict(lambda: defaultdict(int))
for k, v in m_array:
res[k][v] = 1
which both give:
>>> res['A']['aa']
1
>>> res['B']['aa']
0
This is a job for np.unique. It is not clear what format your data is in, but you need to get two 1-D arrays, one with the keys, another with the values, e.g.:
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = np.zeros((len(rows), len(cols)), dtype=np.int)
out[row_idx, col_idx] += 1
>>> out
array([[1, 1, 1],
[0, 1, 0],
[1, 0, 0]])
>>> rows
array(['A', 'B', 'C'],
dtype='|S2')
>>> cols
array(['aa', 'bb', 'cc'],
dtype='|S2')
If you have no repeated key-value pairs, this code will work just fine. If there are repetitions, I would suggest abusing scipy's sparse module:
import scipy.sparse as sps
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa'], ['A', 'bb']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = sps.coo_matrix((np.ones_like(row_idx), (row_idx, col_idx))).A
>>> out
array([[1, 2, 1],
[0, 1, 0],
[1, 0, 0]])
d = {'A': ['aa', 'bb', 'cc'], 'C': ['aa'], 'B': ['bb']}
rows = 'ABC'
cols = ('aa', 'bb', 'cc')
print ' ', ' '.join(cols)
for row in rows:
print row, ' ',
for col in cols:
print ' 1' if col in d.get(row) else ' 0',
print
>>> aa bb cc
>>> A 1 1 1
>>> B 0 1 0
>>> C 1 0 0