python pandas: list of sublist: total items number - python

I've a list like this one:
categories_list = [
['a', array([ 12994, 1262824, 145854, 92469]),
'b', array([273300]),
'c', array([341395, 32857711])],
['a', array([ 356424311, 165573412, 2032850784]),
'b', array([2848105, 228835]),
'c', array([])],
['a', array([1431689, 30655043, 1739919]),
'b', array([597, 251911, 246600]),
'c', array([35590])]
]
where each array belongs to the letter before.
Example: a -> array([ 12994, 1262824, 145854, 92469]), b -> array([273300]), 'a' -> array([1431689, 30655043, 1739919]) and so on...
So, is it possible to retrieve the total items number for each letter?
Desiderata:
----------
a 10
b 6
c 3
All suggestions are welcome

pd.DataFrame(
[dict(zip(x[::2], [len(y) for y in x[1::2]])) for x in categories_list]
).sum()
a 10
b 6
c 3
dtype: int64
I'm aiming at creating a list of dictionaries. So I have to fill in the ...... with something that parses each sub-list with a dictionary
[ ...... for x in catgories_list]
If I use dict on a list or generator of tuples, it will magically turn that into a dictionary with keys as the first value in the tuple and values as the second value in the tuple.
dict(...list of tuples...)
zip will give me that generator of tuples
zip(list one, list two)
I know that in each sub-list, my keys are at the even indices [0, 2, 4...] and values are at the odd indices [1, 3, 5, ...]
# even odd
zip(x[::2], x[1::2])
but x[1::2] will be arrays, and I don't want the arrays. I want the length of the arrays.
# even odd
zip(x[::2], [len(y) for y in x[1::2]])
pandas.DataFrame will take a list of dictionaries and create a dataframe.
Finally, use sum to count the lengths.

I use groupby in order to group key in column 0, 2, 4 (which has keys a, b, c respectively) and then count number of distinct item number in the next column. Number in the group in this case is len(set(group)) (or len(group) if you want just total length of the group). See the code below:
from itertools import groupby, chain
count_distincts = []
cols = [0, 2, 4]
for c in cols:
for gid, group in groupby(categories_list, key=lambda x: x[c]):
group = list(chain(*[list(g[c + 1]) for g in group]))
count_distincts.append([gid, len(set(group))])
Output [['a', 10], ['b', 6], ['c', 3]]

Related

Loop Over every Nth item in Dictionary

can anyone advise how to loop over every Nth item in a dictionary?
Essentially I have a dictionary of dataframes and I want to be able to create a new dictionary based on every 3rd dataframe item (including the first) based on index positioning of the original. Once I have this I would like to concatenate the dataframes together.
So for example if I have 12 dataframes , I would like the new dataframe to contain the first,fourth,seventh,tenth etc..
Thanks in advance!
if the dict is required, you may use tuple of dict keys:
custom_dict = {
'first': 1,
'second': 2,
'third': 3,
'fourth': 4,
'fifth': 5,
'sixth': 6,
'seventh': 7,
'eighth': 8,
'nineth': 9,
'tenth': 10,
'eleventh': 11,
'twelveth': 12,
}
for key in tuple(custom_dict)[::3]:
print(custom_dict[key])
then, you may call pandas.concat:
df = pd.concat(
[
custom_dict[key]
for key in tuple(custom_dict)[::3]
],
# =========================================================================
# axis=0 # To Append One DataFrame to Another Vertically
# =========================================================================
axis=1 # To Append One DataFrame to Another Horisontally
)
assuming custom_dict[key] returns pandas.DataFrame, not int as in my code above.
What you ask it a bit strange. Anyway, you have two main options.
convert your dictionary values to list and slice that:
out = pd.concat(list(dfs.values())[::3])
output:
a b c
0 x x x
0 x x x
0 x x x
0 x x x
slice your dictionary keys and generate a subdictionary:
out = pd.concat({k: dfs[k] for k in list(dfs)[::3]})
output:
a b c
df1 0 x x x
df4 0 x x x
df7 0 x x x
df10 0 x x x
Used input:
dfs = {f'df{i+1}': pd.DataFrame([['x']*3], columns=['a', 'b', 'c']) for i in range(12)}

How to get the closest indexes within a list

If there are two lists:
One being the items:
items = ['A', 'A', 'A', 'B', 'B', 'C', 'C']
The other being their indexes:
index = [0, 15, 20, 2, 16, 7, 17]
ie. The first 'A' is in index 0, the second 'A' is in index 15, etc.
How would I be able to get the closest indexes for the unique items, A, B, and C
ie. Get 15, 16, 17?
You can achieve this with a simples script. Consider those two lists as input, you just want to find the index on the letter list and it's correspondence on number list:
list_of_repeated=[]
list_of_closest=[]
for letter in list_letter:
if letter in list_of_repeated:
continue
else:
list_of_repeated.append(letter)
list_of_closest.append(list_number[list_letter.index(letter)])
What you are trying to do is minimize the sum of differences between indices.
You can find the minimal combination like this:
import numpy as np
from itertools import product
items = ['A', 'A', 'A', 'B', 'B', 'C', 'C']
index = [0, 15, 20, 2, 16, 7, 17]
cost = np.inf
for combination in product(*[list(filter(lambda x: x[0] == i, zip(items, index))) for i in set(items)]):
diff = sum(abs(np.ediff1d([i[1] for i in combination])))
if diff < cost:
cost = diff
idx = combination
print(idx)
This is bruteforcing the solution, there may be more elegant / faster ways to do this, but this is what comes to my mind on the fly.

A column in my dataframe does not seem to correspond to the input List (python)

I want to assign one of the columns of my dataframe to a list. I used the code below.
listone = [['a', 'b', 'c'], ['m', 'g'], ['h'], ['y', 't', 'r']]
df['Letter combinations'] = listone
The 'Letter Combinations' column in the dataframe doesn't correspond to the list, instead seems to assign random elements to each row in the column. I was wondering if this method indexes the elements differently causing a change in the order or if there is something wrong with my code. Any help would be appreciated!
Edit: Here is my complete code
listone = [[a, b, c], [m, g], [h], [y, t, r]]
numbers = [1, 2, 3, 4]
my_matrix = {'Numbers': numbers}
sample = pd.DataFrame(my_matrix)
sample['Letter combinations'] = listone
sample
My output looks like:
```
Numbers Letter combination
0 1 [b]
1 2 [m, g]
2 3 []
3 4 [r]
```
You need to make the listone to be a series. Ie:
sample['Letter combinations'] = pd.Series(listone)
sample
Numbers Letter combinations
0 1 [a, b, c]
1 2 [m, g]
2 3 [h]
3 4 [y, t, r]

Python/Numpy subarray selection

I have some Numpy code which I'm trying to decipher. There's a line v1 = v1[:, a1.tolist()] which passes a numpy array a1 and converts it into a list. I'm confused as to what v1[:, a1.tolist()] actually does. I know that v1 is now being set to a column array given from v1 given by the selection [:, a1.tolist()] but what's getting selected? More precisely, what is [:, a.tolist()] doing?
The syntax you observed is easier to understand if you split it in two parts:
1. Using a list as index
With numpy the meaning of
a[[1,2,3]]
is
[a[1], a[2], a[3]]
In other words when using a list as index is like creating a list of using elements as index.
2. Selecting a column with [:,x]
The meaning of
a2[:, x]
is
[a2[0][x],
a2[1][x],
a2[2][x],
...
a2[n-1][x]]
I.e. is selecting one column from a matrix.
Summing up
The meaning of
a[:, [1, 3, 5]]
is therefore
[[a[ 0 ][1], a[ 0 ][3], a[ 0 ][5]],
[a[ 1 ][1], a[ 1 ][3], a[ 1 ][5]],
...
[a[n-1][1], a[n-1][3], a[n-1][5]]]
In other words a copy of a with a selection of columns (or duplication and reordering; elements in the list of indexes doesn't need to be distinct or sorted).
Assuming a simple example like a 2D array, v1[:, a1.tolist()] would selects all rows of v1, but only columns described by a1 values
Simple example:
>>> x
array([['a', 'b', 'c'],
['d', 'f', 'g']],
dtype='|S1')
>>> x[:,[0]]
array([['a'],
['d']],
dtype='|S1')
>>> x[:,[0, 1]]
array([['a', 'b'],
['d', 'f']],
dtype='|S1')

Iterate over two nested 2D lists where list2 has list1's row numbers

I'm new to Python. So I want to get this done with loops without using some fancy stuff like generators. I have two 2D arrays, one integer array and the other string array like this:
Integer 2D list:
Here, dataset2d[0][0] is number of rows in the table, dataset[0][1] is number of columns. So the below 2D list has 6 rows and 4 columns
dataset2d = [
[6, 4],
[0, 0, 0, 1],
[1, 0, 2, 0],
[2, 2, 0, 1],
[1, 1, 1, 0],
[0, 0, 1, 1],
[1, 0, 2, 1]
]
String 2D list:
partition2d = [
['A', '1', '2', '4'],
['B', '3', '5'],
['C', '6']
]
partition[*][0] i.e first column is a label. For group A, 1,2 and 4 are the row numbers that I need to pick up from dataset2d and apply a formula. So it means I will read 1, go to row 1 in dataset2d and read the first column value i.e dataset2d[1][0], then I will read 2 from partition2d, go to row 2 of dataset 2d and read the first column i.e dataset2d[2][0]. Similarly next one I'll read dataset2d[4][0].
Then I will do some calculations, get a value and store it in a 2D list, then go to the next column in dataset2d for those rows. So in this example, next column values read would be dataset2d[1][1], dataset2d[2][1], dataset2d[4][1]. And again do some calculation and get one value for that column, store it. I'll do this until I reach the last column of dataset2d.
The next row in partition2d is [B, 3, 5]. So I'll start with dataset2d[3][0], dataset2d[5][0]. Get a value for that column be a formula. Then real dataset2d [3][1], dataset2d[5][1] etc. until I reach last column. I do this until all rows in partition2d are read.
What I tried:
for partitionRow in partition2d:
for partitionCol in partitionRow:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
What problem I'm facing:
partition2d is a string array where I need to skip the first column which has characters like A,B,C.
I want to iterate in dataset2d column wise only over the row numbers given in partition2d. So the colDataset should increment only after I'm done with that column.
Update1:
I'm reading the contents from a text file, and the data in 2D lists can vary, depending on file content and size, but the structure of file1 i.e dataset2d and file2 i.e partition2d will be the same.
Update2: Since Eric asked about how the output should look like.
0.842322 0.94322 0.34232 0.900009 (For A)
0.642322 0.44322 0.24232 0.800009 (For B)
This is just an example and the numbers are randomly typed by me.
So the first number 0.842322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group A having considered rows 1,2,4.
The second number, 0.94322 is the result of applying formula to column 1 of dataset2d i.e dataset2d[partitionCol][1] for group A having considered rows 1,2 4.
The third number, 0.34232 is the result of applying formula to column 2 of dataset2d i.e dataset2d[partitionCol][2] for group A having considered rows 1,2 4. Similarly we get 0.900009.
The first number in second row, i.e 0.642322 is the result of applying the formula to column 0 of dataset2d i.e dataset2d[parttionCol][0] for group B having considered rows 3,5. And so on.
You can use Numpy (I hope this is not fancy for you):
import numpy
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
dataset2D_size = dataset2D[0]
dataset2D = numpy.array(dataset2D)
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
# Take the specified rows
rows = dataset2D[row_indices]
# Iterate the columns (this is the power of Python!)
for column in zip(*rows):
# Now, column will contain one column of data from specified row indices
print column, # Apply your formula here
print
or if you don't want to install Numpy, here is what you can do (this is what you want, actually):
dataset2D = [ [6, 4], [0, 0, 0, 1], [1, 0, 2, 0], [2, 2, 0, 1], [1, 1, 1, 0], [0, 0, 1, 1], [1, 0, 2, 1] ]
partition2D = [ ['A', '1', '2', '4'], ['B', '3', '5'], ['C', '6'] ]
dataset2D_size = dataset2D[0]
for partition in partition2D:
label = partition[0]
row_indices = [int(i) for i in partition[1:]]
rows = [dataset2D[row_idx] for row_idx in row_indices]
for column in zip(*rows):
print column,
print
both will print:
(0, 1, 1) (0, 0, 1) (0, 2, 1) (1, 0, 0)
(2, 0) (2, 0) (0, 1) (1, 1)
(1,) (0,) (2,) (1,)
Explanation of second code (without Numpy):
[dataset2D[row_idx] for row_idx in row_indices]
This is basically you take each row (dataset2D[row_idx]) and collate them together as a list. So the result of this expression is a list of lists (which comes from the specified row indices)
for column in zip(*rows):
Then zip(*rows) will iterate column-wise (the one you want). This works by taking the first element of each row, then combine them together to form a tuple. In each iteration, the result is stored in variable column.
Then inside the for column in zip(*rows): you already have your intended column-wise iterated elements from specified rows!
To apply your formula, just change the print column, into the stuff you wanna do. For example I modify the code to include row and column number:
print 'Processing partition %s' % label
for (col_num, column) in enumerate(zip(*rows)):
print 'Column number: %d' % col_num
for (row_num, element) in enumerate(column):
print '[%d,%d]: %d' % (row_indices[row_num], col_num, element)
which will result in:
Processing partition A
Column number: 0
[1,0]: 0
[2,0]: 1
[4,0]: 1
Column number: 1
[1,1]: 0
[2,1]: 0
[4,1]: 1
Column number: 2
[1,2]: 0
[2,2]: 2
[4,2]: 1
Column number: 3
[1,3]: 1
[2,3]: 0
[4,3]: 0
Processing partition B
Column number: 0
[3,0]: 2
[5,0]: 0
Column number: 1
[3,1]: 2
[5,1]: 0
Column number: 2
[3,2]: 0
[5,2]: 1
Column number: 3
[3,3]: 1
[5,3]: 1
Processing partition C
Column number: 0
[6,0]: 1
Column number: 1
[6,1]: 0
Column number: 2
[6,3]: 2
Column number: 3
[6,3]: 1
I hope this helps.
Here's an extensible solution using an iterator:
def partitions(data, p):
for partition in p:
label = partition[0]
row_indices = partition[1:]
rows = [dataset2D[row_idx] for row_idx in row_indices]
columns = zip(*rows)
yield label, columns
for label, columns in partitions(dataset2D, partitions2d):
print "Processing", label
for column in columns:
print column
to address your problems:
What problem I'm facing:
partition2d is a string array where I need to
skip the first column which has characters like A,B,C.
I want to
iterate in dataset2d column wise only over the row numbers given in
partition2d. So the colDataset should increment only after I'm done
with that column.
Problem 1 can be solved using slicing - if you want to iterate on partition2d from the second element only you can to something for partitionCol in partitionRow[1:]. This will slice the row starting from the second element to the end.
So something like:
for partitionRow in partition2d:
for partitionCol in partitionRow[1:]:
for colDataset in dataset2d:
print dataset2d[partitionCol][colDataset]
Problem 2 I didn't understand what you want :)
partition2d is a string array where I need to skip the first column
which has characters like A,B,C.
This is called slicing:
for partitionCol in partitionRow[1:]:
the above snippet will skip the first column.
for colDataset in dataset2d:
Already does what you want. There is no structure here like in C++ loops. Although you could do stuff in a very Unpythonic way:
i=0
for i in range(len(dataset2d)):
print dataset2d[partitionCol][i]
i=+1
This is a very bad way of doing stuff. For arrays and matrices, I suggest you don't re-invent the wheel (that is also Pythonic stuff), look at Numpy. And especially at:
numpy.loadtxt
Setup:
d = [[6,4],[0,0,0,1],[1,0,2,0],[2,2,0,1],[1,1,1,0],[0,0,1,1],[1,0,2,1]]
s = [['A',1,2,4],['B',3,5],['C',6]]
The results are put into a list l
l = []
for r in s: #go over each [character,index0,index1,...]
new_r = [r[0]] #create a new list to values given by indexN. Add in the character by default
for i,c in enumerate(r[1:]): #go over each indexN. Using enumerate to keep track of what N is.
new_r.append(d[c][i]) #i is now the N in indexN. c is the column.
l.append(new_r) #add that new list to l
Resulting in
>>> l
[['A', 0, 0, 1], ['B', 2, 0], ['C', 1]]
The execution of the first iteration would look like:
for r in s:
#-> r = ['A',1,2,4]
new_r = [r[0]] #= ['A']
for i,c in enumerate([r[1:] = [1,2,4])
#-> i = 0, c = 1
new_r.append(d[1][i])
#-> i = 1, c = 2
#...

Categories

Resources