Find size/internal structure of list in Python - python

If I have a list c like so:
a = [1,2,3,4]
c = [a,a]
What's the simplest way of finding that it's a list of length two where each element is a list of length 4? If I do len(c) I get 2 but it doesn't give any indication that those elements are lists or their size unless I explicitly do something like
print(type(c[0]))
print(len(c[0]))
print(len(c[1]))
I could do something like
import numpy as np
np.asarray(c).shape
which gives me (2,4), but that only works when the internal lists are of equal size. If instead, the list is like
a = [1,2,3,4]
b = [1,2]
d = [a,b]
then np.asarray(d).shape just gives me (2,). In this case, I could do something like
import pandas as pd
df = pd.DataFrame(d)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
0 2 non-null int64
1 2 non-null int64
2 1 non-null float64
3 1 non-null float64
dtypes: float64(2), int64(2)
memory usage: 144.0 bytes
From this, I can tell that there are lists inside the original list, but I would like to be able to see this without using pandas. What's the best way to look at the internal structure of a list?

Depending on the output format you expect, you could write a recursive function that returns nested tuples of length and shape.
Code
def shape(lst):
length = len(lst)
shp = tuple(shape(sub) if isinstance(sub, list) else 0 for sub in lst)
if any(x != 0 for x in shp):
return length, shp
else:
return length
Examples
lst = [[1, 2, 3, 4], [1, 2, 3, 4]]
print(shape(lst)) # (2, (4, 4))
lst = [1, [1, 2]]
print(shape(lst)) # (2, (0, 2))
lst = [1, [1, [1]]]
print(shape(lst)) # (2, (0, (2, (0, 1))))

This way is returning the type of element of list, and the first item is the parent list info.
def check(item):
res = [(type(item), len(item))]
for i in item:
res.append((type(i), (len(i) if hasattr(i, '__len__') else None)))
return res
>>> a = [1,2,3,4]
>>> c = [a,a]
>>> check(c)
[(list, 2), (list, 4), (list, 4)]

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]
This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)
Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

How to store two space-separated variables as one element in a list to output the element later as integer datatypes

I'd like to append two space-separated integer variables to a list as an element, then later output the contents of the list on newlines. For example,
storage = a + 3, b + 3
lst.append(storage)
Later, when printing the elements of the list, I get:
for i in lst:
print(i)
>>> (4, 7)
>>> (3, 6)
>>> (7, 7)
Instead, I'd like the output to be exactly:
>>> 4 7
>>> 3 6
>>> 7 7
separated on newlines as a space-separated pair of integers without commas and not part of a list. In addition, I also input singular integers between the pairs and would like to output them on a newline as well:
for i in lst:
print(i)
Expected output:
>>> 1
>>> 4 7
>>> 3 6
>>> -1
>>> 7 7
>>> 3
How can I do this without using list comprehension/mapping/defined functions/importing?
Test each element to see if it's a tuple, and if it is, use the * operator to spread it as multiple args to print().
>>> lst = [1, (4, 7), (3, 6), -1, (7, 7), 3]
>>> for i in lst:
... if isinstance(i, tuple):
... print(">>>", *i)
... else:
... print(">>>", i)
...
>>> 1
>>> 4 7
>>> 3 6
>>> -1
>>> 7 7
>>> 3

Spark - calculate the sum of features for each sample

If I have a RDD which looks like below then I know how to calculate the sum of my features per sample data:
import numpy as np
from pyspark import SparkContext
x = np.arange(10) # first sample with 10 features [0 1 2 3 4 5 6 7 8 9]
y = np.arange(10) # second sample with 10 features [0 1 2 3 4 5 6 7 8 9]
z = (x,y)
sc = SparkContext()
rdd1 = sc.parallelize(z)
rdd1.sum()
The output will be an array like this: ([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]), which is what I want.
My Question is:
If I construct a RDD by parsing a csv file as below, in which each element of the RDD is a tuple or list. How can I calculate the sum of each tuple/list elements (each feature) like above? If I use the sum I get this error:
Rdd : [(0.00217010083485, 0.00171658370653), (7.24521659993e-05, 4.18413109325e-06), ....]
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
[EDIT] To be more specific:
rdd = sc.parallelize([(1,3),(2,4)])
I want my output to be [3,7]. Each tuple is a data instance that I have, and each element of tuple is my feature. I want to calculate the sum of each feature per all data samples.
In that case, you will need the reduce method, zip the two consecutive tuples and add them element by element:
rdd.reduce(lambda x, y: [t1+t2 for t1, t2 in zip(x, y)])
# [3, 7]
You can do something like this:
z = zip(x, y)
#z is [(0, 0), (1, 1), (2, 2) ......]
map(np.sum, z)
that should do the trick.
Here, I just add the solution using PySpark dataframe for larger rdd that you have
rdd = sc.parallelize([(1, 3),(2, 4)])
df = rdd.toDF() # tranform rdd to dataframe
col_sum = df.groupby().sum().rdd.map(lambda x: x.asDict()).collect()[0]
[v for k, v in col_sum.asDict().items()] # sum of columns: [3, 7]

pandas drop_duplicates using comparison function

Is it somehow possible to use pandas.drop_duplicates with a comparison operator which compares two objects in a particular column in order to identify duplicates? If not, what is the alternative?
Here is an example where it could be used:
I have a pandas DataFrame which has lists as values in a particular column and I would like to have duplicates removed based on column A
import pandas as pd
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2]]} )
print df
giving me
A
0 [1, 2]
1 [2, 3]
2 [1, 2]
Using pandas.drop_duplicates
df.drop_duplicates( 'A' )
gives me a TypeError
[...]
TypeError: type object argument after * must be a sequence, not itertools.imap
However, my desired result is
A
0 [1, 2]
1 [2, 3]
My comparison function would be here:
def cmp(x,y):
return x==y
But in principle it could be something else, e.g.,
def cmp(x,y):
return x==y and len(x)>1
How can I remove duplicates based on the comparison function in a efficient way?
Even more, what could I do if I had more columns to compare using a different comparison function, respectively?
Option 1
df[~pd.DataFrame(df.A.values.tolist()).duplicated()]
Option 2
df[~df.A.apply(pd.Series).duplicated()]
IIUC, your question is how to use an arbitrary function to determine what is a duplicate. To emphasize this, let's say that two lists are duplicates if the sum of the first item, plus the square of the second item, is the same in each case
In [59]: In [118]: df = pd.DataFrame( {'A': [[1,2],[4,1],[2,3]]} )
(Note that the first and second lists are equivalent, although not same.)
Python typically prefers key functions to comparison functions, so here we need a function to say what is the key of a list; in this case, it is lambda l: l[0] + l[1]**2.
We can use groupby + first to group by the values of the key function, then take the first of each group:
In [119]: df.groupby(df.A.apply(lambda l: l[0] + l[1]**2)).first()
Out[119]:
A
A
5 [1, 2]
11 [2, 3]
Edit
Following further edits in the question, here are a few more examples using
df = pd.DataFrame( {'A': [[1,2],[2,3],[1,2], [1], [1], [2]]} )
Then for
def cmp(x,y):
return x==y
this could be
In [158]: df.groupby(df.A.apply(tuple)).first()
Out[158]:
A
A
(1,) [1]
(1, 2) [1, 2]
(2,) [2]
(2, 3) [2, 3]
for
def cmp(x,y):
return x==y and len(x)>1
this could be
In [184]: class Key(object):
.....: def __init__(self):
.....: self._c = 0
.....: def __call__(self, l):
.....: if len(l) < 2:
.....: self._c += 1
.....: return self._c
.....: return tuple(l)
.....:
In [187]: df.groupby(df.A.apply(Key())).first()
Out[187]:
A
A
1 [1]
2 [1]
3 [2]
(1, 2) [1, 2]
(2, 3) [2, 3]
Alternatively, this could also be done much more succinctly via
In [190]: df.groupby(df.A.apply(lambda l: np.random.rand() if len(l) < 2 else tuple(l))).first()
Out[190]:
A
A
0.112012068449 [2]
0.822889598152 [1]
0.842630848774 [1]
(1, 2) [1, 2]
(2, 3) [2, 3]
but some people don't like these Monte-Carlo things.
Lists are unhashable in nature. Try converting them to hashable types such as tuples and then you can continue to use drop_duplicates:
df['A'] = df['A'].map(tuple)
df.drop_duplicates('A').applymap(list)
One way of implementing it using a function would be based on computing value_counts of the series object, as duplicated values get aggregated and we are interested in only the index part (which by the way is unique) and not the actual count part.
def series_dups(col_name):
ser = df[col_name].map(tuple).value_counts(sort=False)
return (pd.Series(data=ser.index.values, name=col_name)).map(list)
series_dups('A')
0 [1, 2]
1 [2, 3]
Name: A, dtype: object
If you do not want to convert the values to tuple but rather process the values as they are, you could do:
Toy data:
df = pd.DataFrame({'A': [[1,2], [2,3], [1,2], [3,4]],
'B': [[10,11,12], [11,12], [11,12,13], [10,11,12]]})
df
def series_dups_hashable(frame, col_names):
for col in col_names:
ser, indx = np.unique(frame[col].values, return_index=True)
frame[col] = pd.Series(data=ser, index=indx, name=col)
return frame.dropna(how='all')
series_dups_hashable(df, ['A', 'B']) # Apply to subset/all columns you want to check

How to exclude one min and one max number?

I have list:
numbers = [2,3,1,6,5]
And I must remove one min and one max number:
sorted(numbers)[1:-1]
And this is ok, but I want get additional information - position of removed numbers in original list:
remains = sorted(numbers)[1:-1]
min_number_position = 2
max_number_position = 3
How to do it? Numbers can be repeated.
Just use min and max functions in couple with index method of list to get position:
min_position = numbers.index(min(numbers))
max_position = numbers.index(max(numbers))
del numbers[min_position]
del numbers[max_position]
A pure python solution by creating arg sorted array (as created by numpy.argsort()) . Example -
numbers = [2,3,1,6,5]
argsorted = sorted(range(len(numbers)),key=lambda x:numbers[x])
maxpos,minpos = argsorted[-1],argsorted[0]
remains = [numbers[i] for i in argsorted[1:-1]]
Demo -
>>> numbers = [2,3,1,6,5]
>>> argsorted = sorted(range(len(numbers)),key=lambda x:numbers[x])
>>> argsorted
[2, 0, 1, 4, 3]
>>> maxpos,minpos = argsorted[-1],argsorted[0]
>>> remains = [numbers[i] for i in argsorted[1:-1]]
>>> remains
[2, 3, 5]
>>> maxpos
3
>>> minpos
2
If you can use numpy library, this can be easily done using array.argsort() . Example -
nnumbers = np.array(numbers)
nnumargsort = nnumbers.argsort()
minpos,maxpos = nnumargsort[[0,-1]]
remains = nnumbers[nnumargsort[1:-1]]
Demo -
In [136]: numbers = [2,3,1,6,5]
In [137]: nnumbers = np.array(numbers)
In [138]: nnumargsort = nnumbers.argsort()
In [139]: minpos,maxpos = nnumargsort[[0,-1]]
In [140]: remains = nnumbers[nnumargsort[1:-1]]
In [141]: remains
Out[141]: array([2, 3, 5])
In [142]: maxpos
Out[142]: 3
In [143]: minpos
Out[143]: 2
>>> sorted(enumerate(numbers), key=operator.itemgetter(1))
[(2, 1), (0, 2), (1, 3), (4, 5), (3, 6)]
The rest is left as an exercise for the reader.
You can use a function and return the index of max and min with list.index method :
>>> def func(li):
... sorted_li=sorted(li)
... return (li.index(sorted_li[0]),sorted_li[1:-1],li.index(sorted_li[-1]))
...
>>> min_number_position,remains,max_number_position=func(numbers)
>>> min_number_position
2
>>> remains
[2, 3, 5]
>>> max_number_position
3
In python 3.X you can use unpacking assignment :
>>> def func(li):
... mi,*re,ma=sorted(li)
... return (li.index(mi),re,li.index(ma))

Categories

Resources