Lists become pd.Series, the again lists with one dimension more - python

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]

This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)

Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6

values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Related

Pandas Dataframe selecting a single value each row

I want to have access to an element inside a panda dataframe, my df looks like below
index
A
B
0
3, 2, 1
5, 6, 7
1
3, 2, 1
5, 6, 7
2
3, 2, 1
5, 6, 7
I want to print from A the second value for every index for example, the problem I don't know how to select them.
Output should be
(2,2,2)
Assuming "3, 2, 1" is a list, you can do this with :
df.A.apply(lambda x: x[1])
if this is a string, you can do this with :
df.A.apply(lambda x: x.split(", ")[1])
If the entries in A are a non-string iterable (like a list or tuple, e.g.), you can use pandas string indexing:
df['A'].str[1]
Full example:
>>> import pandas as pd
>>> a = (3, 2, 1)
>>> df = pd.DataFrame([[a], [a], [a]], columns=['A'])
>>> df
A
0 (3, 2, 1)
1 (3, 2, 1)
2 (3, 2, 1)
>>> df['A'].str[1]
0 2
1 2
2 2
Name: A, dtype: int64
If the entries are strings, you can use pandas string methods to split them into a list and apply the same approach above:
>>> import pandas as pd
>>> a = '3,2,1'
>>> df = pd.DataFrame([[a], [a], [a]], columns=['A'])
>>> df
A
0 3,2,1
1 3,2,1
2 3,2,1
>>> df['A'].str.split(',').str[1]
0 2
1 2
2 2
Name: A, dtype: object
If column A contain string values:
import pandas as pd
data = {
"A" :["3, 2, 1","3, 2, 1", "3, 2, 1"],
"B" : ["5, 6, 7", "5, 6, 7", "5, 6, 7"]
}
df = pd.DataFrame(data)
output = df["A"].apply(lambda x: (x.split(",")[1]).strip()).to_list()
print(output)
Result:
['2', '2', '2']

Pandas - add a row at the end of a for loop iteration

So I have a for loop that gets a series of values and makes some tests:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for value in list:
if value > 3:
df['columnX']="A"
else:
df['columnX']="B"
df['columnZ']="Another value only to be filled in this condition"
df['columnY']=value-1
How can I do this and keep all the values in a single row for each loop iteration no matter what's the if outcome? Can I keep some columns empty?
I mean something like the following process:
[create empty row] -> [process] -> [fill column X] -> [process] -> [fill column Y if true] ...
Like:
[index columnX columnY columnZ]
[0 A 0 NULL ]
[1 A 1 NULL ]
[2 B 2 "..." ]
[3 B 3 "..." ]
[4 B 4 "..." ]
I am not sure to understand exactly but I think this may be a solution:
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
d['columnY'].append(value-1)
df = pd.DataFrame(d)
for the second question just add another condition
list = [1, 2, 3, 4, 5, 6]
d = {'columnX':[],'columnY':[], 'columnZ':[]}
for value in list:
if value > 3:
d['columnX'].append("A")
else:
d['columnX'].append("B")
if condition:
d['columnZ'].append(xxx)
else:
d['columnZ'].append(None)
df = pd.DataFrame(d)
According to the example you have given I have changed your code a bit to achieve the result you shared:
list = [1, 2, 3, 4, 5, 6]
df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'])
for index, value in enumerate(list):
temp = []
if value > 3:
#df['columnX']="A"
temp.append("A")
temp.append(None)
else:
#df['columnX']="B"
temp.append("B")
temp.append("Another value") # or you can add any conditions
#df['columnY']=value-1
temp.append(value-1)
df.loc[index] = temp
print(df)
this produce the result:
columnX columnY columnZ
0 B Another value 0.0
1 B Another value 1.0
2 B Another value 2.0
3 A None 3.0
4 A None 4.0
5 A None 5.0
df.index is printed as : Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')
You may just prepare/initialize your Dataframe with an index depending on input list size, then getting power from np.where routine:
In [111]: lst = [1, 2, 3, 4, 5, 6]
...: df = pd.DataFrame(columns=['columnX','columnY', 'columnZ'], index=range(len(lst)))
In [112]: int_arr = np.array(lst)
In [113]: df['columnX'] = np.where(int_arr > 3, 'A', 'B')
In [114]: df['columnZ'] = np.where(int_arr > 3, df['columnZ'], '...')
In [115]: df['columnY'] = int_arr - 1
In [116]: df
Out[116]:
columnX columnY columnZ
0 B 0 ...
1 B 1 ...
2 B 2 ...
3 A 3 NaN
4 A 4 NaN
5 A 5 NaN

Find size/internal structure of list in Python

If I have a list c like so:
a = [1,2,3,4]
c = [a,a]
What's the simplest way of finding that it's a list of length two where each element is a list of length 4? If I do len(c) I get 2 but it doesn't give any indication that those elements are lists or their size unless I explicitly do something like
print(type(c[0]))
print(len(c[0]))
print(len(c[1]))
I could do something like
import numpy as np
np.asarray(c).shape
which gives me (2,4), but that only works when the internal lists are of equal size. If instead, the list is like
a = [1,2,3,4]
b = [1,2]
d = [a,b]
then np.asarray(d).shape just gives me (2,). In this case, I could do something like
import pandas as pd
df = pd.DataFrame(d)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 4 columns):
0 2 non-null int64
1 2 non-null int64
2 1 non-null float64
3 1 non-null float64
dtypes: float64(2), int64(2)
memory usage: 144.0 bytes
From this, I can tell that there are lists inside the original list, but I would like to be able to see this without using pandas. What's the best way to look at the internal structure of a list?
Depending on the output format you expect, you could write a recursive function that returns nested tuples of length and shape.
Code
def shape(lst):
length = len(lst)
shp = tuple(shape(sub) if isinstance(sub, list) else 0 for sub in lst)
if any(x != 0 for x in shp):
return length, shp
else:
return length
Examples
lst = [[1, 2, 3, 4], [1, 2, 3, 4]]
print(shape(lst)) # (2, (4, 4))
lst = [1, [1, 2]]
print(shape(lst)) # (2, (0, 2))
lst = [1, [1, [1]]]
print(shape(lst)) # (2, (0, (2, (0, 1))))
This way is returning the type of element of list, and the first item is the parent list info.
def check(item):
res = [(type(item), len(item))]
for i in item:
res.append((type(i), (len(i) if hasattr(i, '__len__') else None)))
return res
>>> a = [1,2,3,4]
>>> c = [a,a]
>>> check(c)
[(list, 2), (list, 4), (list, 4)]

Pandas ValueError when calling apply with axis=1 and setting lists of varying length as cell-value

While calling apply on a Pandas dataframe with axis=1, getting ValueError when trying to set a list as cell-value.
Note: Lists in different rows are of varying lengths and this seems to be cause, but not sure how to overcome it.
import numpy as np
import pandas as pd
data = [{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1},
{'a': 45, 'b': '412\t34\t324', 'c': 2}]
df = pd.DataFrame.from_dict(data)
print("df: ")
print(df)
def get_rank_array(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids))
def get_rank_list(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids)).tolist()
df['rank'] = df.apply(lambda row: get_rank_array(row['b'].split('\t')), axis=1)
ValueError: could not broadcast input array from shape (2) into shape (3)
df['rank'] = df.apply(lambda row: get_rank_list(row['b'].split('\t')), axis=1)
print("df: ")
print(df)
df:
a b c rank
0 1 3412 0 [6]
1 88 56\t23 1 [0, 0]
2 45 412\t34\t324 2 [3, 3, 6]
get_rank_list works but not get_rank_array in producing the above expected result.
I understand the (3,) shape comes from the number of columns in the dataframe, and (2,) is from the length of the list after splitting 56\t23 in the second row.
But I do not get the reason behind the error itself.
When
data = [{'a': 45, 'b': '412\t34\t324', 'c': 2},
{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1}]
the error occurs with lists too.
Observe -
df.apply(lambda x: [0, 1, 2])
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: [0, 1])
a [0, 1]
b [0, 1]
c [0, 1]
dtype: object
Pandas does two things inside apply:
it special cases np.arrays and lists, and
it attempts to snap the results into a DataFrame if the shape is compatible
Note that arrays are special cased a little differently to lists, in that, if the shape is not compatible, for lists, the result is a series (as you see in the second output above), but for arrays,
df.apply(lambda x: np.array([0, 1, 2]))
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: np.array([0, 1]))
ValueError: Shape of passed values is (3, 2), indices imply (3, 3)
In short, this is a consequence of the pandas internals. For more information, peruse the apply function code on GitHub.
To get your desired o/p, use a list comprehension and assign the result to df['new']. Don't use apply.
df['new'] = [
np.random.randint(0, 10, len(x.split('\t'))).tolist() for x in df.b
]
df
a b c new
0 1 3412 0 [8]
1 88 56\t23 1 [4, 2]
2 45 412\t34\t324 2 [9, 0, 3]

Spark - calculate the sum of features for each sample

If I have a RDD which looks like below then I know how to calculate the sum of my features per sample data:
import numpy as np
from pyspark import SparkContext
x = np.arange(10) # first sample with 10 features [0 1 2 3 4 5 6 7 8 9]
y = np.arange(10) # second sample with 10 features [0 1 2 3 4 5 6 7 8 9]
z = (x,y)
sc = SparkContext()
rdd1 = sc.parallelize(z)
rdd1.sum()
The output will be an array like this: ([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18]), which is what I want.
My Question is:
If I construct a RDD by parsing a csv file as below, in which each element of the RDD is a tuple or list. How can I calculate the sum of each tuple/list elements (each feature) like above? If I use the sum I get this error:
Rdd : [(0.00217010083485, 0.00171658370653), (7.24521659993e-05, 4.18413109325e-06), ....]
TypeError: unsupported operand type(s) for +: 'int' and 'tuple'
[EDIT] To be more specific:
rdd = sc.parallelize([(1,3),(2,4)])
I want my output to be [3,7]. Each tuple is a data instance that I have, and each element of tuple is my feature. I want to calculate the sum of each feature per all data samples.
In that case, you will need the reduce method, zip the two consecutive tuples and add them element by element:
rdd.reduce(lambda x, y: [t1+t2 for t1, t2 in zip(x, y)])
# [3, 7]
You can do something like this:
z = zip(x, y)
#z is [(0, 0), (1, 1), (2, 2) ......]
map(np.sum, z)
that should do the trick.
Here, I just add the solution using PySpark dataframe for larger rdd that you have
rdd = sc.parallelize([(1, 3),(2, 4)])
df = rdd.toDF() # tranform rdd to dataframe
col_sum = df.groupby().sum().rdd.map(lambda x: x.asDict()).collect()[0]
[v for k, v in col_sum.asDict().items()] # sum of columns: [3, 7]

Categories

Resources