Pandas ValueError when calling apply with axis=1 and setting lists of varying length as cell-value - python

While calling apply on a Pandas dataframe with axis=1, getting ValueError when trying to set a list as cell-value.
Note: Lists in different rows are of varying lengths and this seems to be cause, but not sure how to overcome it.
import numpy as np
import pandas as pd
data = [{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1},
{'a': 45, 'b': '412\t34\t324', 'c': 2}]
df = pd.DataFrame.from_dict(data)
print("df: ")
print(df)
def get_rank_array(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids))
def get_rank_list(ids):
ids = list(map(int, ids))
return np.random.randint(0, 10, len(ids)).tolist()
df['rank'] = df.apply(lambda row: get_rank_array(row['b'].split('\t')), axis=1)
ValueError: could not broadcast input array from shape (2) into shape (3)
df['rank'] = df.apply(lambda row: get_rank_list(row['b'].split('\t')), axis=1)
print("df: ")
print(df)
df:
a b c rank
0 1 3412 0 [6]
1 88 56\t23 1 [0, 0]
2 45 412\t34\t324 2 [3, 3, 6]
get_rank_list works but not get_rank_array in producing the above expected result.
I understand the (3,) shape comes from the number of columns in the dataframe, and (2,) is from the length of the list after splitting 56\t23 in the second row.
But I do not get the reason behind the error itself.
When
data = [{'a': 45, 'b': '412\t34\t324', 'c': 2},
{'a': 1, 'b': '3412', 'c': 0}, {'a': 88, 'b': '56\t23', 'c': 1}]
the error occurs with lists too.

Observe -
df.apply(lambda x: [0, 1, 2])
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: [0, 1])
a [0, 1]
b [0, 1]
c [0, 1]
dtype: object
Pandas does two things inside apply:
it special cases np.arrays and lists, and
it attempts to snap the results into a DataFrame if the shape is compatible
Note that arrays are special cased a little differently to lists, in that, if the shape is not compatible, for lists, the result is a series (as you see in the second output above), but for arrays,
df.apply(lambda x: np.array([0, 1, 2]))
a b c
0 0 0 0
1 1 1 1
2 2 2 2
df.apply(lambda x: np.array([0, 1]))
ValueError: Shape of passed values is (3, 2), indices imply (3, 3)
In short, this is a consequence of the pandas internals. For more information, peruse the apply function code on GitHub.
To get your desired o/p, use a list comprehension and assign the result to df['new']. Don't use apply.
df['new'] = [
np.random.randint(0, 10, len(x.split('\t'))).tolist() for x in df.b
]
df
a b c new
0 1 3412 0 [8]
1 88 56\t23 1 [4, 2]
2 45 412\t34\t324 2 [9, 0, 3]

Related

Lists become pd.Series, the again lists with one dimension more

I have another problem with pandas, I will never make mine this library.
First, this is - I think - how zip() is supposed to work with lists:
import numpy as np
import pandas as pd
a = [1,2]
b = [3,4]
print(type(a))
print(type(b))
vv = zip([1,2], [3,4])
for i, v in enumerate(vv):
print(f"{i}: {v}")
with output:
<class 'list'>
<class 'list'>
0: (1, 3)
1: (2, 4)
Problem. I create a dataframe, with list elements (in the actual code the lists come from grouping ops and I cannot change them, basically they contain all the values in a dataframe grouped by a column).
# create dataframe
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df)
x y
0 [1, 2, 3] [4, 5, 6]
However, the lists are now pd.Series:
print(type(df["x"]))
<class 'pandas.core.series.Series'>
If I do this:
col1 = df["x"].tolist()
col2 = df["y"].tolist()
print(f"col1 is of type {type(col1)}, with length {len(col1)}, first el is {col1[0]} of type {type(col1[0])}")
col1 is of type <class 'list'>, width length 1, first el is [1, 2, 3] of type <class 'list'>
Basically, the tolist() returned a list of list (why?):
Indeed:
print("ZIP AND ITER")
vv = zip(col1, col2)
for v in zip(col1, col2):
print(v)
ZIP AND ITER
([1, 2, 3], [4, 5, 6])
I neeed only to compute this:
# this fails because x (y) is a list
# df['s'] = [np.sqrt(x**2 + y**2) for x, y in zip(df["x"], df["y"])]
I could add df["x"][0] that seems not very elegant.
Question:
How am I supposed to compute sqrt(x^2 + y^2) when x and y are in two columns df["x"] and df["y"]
This should calculate df['s']
df['s'] = df.apply(lambda row: [np.sqrt(x**2 + y**2) for x, y in zip(row["x"], row["y"])], axis=1)
Basically, the tolist() returned a list of list (why?):
Because your dataframe has only 1 row, with two columns and both columns contain a list for its value. So, returning that column as a list of its values, it would return a list with 1 element (the list that is the value).
I think you wanted to create a dataframe like this:
values = {'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}
x y
0 1 4
1 2 5
2 3 6
values = [{'x': list( (1, 2, 3) ), 'y': list( (4, 5, 6))}]
df = pd.DataFrame.from_dict(values)
print(df) # yields
x y
0 [1, 2, 3] [4, 5, 6]
An elegant solution to computing sqrt(x^2 + y^2) can be done by converting the dataframe as following:
new_df = df.iloc[0,:].apply(pd.Series).T.reset_index(drop=True)
This yields the follwoing output
x y
0 1 4
1 2 5
2 3 6
Now compute the sqrt(x^2 + y^2)
np.sqrt(new_df['x']**2 + new_df['y']**2)
This yields :
0 4.123106
1 5.385165
2 6.708204
dtype: float64

Why is the result of a Pandas' melt Fortran contiguous and not C-contiguous?

I ran into some pandas melt behavior that undermines my mental model of that function and I wonder if somebody could explain why this is sane/logical/desirable behavior.
The following snippet melts down a dataframe and then converts the result into a numpy array. Since I'm melting all columns I would have expected the result to be similar to what np.ndarray.ravel() would do. I.e., create a 1D view into the data and add a column with the respective column names (var names). However, - to my surprise - melt actually makes a copy of the data and reorders it as f-contigous. Why is f-contiguity a good idea here?
expected_flat = np.arange(100*3)
expected_full = expected_flat.reshape(100, 3)
# expected_full is view into flat array
assert expected_full.base is expected_flat
assert expected_flat.flags["C_CONTIGUOUS"]
test_df = pd.DataFrame(
expected_flat.reshape(100, 3),
columns=["a", "b", "c"],
)
# test_df, too, is a view into flat array
reconstructed = test_df.to_numpy()
assert reconstructed.base is expected_flat
flatten_melt = test_df.melt(var_name="col", value_name="foobar")
flatten_melt_numpy = flatten_melt.foobar.to_numpy()
# flatten_melt is NOT a view and reordered
assert flatten_melt_numpy.base is not expected_flat
assert np.allclose(flatten_melt_numpy, expected_flat) == False
# the confusing part is that the array is now F-contigous
reconstructed_melt = flatten_melt_numpy.reshape(100, 3, order="F")
assert np.allclose(reconstructed_melt, expected_full)
Construct a frame from a pair of "series":
In [322]: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
In [323]: df
Out[323]:
a b
0 1 4
1 2 5
2 3 6
In [324]: arr = df.to_numpy()
In [325]: arr
Out[325]:
array([[1, 4],
[2, 5],
[3, 6]])
In [326]: arr.flags
Out[326]:
C_CONTIGUOUS : False
F_CONTIGUOUS : True
...
In [327]: arr.strides
Out[327]: (8, 24)
The resulting array is F_CONTIGUOUS.
If I make a frame from a 2d array, the value is the same as the input, and in this case order 'C':
In [328]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [329]: df1
Out[329]:
a b
0 1 2
1 3 4
2 5 6
In [330]: df1.to_numpy().strides
Out[330]: (16, 8)
Create it with an order F, the result is same as in the first case:
In [332]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2, order="F"), columns=[
...: "a", "b"])
In [333]: df1
Out[333]:
a b
0 1 4
1 2 5
2 3 6
In [334]: df1.to_numpy().strides
Out[334]: (8, 24)
melt
Going back to the frame created from an order C:
In [335]: df1 = pd.DataFrame(np.arange(1, 7).reshape(3, 2), columns=["a", "b"])
In [336]: df2 = df1.melt()
In [337]: df2
Out[337]:
variable value
0 a 1
1 a 3
2 a 5
3 b 2
4 b 4
5 b 6
Notice how the value column is a vertical concatenation of the 'a' and 'b' columns. This is what the method examples show. I don't use pivot enough to know if this a natural interpretation of that or not.
With the order 'F' frame:
In [338]: df2.to_numpy()
Out[338]:
array([['a', 1],
['a', 3],
['a', 5],
['b', 2],
['b', 4],
['b', 6]], dtype=object)
In [339]: _.strides
Out[339]: (8, 48)
In df1 both columns are int dtype, and can be stored as a 2d array:
In [340]: df1.dtypes
Out[340]:
a int64
b int64
dtype: object
df2 columns are different, object (string) and int, so are stored as separate arrays. to_numpy constructs an object dtype array from them, but it is order 'F':
In [341]: df2.dtypes
Out[341]:
variable object
value int64
dtype: object
We get a hint of this storage from:
In [352]: df1._mgr
Out[352]:
BlockManager
Items: Index(['a', 'b'], dtype='object')
Axis 1: RangeIndex(start=0, stop=3, step=1)
NumericBlock: slice(0, 2, 1), 2 x 3, dtype: int64
In [353]: df2._mgr
Out[353]:
BlockManager
Items: Index(['variable', 'value'], dtype='object')
Axis 1: RangeIndex(start=0, stop=6, step=1)
ObjectBlock: slice(0, 1, 1), 1 x 6, dtype: object
NumericBlock: slice(1, 2, 1), 1 x 6, dtype: int64
How a dataframe stores its values is a complex subject, and I have not read a comprehensive description. I've only gathered bits and pieces from experimenting like this.

pandas groupby ID and select row with minimal value of specific columns

i want to select the whole row in which the minimal value of 3 selected columns is found, in a dataframe like this:
it is supposed to look like this afterwards:
I tried something like
dfcheckminrow = dfquery[dfquery == dfquery['A':'C'].min().groupby('ID')]
obviously it didn't work out well.
Thanks in advance!
Bkeesey's answer looks like it almost got you to your solution. I added one more step to get the overall minimum for each group.
import pandas as pd
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3],
'C': [1, 2, 3, 4, 5, 6],
})
# set "ID" as the index
df = df.set_index('ID')
# get the min for each column
mindf = df[['A','B']].groupby('ID').transform('min')
# get the min between columns and add it to df
df['min'] = mindf.apply(min, axis=1)
# filter df for when A or B matches the min
df2 = df.loc[(df['A'] == df['min']) | (df['B'] == df['min'])]
print(df2)
In my simplified example, I'm just finding the minimum between columns A and B. Here's the output:
A B C min
ID
1 14 1 2 1
2 100 2 3 2
3 1 100 5 1
One method do filter the initial DataFrame based on a groupby conditional could be to use transform to find the minimum for a "ID" group and then use loc to filter the initial DataFrame where `any(axis=1) (checking rows) is met.
# create sample df
df = pd.DataFrame({'ID': [1, 1, 2, 2, 3, 3],
'A': [30, 14, 100, 67, 1, 20],
'B': [10, 1, 2, 5, 100, 3]})
# set "ID" as the index
df = df.set_index('ID')
Sample df:
A B
ID
1 30 10
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
Use groupby and transform to find minimum value based on "ID" group.
Then use loc to filter initial df to where any(axis=1) is valid
df.loc[(df == df.groupby('ID').transform('min')).any(axis=1)]
Output:
A B
ID
1 14 1
2 100 2
2 67 5
3 1 100
3 20 3
In this example only the first row should be removed as it in both columns is not a minimum for the "ID" group.

How to aggregate two largest values per group in pandas?

I was going through this link: Return top N largest values per group using pandas
and found multiple ways to find the topN values per group.
However, I prefer dictionary method with agg function and would like to know if it is possible to get the equivalent of the dictionary method for the following problem?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': [1, 1, 2, 2, 1],
'C': [10, 20, 30, 40, 50],
'D': ['X', 'Y', 'X', 'Y', 'Y']})
print(df)
A B C D
0 1 1 10 X
1 1 1 20 Y
2 1 2 30 X
3 2 2 40 Y
4 2 1 50 Y
I can do this:
df1 = df.groupby(['A'])['C'].nlargest(2).droplevel(-1).reset_index()
print(df1)
A C
0 1 30
1 1 20
2 2 50
3 2 40
# also this
df1 = df.sort_values('C', ascending=False).groupby('A', sort=False).head(2)
print(df1)
# also this
df.set_index('C').groupby('A')['B'].nlargest(2).reset_index()
Required
df.groupby('A',as_index=False).agg(
{'C': lambda ser: ser.nlargest(2) # something like this
})
Is it possible to use the dictionary here?
If you want to get a dictionary like A: 2 top values from C,
you can run:
df.groupby(['A'])['C'].apply(lambda x:
x.nlargest(2).tolist()).to_dict()
For your DataFrame, the result is:
{1: [30, 20], 2: [50, 40]}

Compare Two Columns in Two Pandas Dataframes

I have two pandas dataframes:
df1:
a b c
1 1 2
2 1 2
3 1 3
df2:
a b c
4 0 2
5 5 2
1 1 2
df1 = {'a': [1, 2, 3], 'b': [1, 1, 1], 'c': [2, 2, 3]}
df2 = {'a': [4, 5, 6], 'b': [0, 5, 1], 'c': [2, 2, 2]}
df1= pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
I'm looking for a function that will display whether df1 and df2 contain the same value in column a.
In the example I provided df1.a and df2.a both have a=1.
If df1 and df2 do not have an entry where the the value in column a are equal then the function should return None or False.
How do I do this? I've tried a couple combinations of panda.merge
Define your own function by using isin and any
def yourf(x,y):
if any(x.isin(y)):
#print(x[x.isin(y)])
return x[x.isin(y)]
else:
return 'No match' # you can change here to None
Out[316]:
0 1
Name: a, dtype: int64
yourf(df1.b,df2.c)
Out[318]: 'No match'
You could use set intersection:
def col_intersect(df1, df2, col='a'):
s1 = set(df1[col])
s2 = set(df2[col])
return s1 & s2 else None
Using merge as you tried, you could try this:
def col_match(df1, df2, col='a'):
merged = df1.merge(df2, how='inner', on=col)
if len(merged):
return merged[col]
else:
return None

Categories

Resources