I have dataframe something like this
| ID | M001 | M002 | M003 | M004 |
|------|------|------|------|------|
| E001 | 3 | 4 | 3 | 2 |
| E002 | 4 | 5 | 5 | 3 |
| E003 | 4 | 3 | 5 | 4 |
And I want output in list but something like this for each unique ID such E001, E002 i want list of their response in each M001, M002 and so on
My required output is different variable for different id lets say
E001_response = [["M001",3],["M002",4],["M003",3],["M004",2]]
You can create Series with lists with custom lambda function:
s = df.set_index('ID').apply(lambda x: list(map(list,zip(df.columns[1:], x))), 1)
print (s)
ID
E001 [[M001, 3], [M002, 4], [M003, 3], [M004, 2]]
E002 [[M001, 4], [M002, 5], [M003, 5], [M004, 3]]
E003 [[M001, 4], [M002, 3], [M003, 5], [M004, 4]]
dtype: object
And then is possible use globals, but better is not create varables by names
for k, v in s.items():
globals()[f'{k}_response'] = v
print (E001_response)
[['M001', 3], ['M002', 4], ['M003', 3], ['M004', 2]]
Better is create dictionary:
d = s.to_dict()
print (d)
{'E001': [['M001', 3], ['M002', 4], ['M003', 3], ['M004', 2]],
'E002': [['M001', 4], ['M002', 5], ['M003', 5], ['M004', 3]],
'E003': [['M001', 4], ['M002', 3], ['M003', 5], ['M004', 4]]}
print (d['E001'])
[['M001', 3], ['M002', 4], ['M003', 3], ['M004', 2]]
Something like that:
new_df = df.apply(lambda x: list(zip(df.columns, x)), axis=1)
Output:
ID
E001 [(M001, 3), (M001, 4), (M001, 3), (M001, 2)]
E002 [(M002, 4), (M002, 5), (M002, 5), (M002, 3)]
E003 [(M003, 4), (M003, 3), (M003, 5), (M003, 4)]
dtype: object
Related
Similar to this question (Scala), but I need combinations in PySpark (pair combinations of array column).
Example input:
df = spark.createDataFrame(
[([0, 1],),
([2, 3, 4],),
([5, 6, 7, 8],)],
['array_col'])
Expected output:
+------------+------------------------------------------------+
|array_col |out |
+------------+------------------------------------------------+
|[0, 1] |[[0, 1]] |
|[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
|[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
+------------+------------------------------------------------+
Native Spark approach. I've translated this answer to PySpark.
Python 3.8+ (walrus := operator for "array_col" which is repeated several times in this script):
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.filter(
F.transform(
F.flatten(F.transform(
c:="array_col",
lambda x: F.arrays_zip(F.array_repeat(x, F.size(c)), c)
)),
lambda x: F.array(x["0"], x[c])
),
lambda x: x[0] < x[1]
)
)
df.show(truncate=0)
# +------------+------------------------------------------------+
# |array_col |out |
# +------------+------------------------------------------------+
# |[0, 1] |[[0, 1]] |
# |[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
# |[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
# +------------+------------------------------------------------+
Alternative without walrus operator:
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.filter(
F.transform(
F.flatten(F.transform(
"array_col",
lambda x: F.arrays_zip(F.array_repeat(x, F.size("array_col")), "array_col")
)),
lambda x: F.array(x["0"], x["array_col"])
),
lambda x: x[0] < x[1]
)
)
Alternative for Spark 2.4+
from pyspark.sql import functions as F
df = df.withColumn(
"out",
F.expr("""
filter(
transform(
flatten(transform(
array_col,
x -> arrays_zip(array_repeat(x, size(array_col)), array_col)
)),
x -> array(x["0"], x["array_col"])
),
x -> x[0] < x[1]
)
""")
)
pandas_udf is an efficient and concise approach in PySpark.
from pyspark.sql import functions as F
import pandas as pd
from itertools import combinations
#F.pandas_udf('array<array<int>>')
def pudf(c: pd.Series) -> pd.Series:
return c.apply(lambda x: list(combinations(x, 2)))
df = df.withColumn('out', pudf('array_col'))
df.show(truncate=0)
# +------------+------------------------------------------------+
# |array_col |out |
# +------------+------------------------------------------------+
# |[0, 1] |[[0, 1]] |
# |[2, 3, 4] |[[2, 3], [2, 4], [3, 4]] |
# |[5, 6, 7, 8]|[[5, 6], [5, 7], [5, 8], [6, 7], [6, 8], [7, 8]]|
# +------------+------------------------------------------------+
Note: in some systems, instead of 'array<array<int>>' you may need to provide types from pyspark.sql.types, e.g. ArrayType(ArrayType(IntegerType())).
Here is my dataframe:
| col1 | col2 | col3 |
----------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4]
I also have this function:
def joiner(col1,col2,col3):
snip = []
snip.append(col1)
snip.append(col2)
snip.append(col3)
return snip
I want to call this on each of the columns and assign it to a new column.
My end goal would be something like this:
| col1 | col2 | col3 | col4
------------------------------------------------------------------
[1,2,3,4] | [1,2,3,4] | [1,2,3,4] | [[1,2,3,4],[1,2,3,4],[1,2,3,4]]
Just .apply list on axis=1, it'll create lists for each rows
>>> df['col4'] = df.apply(list, axis=1)
OUTPUT:
col1 col2 col3 col4
0 [1, 2, 3, 4] [1, 2, 3, 4] [1, 2, 3, 4] [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]
You can just do
df['col'] = df.values.tolist()
Hi I have just started to learn Python programming.
I wrote this code:
a = [[1, 2, 3], [4, 5, 6]]
b = [[1, 2, 3], [4, 5, 6]]
c = []
d = []
for i in range(len(a)):
for j in range(len(a[0])):
d.append(a[i][j]+b[i][j])
c.append(d)
print(c)
I got this output:
[[2, 4, 6, 8, 10, 12], [2, 4, 6, 8, 10, 12]]
But to my understanding the output should be:
[[2, 4, 6], [2, 4, 6, 8, 10, 12]]
So please someone explain me the output.
Thank you.
You need to include the copy statement for your desired output.
a = [[1, 2, 3], [4, 5, 6]]
b = [[1, 2, 3], [4, 5, 6]]
c = []
d = []
d1=[]
for i in range(len(a)):
for j in range(len(a[0])):
d.append(a[i][j]+b[i][j])
d1=d.copy() # This will copy the d list to d1 and the append d1 to c by this way d1 will not get appended with the values after[2,4,6]
c.append(d1)
print(c)
I tried to debug and got the same answer
| i = | j = | d = |
| --- | --- | ------------- |
| 0 | 0 | [ 2 ] |
| 0 | 1 | [ 2, 4 ] |
| 0 | 2 | [ 2, 4, 6 ] |
end of i = 0 iteration, so d = [ 2, 4, 6 ]
end of i = 0 iteration, so c = [ 2, 4, 6 ]
| i = | j = | d = |
| --- | --- | ---------------------- |
| 1 | 0 | [ 2, 4, 6, 8 ] |
| 1 | 1 | [ 2, 4, 6, 8, 10 ] |
| 1 | 2 | [ 2, 4, 6, 8, 10, 12 ] |
end of i = 1 iteration, so d = [ 2, 4, 6, 8, 10, 12 ]
end of i = 1 iteration, so c = [ [ 2, 4, 6 ], [ 2, 4, 6, 8, 10, 12 ] ]
I have a buyer (buyerid) and this buyer can buy several different cars (carid).
I would like to list which cars he has bought.
Here I would like to summarize all cars for each buyer and save them as a list.
For example, buyer 1 bought the car with ID 1 and ID 2. This list should now contain [1,2].
How do I make such a list?
If I call method .values.tolist() then I get each line as a list, but I want the carid to be summarized by buyers.
import pandas as pd
d = {'Buyerid': [1,1,2,2,3,3,3,4,5,5,5],
'Carid': [1,2,3,4,4,1,2,4,1,3,5],
'Carid2': [1,2,3,4,4,1,2,4,1,3,5]}
df = pd.DataFrame(data=d)
print(df)
ls = df.values.tolist()
print(ls)
Buyerid Carid Carid2
0 1 1 1
1 1 2 2
2 2 3 3
3 2 4 4
4 3 4 4
5 3 1 1
6 3 2 2
7 4 4 4
8 5 1 1
9 5 3 3
10 5 5 5
[[1, 1, 1], [1, 2, 2], [2, 3, 3], [2, 4, 4], [3, 4, 4], [3, 1, 1], [3, 2, 2], [4, 4, 4], [5, 1, 1], [5, 3, 3], [5, 5, 5]]
# What I want as list
[[1,2],[3,4],[4,1,2],[4],[1,3,5]]
If need select columns for processing use GroupBy.apply with np.unique if order is not important:
L = (df.groupby(['Buyerid'])[['Carid','Carid2']]
.apply(lambda x: np.unique(x).tolist()).tolist())
Or if need processing all columns without Buyerid use:
L = (df.set_index('Buyerid')
.groupby('Buyerid')
.apply(lambda x: np.unique(x).tolist())
.tolist())
print (L)
[[1, 2], [3, 4], [1, 2, 4], [4], [1, 3, 5]]
If ordering is important use DataFrame.melt for unpict wit hremove duplicates by DataFrame.drop_duplicates:
L1 = (df.melt('Buyerid')
.drop_duplicates(['Buyerid','value'])
.groupby('Buyerid')['value']
.agg(list)
.tolist())
print (L1)
[[1, 2], [3, 4], [4, 1, 2], [4], [1, 3, 5]]
I have Pandas DataFrame that looks like this:
| Index | Value |
|-------|--------------|
| 1 | [1, 12, 123] |
| 2 | [12, 123, 1] |
| 3 | [123, 12, 1] |
and I want to append third column with list of array elements lengths:
| Index | Value | Expected_value |
|-------|--------------|----------------|
| 1 | [1, 12, 123] | [1, 2, 3] |
| 2 | [12, 123, 1] | [2, 3, 1] |
| 3 | [123, 12, 1] | [3, 2, 1] |
I've tried to use python lambda function and mapping little bit like this:
dataframe["Expected_value"] = dataframe.value.map(lambda x: len(str(x)))
but instead of list I got sum of those lengths:
| Index | Value | Expected_value |
|-------|--------------|----------------|
| 1 | [1, 12, 123] | 6 |
| 2 | [12, 123, 1] | 6 |
| 3 | [123, 12, 1] | 6 |
You can use list comprehension with map:
dataframe["Expected_value"] = dataframe.Value.map(lambda x: [len(str(y)) for y in x])
Or nested list comprehension:
dataframe["Expected_value"] = [[len(str(y)) for y in x] for x in dataframe.Value]
There is also possible use alternative for get lengths of integers:
import math
dataframe["Expected_value"] = [[int(math.log10(y))+1 for y in x] for x in dataframe.Value]
print (dataframe)
Index Value Expected_value
0 1 [1, 12, 123] [1, 2, 3]
1 2 [12, 123, 1] [2, 3, 1]
2 3 [123, 12, 1] [3, 2, 1]
Use a list comprehension:
[[len(str(y)) for y in x] for x in df['Value'].tolist()]
# [[1, 2, 3], [2, 3, 1], [3, 2, 1]]
df['Expected_value'] = [[len(str(y)) for y in x] for x in df['Value'].tolist()]
df
Index Value Expected_value
0 1 [1, 12, 123] [1, 2, 3]
1 2 [12, 123, 1] [2, 3, 1]
2 3 [123, 12, 1] [3, 2, 1]
If you need to handle missing data,
def foo(x):
try:
return [len(str(y)) for y in x]
except TypeError:
return np.nan
df['Expected_value'] = [foo(x) for x in df['Value'].tolist()]
df
Index Value Expected_value
0 1 [1, 12, 123] [1, 2, 3]
1 2 [12, 123, 1] [2, 3, 1]
2 3 [123, 12, 1] [3, 2, 1]
It is probably the best in terms of performance when dealing with object type data. More reading at For loops with pandas - When should I care?.
Another solution with pd.DataFrame, applymap and agg:
pd.DataFrame(df['Value'].tolist()).astype(str).applymap(len).agg(list, axis=1)
0 [1, 2, 3]
1 [2, 3, 1]
2 [3, 2, 1]
dtype: object