Convert multiple rows into one row with multiple columns in pyspark? - python

I have something like this (I've simplified the number of columns for brevity, there's about 10 other attributes):
id name foods foods_eaten color continent
1 john apples 2 red Europe
1 john oranges 3 red Europe
2 jack apples 1 blue North America
I want to convert it to:
id name apples oranges color continent
1 john 2 3 red Europe
2 jack 1 0 blue North America
Edit:
(1) I updated the data to show a few more of the columns.
(3) I've done
df_piv = df.groupBy(['id', 'name', 'color', 'continent', ...]).pivot('foods').avg('foods_eaten')
Is there a simpler way to do this sort of thing? As far as I can tell, I'll need to groupby almost every attribute to get my result.

Extending from what you have done so far and leveraging here
>>>from pyspark.sql import functions as F
>>>from pyspark.sql.types import *
>>>from pyspark.sql.functions import collect_list
>>>data=[{'id':1,'name':'john','foods':"apples"},{'id':1,'name':'john','foods':"oranges"},{'id':2,'name':'jack','foods':"banana"}]
>>>dataframe=spark.createDataFrame(data)
>>>dataframe.show()
+-------+---+----+
| foods| id|name|
+-------+---+----+
| apples| 1|john|
|oranges| 1|john|
| banana| 2|jack|
+-------+---+----+
>>>grouping_cols = ["id","name"]
>>>other_cols = [c for c in dataframe.columns if c not in grouping_cols]
>>> df=dataframe.groupBy(grouping_cols).agg(*[collect_list(c).alias(c) for c in other_cols])
>>>df.show()
+---+----+-----------------+
| id|name| foods|
+---+----+-----------------+
| 1|john|[apples, oranges]|
| 2|jack| [banana]|
+---+----+-----------------+
>>>df_sizes = df.select(*[F.size(col).alias(col) for col in other_cols])
>>>df_max = df_sizes.agg(*[F.max(col).alias(col) for col in other_cols])
>>> max_dict = df_max.collect()[0].asDict()
>>>df_result = df.select('id','name', *[df[col][i] for col in other_cols for i in range(max_dict[col])])
>>>df_result.show()
+---+----+--------+--------+
| id|name|foods[0]|foods[1]|
+---+----+--------+--------+
| 1|john| apples| oranges|
| 2|jack| banana| null|
+---+----+--------+--------+

Related

How to text to columns in pandas and create new columns?

I have a csv file as show below, the names column has names separated with commas, I want to spilt them on comma and append them to new columns and create the same csv, similar to the text to columns in excel, the problem is some rows have random number of names.
| Address | Name |
| 1st st | John, Smith |
|2nd st. | Andrew, Jane, Aaron|
my pandas code look something like this
df1 = pd.read_csv('sample.csv')
df1['Name'] = df1['Name'].str.split(',', expand=True)
df1.to_csv('results.csv',index=None)
offcourse this doesn't work because columns must be same length as key. The expected output is
| Address | Name | | |
| 1st st | John |Smith| |
|2nd st. | Andrew| Jane| Aaron|
count the max number of commas, then accordingly assign to new columns.
max_commas = df['name'].str.split(',').transform(len).max()
df[[f'name_{x}' for x in range(max_commas)]] = df['name'].str.split(',', expand=True)
input df:
col name
0 1st st john, smith
1 2nd st andrew, jane, aron
2 3rd st harry, philip, anna, james
output:
col name name_0 name_1 name_2 name_3
0 1st st john, smith john smith None None
1 2nd st andrew, jane, aron andrew jane aron None
2 3rd st harry, philip, anna, james harry philip anna james
You can do
out = df.join(df['Name'].str.split(', ',expand=True).add_prefix('name_'))

Moving row cell to column if index is same

I have a dataframe with the below example
Type | date
Apple |01/01/2021
Apple |10/02/2021
Orange |05/01/2021
Orange |20/20/2020
Is there any easiest way transform the data as below?
Type | Date
Apple | 01/01/2020 | 10/20/2021
Orange| 05/01/2020 | 20/20/2020
The stack function does not match my requirement
You could group by "type", collect the "date" values and make a new dataframe.
df = pd.DataFrame({'type':['Apple','Apple','Orange','Orange'], 'date':['01/01/2021','10/02/2021','05/01/2021','20/20/2020']})
d = {}
for fruit, group in df.groupby('type'):
d[fruit] = group.date.values
pd.DataFrame(d).T
0 1
Apple 01/01/2021 10/02/2021
Orange 05/01/2021 20/20/2020

Add prefix to ffill, identifying values which were carried forward

Is there a wayto add a prefix when filling na's with ffill in pandas? I have a dataframe containing, taxonomic information like so:
| Kingdom | Phylum | Class | Order | Family | Genus |
| Bacteria | Firmicutes | Bacilli | Lactobacillales | Lactobacillaceae | Lactobacillus |
| Bacteria | Bacteroidetes | Bacteroidia | Bacteroidales | | |
| Bacteria | Bacteroidetes | | | | |
Since not all of the taxa in my dataframe can be classified fully, I have some empty cells. Replacing the spaces with NA and using ffill I can fill these with the last valid string in each row but I would like to add a string to these (for example "Unknown_Bacteroidales") so I can identify which ones were carried forward.
So far I tried this taxa_formatted = "unknown_" + taxonomy.fillna(method='ffill', axis=1) but this of course adds the "unknown_" prefix to everything in the dataframe.
You can this using boolean masking with df.isna.
df = df.replace("", np.nan) # if already NaN present skip this step
d = df.ffill()
d[df.isna()]+="(Copy)"
d
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales Lactobacillaceae(Copy) Lactobacillus(Copy)
2 Bacteria Bacteroidetes Bacteroidia(Copy) Bacteroidales(Copy) Lactobacillaceae(Copy) Lactobacillus(Copy)
You can use df.add here.
d = df.ffill(axis=1)
df.add("unkown_" + d[df.isna()],fill_value='')
Kingdom Phylum Class Order Family Genus
0 Bacteria Firmicutes Bacilli Lactobacillales Lactobacillaceae Lactobacillus
1 Bacteria Bacteroidetes Bacteroidia Bacteroidales unkown_Bacteroidales unkown_Bacteroidales
2 Bacteria Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes unkown_Bacteroidetes
You need to use mask and update:
#make true nan's first.
#df = df.replace('',np.nan)
s = df.isnull()
df = df.ffill(axis=1)
df.update('unknown_' + df.mask(~s) )
print(df)
Bacteria Firmicutes Bacilli Lactobacillales \
0 Bacteria Bacteroidetes Bacteroidia Bacteroidales
1 Bacteria Bacteroidetes unknown_Bacteroidetes unknown_Bacteroidetes
Lactobacillaceae Lactobacillus
0 unknown_Bacteroidales unknown_Bacteroidales
1 unknown_Bacteroidetes unknown_Bacteroidetes

find rows that share values

I have a pandas dataframe that look like this:
df = pd.DataFrame({'name': ['bob', 'time', 'jane', 'john', 'andy'], 'favefood': [['kfc', 'mcd', 'wendys'], ['mcd'], ['mcd', 'popeyes'], ['wendys', 'kfc'], ['tacobell', 'innout']]})
-------------------------------
name | favefood
-------------------------------
bob | ['kfc', 'mcd', 'wendys']
tim | ['mcd']
jane | ['mcd', 'popeyes']
john | ['wendys', 'kfc']
andy | ['tacobell', 'innout']
For each person, I want to find out how many favefood's of other people overlap with their own.
I.e., for each person I want to find out how many other people have a non-empty intersection with them.
The resulting dataframe would look like this:
------------------------------
name | overlap
------------------------------
bob | 3
tim | 2
jane | 2
john | 1
andy | 0
The problem is that I have about 2 million rows of data. The only way I can think of doing this would be through a nested for-loop - i.e. for each person, go through the entire dataframe to see what overlaps (this would be extremely inefficient). Would there be anyway to do this more efficiently using pandas notation? Thanks!
Logic behind it
s=df['favefood'].explode().str.get_dummies().sum(level=0)
s.dot(s.T).ne(0).sum(axis=1)-1
Out[84]:
0 3
1 2
2 2
3 1
4 0
dtype: int64
df['overlap']=s.dot(s.T).ne(0).sum(axis=1)-1
Method from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s=pd.DataFrame(mlb.fit_transform(df['favefood']),columns=mlb.classes_, index=df.index)
s.dot(s.T).ne(0).sum(axis=1)-1
0 3
1 2
2 2
3 1
4 0
dtype: int64

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?
Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups
How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Categories

Resources