Python, Splitting Multiple Strings in a Column - python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application
1 Grey Blue
1 White Orange
2 Yellow Purple
2 Blue Pink
Basically, i would like to keep the first string of every :: instance for every string in a given cell.
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Role.str.split(';|::|').map(lambda x : x[0::2])
unnesting(df.drop('Role',1),['Application']
The following code reads
UserId Application
1 Grey Blue, White Orange
2 Yellow Purple, Blue Pink
Please Assist i dont know where i should be using pandas or numpy to solve this problem!!

Maybe you can try using extractall
yourdf=df.set_index('UserId').Application.str.extractall(r'(\w+):').reset_index(level=0)
# You can adding rename(columns={0:'Application'})at the end
Out[87]:
UserId 0
match
0 1 Grey
1 1 White
0 2 Yellow
1 2 Blue
Update look at the unnesting , after we split and select the value we need from the string , we store them into a list , when you have a list type in you columns , I recommend using unnesting
df['LIST']=df.Application.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Application',1),['LIST'])
Out[111]:
LIST UserId
0 Grey Blue 1
0 White 1
1 Yellow Purple 2
1 Blue Pink 2
My own def-function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.
df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Pandas count size of groupby groups idiomatically

I often want a dataframe of counts for how many members are in each group after a groupby operation in pandas. I have a verbose way of doing it with size and reset index and rename, but I'm sure there is a better way.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
np.random.seed(0)
colors = ['red','green','blue']
cdf = pd.DataFrame({
'color1':np.random.choice(colors,10),
'color2':np.random.choice(colors,10),
})
print(cdf)
#better way to do next line? (somehow use agg?)
gb_count = cdf.groupby(['color1','color2']).size().reset_index().rename(columns={0:'num'})
print(gb_count)
#cdf.groupby(['color1','color2']).count() #<-- this doesn't work
Final output:
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
To avoid getting your MultiIndex, use as_index=False:
cdf.groupby(['color1','color2'], as_index=False).size()
color1 color2 size
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
If you explicitly want to name your new column num. You can use reset_index with name=.. since groupby will return a series:
cdf.groupby(['color1','color2']).size().reset_index(name='num')
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
Another way is to reset the grouper_index after sending it to_frame(with preferred column name) in an agg operation.
gb_count = cdf.groupby(['color1','color2']).agg('size').to_frame('num').reset_index()
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

Split (Explode) String into Multiple columns and rows - Python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application Role
1 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
2 Blue Pink Red
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Roles.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Roles',1),['Application'])
The following output code reads
UserId Application
1 Grey Blue
1 White
2 Yellow Purple
2 Blue Pink
i do not know how to add the second column (role) in the code for the second split after ::
Given this dataframe:
UserId Application
0 1 Grey Blue::Black Orange;White::Green
1 2 Yellow Purple::Orange Grey;Blue Pink::Red
you could at least achieve the last two columns directly via
df.Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_0', 'level_1'])
which results in
0 1
0 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
3 Blue Pink Red
However, defining UserId as index before would also provide the proper UserId column:
result = df.set_index('UserId').Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_1'])
result.columns = ['UserId', 'Application', 'Role']
UserId Application Role
0 1 Grey Blue Black Orange
1 1 White Green
2 2 Yellow Purple Orange Grey
3 2 Blue Pink Red

ffill not retaining all columns

I have a df like this:
Key Class
1 Green
1 NaN
1 NaN
2 Red
2 NaN
2 NaN
and I want to forward fill Class. Im using this code:
df=df.Class.fillna(method='ffill')
and this returns:
Green
Green
Green
Red
Red
Red
how can I retain the Key column while doing this?
df['class'] = df.Class.fillna(method='ffill')
in your code you're assigning the whole dataframe to be the result , so instead you need to assign only the class column
or another way is to do the following
In [126]:
df.ffill()
Out[126]:
Key Class
0 1 Green
1 1 Green
2 1 Green
3 2 Red
4 2 Red
5 2 Red
you can set also the inplace parameter to be true if you don't want to create a new copy from your df
df.ffill(inplace=True)

Concatenate two pandas dataframe with the same index but on different positions

I have a data frame like
id value_right color_right value_left color_left
1 asd red dfs blue
2 dfs blue afd green
3 ccd yellow asd blue
4 hty red hrr red
I need to get the left values below the right values, something like
id value color
1 asd red
1 dfs blue
2 dfs blue
2 afd green
3 ccd yellow
3 asd blue
4 hty red
4 hrr red
I tried to split in two data frames and to interleave using the id, but I got only half of the data on it, using the mod of the value of the id. Any ideas?
Take a view of the desired left and right side dfs, then rename the columns and then concat them and sort on 'id' column:
In [205]:
left = df[['id','value_left','color_left']].rename(columns={'value_left':'value','color_left':'color'})
right = df[['id','value_right','color_right']].rename(columns={'value_right':'value','color_right':'color'})
merged = pd.concat([right,left]).sort('id')
merged
Out[205]:
id value color
0 1 asd red
0 1 dfs blue
1 2 dfs blue
1 2 afd green
2 3 ccd yellow
2 3 asd blue
3 4 hty red
3 4 hrr red

Categories

Resources