Split (Explode) String into Multiple columns and rows - Python - python

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application Role
1 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
2 Blue Pink Red
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Roles.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Roles',1),['Application'])
The following output code reads
UserId Application
1 Grey Blue
1 White
2 Yellow Purple
2 Blue Pink
i do not know how to add the second column (role) in the code for the second split after ::

Given this dataframe:
UserId Application
0 1 Grey Blue::Black Orange;White::Green
1 2 Yellow Purple::Orange Grey;Blue Pink::Red
you could at least achieve the last two columns directly via
df.Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_0', 'level_1'])
which results in
0 1
0 Grey Blue Black Orange
1 White Green
2 Yellow Purple Orange Grey
3 Blue Pink Red
However, defining UserId as index before would also provide the proper UserId column:
result = df.set_index('UserId').Application.str.split(';', expand=True).stack().str.split('::', expand=True).reset_index().drop(columns=['level_1'])
result.columns = ['UserId', 'Application', 'Role']
UserId Application Role
0 1 Grey Blue Black Orange
1 1 White Green
2 2 Yellow Purple Orange Grey
3 2 Blue Pink Red

Related

Finding combinations of values across multiple rows and columns in pandas dataframe

I am trying to create all possible sets of values from a pandas dataframe, taking 3 values from each row.
For example, if I consider the dataframe below:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Amber
Chair Magenta Turquoise White Orange Violet Pink
Window Indigo Yellow Cerulean Grey Peach Aqua
I want to generate all possible solution sets, by taking 3 values from the first, second, third and fourth rows each.
This is what I tried:
from itertools import product
uniques = [df[i].unique().tolist() for i in df.columns ]
pd.DataFrame(product(*uniques), columns = df.columns)
But this generates all combinations with all 6 columns like this:
0 1 2 3 4 5 6
Table Black Brown Red Blue Green Aqua
Table Black Brown Red Blue Green Violet
Here, all values of Row 1 remain the same except for the last value, and all combinations are generated like this.
But what I need is this:
0 1 2 3 4 5 6 7 8 9
Table Black Red Blue Magenta White Orange Yellow Peach Aqua
Here, the first three values are from Row 1, the second 3 values are from Row 2, and the last 3 values are from Row 3.
Similarly, I want to display all such sets, and store them in a new dataframe.
Any help will be appreciated.
df
###
0 1 2 3 4 5 6
0 Table Black Brown Red Blue Green Amber
1 Chair Magenta Turquoise White Orange Violet Pink
2 Window Indigo Yellow Cerulean Grey Peach Aqua
from itertools import product
import random
uniques = [df[i].unique().tolist() for i in df.columns]
rl = list(product(*uniques))
pd.DataFrame(random.choices(rl))
product() generates all the combinations from sets, but you want a random result from the combinations list.
0 1 2 3 4 5 6
0 Table Black Brown White Orange Peach Aqua
Supplement
Combination
with 3 sets.
Select one element from each set, how many combinations would be?
2 * 2 * 3 = 12
Let's check whether the total number of all combinations is 12 or not.
list_of_lists = [['Yellow','Blue'],['Cat','Dog'],['Swim','Run','Sleep']]
combination = product(*list_of_lists)
combination_list = list(combination)
pd.DataFrame(combination_list)
###
0 1 2
0 Yellow Cat Swim
1 Yellow Cat Run
2 Yellow Cat Sleep
3 Yellow Dog Swim
4 Yellow Dog Run
5 Yellow Dog Sleep
6 Blue Cat Swim
7 Blue Cat Run
8 Blue Cat Sleep
9 Blue Dog Swim
10 Blue Dog Run
11 Blue Dog Sleep
And choose one row from above randomly, would be the solution to generate a set from combinations.

Pandas count size of groupby groups idiomatically

I often want a dataframe of counts for how many members are in each group after a groupby operation in pandas. I have a verbose way of doing it with size and reset index and rename, but I'm sure there is a better way.
Here's an example of what I'd like to do:
import pandas as pd
import numpy as np
np.random.seed(0)
colors = ['red','green','blue']
cdf = pd.DataFrame({
'color1':np.random.choice(colors,10),
'color2':np.random.choice(colors,10),
})
print(cdf)
#better way to do next line? (somehow use agg?)
gb_count = cdf.groupby(['color1','color2']).size().reset_index().rename(columns={0:'num'})
print(gb_count)
#cdf.groupby(['color1','color2']).count() #<-- this doesn't work
Final output:
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
To avoid getting your MultiIndex, use as_index=False:
cdf.groupby(['color1','color2'], as_index=False).size()
color1 color2 size
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
If you explicitly want to name your new column num. You can use reset_index with name=.. since groupby will return a series:
cdf.groupby(['color1','color2']).size().reset_index(name='num')
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1
Another way is to reset the grouper_index after sending it to_frame(with preferred column name) in an agg operation.
gb_count = cdf.groupby(['color1','color2']).agg('size').to_frame('num').reset_index()
color1 color2 num
0 blue green 1
1 blue red 1
2 green blue 3
3 red green 4
4 red red 1

Python, Splitting Multiple Strings in a Column

Good afternoon, i am trying to split text in a column to a specfic format
here is my table below
UserId Application
1 Grey Blue::Black Orange;White:Green
2 Yellow Purple::Orange Grey;Blue Pink::Red
I would like it to read the following:
UserId Application
1 Grey Blue
1 White Orange
2 Yellow Purple
2 Blue Pink
Basically, i would like to keep the first string of every :: instance for every string in a given cell.
So far my code is
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
df['Application']=df.Role.str.split(';|::|').map(lambda x : x[0::2])
unnesting(df.drop('Role',1),['Application']
The following code reads
UserId Application
1 Grey Blue, White Orange
2 Yellow Purple, Blue Pink
Please Assist i dont know where i should be using pandas or numpy to solve this problem!!
Maybe you can try using extractall
yourdf=df.set_index('UserId').Application.str.extractall(r'(\w+):').reset_index(level=0)
# You can adding rename(columns={0:'Application'})at the end
Out[87]:
UserId 0
match
0 1 Grey
1 1 White
0 2 Yellow
1 2 Blue
Update look at the unnesting , after we split and select the value we need from the string , we store them into a list , when you have a list type in you columns , I recommend using unnesting
df['LIST']=df.Application.str.split(';|::|:').map(lambda x : x[0::2])
unnesting(df.drop('Application',1),['LIST'])
Out[111]:
LIST UserId
0 Grey Blue 1
0 White 1
1 Yellow Purple 2
1 Blue Pink 2
My own def-function
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

Convert text table to pandas dataframe

Many times when I'm trying to answer questions on Stackoverflow, the question contains a table, which I have to convert to a pandas dataframe in order to process. for example, in this question:
http://stackoverflow.com/questions/43172116/pandas-count-some-value-in-all-columns
My question is, is there a faster way to convert it to a dataframe rather than writing:
df=pd.DataFrame({'graph':[1,2,3,4,5,6],
0:['blue','blue','red','red','blue','blue'],
1:['blue','blue','red','blue','red','blue'],
2:['blue','blue','blue','red','blue','blue'],
3:['blue','blue','blue','red','red','blue'],
4:['blue','blue','red','blue','red','blue']})
given that I can copy the text:
graph 0 1 2 3 4
1 blue blue blue blue blue
2 blue blue blue blue blue
3 blue red blue blue red
4 red blue red red blue
5 red red blue red red
6 blue blue blue blue blue
Make sure the desired data set is in clipboard and use pd.read_clipboard() method.
Step by step:
mark desired data set
press Ctrl+C (for MS Windows)
execute: df = pd.read_clipboard()
In [40]: df = pd.read_clipboard()
In [41]: df
Out[41]:
graph 0 1 2 3 4
0 1 blue blue blue blue blue
1 2 blue blue blue blue blue
2 3 blue red blue blue red
3 4 red blue red red blue
4 5 red red blue red red
5 6 blue blue blue blue blue

Pandas: How do I check for value match between columns in same dataframe?

I am a complete coding novice and have been experimenting with Pandas. This is my first post. Thank you in advance for your help!
I would like to remove any rows where cat1 does not match either dog1 or dog2. It does not have to match both, just one or the other.
cat1 dog1 dog2
0 red red blue
1 red green blue
2 blue red blue
3 blue blue green
4 red green blue
I would like the end result to be as follows:
cat1 dog1 dog2
0 red red blue
2 blue red blue
3 blue blue green
How do I do this?
This is really simple:
df.query('cat1 == dog1 or cat1 == dog2')

Categories

Resources