Calculating the Entropy of Data in a Table or Matrix - python

Color Height Sex
----------------------
Red Short Male
Red Tall Male
Blue Medium Female
Green Medium Female
Green Tall Female
Green Short Male
How to compute the entropy of the table as a whole in python?

Related

Splitting a column into two in dataframe

It's solution is definitely out there but I couldn't find it. So posting it here.
I have a dataframe which is like
object_Id object_detail
0 obj00 red mug
1 obj01 red bowl
2 obj02 green mug
3 obj03 white candle holder
I want to split the column object_details into two columns: name, object_color based on a list that contains the color name
COLOR = ['red', 'green', 'blue', 'white']
print(df)
# want to perform some operation so that It'll get output
object_Id object_detail object_color name
0 obj00 red mug red mug
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder
This is my first time using dataframe so I am not sure how to achieve it using pandas. I can achieve it by converting it into a list and then apply a filter. But I think there are easier ways out there that I might miss. Thanks
Use Series.str.extract with joined values of list by | for regex OR and then all another values in new column splitted by space:
pat = "|".join(COLOR)
df[['object_color','name']] = df['object_detail'].str.extract(f'({pat})\s+(.*)',expand=True)
print (df)
object_Id object_detail object_color name
0 obj00 Barbie Pink frock Barbie Pink frock
1 obj01 red bowl red bowl
2 obj02 green mug green mug
3 obj03 white candle holder white candle holder

Pandas replace dataframe values based on criteria

I have a master dataframe, df:
Colour Item Price
Blue Car 40
Red Car 30
Green Truck 50
Green Bike 30
I then have a price correction dataframe, df_pc:
Colour Item Price
Red Car 60
Green Bike 70
I want to say if there is a match on Colour and Item in the price correction dataframe, then replace the price in the master df. so the expected output is;
Colour Item Price
Blue Car 60
Red Car 30
Green Truck 50
Green Bike 70
I can't find a way of doing this currently
Use Index.isin for filter out no matched rows and then DataFrame.combine_first:
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
Another data test:
print (df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 <- not matched row
df = df.set_index(['Colour','Item'])
df_pc = df_pc.set_index(['Colour','Item'])
df_pc = df_pc[df_pc.index.isin(df.index)]
df = df_pc.combine_first(df).reset_index()
print (df)
Colour Item Price
0 Blue Car 40.0
1 Green Bike 30.0
2 Green Truck 50.0
3 Red Car 60.0
here is a way using combine_first():
df_pc.set_index(['Colour','Item']).combine_first(
df.set_index(['Colour','Item'])).reset_index()
Colour Item Price
0 Blue Car 40.0
1 Green Bike 70.0
2 Green Truck 50.0
3 Red Car 60.0
EDIT:
If you want only matching items, we can also use merge with fillna:
print(df_pc)
Colour Item Price
0 Red Car 60
1 Orange Bike 70 #changed row not matching
(df.merge(df_pc, on = ['Colour','Item'],how='left',suffixes=('_x',''))
.assign(Price=lambda x:x['Price'].fillna(x['Price_x'])).reindex(df.columns,axis=1))
Colour Item Price
0 Blue Car 40.0
1 Red Car 60.0
2 Green Truck 50.0
3 Green Bike 30.0

Fill column with conditional mode of another column

Given the below list, I'd like to fill in the 'Color Guess' column with the mode of the 'Color' column conditional on 'Type' and 'Size' and ignoring NULL, #N/A, etc.
For example, what's the most common color for SMALL CATS, what's the most common color for MEDIUM DOGS, etc.
Type Size Color Color Guess
Cat small brown
Dog small black
Dog large black
Cat medium white
Cat medium #N/A
Dog large brown
Cat large white
Cat large #N/A
Dog large brown
Dog medium #N/A
Cat small #N/A
Dog small white
Dog small black
Dog small brown
Dog medium white
Dog medium #N/A
Cat large brown
Dog small white
Dog large #N/A
As BarMar already stated in the comments, we can use pd.Series.mode here from the linked answer. Only trick here is, that we have to use groupby.transform, since we want the data back in the same shape as your dataframe:
df['Color Guess'] = df.groupby(['Type', 'Size'])['Color'].transform(lambda x: pd.Series.mode(x)[0])
Type Size Color Color Guess
0 Cat small brown brown
1 Dog small black black
2 Dog large black brown
3 Cat medium white white
4 Cat medium NaN white
5 Dog large brown brown
6 Cat large white brown
7 Cat large NaN brown
8 Dog large brown brown
9 Dog medium NaN white
10 Cat small NaN brown
11 Dog small white black
12 Dog small black black
13 Dog small brown black
14 Dog medium white white
15 Dog medium NaN white
16 Cat large brown brown
17 Dog small white black
18 Dog large NaN brown

creating stack chart using categorical data

I am going to design a stacked bar chart using the categorical data in this data frame:
gender distress
female high
male low
female high
male high
male medium
female high
male medium
male medium
female low
I know that I can filter the data based on gender and then counts the distress and then draw the stacked chart. Is there a faster way to do it?
First using crosstab, then plot
pd.crosstab(df.gender,df.distress).plot(kind='bar',stacked=True)

Partial string matching on large data set

So I have been working on this for a while and just haven't got any where and just not sure what to do.Fairly new to pandas and python.
Data set is actually 15,000 product names. All in different formats, some have multiple dashes up to 6, some hyphens, different lengths,The rows are product names with variants.
The code i'm using keeps returning only the first letter as oppose to the partial string when I use it on a large data set.
Works just fine on a small data set which I was using to test it.
I'm assuming this is happening because:
I haven't created a stop section when it matches a full partial string
because its trying to match up words as oppose to individual characters and stopping when it finds a difference.
What is the best way to overcome this on a large data set, what am I missing? or am I going to have to do this manual?
Original test data set
`1.star t-shirt-large-red
2.star t-shirt-large-blue
3.star t-shirt-small-red
4.beautiful rainbow skirt small
5.long maxwell logan jeans- light blue -32L-28W
6.long maxwell logan jeans- Dark blue -32L-28W`
Desired data set/output:
`COL1 COL2 COL3 COL4
1[star t-shirt] [large] [red] NONE
2[star t-shirt] [large] [blue] NONE
3[star t-shirt] [small] [red] NONE
4[beautiful rainbow skirt small] [small] NONE NONE
5[long maxwell logan jeans] [light blue] [32L] [28W]
6[long maxwell logan jeans] [Dark blue] [32L] [28W]`
Here is the code I was helped with in a previous question I asked:
`df['onkey'] = 1
df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]
df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')`
`df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))
df = df[['COL1','other']]
df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)`
` COL1 0 1 2
0 star t-shirt large red NaN
1 star t-shirt large blue NaN
2 star t-shirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W`
***************update*****************
This is your input list of product,some have variants and some don't.
When searching for duplicates strings to determine what are the products with variants and products without variants;nothing comes up because they are all seen as unique values due to the variants being added on at the end of the string.
So what I would like to do is group the partial or similar strings together(the longest match), extract the longest matching string within the group and then put the differences into other columns.
If the product /string is unique just print into the column with the extracted longest string.
star t-shirt-large-red
star t-shirt-large-blue
star t-shirt-small-red
beautiful rainbow skirt small
long maxwell logan jeans- light blue -32L-28W
long maxwell logan jeans- Dark blue -32L-28W
Organic and natural candy - 3 Pack - Mint
Organic and natural candy - 3 Pack - Vanilla
Organic and natural candy - 3 Pack - Strawberry
Organic and natural candy - 3 Pack - Chocolate
Organic and natural candy - 3 Pack - Banana
Organic and natural candy - 3 Pack - Cola
Organic and natural candy - 12 Pack Assorted
Morgan T-shirt Company - Small/Medium-Blue
Morgan T-shirt Company - Medium/Large-Blue
Morgan T-shirt Company - Medium/Large-red
Morgan T-shirt Company - Small/Medium-Red
Morgan T-shirt Company - Small/Medium-Green
Morgan T-shirt Company - Medium/Large-Green
Nelly dress leopard small
col1 col2 col3 col4
star t-shirt large red
star t-shirt large blue
star t-shirt small red
beautiful rainbow skirt small
Long maxwell logan jeans light blue 32L 28W
Long maxwell logan jeans Dark blue 32L 28W
Organic and natural candy 3 Pack Mint
Organic and natural candy 3 Pack Vanilla
Organic and natural candy 3 Pack Strawberry
Organic and natural candy 3 Pack Chocolate
Organic and natural candy 3 Pack Banana
Organic and natural candy 3 Pack Cola
Organic and natural candy 12 Pack Assorted
Morgan T-shirt Company Small/Medium Blue
Morgan T-shirt Company Medium/Large Blue
Morgan T-shirt Company Medium/Large Red
Morgan T-shirt Company Small/Medium Red
Morgan T-shirt Company Small/Medium Green
Morgan T-shirt Company Medium/Large Green
Nelly dress Leopard Small
Bijoux
Princess PJ-set
Lemon tank top Yellow Medium
Constructing a DataFrame df as follows:
df = pd.DataFrame()
df = df.append(['1.star t-shirt-large-red'])
df = df.append(['2.star t-shirt-large-blue'])
df = df.append(['4.beautiful rainbow skirt small'])
df = df.append(['5.long maxwell logan jeans- light blue -32L-28W'])
df = df.append(['6.long maxwell logan jeans- Dark blue -32L-28W'])
df.columns = ['Product']
The following code
(a) strips any whitespace,
(b) splits by the period ('.') and grabs what follows,
(c) replaces 't-shirt' with 'tshirt' because of further operations (change this back if you want after the operation)
(d) splits again by '-' and expands to give your dataframe.
df['Product'].str.strip().str.split('.').str.get(1).str.replace('t-shirt', 'tshirt').str.split('-', expand = True)
Output:
0 1 2 3
0 star tshirt large red None
0 star tshirt large blue None
0 beautiful rainbow skirt small None None None
0 long maxwell logan jeans light blue 32L 28W
0 long maxwell logan jeans Dark blue 32L 28W
Given the inconsistency in nomenclature for your product, there will be edge-cases that are missed (ex : beautiful rainbow skirt small). You may have to fish them out again.
A solution which is quite simple to understand, debug and flexibly extend is the following:
Consider that your initial product names are held in a list called strings.
Then the solution is the following line:
mydf = pd.concat([pd.DataFrame([make_row(row, 4)], columns=['COL1', 'COL2', 'COL3', 'COL4']) for row in strings], ignore_index=True)
where we have defined the parsing function make_row to be:
def make_row(string, num_cols):
cols = [item.strip() for item in string[2:].split('-')] # ignore numbering, split on hyphen and strip whitespace
if len(cols) < num_cols:
cols += [np.nan]*(num_cols - len(cols)) # fill with NaN missing values
return cols
The first line defining cols could also be simply cols = string.split('-'), in which case you could do the formatting afterwards with:
mydf.applymap(lambda x: x if pd.isnull(x) else str.strip(x))
Now in your case, I see that there is a hyphen in some of your product names, in which case you might want to 'sanitize' them in advance (or inside make_row, as you wish), with something like:
strings = [item.replace('t-shirt', 'tshirt') for item in strings]
Example input:
strings = ['1.one-two-three', '2. one-two', '3.one-two-three-four', '4.one - two -three -four ']
Output:
COL1 COL2 COL3 COL4
0 one two three NaN
1 one two NaN NaN
2 one two three four
3 one two three four
Output for question's data (after correcting typo for item 4):
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W
Edit:
If you additionally want to "group" the items together, then you can:
a) Use sort_values (pandas doc) on the column COL1 after you get a dataframe as described above to simply display the rows corresponding to the same product the one after the other, or
b) use group_by to actually get a grouped dataframe like this:
grouped_df = mydf.groupby("COL1")
This will allow you to get each group like this:
grouped_df.get_group("star tshirt")
Producing following output:
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN

Categories

Resources