Extraction of a common column pandas - python

I have two data frames, I need to extract data in Column_3 of the second dataframe DF2.
Question 1: How should I create "Column_3" from "Column_1" and "Column_2" of the first dataframe?
DF1 =
Column_1 Column_2 Column_3
Red Apple small Red Apple small
Green fruit Large Green fruit Large
Yellow Banana Medium Yellow Banana Medium
Pink Mango Tiny Pink Mango Tiny
Question 2: I need to extract "n_col3" from n_col1 & n_col2 but that should be similar to the column_3 of data frame 1. (see the brackets for info of what to be extracted)
Note: If all the information of Column_3 is not available in Column_1 & Column_2 like in Row 1 & Row 3, Only that information that is available should be extracted)
DF2 =
n_col1 n_col2 n_col3
L854 fruit Charlie Green LTD Large fruit Large(Green missing Fruit Large extracted)
Red alpha 8 small Tango G250 Apple Red Apple small(all information extracted)
Mk43 Mango Beta Tiny J448 T Mango Tiny(Pink missing, Mango Tiny is extracted)
M40 Yellow Medium Romeo Banana Yellow Banana Medium(all information extracted)
I want to extract that column so that I can do further processing of merging. Can anyone help me with this. Thank you in advance.

Related

How To Span Columns Pandas with Multiple Column Cells

:)
I have the following DataFrame
Fruit Type
Fruit1
Fruit2
Fruit3
Fruit4
Berries
Raspberry
Blueberry
Passionfruit
Kiwi
Raspberry
Blueberry
Passionfruit
Kiwi
Citrus
Grapefruit
Mandarins
Lemon
Lime
Grapefruit
Mandarins
Lemon
Lime
Melons
Rockmelon
Honeydew
Muskmelon
Zucchini
Rockmelon
Honeydew
Muskmelon
Zucchini
I'm trying to get the Fruit Types to span across all the other columns like titles across their fruit families. So just 1 cell spanning multiple columns, like when you drag multiple cells horizontally in excel and click merge cells.
I don't know how to show the output in Stack but I want it to look like 'Berries' is one cell which spans columns Fruit 1, Fruit2, Fruit3 and Fruit 4 and the same for Citrus and Melons and then I can remove the Fruit Type Column entirely.

python Pandas: VLOOKUP multiple cells on column

I'm struggling with next task: I would like to identify using pandas (or any other tool on python) if any of multiple cells (Fruit 1 through Fruit 3) in each row from Table 2 contains in column Fruits of Table1. And at the end obtain "Contains Fruits Table 2?" table.
Fruits
apple
orange
grape
melon
Name
Fruit 1
Fruit 2
Fruit 3
Contains Fruits Table 2?
Mike
apple
Yes
Bob
peach
pear
orange
Yes
Jack
banana
No
Rob
peach
banana
No
Rita
apple
orange
banana
Yes
Fruits in Table 2 can be up to 40 columns. Number of rows in Table1 is about 300.
I hope it is understandable, and someone can help me resolve this.
I really appreciate the support in advance!
Try:
filter DataFrame to include columns that contain the word "Fruit"
Use isin to check if the values are in table1["Fruits"]
Return True if any of fruits are found
map True/False to "Yes"/"No"
table2["Contains Fruits Table 2"] = table2.filter(like="Fruit")
.isin(table1["Fruits"].tolist())
.any(axis=1)
.map({True: "Yes", False: "No"})
>>> table2
Name Fruit 1 Fruit 2 Fruit 3 Contains Fruits Table 2
0 Mike apple None None Yes
1 Bob peach pear orange Yes
2 Jack banana None None No
3 Rob peach banana None No
4 Rita apple orange banana Yes
​~~~

Using pandas, how can I sort a table on all values that contains a string element from a list of string elements?

I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple

Counting rows that have same values in spcific columns in csv

I have a csv that i want to count how many rows that match specific columns, what would be the best way to do this? So for example if this was the csv:
fruit days characteristic1 characteristic2
0 apple 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
the output i would want would be
1 apple: red,sweet
A csv is a file with values that are separated by commas. I would recommend turning this into a .txt file and using this same format. Then establish consistent spacing throughout your file (using tab for example). So that when you loop through each line it knows where the actual information is. Then when you what info is in what column you print those specific values.
# Use a tab in between each column
fruit days charac1 charac2
0 apple1 1 red sweet
1 orange 2 round sweet
2 pineapple 5 prickly sweet
3 apple 4 yellow sweet
This is just to get you started.

Partial string matching on large data set

So I have been working on this for a while and just haven't got any where and just not sure what to do.Fairly new to pandas and python.
Data set is actually 15,000 product names. All in different formats, some have multiple dashes up to 6, some hyphens, different lengths,The rows are product names with variants.
The code i'm using keeps returning only the first letter as oppose to the partial string when I use it on a large data set.
Works just fine on a small data set which I was using to test it.
I'm assuming this is happening because:
I haven't created a stop section when it matches a full partial string
because its trying to match up words as oppose to individual characters and stopping when it finds a difference.
What is the best way to overcome this on a large data set, what am I missing? or am I going to have to do this manual?
Original test data set
`1.star t-shirt-large-red
2.star t-shirt-large-blue
3.star t-shirt-small-red
4.beautiful rainbow skirt small
5.long maxwell logan jeans- light blue -32L-28W
6.long maxwell logan jeans- Dark blue -32L-28W`
Desired data set/output:
`COL1 COL2 COL3 COL4
1[star t-shirt] [large] [red] NONE
2[star t-shirt] [large] [blue] NONE
3[star t-shirt] [small] [red] NONE
4[beautiful rainbow skirt small] [small] NONE NONE
5[long maxwell logan jeans] [light blue] [32L] [28W]
6[long maxwell logan jeans] [Dark blue] [32L] [28W]`
Here is the code I was helped with in a previous question I asked:
`df['onkey'] = 1
df1 = pd.merge(df[['name','onkey']],df[['name','onkey']], on='onkey')
df1['list'] = df1.apply(lambda x:[x.name_x,x.name_y],axis=1)
from os.path import commonprefix
df1['COL1'] = df1['list'].apply(lambda x:commonprefix(x))
df1['COL1_num'] = df1['COL1'].apply(lambda x:len(x))
df1 = df1[(df1['COL1_num']!=0)]
df1 = df1.loc[df1.groupby('name_x')['COL1_num'].idxmin()]
df = df.rename(columns ={'name':'name_x'})
df = pd.merge(df,df1[['name_x','COL1']],on='name_x',how ='left')`
`df['len'] = df['COL1'].apply(lambda x: len(x))
df['other'] = df.apply(lambda x: x.name_x[x.len:],axis=1)
df['COL1'] = df['COL1'].apply(lambda x: x.strip())
df['COL1'] = df['COL1'].apply(lambda x: x[:-1] if x[-1]=='-' else x)
df['other'] = df['other'].apply(lambda x:x.split('-'))
df = df[['COL1','other']]
df = pd.concat([df['COL1'],df['other'].apply(pd.Series)],axis=1)`
` COL1 0 1 2
0 star t-shirt large red NaN
1 star t-shirt large blue NaN
2 star t-shirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W`
***************update*****************
This is your input list of product,some have variants and some don't.
When searching for duplicates strings to determine what are the products with variants and products without variants;nothing comes up because they are all seen as unique values due to the variants being added on at the end of the string.
So what I would like to do is group the partial or similar strings together(the longest match), extract the longest matching string within the group and then put the differences into other columns.
If the product /string is unique just print into the column with the extracted longest string.
star t-shirt-large-red
star t-shirt-large-blue
star t-shirt-small-red
beautiful rainbow skirt small
long maxwell logan jeans- light blue -32L-28W
long maxwell logan jeans- Dark blue -32L-28W
Organic and natural candy - 3 Pack - Mint
Organic and natural candy - 3 Pack - Vanilla
Organic and natural candy - 3 Pack - Strawberry
Organic and natural candy - 3 Pack - Chocolate
Organic and natural candy - 3 Pack - Banana
Organic and natural candy - 3 Pack - Cola
Organic and natural candy - 12 Pack Assorted
Morgan T-shirt Company - Small/Medium-Blue
Morgan T-shirt Company - Medium/Large-Blue
Morgan T-shirt Company - Medium/Large-red
Morgan T-shirt Company - Small/Medium-Red
Morgan T-shirt Company - Small/Medium-Green
Morgan T-shirt Company - Medium/Large-Green
Nelly dress leopard small
col1 col2 col3 col4
star t-shirt large red
star t-shirt large blue
star t-shirt small red
beautiful rainbow skirt small
Long maxwell logan jeans light blue 32L 28W
Long maxwell logan jeans Dark blue 32L 28W
Organic and natural candy 3 Pack Mint
Organic and natural candy 3 Pack Vanilla
Organic and natural candy 3 Pack Strawberry
Organic and natural candy 3 Pack Chocolate
Organic and natural candy 3 Pack Banana
Organic and natural candy 3 Pack Cola
Organic and natural candy 12 Pack Assorted
Morgan T-shirt Company Small/Medium Blue
Morgan T-shirt Company Medium/Large Blue
Morgan T-shirt Company Medium/Large Red
Morgan T-shirt Company Small/Medium Red
Morgan T-shirt Company Small/Medium Green
Morgan T-shirt Company Medium/Large Green
Nelly dress Leopard Small
Bijoux
Princess PJ-set
Lemon tank top Yellow Medium
Constructing a DataFrame df as follows:
df = pd.DataFrame()
df = df.append(['1.star t-shirt-large-red'])
df = df.append(['2.star t-shirt-large-blue'])
df = df.append(['4.beautiful rainbow skirt small'])
df = df.append(['5.long maxwell logan jeans- light blue -32L-28W'])
df = df.append(['6.long maxwell logan jeans- Dark blue -32L-28W'])
df.columns = ['Product']
The following code
(a) strips any whitespace,
(b) splits by the period ('.') and grabs what follows,
(c) replaces 't-shirt' with 'tshirt' because of further operations (change this back if you want after the operation)
(d) splits again by '-' and expands to give your dataframe.
df['Product'].str.strip().str.split('.').str.get(1).str.replace('t-shirt', 'tshirt').str.split('-', expand = True)
Output:
0 1 2 3
0 star tshirt large red None
0 star tshirt large blue None
0 beautiful rainbow skirt small None None None
0 long maxwell logan jeans light blue 32L 28W
0 long maxwell logan jeans Dark blue 32L 28W
Given the inconsistency in nomenclature for your product, there will be edge-cases that are missed (ex : beautiful rainbow skirt small). You may have to fish them out again.
A solution which is quite simple to understand, debug and flexibly extend is the following:
Consider that your initial product names are held in a list called strings.
Then the solution is the following line:
mydf = pd.concat([pd.DataFrame([make_row(row, 4)], columns=['COL1', 'COL2', 'COL3', 'COL4']) for row in strings], ignore_index=True)
where we have defined the parsing function make_row to be:
def make_row(string, num_cols):
cols = [item.strip() for item in string[2:].split('-')] # ignore numbering, split on hyphen and strip whitespace
if len(cols) < num_cols:
cols += [np.nan]*(num_cols - len(cols)) # fill with NaN missing values
return cols
The first line defining cols could also be simply cols = string.split('-'), in which case you could do the formatting afterwards with:
mydf.applymap(lambda x: x if pd.isnull(x) else str.strip(x))
Now in your case, I see that there is a hyphen in some of your product names, in which case you might want to 'sanitize' them in advance (or inside make_row, as you wish), with something like:
strings = [item.replace('t-shirt', 'tshirt') for item in strings]
Example input:
strings = ['1.one-two-three', '2. one-two', '3.one-two-three-four', '4.one - two -three -four ']
Output:
COL1 COL2 COL3 COL4
0 one two three NaN
1 one two NaN NaN
2 one two three four
3 one two three four
Output for question's data (after correcting typo for item 4):
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN
3 beautiful rainbow skirt small NaN NaN
4 long maxwell logan jeans light blue 32L 28W
5 long maxwell logan jeans Dark blue 32L 28W
Edit:
If you additionally want to "group" the items together, then you can:
a) Use sort_values (pandas doc) on the column COL1 after you get a dataframe as described above to simply display the rows corresponding to the same product the one after the other, or
b) use group_by to actually get a grouped dataframe like this:
grouped_df = mydf.groupby("COL1")
This will allow you to get each group like this:
grouped_df.get_group("star tshirt")
Producing following output:
COL1 COL2 COL3 COL4
0 star tshirt large red NaN
1 star tshirt large blue NaN
2 star tshirt small red NaN

Categories

Resources