Add extra column to dataframe with specific words from existing column - python

I am new in Python and am struggling with Pandas. More specifically I have a column (Sensory scores) in a dataframe that consists of multiple words like this:
*Treatment* *Sensory scores*
A soft, short
B soft, tender
C short, tender
Now I want to add extra columns "soft', 'short' and ' tender' to the dataframe whereby the individual scores are extracted and quantified like this:
*Treatment* *Sensory scores* *soft* *short* *tender*
A soft, short 1 1 0
B soft, tender 1 0 1
C short, tender 0 1 1
What is the best way to program this in Pandas? Any help, suggestions are appreciated. Many thanks in advance.
Coen

You need first to split the values, then you can use pivot_table to sum a dummy column (count):
df = df.set_index("*Treatment*")
df = pd.DataFrame(df["*Sensory scores*"].str.split(', ').explode())
df["count"] = 1
df = df.pivot_table(index=df.index, columns="*Sensory scores*", fill_value=0)

Related

filtering pandas dataframe when data contains two parts

I have a pandas dataframe and want to filter down to all the rows that contain a certain criteria in the “Title” column.
The rows I want to filter down to are all rows that contain the format “(Axx)” (Where xx are 2 numbers).
The data in the “Title” column doesn’t just consist of “(Axx)” data.
The data in the “Title” column looks like so:
“some_string (Axx)”
What Ive been playing around a bit with different methods but cant seem to get it.
I think the closest ive gotten is:
df.filter(regex=r'(D\d{2})', axis=0))
but its not correct as the entries aren’t being filtered.
Use Series.str.contains with escape () and $ for end of string and filter in boolean indexing:
df = pd.DataFrame({'Title':['(D89)','aaa (D71)','(D5)','(D78) aa','D72']})
print (df)
Title
0 (D89)
1 aaa (D71)
2 (D5)
3 (D78) aa
df1 = df[df['Title'].str.contains(r'\(D\d{2}\)$')]
print (df1)
4 D72
Title
0 (D89)
1 aaa (D71)
If ned match only (Dxx) use Series.str.match:
df2 = df[df['Title'].str.match(r'\(D\d{2}\)$')]
print (df2)
Title
0 (D89)

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!
split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

Adding the quantities of products in a dataframe column in Python

I'm trying to calculate the sum of weights in a column of an excel sheet that contains the product title with the help of Numpy/Pandas. I've already managed to load the sheet into a dataframe, and isolate the rows that contain the particular product that I'm looking for:
dframe = xlsfile.parse('Sheet1')
dfFent = dframe[dframe['Product:'].str.contains("ABC") == True]
But, I can't seem to find a way to sum up its weights, due to the obvious complexity of the problem (as shown below). For eg. if the column 'Product Title' contains values like -
1 gm ABC
98% pure 12 grams ABC
0.25 kg ABC Powder
ABC 5gr
where, ABC is the product whose weight I'm looking to add up. Is there any way that I can add these weights all up to get a total of 268 gm. Any help or resources pointing to the solution would be highly appreciated. Thanks! :)
You can use extractall for values with units or percentage:
(?P<a>\d+\.\d+|\d+) means extract float or int to column a
\s* - is zero or more spaces between number and unit
(?P<b>[a-z%]+) is extract lowercase unit or percentage after number to b
#add all possible units to dictonary
d = {'gm':1,'gr':1,'grams':1,'kg':1000,'%':.01}
df1 = df['Product:'].str.extractall('(?P<a>\d+\.\d+|\d+)\s*(?P<b>[a-z%]+)')
print (df1)
a b
match
0 0 1 gm
1 0 98 %
1 12 grams
2 0 0.25 kg
3 0 5 gr
Then convert first column to numeric and second map by dictionary of all units. Then reshape by unstack and multiple columns by prod, last sum:
a = df1['a'].astype(float).mul(df1['b'].map(d)).unstack().prod(axis=1).sum()
print (a)
267.76
Similar solution:
a = df1['a'].astype(float).mul(df1['b'].map(d)).prod(level=0).sum()
You need to do some data wrangling to get the column consistent in same format. You may do some matching and try to get Product column aligned and consistent, similar to date -time formatting.
Like you may do the following things.
Make a separate column with only values(float)
Change % value to decimal and multiply by quantity
Replace value with kg to grams
Without any string, only float column to get total.
Pandas can work well with this problem.
Note: There is no shortcut to this problem, you need to get rid of strings mixed with decimal values for calculation of sum.

Create new columns for a dataframe by parsing column values and populate new columns with values from another column python

I need to add new columns to a dataframe based on lists within a certain column. The new columns need to be a set derived from all the lists in the column.
I then have another column with lists corresponding to the first but the data is slightly different. I need these values to populate the new columns if the values are not in a "do not include" list
Here is an example:
Disease Status
0 Asthma|ARD Ph II|Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development reported
This should become:
Disease Status Asthma ARD Arthritis Inflammation
0 Asthma|ARD Ph II|Ph I Ph II Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development Ph III Approved
Where here the list of "do not include" would just be ['No development'] however there are more terms I would like to include here.
The dataframe I am working with has many columns, I am interested in developing a function in which I can simply pass the df, column names, and a "do not inlcude" list that will perform this task in an efficient way (ideally without any or very few loops).
My current approach has been to create a set from the Disease columns, add it to the dataframe through pd.concat, and then loop through each row, split values in the two columns and then loop through the "Disease" list to put the correct status in the disease column.
The problem with this is that my data frame is ~12k rows, and this becomes exceptionally time intensive.
It seems that you have multiple values in each individual cell (from your previous and current questions). It would be far far easier to tidy up your data first and then continue with your analysis. Try to put each value in each column in its own cell.
df1 = pd.concat([df[col].str.split('|', expand=True).stack().reset_index(1, drop=True) for col in df.columns], axis=1)
Output of df1
0 1
0 Asthma Ph II
0 ARD Ph I
1 Arthritis Ph III
1 Inflammation Approved
1 Asthma No development reported
And then you can pivot this from here and select only the columns you care about
cols = ['Asthma', 'ARD']
df2 = df1.reset_index().pivot(index='index',columns=0, values=1)[cols]
Output of df2
0 Asthma ARD
index
0 Ph II Ph I
1 No development reported None
Then just concatenate this DataFrame to your original
pd.concat((df, df2),axis=1)
Disease Status \
index
0 Asthma|ARD Ph II|Ph I
1 Arthritis|Inflammation|Asthma Ph III|Approved|No development reported
Asthma ARD
index
0 Ph II Ph I
1 No development reported None
make exclusion list a set
str.extractall was a style choice. str.split will be faster
query to get rid of things not to include
join
dont_include = set(['No development'])
d1 = df.stack().str.extractall('([^|]+)')[0].unstack(1) \
.reset_index(1, drop=True).query('Status not in #dont_include') \
.set_index('Disease', append=1).Status.unstack().fillna('')
df.join(d1)

How to select values in between strings and place in column of dataframe using regex in python

I have a large dataframe containing a column titled "Comment"
within the comment section I need to pull out 3 values and place into separate columns i.e. (Duty cycle, gas, and pressure)
"Data collection START for Duty Cycle: 0, Gas: Vacuum Pressure: 0.000028 Torr"
Currently i am using .split and .tolist to parse the string ->
#split string and sort into columns
df1 = pd.DataFrame(eventsDf.comment.str.split().tolist(),columns="0 0 0 0 0 0 dutyCycle 0 Gas 0 Pressure 0 ".split())
#join dataFrames
eventsDf = pd.concat([eventsDf, df1], axis=1)
#drop columns not needed
eventsDf.drop(['comment','0',],axis=1,inplace=True)
I found this method rather "hacky" in that in the event the structure of the comment section changes my code would be useless... can anyone show me a more effecient/robust way to go about doing this?? Thank you so much!
use str.extract with a regex.
regex = r'Duty Cycle: (?P<Duty_Cycle>\d+), Gas: (?P<Gas>\w+) Pressure: (?P<Pressure>\S+) Torr'
df1 = eventsDf.comment.str.extract(regex, expand=True)
df1

Categories

Resources