How to extract a specific "word" from a list - python

I have a dataframe.
dict_df = {'code': {0: 'a02',
1: 'a03',
2: 'a04',
3: 'a05',
4: 'a06',
5: 'a07',
6: 'a08',
7: 'a09',
8: 'a10'},
'name': {0: 'Dr Mike',
1: ' Dr. Benjamin',
2: 'Doctor Dre',
3: 'ApotekOne',
4: 'Aptek Two',
5: 'Apotek 3',
6: 'DrVladrimir',
7: ' dR Sarah inc.',
8: 'DR.John'}}
df = pd.DataFrame(dict_df)
I'm trying to extract in another column different strings. I will take "dr" as example but it is applies to all of them.
For "dr" I need it in any form or shape (dr, DR, Dr, dR) plus
before (dr) can be blank or any other char except a letter or a number (ex. Dr)
after (dr) can be blank, point or any other char except a letter of a number (ex. DR.John)
if there is no special char after (dr) (ex. blank, point, etc) and it is an uppercase letter, it is a match (ex "Dre" is not a match but "DrVlad" is a match)
What I did by now but it doesn't cover all conditions above:
df['inclusions']= df['name'].str.findall(r'(?i)dr|doctor|apotek|aptek|two').str.join(", ").str.lower()
If on the column "inclusions" I have double (dr), how can I keep only one (no duplicates)?
Thank you.

IIUC, you can use an anchor to the start, a partial case insensitive group and a negative match:
df['inclusions']= (df['name']
.str.findall(r'^\s*(?i:dr|doctor|apotek|aptek|two)(?![a-z])')
.str.join(", ")
.str.lower()
)
output:
code name inclusions
0 a02 Dr Mike dr
1 a03 Dr. Benjamin dr
2 a04 Doctor Dre doctor
3 a05 ApotekOne apotek
4 a06 Aptek Two aptek
5 a07 Apotek 3 apotek
6 a08 DrVladrimir dr
7 a09 dR Sarah inc. dr
8 a10 DR.John dr

Related

Performing an Operation On Grouped Pandas Data

I have a pandas DataFrame with the following information:
year state candidate percvotes electoral_votes perc_evotes vote_frac vote_int
1976 ALABAMA CARTER, JIMMY 55.727269 9 5.015454 0.015454 5
1976 ALABAMA FORD, GERALD 42.614871 9 3.835338 0.835338 3
1976 ALABAMA MADDOX, LESTER 0.777613 9 0.069985 0.069985 0
1976 ALABAMA BUBAR, BENJAMIN 0.563808 9 0.050743 0.050743 0
1976 ALABAMA HALL, GUS 0.165194 9 0.014867 0.014867 0
where pervotes is the percentage of the total votes cast the candidate received (calculated before), electoral_votes are the electoral college votes for that state, perc_evotes is the calculated percent of the electoral votes, and vote_frac and vote_int are the fraction and whole number part of the electoral votes earned respectively. This data repeats for each year of an election and then by state per year. The candidates each have a row for each state, and it is similar data.
What I want to do is allocate the leftover electoral votes to the candidate with the highest fraction. This number is different for each state and year. In this case there would be 1 leftover electoral vote (9 total votes and 5+3=8 are already allocated) and the remaining one will go to 'FORD, GERALD' since he has 0.85338 in the vote_frac column. Sometimes there are 2 or 3 left unallocated.
I have a solution that adds the data to a dictionary, but it is using for loops. I know there must be a better way to do this in a more "pandas" way. I have touched on groupby in this loop but I feel like I am not utilizing pandas to it's full potential.
My for loop:
results = {}
grouped = electdf.groupby(["year", "state"])
for key, group in grouped:
year, state = key
group['vote_remaining'] = group['electoral_votes'] - group['vote_int'].sum()
remaining = group['vote_remaining'].iloc[0]
top_fracs = group['vote_frac'].nlargest(remaining)
group['total'] = (group['vote_frac'].isin(top_fracs)).astype(int) + group['vote_int']
if year not in results:
results[year] = {}
for candidate, evotes in zip(group['candidate'], group['total']):
if candidate not in results[year] and evotes:
results[year][candidate] = 0
if evotes:
results[year][candidate] += evotes
Thanks in advance!
Perhaps an apply function which finds the available electoral votes, the amount of votes cast, and conditionally updates the max 'vote_frac' row's 'vote_int' column with the difference of available and cast votes:
import pandas as pd
df = pd.DataFrame({'year': {0: 1976, 1: 1976, 2: 1976, 3: 1976, 4: 1976},
'state': {0: 'ALABAMA', 1: 'ALABAMA', 2: 'ALABAMA',
3: 'ALABAMA', 4: 'ALABAMA'},
'candidate': {0: 'CARTER, JIMMY', 1: 'FORD, GERALD',
2: 'MADDOX, LESTER', 3: 'BUBAR, BENJAMIN',
4: 'HALL, GUS'},
'percvotes': {0: 55.727269, 1: 42.614871, 2: 0.777613, 3: 0.563808,
4: 0.165194},
'electoral_votes': {0: 9, 1: 9, 2: 9, 3: 9, 4: 9},
'perc_evotes': {0: 5.015454, 1: 3.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_frac': {0: 0.015454, 1: 0.835338, 2: 0.069985,
3: 0.050743, 4: 0.014867},
'vote_int': {0: 5, 1: 3, 2: 0, 3: 0, 4: 0}})
def apply_extra_e_votes(grp):
# Get First Electoral Vote
# (Assumes first row in group contains the
# correct number of electoral votes for the group)
available_e_votes = grp['electoral_votes'].iloc[0]
# Get the Sum of the vote_int column
current_e_votes = grp['vote_int'].sum()
# If there more available votes than votes cast
if available_e_votes > current_e_votes:
# Update the 'vote_int' column at the max value of 'vote_frac'
grp.loc[
grp['vote_frac'].idxmax(),
'vote_int'
] += available_e_votes - current_e_votes # (Remaining Votes)
return grp
# Groupby and Apply Function
new_df = df.groupby(['year', 'state']).apply(apply_extra_e_votes)
# For Display
print(new_df.to_string(index=False))
Output:
year
state
candidate
percvotes
electoral_votes
perc_evotes
vote_frac
vote_int
1976
ALABAMA
CARTER, JIMMY
55.727269
9
5.015454
0.015454
5
1976
ALABAMA
FORD, GERALD
42.614871
9
3.835338
0.835338
4
1976
ALABAMA
MADDOX, LESTER
0.777613
9
0.069985
0.069985
0
1976
ALABAMA
BUBAR, BENJAMIN
0.563808
9
0.050743
0.050743
0
1976
ALABAMA
HALL, GUS
0.165194
9
0.014867
0.014867
0

Error on value column in group by value counts

I have code below as:
df[('name')]['cash_amount'].valuecounts(normalize=True).sum()
I want to use valuecounts normalize true, because I want to calculate the % of each names cash over the total amount of cash in the column.
Where I am trying to calculate the total number each name has in the cash_amount column,
but I get error that says - KeyError: 'cash_amount'
df looks like
input:
name | cash_amount
bob $400
chris $500
amy $100
amy $100
bob $100
bob $100
output:
name | %
bob .46
chris .38
amy .15
I looked for any white spaces in the column name and tried df.columns = df.columns.str.strip() and still get same error
First remove the $ from your string and convert to float or int. $ is a regex character, so you need to escape it with \. Then .groupby and get the percentage of total by getting the sum for each group and dividing it by the toal sum:
import pandas as pd
df = pd.DataFrame({'name': {0: 'bob', 1: 'chris', 2: 'amy', 3: 'amy', 4: 'bob', 5: 'bob'},
'cash_amount': {0: '$400', 1: '$500', 2: '$100', 3: '$100', 4: '$100', 5: '$100'}})
df['cash_amount'] = df['cash_amount'].str.replace('\$', '').astype(float)
df = ((df.groupby('name')['cash_amount'].sum() / df['cash_amount'].sum())
.rename('%').reset_index())
df
Out[1]:
name %
0 amy 0.153846
1 bob 0.461538
2 chris 0.384615
Please use df.replace, groupby() and apply lambda grouped sum divided by total sum
df['cash_amount']=df.replace(regex=r'\$', value='')['cash_amount'].astype(int)
(df.groupby('name').cash_amount.apply(lambda x: x.sum())/df.cash_amount.sum()).rename('%').reset_index()
name %
0 amy 0.153846
1 bob 0.461538
2 chris 0.384615

Extract Volume information from pandas series - Pandas , Regex

I have a Pandas series which can be produced by the below code:
Input:
l = ['abcd 1942 Lmauu 40% 70cl',
'something again something 1.5 L',
'some other stuff 45% 70 CL',
'not the exact data 3LTR',
'abcd 100Ltud 6%(8)500ML',
'cdef 6%(8)500 ml',
'a packet 24 x 27.5 cl ( PET )']
ser = pd.Series(l)
Problem Statement and expected output:
I am trying to extract the Volumes from the series and convert into a dataframe such that the volume would be in 1 column of the dataframe and the unit of measure in the other column, expected output can be reproduced using the below code:
d = {0: {0: '70',
1: '1.5',
2: '70',
3: '3',
4: '500',
5: '500',
6: '27.5'},
1: {0: 'cl', 1: 'L', 2: 'CL', 3: 'LTR', 4: 'ML', 5: 'ml', 6: 'cl'}}
expected_output = pd.DataFrame(d)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
My try code
Here is what I have tried , i have come very near to what I want but not quite , if you see i dont get the last volume. I think because i have included $ in my regex , but without it I was not able to parse the volume as in this string for example abcd 1942 Lmauu 40% 70cl, 1942 L would have been returned. Also I want the unit of measure only in second column not the first as shown in my output but that is secondary.
print(ser.str.extract(r'((?i)([\d]+?[.])?\d+?[\s+]?(cl$|ml$|ltr$|L$)(?:$))').iloc[:,[0,-1]])
0 2
0 70cl cl
1 1.5 L L
2 70 CL CL
3 3LTR LTR
4 500ML ML
5 500 ml ml
6 NaN NaN
Please suggest what should I do here.
You may use
r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b'
See the regex demo.
Details
(?i) - case insensitive mode on
\b - a word boundary
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits followed with an optional sequence of a dot and one or more digits
\s* - 0+ whitespaces
(cl|ml|ltr|L) - cl, ml, ltr or L (mind the case insensitive matching)
\b - a word boundary
Test:
>>> ser.str.extract(r'(?i)\b(\d+(?:\.\d+)?)\s*(cl|ml|ltr|L)\b', expand=True)
0 1
0 70 cl
1 1.5 L
2 70 CL
3 3 LTR
4 500 ML
5 500 ml
6 27.5 cl
It is better to use named capturing groups, so that result
columns have meaningful names.
I also simplified a bit your regex and changed units of measure to lower case.
So change your code to:
res = ser.str.extract(r'(?i)(?P<Amount>\d+(?:\.\d+)?)\s?(?P<Unit>[CM]?L|LTR)\b')
res.Unit = res.Unit.str.lower()
The result is:
Amount Unit
0 70 cl
1 1.5 l
2 70 cl
3 3 ltr
4 500 ml
5 500 ml
6 27.5 cl
Note also that $ in (cl$|ml$|ltr$|L$) is wrong, because at least in
one case you have additional text after the unit of measure.

Set minimal spacing between values

I have the following dataframe where the column value is sorted:
df = pd.DataFrame({'variable': {0: 'Chi', 1: 'San Antonio', 2: 'Dallas', 3: 'PHL', 4: 'Houston', 5: 'NY', 6: 'Phoenix', 7: 'San Diego', 8: 'LA', 9: 'San Jose', 10: 'SF'}, 'value': {0: 191.28, 1: 262.53, 2: 280.21, 3: 283.08, 4: 290.75, 5: 295.72, 6: 305.6, 7: 357.89, 8: 380.07, 9: 452.71, 10: 477.67}})
Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 283.08
4 Houston 290.75
5 NY 295.72
6 Phoenix 305.60
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67
I want to find values where the distance between neighboring values is smaller than 10:
df['value'].diff() < 10
Output:
0 False
1 False
2 False
3 True
4 True
5 True
6 True
7 False
8 False
9 False
10 False
Name: value, dtype: bool
Now I want to equally space those True values that are too close to each other. The idea is to take the first value before the True sequence (280.21) and add 5 to each next True value (cumulative sum): first True = 280 + 5, second True = 280 + 5 + 5, third True = 280 + 5 + 5...
Expected Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 285.21 <-
4 Houston 290.21 <-
5 NY 295.21 <-
6 Phoenix 300.21 <-
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67
My solution:
mask = df['value'].diff() < 10
df.loc[mask, 'value'] = 5
df.loc[mask | mask.shift(-1), 'value'] = last_day[mask | mask.shift(-1), 'value'].cumsum()
Maybe there is a more elegant one.
Let's try this:
df = pd.DataFrame({'variable': {0: 'Chi', 1: 'San Antonio', 2: 'Dallas', 3: 'PHL', 4: 'Houston', 5: 'NY', 6: 'Phoenix', 7: 'San Diego', 8: 'LA', 9: 'San Jose', 10: 'SF'}, 'value': {0: 191.28, 1: 262.53, 2: 280.21, 3: 283.08, 4: 290.75, 5: 295.72, 6: 305.6, 7: 357.89, 8: 380.07, 9: 452.71, 10: 477.67}})
s = df['value'].diff() < 10
add_amt = s.cumsum().mask(~s) * 5
df_out = df.assign(value=df['value'].mask(add_amt.notna()).ffill() + add_amt.fillna(0))
df_out
Output:
variable value
0 Chi 191.28
1 San Antonio 262.53
2 Dallas 280.21
3 PHL 285.21
4 Houston 290.21
5 NY 295.21
6 Phoenix 300.21
7 San Diego 357.89
8 LA 380.07
9 San Jose 452.71
10 SF 477.67

Check if column in dataframe is missing values

I have am column full of state names.
I know how to iterate down through it, but I don't know what syntax to use to have it check for empty values as it goes. tried "isnull()" but that seems to be the wrong approach. Anyone know a way?
was thinking something like:
for state_name in datFrame.state_name:
if datFrame.state_name.isnull():
print ('no name value' + other values from row)
else:
print(row is good.)
df.head():
state_name state_ab city zip_code
0 Alabama AL Chickasaw 36611
1 Alabama AL Louisville 36048
2 Alabama AL Columbiana 35051
3 Alabama AL Satsuma 36572
4 Alabama AL Dauphin Island 36528
to_dict():
{'state_name': {0: 'Alabama',
1: 'Alabama',
2: 'Alabama',
3: 'Alabama',
4: 'Alabama'},
'state_ab': {0: 'AL', 1: 'AL', 2: 'AL', 3: 'AL', 4: 'AL'},
'city': {0: 'Chickasaw',
1: 'Louisville',
2: 'Columbiana',
3: 'Satsuma',
4: 'Dauphin Island'},
'zip_code': {0: '36611', 1: '36048', 2: '35051', 3: '36572', 4: '36528'}}
Based on your description, you can use np.where to check if rows are either null or empty strings.
df['status'] = np.where(df['state'].eq('') | df['state'].isnull(), 'Not Good', 'Good')
(MCVE) For example, suppose you have the following dataframe
state
0 New York
1 Nevada
2
3 None
4 New Jersey
then,
state status
0 New York Good
1 Nevada Good
2 Not Good
3 None Not Good
4 New Jersey Good
It's always worth mentioning that you should avoid loops whenever possible, because they are way slower than masking

Categories

Resources