Mapping keyword with a dataframe column using pandas in python - python

I have a dataframe,
DF,
Name Stage Description
Sri 1 Sri is one of the good singer in this two
2 Thanks for reading
Ram 1 Ram is one of the good cricket player
ganesh 1 good driver
and a list,
my_list=["one"]
I tried mask=df["Description"].str.contains('|'.join(my_list),na=False)
but it gives,
output_DF.
Name Stage Description
Sri 1 Sri is one of the good singer in this two
Ram 1 Ram is one of the good cricket player
My desired output is,
desired_DF,
Name Stage Description
Sri 1 Sri is one of the good singer in this two
2 Thanks for reading
Ram 1 Ram is one of the good cricket player
It has to consider the stage column, I want all the rows associated with the description.

I think you need:
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player
3 ganesh 1 good driver
#replace empty or whitespaces by previous value
df['Name'] = df['Name'].mask(df['Name'].str.strip() == '').ffill()
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 Sri 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player
3 ganesh 1 good driver
#get all names by condition
my_list = ["one"]
names=df.loc[df["Description"].str.contains("|".join(my_list),na=False), 'Name']
print (names)
0 Sri
2 Ram
Name: Name, dtype: object
#select all rows contains names
df = df[df['Name'].isin(names)]
print (df)
Name Stage Description
0 Sri 1 Sri is one of the good singer in this two
1 Sri 2 Thanks for reading
2 Ram 1 Ram is one of the good cricket player

It looks to be finding "one" in the Description fields of the dataframe and returning the matching descriptions.
If you want the third row, you will have to add an array element for the second match
eg. 'Thanks' so something like my_list=["one", "Thanks"]

Related

How do i use medical codes to determine what disease a person have using jupyter?

I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.
Excel file 1:
Patient
Primary Diagnosis
Secondary Diagnosis
Secondary Diagnosis 2
Secondary Diagnosis 3
Alex
50322
50111
John
50331
60874
50226
74444
Peter
50226
74444
Peter
50233
88888
Excel File 2:
Primary Diagnosis
Medical Code
Diabetes Type 2
50322
Diabetes Type 2
50331
Diabetes Type 2
50233
Cardiovescular Disease
50226
Hypertension
50111
AIDS
60874
HIV
74444
HIV
88888
Intended output:
Patient
Positive for Diabetes Type 2
Positive for Cardiovascular Disease
Positive for Hypertension
Positive for AIDS
Positive for HIV
Alex
1
1
0
0
0
John
1
1
0
1
1
Peter
1
1
0
0
1
You can use merge and pivot_table
out = (
df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
.merge(df2, on='Medical Code').assign(dummy=1)
.pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
.add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)
Output:
Patient
Positive for AIDS
Positive for Cardiovescular Disease
Positive for Diabetes Type 2
Positive for HIV
Positive for Hypertension
Alex
0
0
1
0
1
John
1
1
1
1
0
Peter
0
1
1
1
0
IIUC, you could melt df1, then map the codes from reshaped df2, finally pivot_table on the output:
diseases = df2.set_index('Medical Code')['Primary Diagnosis']
(df1
.reset_index()
.melt(id_vars=['index', 'Patient'])
.assign(disease=lambda d: d['value'].map(diseases),
value=1,
)
.pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)
output:
disease AIDS Cardiovescular Disease Diabetes Type 2 HIV Hypertension
Patient
Alex 0 0 1 0 1
John 1 1 1 1 0
Peter 0 1 1 1 0
Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.

How would I go about iterating through each row in a column and keeping a running tally of every substring that comes up? Python

Essentially what I am trying to do is go through the "External_Name" column, row by row, and get a count of unique substrings within each string, kind of like .value_counts().
External_Name
Specialty
ABMC Hyperbaric Medicine and Wound Care
Hyperbaric/Wound Care
ABMC Kaukauna Laboratory Services
Laboratory
AHCM Sinai Bariatric Surgery Clinic
General Surgery
...........
...........
n
n
For example, after running through the first three rows in "External_Name" the output would be something like
Output
Count
ABMC
2
Hyperbaric
1
Medicine
1
and
1
Wound
1
Care
1
So on and so forth. Any help would be really appreciated!
You can split at whitespace with str.split(), then explode the resulting word lists into individual rows and count the values with value_counts.
>>> df.External_Name.str.split().explode().value_counts()
ABMC 2
Hyperbaric 1
Medicine 1
and 1
Wound 1
Care 1
Kaukauna 1
Laboratory 1
Services 1
AHCM 1
Sinai 1
Bariatric 1
Surgery 1
Clinic 1
Name: External_Name, dtype: int64

Pythom:Compare 2 columns and write data to excel sheets

I need to compare two columns together: "EMAIL" and "LOCATION".
I'm using Email because it's more accurate than name for this issue.
My objective is to find total number of locations each person worked
at, sum up the total of locations to select which sheet the data
will been written to and copy the original data over to the new
sheet(tab).
I need the original data copied over with all the duplicate
locations, which is where this problem stumps me.
Full Excel Sheet
Had to use images because it flagged post as spam
The Excel sheet (SAMPLE) I'm reading in as a data frame:
Excel Sample Spreadsheet
Example:
TOMAPPLES#EXAMPLE.COM worked at WENDYS,FRANKS HUT, and WALMART - That
sums up to 3 different locations, which I would add to a new sheet
called SHEET: 3 Different Locations
SJONES22#GMAIL.COM worked at LONDONS TENT and YOUTUBE - That's 2 different locations, which I would add to a new sheet called SHEET:
2 Different Locations
MONTYJ#EXAMPLE.COM worked only at WALMART - This user would be added
to SHEET: 1 Location
Outcome:
data copied to new sheets
Sheet 2
Sheet 2: different locations
Sheet 3
Sheet 3: different locations
Sheet 4
Sheet 4: different locations
Thanks for taking your time looking at my problem =)
Hi Check below lines if work for you..
import pandas as pd
df = pd.read_excel('sample.xlsx')
df1 = df.groupby(['Name','Location','Job']).count().reset_index()
# this is long line
df2 = df.groupby(['Name','Location','Job','Email']).agg({'Location':'count','Email':'count'}).rename(columns={'Location':'Location Count','Email':'Email Count'}).reset_index()
print(df1)
print('\n\n')
print(df2)
below is the output change columns to check more variations
df1
Name Location Job Email
0 Monty Jakarta Manager 1
1 Monty Mumbai Manager 1
2 Sahara Jonesh Paris Cook 2
3 Tom App Jakarta Buser 1
4 Tom App Paris Buser 2
df2 all columns
Name Location ... Location Count Email Count
0 Monty Jakarta ... 1 1
1 Monty Mumbai ... 1 1
2 Sahara Jonesh Paris ... 2 2
3 Tom App Jakarta ... 1 1
4 Tom App Paris ... 2 2

Sort text in second column based on values in first column

in python i would like to separate the text in different rows based on the values of the first number. So:
Harry went to School 100
Mary sold goods 50
Sick man
using the provided information below:
number text
1 Harry
1 Went
1 to
1 School
1 100
2 Mary
2 sold
2 goods
2 50
3 Sick
3 Man
for i in xrange(0, len(df['number'])-1):
if df['number'][i+1] == df['number'][i]:
# append text (e.g Harry went to school 100)
else:
# new row (Mary sold goods 50)
You can use groupby,
for name,group in df.groupby(df['number']):
print ' '.join([i for i in group['text']])
Result
Harry Went to School 100
Mary sold goods 50
Sick Man

groupby and join text column

I have a csv file with this header text|business_id
I wanna group all texts related to one business
I used review_data=review_data.groupby(['business_id'])['text'].apply("".join)
The review_data is like:
text \
0 mr hoagi institut walk doe seem like throwback...
1 excel food superb custom servic miss mario mac...
2 yes place littl date open weekend staff alway ...
business_id
0 5UmKMjUEUNdYWqANhGckJw
1 5UmKMjUEUNdYWqANhGckJw
2 5UmKMjUEUNdYWqANhGckJw
I get this error: TypeError: sequence item 131: expected string, float found
these are the lines 130 to 132:
130 use order fair often past 2 year food get progress wors everi time order doesnt help owner alway regist rude everi time final decid im done dont think feel let inconveni order food restaur let alon one food isnt even good also insid dirti heck deliv food bmw cant buy scrub brush found golden dragon collier squar 100 time better|SQ0j7bgSTazkVQlF5AnqyQ
131 popular denni|wqu7ILomIOPSduRwoWp4AQ
132 want smth quick late night would say denni|wqu7ILomIOPSduRwoWp4AQ
I think you need filter notnull data with boolean indexing before groupby:
print review_data
text business_id
0 mr hoagi 5UmKMjUEUNdYWqANhGckJw
1 excel food 5UmKMjUEUNdYWqANhGckJw
2 NaN 5UmKMjUEUNdYWqANhGckJw
3 yes place 5UmKMjUEUNdYWqANhGckJw
review_data = review_data[review_data['text'].notnull()]
print review_data
text business_id
0 mr hoagi 5UmKMjUEUNdYWqANhGckJw
1 excel food 5UmKMjUEUNdYWqANhGckJw
3 yes place 5UmKMjUEUNdYWqANhGckJw
review_data=review_data.groupby(['business_id'])['text'].apply("".join)
print review_data
business_id
5UmKMjUEUNdYWqANhGckJw mr hoagi excel food yes place
Name: text, dtype: object

Categories

Resources