searching values from one dataframe in another dataframe using pandas

searching values from one dataframe in another dataframe using pandas - python

I have two datasets, patient data and disease data.
The patient dataset has diseases written in alphanumeric code format which I want to search in the disease dataset to display the disease name.
Patient dataset snapshot
Disease dataset snapshot
I want use groupby function on the ICD column and find out the occurrence of a disease and rank it in descending order to display the top 5. I have been trying to find a reference for the same, but could not.
Would appreciate the help!
EDIT!!
avg2 = joined.groupby('disease_name').TIME_DELTA.mean().disease_name.value_counts()
I am getting this error "'Series' object has no attribute 'disease_name'"

Assuming that the data you have are in two pandas dataframes called patients and diseases and that the diseases dataset has the column names disease_id and disease_name this could be a solution:
joined = patients.merge(diseases, left_on='ICD', right_on='disease_id')
top_5 = joined.disease_name.value_counts().head(5)
This solution joins the data together and then use value_counts instead of grouping. It should solve what I preceive to be what you are asking for even if it is not exactly the functionality you asked for.

Related

How could I create a column with matchin values from different datasets with different lengths

I want to create a new column in the dataset in which a ZipCode is assigned to a specific Region.
There are in total 5 Regions. Every Region consists of an x amount of ZipCodes. I would like to use the two different datasets to create a new column.
I tried some codes already, however, I failed because the series are not identically labeled. How should I tackle this problem?
I have two datasets, one of them has 1518 rows x 3 columns and the other one has
46603 rows x 3 columns.
As you can see in the picture:
df1 is the first dataset with the Postcode and Regio columns, which are the ZipCodes assigned to the corresponding Regio.
df2 is the second dataset where the Regio column is missing as you can see. I would like to add a new column into the df2 dataset which contains the corresponding Regio.
I hope someone could help me out.
Kind regards.

I believe you need to map the zipcode from dataframe 2 to the region column from the first dataframe. Assuming Postcode and ZipCode are same.
First create a dictionary from df1 and then replace the zipcode values based on the dictionary values
zip_dict = dict(zip(df1.Postcode, df1.Regio))
df2.ZipCode.replace(zip_dict)

Extract unique value with multiple columns from DataFrame

I have a dataframe where I want to extract values from two columns but the criteria set is unique values from one of the columns. In the image below, I want to extract unique values of 'education' along with its corresponding values from 'education-num'. I can easily extract the unique values with df['education'].unique() and I am stuck with not being able to extract the 'education-num'.
image of the dataframe.
(Originally the task was to compute the population of people with education of Bachelors, Masters and Doctorate and I assume this would be easier when comparing the 'education-num' rather than logical operators on string. But if there's any way we could do it directly from the 'education' that would also be helpful.
Edit: Turns out the Dataframe.isin helps to select rows by the list of string as given in the solution here.)
P.S. stack-overflow didn't allow me to post the image directly and posted a link to it instead...😒

Select columns by subset and call DataFrame.drop_duplicates:
df1 = df[['education', 'education-num']].drop_duplicates()
If need count population use:
df2 = df.groupby(['education', 'education-num']).size().reset_index(name='count')

Compare two date columns in pandas DataFrame to validate third column

Background info
I'm working on a DataFrame where I have successfully joined two different datasets of football players using fuzzymatcher. These datasets did not have keys for an exact match and instead had to be done by their names. An example match of the name column from two databases to merge as one is the following
long_name name
L. Messi Lionel Andrés Messi Cuccittini
As part of the validation process of a 18,000 row database, I want to check the two date of birth columns in the merged DataFrame - df, ensuring that the columns match like the example below
dob birth_date
1987-06-24 1987-06-24
Both date columns have been converted from strings to dates using pd.to_datetime(), e.g.
df['birth_date'] = pd.to_datetime(df['birth_date'])
My question
My query, I have another column called 'value'. I want to update my pandas DataFrame so that if the two date columns match, the entry is unchanged. However, if the two date columns don't match, I want the data in this value column to be changed to null. This is something I can do quite easily in Excel with a date_diff calculation but I'm unsure in pandas.
My current code is the following:
df.loc[(df['birth_date'] != df['dob']),'value'] = np.nan
Reason for this step (feel free to skip)
The reason for this code is that it will quickly show me fuzzy matches that are inaccurate (approx 10% of total database) and allow me to quickly fix those.
Ideally I need to also work on the matching algorithm to ensure a perfect date match, however, my current algorithm currently works quite well in it's current state and the project is nearly complete. Any advice on this however I'd be happy to hear, if this is something you know about
Many thanks in advance!

IICU:
Please Try np.where.
Works as follows;
np.where(if condition, assign x, else assign y)
if condition=df.loc[(df['birth_date'] != df['dob'],
x=np.nan and
y= prevailing df.value
df['value']= np.where(df.loc[(df['birth_date'] != df['dob']),'value'], np.nan, df['value'])

GradientBoostingClassifier and many columns

I use GradientBoosting classifier to predict gender of users. The data have a lot of predictors and one of them is the country. For each country I have binary column. There are always only one column set to 1 for all country columns. But such desicion is very slow from computation point of view. Is there any way to represent country columns with only one column? I mean correct way.

You can replace the binary variable with the actual country name then collapse all of these columns into one column. Use LabelEncoder on this column to create a proper integer variable and you should be all set.

Understanding groupby and pandas

I'm trying to use pandas on a movie dataset to find the 10 critics with the most reviews, and to list their names in a table with the name of the magazine publication they work for and the dates of their first and last review.
the movie dataset starts as a csv file which in excel looks something like this:
critic fresh date publication title reviewtext
r.ebert fresh 1/2/12 Movie Mag Toy Story 'blahblah'
n.bob rotten 4/2/13 Time Ghostbusters 'blahblah'
r.ebert rotten 3/31/09 Movie Mag CasaBlanca 'blahblah'
(you can assume that a critic posts reviews at only one magazine/publication)
Then my basic code starts out like this:
reviews = pd.read_csv('reviews.csv')
reviews = reviews[~reviews.quote.isnull()]
reviews = reviews[reviews.fresh != 'none']
reviews = reviews[reviews.quote.str.len() > 0]
most_rated = reviews.groupby('critic').size().order(ascending=False)[:30]
print most_rated
output>>>
critic
r.ebert 2
n.bob 1
Then I know how to isolate the top ten critics and the number of reviews they've made (shown above), but I'm still not familiar with pandas groupby, and using it seems to get rid of the rest of the columns (and along with it things like publication and dates). When that code runs, it only prints a list of the movie critics and how many reviews they've done, not any of the other column data.
Honestly I'm lost as to how to do it. Do I need to append data from the original reviews back onto my sorted dataframe? Do I need to make a function to apply onto the groupby function? Tips or suggestions would be very helpful!

As DanB says, groupby() just splits your DataFrame into groups. Then, you apply some number of functions to each group and pandas will stitch the results together as best it can -- indexed by the original group identifiers. Other than that, as far as I understand, there's no "memory" for what the original group looked like.
Instead, you have to specify what you want to output to contain. There are a few ways to do this -- I'd look into 'agg' and 'apply'. 'Agg' is for functions that return a single value for the whole group, whereas apply is much more flexible.
If you specify what you are looking to do, I can be more helpful. For now, I'll just give you two examples.
Suppose you want, for each reviewer, the number of reviews as well as the date of the first and last review and the movies that were reviewed first and last. Since each of these is a single value per group, use 'agg':
grouped_reviews = reviews.groupby('critic')
grouped.agg('size', {'date': ['first', 'last'], 'title': ['first', 'last']})
Suppose you want to return a dataframe of the first and last review by each reviewer. We can use 'apply', which works with any function that outputs a pandas object. So we'll write a function that takes each group and a dataframe of just the first and last row:
def get_first_and_last(df):
return pd.concat((df.iloc[0], df.iloc[-1]), axis = 1,ignore_index = True)
grouped_reviews.apply(get_first_and_last)
If you are more specific about what you are looking to do, I can give you a more specific answer.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.