I have a healthcare dataset that includes columns with different text (such as medical history, doctor notes etc..). I want to use these notes to help build a 'criteria list' for the patients that stayed at the hospital for less than 2 days (i have that flagged in the dataset).
I'm new with NLP and have only done coursework projects where only one column of text is used but this dataset has multiple columns so how do i go about doing it? do i combine all the columns to be one big string and then do all the text cleaning and processing? or what is another option?
Heres a screenshot of the dataset, i couldnt get any other way to display it:
Related
I'm working on a project for school that requires me to use NLP, and I'm working in Python in a Jupyter Notebook. I decided to use this dataset of game reviews on Steam by Australian users, provided by UC San Diego. I've taken the downloaded data and piled it into a large dataframe that consists of a user_id and then data for up to 10 reviews by that user (no user has submitted more than 10 reviews in the dataset). There are 6 columns for each review: the name of the game being reviewed, a count of users finding the review funny, a count of users finding the review helpful, a count of users finding the review not helpful, whether the user recommended the game or not, and then the text review itself. This amounts to a dataframe with 61 columns, 1 for the userframe and then 6 for each of the 10 reviews (if a user submits fewer reviews, the data for the "missing" reviews is NaN).
In order to get the textual reviews together into one single array to be processed for tokenizing, I concatenated each of the 10 review_text columns into a new array, which came out to 257,990 elements (the length of my dataframe * 10 reviews), though many of these were NaN. I then dropped the NaN results from this array, coming to a total of 59,305 elements. Here is a picture of a bit of the dataframe and these operations to compile the text reviews in a single object.
After that, I import word_tokenize and sent_tokenize from nltk, and I attempt to use sent_tokenize() on the array. This failed (requires a string and I was providing a bytes-like object), but another answer I found on StackOverflow indicated that this was an encoding issue that could be fixed by casting as a string (thus the str() casting you see here). Upon performing sent_tokenize() on the array, I'm suddenly left with only 11 elements, and in looking at the results of the tokenization, I can see that I'm clearly missing many, many reviews that should've been tokenized. Using word_tokenize() then expands this out to 374 words, and there are obviously much more than 374 words present in the 59,000 reviews that I want to examine.
I'm not sure why this is happening, and I feel like I'm either very close to "getting it" or I'm wildly off base. I tried searching for why sent_tokenize would omit elements, but didn't have a whole lot of success, so I'm hoping someone here can give me a push in the right direction.
I'm trying to break up Product names into categories, for example if the product is "Demi Baguette", the category should be "Baguette" and sub category "Demi". I have looked at NLP articles but nothing seems to be what I need as they all focus on sentences and text.
I've seen other questions answered by saying to use a dict, however there is over 15 thousand rows in the excel file so that's not really possible.
Any ideas as to how I can tackle this or where I can look?
Here is an example of my data.
So I would want the category to be "Soup" and then sub categories based on flavour e.g"Chicken", and misc labels "Cream".
I'm working on a DataFrame using Python that contains data from different people. One of the columns contains the "Current Job" (as a string). I would like to classify the different kinds of jobs that this DataFrame contains.
But I have some issues with these columns:
Bad spelling
Non-existing jobs (for example one of the jobs is "Spinning in my chair").
It's Spanish, so you can find jobs as "operario", "operaria".
Some jobs are written in different ways (but they are basically the same job) such as "operario de máquinas", "operario de maquinaria".
Stemming can deal with similar words such as "operario" and "operaria" but I would like to know if there are some other tools that I can learn to deal with this kind of problem.
Thanks in advance.
OK, I'll ask in more detail. I'm updating the question and will add an image as well. I have the sectors and job vacancies data for those sectors as in the picture. The first column is dates and it's an index. the other 18 columns are job vacancies data for sectors.
enter image description here
Now my question is,
When I chart calculations such as seasonality and moving average, there are 18 tables for each sector.
For example, the healthcare industry, or mining.
enter image description here
I have exactly 18 of these three tables. At the end of the data preprocessing stage, I will have almost hundreds of tables. I wanted to tell them table by table in the readme.md section when I upload them to my github profile. But this way it won't be possible. Do you think I'm going right? I think about it. Or am I making things difficult for myself?
Is there any other way to analyze them? Can't I merge? I am open to suggestions at this point... I am doing time series analysis for the first time.
Excel newbie and aspiring Data analyst, I have this data and I want to find the distribution of City wise Shopping Experience. The column M has the shopping experience rated from 1 to 5.
What I tried
I am not able to google how to do this at all. I tried running correlation, but the in-built excel data analysis tool does not let me run it on non-numeric data, and I am not able to group the City cells either. I thought of replacing every city with numeric alias but I don't know how to do that either. How to search, or go ahead with this problem?
Update: I was thinking of some way to get this out of the cities column.
I am thinking this is better done in python.
How about something like this, have just taken the cities and data to show averageif, sumif and countif:
I used Data validation to provide the list to select from.