So I've been working on data classification as part of a research project but since there are thousands of different values, I thought it best to use python to simplify the process rather than going through each record and classifying it manually.
So basically, I have a dataframe wherein one column is entitled "description" and another is entitled "codes". Each row in the "description" column contains a survey response about activities. The descriptions are all different but might contain some keywords. I have a list of some 40 codes to classify each row based on the text. I was thinking of manually creating some columns in the csv file and in each column, typing a keyword corresponding to each of the codes. Then, a loop (or function with a loop) is applied to the dataframe that goes through each row and if a specific substring is found that corresponds to any of the keywords, and then updated the "codes" column with the code corresponding to that keyword.
My Dilemma
For example:
Suppose the list of codes is "Dance", "Nap", "Run", and "Fight" that are in a separate dataframe column. This dataframe also with the manually entered keyword columns is shown below (can be more than two but I just used two for illustration purposes).
This dataframe is named "classes".
category
Keyword1
Keyword2
Dance
dance
danc
Nap
sleep
slept
Run
run
quick
Fight
kick
unch
The other dataframe is as follows with the "codes" column initially blank.
This dataframe is named "data".
description
codes
Iwasdancingthen
She Slept
He was stealing
The function or loop will search through the "description" column above and check if the keywords are in a given row. If they are, the corresponding codes are applied (as shown in the resulting dataframe below in bold). If not, the row in the "codes" column is left blank. The loop should run as many times as there are Keyword columns; the loop will run twice in this case since there are two keyword columns.
description
codes
Iwasdancingthen
Dance
She Slept
Sleep
He landed a kick
Fight
We are family
FYI: The keywords don't actually have to be complete words. I'd like to use partial words too as you see above.
Also, it should be noted that the loop or function I want to make should account for case sensitivity and strings that are combined.
I hope you understand what I'm trying to do.
What I tried:
At first, I tried using a dictionary and manipulate it somehow. I used the advice here:
search keywords in dataframe cell
However, this didn't work too well as I had many "Nan" values pop up and it became too complicated, so I tried a different route using lists. The code I used was based off another user's advice:
How to conditionally update DataFrame column in Pandas
Here's what I did:
# Create lists from the classes dataframe
Keyword1list = classes["Keyword1"].values.tolist()
Category = classes["category"].values.tolist()
I then used the following loop for classification
for i in range(len(Keyword1list)):
data.loc[data["description"] == Keyword1list[i] , "codes"] = Category[i]
However, the resulting output still gives me "Nan" for all columns. Also, I don't know how to loop over every single keyword column (in this case, loop over the two columns "Keyword1" and "Keyword2").
I'd really appreciate it if anyone could help me with a function or loop that works. Thanks in advance!
Edit: It was pointed out to me that some descriptions might contain multiple keywords. I forgot to mention that the codes in the "classes" dataframe are ordered by rank so that the ones that appear first on the dataframe should take priority; for example, if both "dance" and "nap" are in a description, the code listed higher in the "classes" dataframe (i.e. dance) should be selected and inputted into the "codes" column. I hope there's a way to do that.
Related
Goal
I want to split the response from Google Sentiment Analysis into four columns, then merge with original content dataframe.
Situation
I'm running the Google sentiment analysis on a column of text in a python dataframe.
Here's a sample for one of the returned rows. The column is 'sentiment':
magnitude: 0.6000000238418579\nscore: -0.6000000238418579
I then need to split that cell into four new columns, one for magnitude, one for it's returned value, one for score, and one for it's returned value.
What I've tried
Currently, I'm using this method to do that:
df02 = df01['sentiment'].astype(str).str.split(expand=True)
I'm then merging those four columns with the original dataframe that contains the analyzed text field and other values.
However, if sentiment returns no results, the sentiment cell is empty. And if all rows have empty sentiment cells, then it won't create four new columns. And that breaks my attempt to merge the two dataframes.
So I'm trying to understand how I can insert None into the new four column cells if the sentiment cell value is empty in the source dataframe. That way, at least I'll have four columns, with the values for each of the four new cells being None.
I've received input that I should use apply() and fillna, but I'm not understanding how that should be handled in my instance, and the documentation isn't clear to me. It seems like the method above needs code added that inserts None if no value is detected, but I'm not familiar enough with Python or pandas to know where to start on that.
EXAMPLE
What the data returned looks like. If all rows have no entry, then it won't create the four columns, which is required for my next method of merging this dataframe back into the dataframe with the original text content.
|index|0|1|2|3|
|---|---|---|---|---|
|0|||||
|1|||||
|2|||||
|3|||||
|4|||||
|5|magnitude:|0\.6000000238418579|score:|-0\.6000000238418579|
|6|magnitude:|0\.10000000149011612|score:|0\.10000000149011612|
|7|magnitude:|0\.10000000149011612|score:|-0\.10000000149011612|
|8|magnitude:|0\.699999988079071|score:|-0\.699999988079071|
|9|magnitude:|0\.699999988079071|score:|-0\.30000001192092896|
|10|magnitude:|0\.699999988079071|score:|-0\.30000001192092896|
As mentioned by #dsx, the responses from Google Sentiment Analysis can be split into four columns by using the below code :
pd.DataFrame(df['sentiment'].apply(sentiment_pass).tolist(),columns=['magnitude', 'score'], index=df.index)
Sentiment Analysis is used to identify the prevailing emotions within the text using natural language processing. For more information, you can check this link.
I'm a beginner in coding and I wrote some codes in python pandas that I didn't understand fully and need some clarification.
Lets say this is the data, DeathYear, Age, Gender and Country are all columns in an excel file.
How to plot a table with non-numeric values in python?
I saw this question and I used this command
df.groupby('Gender')['Gender'].count().plot.pie(autopct='%.2f',figsize=(5,5))
it works and gives me a pie chart of the percentage of each gender,
but the normal pie chart command that I know for numerical data looks like this
df["Gender"].plot.pie(autopct="%.2f",figsize=(5,5))
My question is why did we add the .count()?
is it to transform non numerical data to numerical?
and why did why use the group by and type the column twice ('Gender')['Gender']?
I'll address the second part of your question first since it makes more sense to explain it that way
The reason that you use ('Gender')['Gender'] is that it does two different things. The first ('Gender') is the argument to the groupby function. It tells you that you want the DataFrame to be grouped by the 'Gender' column. Note that the groupby function needs to have a column or level to group by or else it will not work.
The second ['Gender'] tells you to only look at the 'Gender' column in the resulting DataFrame. The easiest way to see what the second ['Gender'] does is to compare the output of df.groupby('Gender').count() and df.groupby('Gender')['Gender'].count() and see what happens.
One detail that I omitted in first part for clarity it that the output of df.groupby('Gender') is not a DataFrame, but actually a DataFrameGroupBy object. The details of what exactly this object is are not important to your question, but the key is that to get a DataFrame back you need to have a function that tells you what to put in the rows of the DataFrame that you wish to create. The .count() function is one of those options (along with many others such as .mean(), etc.). In your case, since you want the total counts to make a pie chart, the .count() function does exactly that; it will count the number of times 'Female' and 'Male' appears in the 'Gender' column and that sum will be the entries in the corresponding row. The DataFrame is then able to be used to create a pie chart. So you are correct in that the .count() function transforms the non-numeric 'Female' and 'Male' entries into a numeric value which corresponds to how often those entries appeared in the initial DataFrame.
I have csv file
salary = pd.read_csv('./datasets/salary.csv')
is it possible to have an output like this
Apology for sharing only the concept as you did not provide any code in the question. Consider adding example code if you are unable to understand the concept.
This will require creating a new "column" - "Label" in the dataframe for each matching "Salary". For example, check the table in the link below:
Click to see a sample table to achieve desired columns
This columns "Label" can be filled using ifesle statements. Use string function or == "string in the column Salary" to write the conditional statements. Additionally, use for loop if dataframe has multiple entries for each type of salary. Second, create the three new columns of interest i.e. "per annum", "p.a. + Super", and "p.d.". Now, use ifesle statement again on Label column to enter values row wise in each column of interest based on the conditional statement.
This should let you achieve the desired entries.
Hope it helps.
I am working with a CSV file and I need to find the greatest several items in a column. I was able to find the top value just by doing the standard looping through and comparing values.
My idea to get the top few values would be to either store all of the values from that column into an array, sort it, and then pull the last three indices. However I'm not sure if that would be a good idea in terms of efficiency. I also need to pull other attributes associated with the top value and it seems like separating out these column values would make everything messy.
Another thing that I thought about doing is having three variables and doing a running top value sort of deal, where every time I find something bigger I compare the "top three" amongst each other and reorder them. That also seems a bit complex and I'm not sure how I would implement it.
I would appreciate some ideas or if someone told if I'm missing something obvious. Let me know if you need to see my sample code (I felt it was probably unnecessary here).
Edit: To clarify, if the column values are something like [2,5,6,3,1,7] I would want to have the values first = 7, second = 6, third = 5
Pandas looks perfect for your task:
import pandas as pd
df = pd.read_csv('data.csv')
df.nlargest(3, 'column name')
I have a spreadsheet that comes to me with a column that contains FQDN's of computers. However, filtering this is difficult because of the unique names and I ended up putting in a new column next the FQDN column and then entering a less unique value based on that name. An example of this would be:
dc01spmkt.domain.com
new column value = "MARKETING"
All of the hosts will have a 3 letter designation so people can filter on the new column with the more generic titles.
My question is: Is there a way that I can script this so that when the raw sheet comes I can run the script and it will look for values in the old column to populate the new one? So if it finds 'mkt' together in the hostname field it writes MARKETING, or if it finds 'sls' it writes SALES?
If I understand you correctly, you should be able to do this with an if, isnumber, search formula as follows:
=IF(ISNUMBER(SEARCH("mkt",A1))=TRUE,"Marketing",IF(ISNUMBER(SEARCH("sls",A1))=TRUE,"Sales",""))
which would yield you the following:
asdfamkt Marketing
sls Sales
aj;sldkjfa
a;sldkfja
mkt Marketing
sls Sales
What this is doing is using Search, which returns the numbered place where your text you are searching begins in the field. Then you use ISNumber to return a true or false as to whether your Search returned a number, meaning it found your three letters in question. Then you are using the IF to say that if ISNumber is True, then you want to call it "Marketing" or whatever.
You can draw out the IF arguments for as many three letter variables as you would need to.
Hope this helped!