Counting unique words in a pandas column

Counting unique words in a pandas column - python

I am having some difficulties with the following data (from a pandas dataframe):
Text
0 Selected moments from Fifa game t...
1 What I learned is that I am ...
3 Bill Gates kept telling us it was comi...
5 scenario created a month before the...
... ...
1899 Events for May 19 – October 7 - October CTOvision.com
1900 Office of Event Services and Campus Center Ope...
1901 How the CARES Act May Affect Gift Planning in ...
1902 City of Rohnert Park: Home
1903 iHeartMedia, Inc.
I would need to extract the count of unique words per row (after removing punctuation). So, for example:
Unique
0 6
1 6
3 8
5 6
... ...
1899 8
1900 8
1901 9
1902 5
1903 2
I tried to do it as follows:
df["Unique"]=df['Text'].str.lower()
df["Unique"]==Counter(word_tokenize('\n'.join( file["Unique"])))
but I have not got any count, only a list of words (without their frequency in that row).
Can you please tell me what is wrong?

First remove all Punctuation if you dont need it counted. Leverage sets. str.split.map(set) will give you a set. Count the elements in the set there after. Sets do not take multiple unique elements.
Chained
df['Text'].str.replace(r'[^\w\s]+', '').str.split().map(set).str.len()
Stepwise
df[Text]=df['Text'].str.replace(r'[^\w\s]+', '')
df['New Text']=df.Text.str.split().map(set).str.len()

So, I'm just updating this as per the comments. This solution accounts for punctuation as well.
df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)

try this
from collections import Counter
dict = {'A': {0:'John', 1:'Bob'},
'Desc': {0:'Bill ,Gates Started Microsoft at 18 Bill', 1:'Bill Gates, Again .Bill Gates and Larry Ellison'}}
df = pd.DataFrame(dict)
df['Desc']=df['Desc'].str.replace(r'[^\w\s]+', '')
print(df.loc[:,"Desc"])
print(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items())
print(len(Counter(" ".join(df.loc[0:0,"Desc"]).split(" ")).items()))

Related

how i can change dataframe and remove duplicate cell

I have dataferame like this:
I want change it to this:

here is one way to do it
An MRE would have helped shared the result with this answer
#Mask the value with empty string when value for matches previous row
df['Model']=df['Model'].mask(df['Model'].eq(df['Model'].shift(1)),'' )
df

You can use df.groupby with the group_keys = True.
df.groupby("Model", group_keys=True).apply(lambda x: x).drop('Model',axis=1)
tip segment pd gear
Model
Mazda 0 3 Japanese 2020 auto
1 2 Japanese 2016 manual
2 3 Japanese 2020 auto
Toyota Camry 3 glx Japanese 2019 manual
4 gli Japanese 2018 manual

How would I go about iterating through each row in a column and keeping a running tally of every substring that comes up? Python

Essentially what I am trying to do is go through the "External_Name" column, row by row, and get a count of unique substrings within each string, kind of like .value_counts().
External_Name
Specialty
ABMC Hyperbaric Medicine and Wound Care
Hyperbaric/Wound Care
ABMC Kaukauna Laboratory Services
Laboratory
AHCM Sinai Bariatric Surgery Clinic
General Surgery
...........
...........
n
n
For example, after running through the first three rows in "External_Name" the output would be something like
Output
Count
ABMC
2
Hyperbaric
1
Medicine
1
and
1
Wound
1
Care
1
So on and so forth. Any help would be really appreciated!

You can split at whitespace with str.split(), then explode the resulting word lists into individual rows and count the values with value_counts.
>>> df.External_Name.str.split().explode().value_counts()
ABMC 2
Hyperbaric 1
Medicine 1
and 1
Wound 1
Care 1
Kaukauna 1
Laboratory 1
Services 1
AHCM 1
Sinai 1
Bariatric 1
Surgery 1
Clinic 1
Name: External_Name, dtype: int64

How to group by column and take average of column weighted by another column?

I am trying to carry out what I thought would be a typical groupby and average problem on a DataFrame, but this has gotten a bit more complex than I had anticipated since the problem will deal with string/ordinal years and float values. I am using python. I will explain below.
I have a data frame showing different model years for different models of refrigerators across several counties in a state. I want to find the average model year of refrigerator for each county.
I have this example dataframe (abbreviated since the full dataframe would be far too long to show):
County_ID Type Year Population
--------------------------------------------
1 A 2022 54355
1 A 2021 54645
1 A 2020 14554
...
1 B 2022 23454
1 B 2021 34657
1 B 2020 12343
...
1 C 2022 23454
1 C 2021 34537
1 C 2020 23323
...
2 A 2022 54355
2 A 2021 54645
2 A 2020 14554
...
2 B 2022 23454
2 B 2021 34657
2 B 2020 12343
...
2 C 2022 23454
2 C 2021 34537
2 C 2020 23323
...
3 A 2022 54355
3 A 2021 54645
3 A 2020 14554
...
3 B 2022 23454
3 B 2021 34657
3 B 2020 12343
...
3 C 2022 23454
3 C 2021 34537
3 C 2020 23323
...
And so I kept this abbreviated for space, but the concept goes as I have many counties in my data, with county IDs going from 1 all the way to 50, and so 50 counties. In this example, there are 3 types of refrigerators. And then for each of these 3 types of refrigerators, there are the model year vintages of these refrigerators shown, e.g. how old the refrigerator is. And then we have population, showing how many of each of these physical units (unique pair of type and year) found in each of these counties. What I am trying to find is, for each County ID, I want the average year.
And so I want to produce the following DataFrame:
County_ID Average_vintage
--------------------------------
1 XXXX.XX
2 XXXX.XX
3 XXXX.XX
4 XXXX.XX
5 XXXX.XX
6 XXXX.XX
...
But here is why this is confusing me, since I want to find the average year, but year is ordinal data and not float, so I am a bit confused conceptually here. What I want to do is weight this by population, I think. And so, basically, if you want to find the average vintage of refrigerators, you would want to find the average of years, but of course, the vintage with a higher population of that vintage would have more influence in that average. And so I want to weight the vintages by population, and basically treat the years like float, so I could have the average year, and then a decimal attached, so there could be an average that says basically, the average refrigerator vintage for County 22 is 2015.48 or something like that. That is what I am trying to go for. I am trying this:
avg_vintage = df.groupby(['County_ID']).mean()
but I don't think this is really going to make much sense, since I need to account for how many (population) of each refrigerator there actually are in each county. How can I find the average year/vintage for each County, considering how many of each refrigerator (population) are found in each County using python?

Sorting values in pandas series [duplicate]

This question already has answers here:
changing sort in value_counts
(4 answers)
Closed 3 years ago.
I have a movies dataframe that looks like this...
title decade
movie name 1 2000
movie name 2 1990
movie name 3 1990
movie name 4 2000
movie name 5 2010
movie name 6 1980
movie name 7 1980
I want to plot number of movies per decade which I am doing this way
freq = movies['decade'].value_counts()
#freq returns me following
2000 56
1980 41
1990 37
1970 21
2010 9
# as you can see the value_counts() method returns a series sorted by the frequencies
freq = movies['decade'].value_counts(sort=False)
# now the frequencies are not sorted, because I want to distribution to be in sequence of decade year
# and not its frequency so I do something like this...
movies = movies.sort_values(by='decade', ascending=True)
freq = movies['decade'].value_counts(sort=False)
now the Series freq should be sorted w.r.t to decades but it does not
although movies is sorted
can someone tell what I am doing wrong? Thanks.
The expected output I am looking for is something like this...
1970 21
1980 41
1990 37
2000 56
2010 9

movies['decade'].value_counts()
returns a series with the decade as index and is sorted descending by count. To sort by decade, just append
movies['decade'].value_counts().sort_index()
or
movies['decade'].value_counts().sort_index(ascending=False)
should do the trick.

Conditional copy of values from one column to another columns

I have a pandas dataframe that looks something like this:
name job jobchange_rank date
Thisguy Developer 1 2012
Thisguy Analyst 2 2014
Thisguy Data Scientist 3 2015
Anotherguy Developer 1 2018
The jobchange_rank represents the each individual's (based on name) ranked change in position, where rank nr 1 represent his/her first position nr 2 his/her second position, etc.
Now for the fun part. I want to create a new column where I can see a person's previous job, something like this:
name job jobchange_rank date previous_job
Thisguy Developer 1 2012 None
Thisguy Analyst 2 2014 Developer
Thisguy Data Scientist 3 2015 Analyst
Anotherguy Developer 1 2018 None
I've created the following code to get the "None" values where there was no job change:
df.loc[df['jobchange_rank'].sub(df['jobchange_rank'].min()) == 0, 'previous_job'] = 'None'
Sadly, I can't seem to figure out how to get the values from the other column where the needed condition applies.
Any help is more then welcome!
Thanks in advance.

This answer assumes that your DataFrame is sorted by name and jobchange_rank, if that is not the case, sort first.
# df = df.sort_values(['name', 'jobchange_rank'])
m = df['name'].eq(df['name'].shift())
df['job'].shift().where(m)
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Or using a groupby + shift (assuming at least sorted by jobchange_rank)
df.groupby('name')['job'].shift()
0 NaN
1 Developer
2 Analyst
3 NaN
Name: job, dtype: object
Although the groupby + shift is more concise, on larger inputs, if your data is already sorted like your example, it may be faster to avoid the groupby and use the first solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting unique words in a pandas column - python

So, I'm just updating this as per the comments. This solution accounts for punctuation as well. df['Unique'] = df['Text'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)).strip()).str.split(' ').apply(len)

Related

how i can change dataframe and remove duplicate cell

How would I go about iterating through each row in a column and keeping a running tally of every substring that comes up? Python

How to group by column and take average of column weighted by another column?

Sorting values in pandas series [duplicate]

Conditional copy of values from one column to another columns

Categories

Resources