How to count the occurrences of a value in a data frame? - python

I have a data frame [df] that looks like this but much lagger:
title of the novel author publishing year mentionned cities
0 Beasts and creatures Bruno Ivory 2021 New York
0 Monsters Renata Mcniar 2023 New York
0 At risk Charles Dobi 2020 London
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro
0 War against the machine Angelina Trotter 1999 Rio de Janeiro
I would like to add another column with the objective of counting all the occurences of the cities. The problem is that I want to maintain the year of that occurrence, as I work with history. In other words, it is important for me to be able to know when the city was mentionned.
The expected outcome would look like this:
title of the novel author publishing year mentionned cities Counter
0 Beasts and creatures Bruno Ivory 2021 New York 1
0 Monsters Renata Mcniar 2022 New York 2
0 At risk Charles Dobi 2020 London 1
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro 1
0 War against the machine Angelina Trotter 1999 Rio de Janeiro 2
So far, I have just managed to count all the occurrences, but I could not relate it to the publishing years. The code I am using is:
df ['New York'] = df.eq('New York').sum().to_frame().T
Can someone help me, please?
edit:
I tried joining two dataframes and I got something interesting but not what I really wanted. The problem is that it does not keep the Publishing year on track.
d[f'counter'] = d.fgroupby('mentionned cities')['mentionned cities'].transform('counter')
result = pd.concat([df['New York'], df], axis=1, join='inner')
display(result)
Output:
title of the novel author publishing year mentionned cities Counter
0 Beasts and creatures Bruno Ivory 2021 New York 2
0 Monsters Renata Mcniar 2023 New York 2
0 At risk Charles Dobi 2020 London 1
0 Manuela and Ricardo Lucas Zacci 2022 Rio de Janeiro 1
0 War against the machine Angelina Trotter 1999 Rio de Janeiro 2
The problem still lingers on

df['Counter'] = df.groupby('mentionned cities').cumcount() + 1

Perhaps you could use a for loop to iterate through the 'mentioned cities' column, and use a dict to count the occurrences of cities:
city_count = {}
count_column = []
for city in df['mentionned cities']:
city_count[city] = city_count.get(city, 0) + 1
count_column.append(city_count[city])
df['Counter'] = count_column

If I understand it right, you can just concatenate both columns and do the same thing as you did. You need first to convert year into string, which can be done with str function, then it is easy to join both strings with +.

Related

str.findall returns all NA's

I have this df1 with a lot of different news articles. An example of a news article is this:
'Today is Monday Aug. 17 the 230th day of 2020 . There are 136 days left in the year . On August 17 2017 a van plowed through pedestrians along a packed promenade in the Spanish city of Barcelona killing 13 people and injuring 120 . A 14th victim died later from injuries . Another man was stabbed to death in a carjacking that night as the van driver made his getaway and a woman died early the next day in a vehicle-and-knife attack in a nearby coastal town . Six by police two more died when a bomb workshop exploded . In 1915 a mob in Cobb County Georgia lynched Jewish businessman Leo Frank 31 whose death sentence for the murder of 13-year-old Mary Phagan had been commuted to life imprisonment . Frank who d maintained his innocence was pardoned by the state of Georgia in 1986 . In 1960 the newly renamed Beatles formerly the Silver Beetles began their first gig in Hamburg West Germany Teamsters union president Jimmy Hoffa was sentenced in Chicago to five years in federal prison for defrauding his union s pension fund . Hoffa was released in 1971 after President Richard Nixon commuted his sentence for this conviction and jury tampering . In 1969 Hurricane Camille slammed into the Mississippi coast as a Category 5 storm that was blamed for 256 U.S. deaths three in Cuba . In 1978 the first successful trans-Atlantic balloon flight ended as Maxie Anderson Ben Abruzzo and Larry Newman landed In 1982 the first commercially produced compact discs a recording of ABBA s The Visitors were pressed at a Philips factory near Hanover West Germany .'
And I have this df2 with all the words from the news articles in the column "Word" with their corresponding LIWC category in the second column.
Data example:
data = {'Word': ['killing','even','guilty','brain'], 'Category': ['Affect', 'Adverb', 'Anx','Body']}
What I'm trying to do is: To calculate for each article in df1 how many words occur of each category in df2. So I want to create a column for each category mentioned in df2["category"].
And it should look like this in the end:
Content | Achieve | Affiliation | affect
article text here | 6 | 2 | 2
article text here | 2 | 43 | 2
article text here | 6 | 8 | 8
article text here | 2 | 13 | 7
I since it's all strings I tried str.findall but this returns all NA's for everything. This is what I tried:
from collections import Counter
liwc = df1['articles'].str.findall(fr"'({'|'.join(df2)})'") \
.apply(lambda x: pd.Series(Counter(x), index=df2["category"].unique())) \
.fillna(0).astype(int)
Both a pandas or r solution would be equally great.
First flatten df2 values to dictionary, add word boundaries \b\b and pass to Series.str.extractall, so possible use Series.map and create DataFrame by reset_index, last pass to crosstab and append to original by DataFrame.join:
df1 = pd.DataFrame({'articles':['Today is killing Aug. 17 the 230th day of 2020',
'Today is brain Aug. 17 the guilty day of 2020 ']})
print (df1)
articles
0 Today is killing Aug. 17 the 230th day of 2020
1 Today is brain Aug. 17 the guilty day of 2020
If list of values in Word column like in picture:
data = {'Word': [['killing'],['even'],['guilty'],['brain']],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
Word Category
0 [killing] Affect
1 [even] Adverb
2 [guilty] Anx
3 [brain] Body
d = {x: b for a, b in zip(df2['Word'], df2['Category']) for x in a}
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
If df2 is different:
data = {'Word': ['killing','even','guilty','brain'],
'Category': ['Affect', 'Adverb', 'Anx','Body']}
df2 = pd.DataFrame(data)
print (df2)
0 killing Affect
1 even Adverb
2 guilty Anx
3 brain Body
d = dict(zip(df2['Word'], df2['Category']))
print (d)
{'killing': 'Affect', 'even': 'Adverb', 'guilty': 'Anx', 'brain': 'Body'}
import re
#thank you for improve solution Wiktor Stribiżew
pat = r"\b(?:{})\b".format("|".join(re.escape(x) for x in d))
df = df1['articles'].str.extractall(rf'({pat})')[0].map(d).reset_index(name='Category')
df = df1.join(pd.crosstab(df['level_0'], df['Category']))
print (df)
articles Affect Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 1 1
You can craft a custom regex with named capturing groups and use str.extractall.
With your dictionary the custom regex would be '(?P<Affect>\\bkilling\\b)|(?P<Adverb>\\beven\\b)|(?P<Anx>\\bguilty\\b)|(?P<Body>\\bbrain\\b)'
Then groupby+max the notna results, convert to int and join to the original dataframe:
regex = '|'.join(fr'(?P<{k}>\b{v}\b)' for v,k in zip(*data.values()))
(df1.join(df1['articles'].str.extractall(regex, flags=2) # re.IGNORECASE
.notna().groupby(level=0).max()
.astype(int)
)
)
output:
articles Affect Adverb Anx Body
0 Today is killing Aug. 17 the 230th day of 2020 1 0 0 0
1 Today is brain Aug. 17 the guilty day of 2020 0 0 1 1

How do you use a function to aggregate dataframe columns and sort them into quarters, based on the European dates?

Hi I’m new to pandas and struggling with a challenging problem.
I have 2 dataframes:
Df1
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
And
Df2
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. Also note the dates are written in the European format so (day/month/year).
I am able to summarize how many heroes are in each city with the line:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which gives me :
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I need to add another column that lists if the hero will be free from missions certain quarters.
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
If the hero ID in Df2 does not have a mission end date, the count should increase by one. If they do have an end date and it’s separated into
So in the end it should look like this:
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
I think I need to use the python datetime library to get the current date time. Than create a custom function which I can than apply to each row using a lambda. Something similar to the below code:
from datetime import date
today = date.today()
q1 = '05/04/2021'
q3 = '05/10/2020'
q4 = '05/01/2021'
count=0
def QuarterCount(Eid,AssignmentEnd )
if df1['Superhero ID'] == df2['SID'] :
if df2['Mission End date']<q3:
++count
return count
elif df2['Mission End date']>q3 && <q4:
++count
return count
elif df2['Mission End date']>q1:\
++count
return count
df['No. of heroes free in Q3'] = df1[].apply(lambda x(QuarterCount))
Please help me correct my syntax or logic or let me know if there is a better way to do this. Learning pandas is challenging but oddly fun. I'd appreciate any help you can provide :)

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

Pandas group by but keep another column

Say that I have a dataframe that looks something like this
date location year
0 1908-09-17 Fort Myer, Virginia 1908
1 1909-09-07 Juvisy-sur-Orge, France 1909
2 1912-07-12 Atlantic City, New Jersey 1912
3 1913-08-06 Victoria, British Columbia, Canada 1912
I want to use pandas groupby function to create an output that shows the total number of incidents by year but also keep the location column that will display one of the locations that year. Any which one works. So it would look something like this:
total location
year
1908 1 Fort Myer, Virginia
1909 1 Juvisy-sur-Orge, France
1912 2 Atlantic City, New Jersey
Can this be done without doing funky joining? The furthest I can get is using the normal groupby
df = df.groupby(['year']).count()
But that only gives me something like this
location
year
1908 1 1
1909 1 1
1912 2 2
How can I display one of the locations in this dataframe?
You can use groupby.agg and use 'first' to extract the first location in each group:
res = df.groupby('year')['location'].agg(['first', 'count'])
print(res)
# first count
# year
# 1908 Fort Myer, Virginia 1
# 1909 Juvisy-sur-Orge, France 1
# 1912 Atlantic City, New Jersey 2

Pandas .groupby automatically selecing column

From the following dataset:
I'm trying to use .groupby to create a set where I get the average Status Count per User Location. I've already done this for Follower Count by using
groupLoc = df.groupby('User Location')
groupCount = groupLoc.mean()
groupCount
Which automatically selected User Location vs Follower Count. Now I'm trying to do the same for User Location vs Status Count, but it's automatically including Follower Count again.
Anyone know how to fix this? Thanks in advance!
I think you need groupby with mean:
print df.groupby('User Location', as_index=False)['Follower Count'].mean()
User Location Follower Count
0 Canada 1654.500000
1 Chicago 9021.000000
2 Indonesia 1352.666667
3 London 990.000000
4 Los Angeles CA 86.000000
5 New York 214.000000
6 Singapore 106.500000
7 Texas 181.000000
8 UK 2431.000000
9 indonesia 316.000000
10 null 295.750000
print df.groupby('User Location', as_index=False)['Status Count'].mean()
User Location Status Count
0 Canada 39299.000000
1 Chicago 6402.000000
2 Indonesia 12826.000000
3 London 4864.666667
4 Los Angeles CA 3230.000000
5 New York 2947.000000
6 Singapore 6785.500000
7 Texas 901.000000
8 UK 81440.000000
9 indonesia 17662.000000
10 null 29610.875000

Categories

Resources