I'm trying to parse some data and I cannot seem to use .replace to remove the junk i.e. [Bluray] and other data that is not year or resolution. My end goal is end up with columns of: Movie Name, Year and Resolution
,Movie Name,others
0,James Bond The Spy Who Loved Me ,1977) [1080p]
1,James Bond Live And Let Die ,1973) [1080p]
2,No Time To Die ,2021) [1080p] [BluRay] [5.1] [YTS.MX]
3,James Bond The Man With The Golden Gun ,1974) [1080p]
4,Casino Royale ,2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5,James Bond Moonraker ,1979) [1080p]
6,James Bond Licence To Kill ,1989) [1080p]
7,James Bond A View To A Kill ,1985) [1080p]
8,James Bond The Living Daylights ,1987) [1080p]
Code i'm using is:
df['others']=df['others'].replace(to_replace=[['','BluRay']],value='')
Can anyone see where i'm going wrong?
Given:
Movie Name others
0 James Bond The Spy Who Loved Me 1977) [1080p]
1 James Bond Live And Let Die 1973) [1080p]
2 No Time To Die 2021) [1080p] [BluRay] [5.1] [YTS.MX]
3 James Bond The Man With The Golden Gun 1974) [1080p]
4 Casino Royale 2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5 James Bond Moonraker 1979) [1080p]
6 James Bond Licence To Kill 1989) [1080p]
7 James Bond A View To A Kill 1985) [1080p]
8 James Bond The Living Daylights 1987) [1080p]
To clean this up I would do:
df['Movie Name'] = df['Movie Name'].str.strip()
df[['Year', 'Resolution']] = df['others'].str.extract('(\d{4})\).*\[(.*p)]')
print(df[['Movie Name', 'Year', 'Resolution']])
Output:
Movie Name Year Resolution
0 James Bond The Spy Who Loved Me 1977 1080p
1 James Bond Live And Let Die 1973 1080p
2 No Time To Die 2021 1080p
3 James Bond The Man With The Golden Gun 1974 1080p
4 Casino Royale 2006 2160p
5 James Bond Moonraker 1979 1080p
6 James Bond Licence To Kill 1989 1080p
7 James Bond A View To A Kill 1985 1080p
8 James Bond The Living Daylights 1987 1080p
Related
I have some football data that I am modifying for analysis. I basically want to calculate career and yearly per game averages on a weekly basis for several stats.
Example
What I have:
Player
Year
Week
Rushing Yards
Catches
Seth Johnson
2020
1
100
4
Seth Johnson
2020
2
80
2
Seth Johnson
2021
1
50
3
Seth Johnson
2021
2
50
2
What I want:
Player
Year
Week
Rushing Yards
Catches
Career Rushing Yards per Game
Career Catches per Game
Yearly Rushing Yards per Game
Yearly Catches per Game
Seth Johnson
2020
1
100
4
100
4
100
4
Seth Johnson
2020
2
80
2
90
3
90
3
Seth Johnson
2021
1
50
3
76.67
3
50
3
Seth Johnson
2021
2
40
2
67.5
2.75
45
2.5
I figure I could calculate the Career stats and Yearly stats separately then just join everything on Player/Year/Week, but I'm not sure how to go about calculating the moving averages given that the window would be dependant on Year and Week.
I've tried things like looping through the desired categories and calculating rolling averages:
new_df['Career ' + category + ' per Game'] = df.groupby('Player')[category].apply(lambda x: x.rolling(3, min_periods=0).mean())
But I'm not finding the creativity necessary to make the appropriate custom window for rolling(). Does anyone have any ideas here?
I'm looking for a pandas expression that will find the margin of victory in percentage terms between the two candidates by finding the candidate with the greater amount of votes, finding what percentage of the county votes they have and then subtracting the lesser of the two candidates' vote total percentage to find a margin of victory like this, all while kind of ignoring the third party candidate and go from this.
YEAR STATE County CANDIDATE VOTES
2016 Ohio Medina County Donald Trump 184211
2016 Ohio Medina County Hillary Clinton 398271
2016 Ohio Medina County Gary Johnson 12993
2016 Ohio Cuyahoga County Donald Trump 54810
2016 Ohio Cuyahoga County Hillary Clinton 32182
2016 Ohio Cuyahoga County Gary Johnson 2975
..to this
YEAR STATE County CANDIDATE VOTES MARGIN OF VICTORY
2016 Ohio Medina County Donald Trump 184211 Hillary Clinton +35.1%
2016 Ohio Medina County Hillary Clinton 398271 Hillary Clinton +35.1%
2016 Ohio Medina County Gary Johnson 12993 Hillary Clinton +35.1%
2016 Ohio Cuyahoga County Donald Trump 54810 Doanld Trump +24.6%
2016 Ohio Cuyahoga County Hillary Clinton 32182 Donald Trump +24.6%
2016 Ohio Cuyahoga County Gary Johnson 2975 Doanld Trump +24.6%
Unsure on how you are intending to calculate the margins. The below is
to outline the approach you can take not the correct answer
You would need to aggregate the data first at YEAR STATE COUNTY level
df_agg = df.groupby(by=['YEAR','STATE','COUNTY'])['VOTES'].sum().reset_index().rename_column({'VOTES':'AGG_VOTES'})
You can join back this df to the original df and use the AGG_VOTES column to generate the required statistics
df = pd.merge(df,df_agg,on=['YEAR','STATE','COUNTY'])
df['Candidate_Percentage'] = df['Votes'] * 100 / df['AGG_VOTES']
You can further aggregate your df to roll it up to get a consolidated df with the required margin victory
Here is a very awkward solution, but it works. The function do_county processes one county at a time.
def do_county(data):
return (data.set_index('CANDIDATE').sort_values('VOTES') \
/ data['VOTES'].sum())\ # Normalize
.diff().tail(1) # Take the diff between top two
df.groupby(['County', 'STATE', 'YEAR']).apply(do_county)
# VOTES
#County STATE YEAR CANDIDATE
#Cuyahoga County Ohio 2016 Donald Trump 0.251514
#Medina County Ohio 2016 Hillary Clinton 0.359478
I am sure there is a better way to solve the problem.
I have the following data, having two columns, "name" and "gross" in table called train_df:
gross name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to read and extract the date from "name" to create a new column that will be called "year", which I will then use to filter the data set by specific year.
The new table will look like the following:
year gross name
2009 760507625.0 Avatar (2009)
1997 658672302.0 Titanic (1997)
2015 652270625.0 Jurassic World (2015)
2012 623357910.0 The Avengers (2012)
2008 534858444.0 The Dark Knight (2008)
I tried the apply and lambda approach, but got no results:
train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]
Is there a way to use contains (as in C# or "isin" to filter the strings in python?
If you know for sure that your years are going to be at the end of the string, you can do
df['year'] = df['name'].str[-5:-1].astype(int)
This takes the column name, uses the str accessor to access the value of each row as a string, and takes the -5:-1 slice from it. Then, it converts the result to int, and sets it as the year column. This approach will be much faster than iterating over the rows if you have lots of data.
Alternatively, you could use regex for more flexibility using the .extract() method of the str accessor.
df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)
This extracts groups matching the expression \((\d{4})\) (Try it here), which means capture the numbers inside a pair of parentheses containing exactly four digits, and will work anywhere in the string. To anchor it to the end of your string, use a $ at the end of your regex like so: \((\d{4})\)$. The result is the same using regex and using string slicing.
Now we have our new dataframe:
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
5 532177324.0 Rogue One (2016) 2016
6 474544677.0 Star Wars: Episode I - The Phantom Menace (1999) 1999
7 459005868.0 Avengers: Age of Ultron (2015) 2015
8 448139099.0 The Dark Knight Rises (2012) 2012
9 436471036.0 Shrek 2 (2004) 2004
10 424668047.0 The Hunger Games: Catching Fire (2013) 2013
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006) 2006
12 415004880.0 Toy Story 3 (2010) 2010
13 409013994.0 Iron Man 3 (2013) 2013
14 408084349.0 Captain America: Civil War (2016) 2016
15 408010692.0 The Hunger Games (2012) 2012
16 403706375.0 Spider-Man (2002) 2002
17 402453882.0 Jurassic Park (1993) 1993
18 402111870.0 Transformers: Revenge of the Fallen (2009) 2009
19 400738009.0 Frozen (2013) 2013
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2 (... 2011
21 380843261.0 Finding Nemo (2003) 2003
22 380262555.0 Star Wars: Episode III - Revenge of the Sith (... 2005
23 373585825.0 Spider-Man 2 (2004) 2004
24 370782930.0 The Passion of the Christ (2004) 2004
You can a regular expression with pandas.Series.str.extract for this:
df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])
print(df.head())
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
The regular expression:
\(: find where there is a literal opening parentheses
(\d{4}) Then, find 4 digits appearing next to each other
The parentheses here means that we're storing our 4 digits as a capture group (in this case its the group of digits we want to extract from the larger string)
\): Then, find a closing parentheses
$: All of the above MUST occur at the end of the string
When all of the above criterion are met, get those 4 digits- or if no match is acquired, return NaN for that row.
Try this.
df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']
for i in df:
mov_title = i[:-7]
year = i[-5:-1]
print(mov_title) //do your actual extraction
print(year) //do your actual extraction
def getYear(val):
startIndex = val.find('(')
endIndex = val.find(')')
return val[(int(startIndex) + 1):endIndex]
Am not much of a python dev, but i believe this will do. You will just need to loop through passing each to the above function. On each function call you will get the date extracted for you.
I have just started python and am trying to rewrite one of my perl scripts in python. Essentially, I had a long script to convert a csv to json.
I've tried to import my csv into a pandas dataframe, and wanted to insert a header row at the top, since my csv lacks that.
Code:
import pandas
db=pandas.read_csv("netmedsdb.csv",header=None)
db
Output:
0 1 2 3
0 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
I wrote the following code to insert the first element at row 0,coloumn 0:
db.insert(loc=0,column='0',value='Brand')
db
Output:
0 0 1 2 3
0 Brand 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 Brand BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 Brand KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 Brand RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 Brand 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 Brand AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 Brand RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 Brand VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 Brand VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
But unfortunately I got the word "Brand" inserted at coloumn 0 in all rows.
I'm trying to add the header coloumns "Brand", "Generic", "Price", "Company"
Need parameter names in read_csv only:
import pandas as pd
temp=u"""a,b,10,d
e,f,45,r
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'netmedsdb.csv'
df = pd.read_csv(pd.compat.StringIO(temp), names=["Brand", "Generic", "Price", "Company"])
print (df)
Brand Generic Price Company
0 a b 10 d
1 e f 45 r
I would like to create a corpus composed by the body of different articles stored in a JSON format. They are in different files named after the year, for example:
with open('Scot_2005.json') as f:
data = [json.loads(line) for line in f]
corresponds to a newspaper, Scotsman for the year 2005. Moreover, the rest of the files for this newspaper are named: APJ_2006.... APJ2015. Also. I have another newspaper, Scottish Daily Mail, that goes only from the years 2014-1015: SDM_2014, SDM_2015. I would like to create a common list with the body of all these articles:
doc_set = [d['body'] for d in data]
My problem is looping the first part of the code that I posted so that data corresponds to all articles rather than just the ones from a given newspaper at a given year. Any ideas of how to accomplish this task? In my attempt I try using Pandas such:
for i in range(2005,2016):
df = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
doc_set = df.body
The problem with this method seems to me to be: it does not append all years; I am not sure how to include other newspapers with time intervals other than from 2005-15. The outcome of this method looks like:
date
2015-12-31 The Institute of Directors (IoD) has added its...
2015-12-31 It is startling to see how much the Holyrood l...
2015-12-31 A hike in interest rates in the new year will ...
2015-12-31 The First Minister has resolved to make 2016 a...
2015-12-30 The Scottish Government announced yesterday th...
2015-12-30 The Footsie closed lower amid falling oil pric...
2015-12-28 BEFORE we start the guessing game for 2016, a ...
2015-12-27 AS WE ushered in 2015, few would have predicte...
2015-12-23 No matter how hard Derek McInnes and his Aberd...
2015-12-21 THE HEAD of a Scottish Government task force s...
2015-12-17 A Scottish local authority has fought off a le...
2015-12-17 Markets lifted after the Federal Reserve hiked...
2015-12-17 Significant increases in UK quotas for fish in...
2015-12-17 WAR of words with Donald Trump suggests its ti...
2015-12-16 SCOTLAND'S national performance companies have...
2015-12-15 Markets jumped ahead of what investors expect ...
2015-12-14 Political uncertainty in back seat as transpor...
2015-12-11 The International Monetary Fund (IMF) has warn...
2015-12-08 Scotland has a "spring in its step" with the j...
2015-12-07 London's leading share index struggled for dir...
2015-12-03 REDUCING carbon is just the start of it, write...
2015-11-26 One of the country's most prized salmon rivers...
2015-11-23 Tax and legislative changes undermine strong f...
2015-11-23 A second House of Lords committee has called f...
2015-11-14 At first glance, Scotland's economic performan...
2015-11-13 THE United States has long been viewed as the ...
2015-11-12 IT IS vital for a new governance group to rest...
2015-11-12 Former SSE chief Ian Marchant has criticised r...
2015-11-11 Telecoms firm TalkTalk said it will take a hit...
2015-11-09 Improvements to consumer rights legislation ma...
...
2015-02-25 Traders baulked at an assault on the 7,000 lev...
2015-02-24 BRITISH military personnel are to be deployed ...
2015-02-20 DAVID Cameron has announced a £859 million inv...
2015-02-16 Falling oil prices and slowing inflation have ...
2015-02-14 DEFENCE spending cuts and falling oil prices h...
2015-02-14 Brent crude rallied to a 2015 high and helped ...
2015-02-12 THE HOUSING markets in Scotland and Northern I...
2015-02-10 INVESTMENT in Scotland's commercial property m...
2015-02-09 Investors took flight after Greece's new gover...
2015-02-01 Experts say large numbers are delaying decisio...
2015-01-29 MORE than 300 jobs are at risk after Tesco sai...
2015-01-27 THE Three Bears have hit out at the Rangers bo...
2015-01-21 GEORGE Osborne has challenged the right of SNP...
2015-01-19 Employment figures this week should show Briti...
2015-01-19 Why haven't petrol pump prices fallen as fast ...
2015-01-18 Without an agreement on immediate action, the...
2015-01-17 A SECOND independence referendum could be trig...
2015-01-14 THE RETAILER, which like its rivals has come u...
2015-01-14 HOUSE prices in Scotland rose by more than 4 p...
2015-01-13 HOUSE builder Taylor Wimpey is preparing for a...
2015-01-13 Supermarket group Sainsbury's today said it wo...
2015-01-13 INFLATION has tumbled to its lowest level on r...
2015-01-12 BUSINESSES are bullish about their prospects ...
2015-01-11 FOR decades, oil has dripped through our natio...
2015-01-09 Shares in the housebuilding sector fell heavil...
2015-01-08 THE Bank of England is expected to leave inter...
2015-01-05 COMPANIES in Scotland are more optimistic abou...
2015-01-04 UK is doing OK, but uncertainty looms on mid-y...
2015-01-02 The London market began the new year in a subd...
2015-01-02 The famous election mantra of Bill Clinton's c...
Name: body, dtype: object
Assuming you have a file list:
file_name_list = ( 'Scot_2005.json', 'APJ_2006.json' )
You can append to a list like this:
data = list()
for file_name in file_name_list:
with open(file_name, 'r') as json_file:
for line in json_file:
data.append(json.loads(line))
If you want to create the file_name_list programmatically, you can use the glob library.