Trouble merging Scraped data using Pandas and numpy in Python - python
I am trying to collect information from a lot of different urls and combine the data based on the year and Golfer name. As of now I am trying to write the information to csv and then match using pd.merge() but I have to use a unique name for each dataframe to merge. I tried to use a numpy array but I am stuck with the final process of getting all the separate data to be merged.
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np
base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']
alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for link in soup.find_all('a'):
if link.has_attr('href'):
alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
if inn in i:
beta.append(i)
#no repeats
gamma= []
for i in beta:
if i not in gamma:
gamma.append(i)
#making list of urls with Statistic labels
jan = []
for i in gamma:
try:
data = urlopen(i)
soup = BeautifulSoup(data, "html.parser")
for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
for j in table.find_all('h3'):
y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
jan.append([i,str(y+'.csv')])
print([i,str(y+'.csv')])
except Exception as e:
print(e)
pass
# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = []
title = []
for i in jan:
try:
with open(i[1], 'w+') as fp:
writer = csv.writer(fp)
for y in years:
data = urlopen(i[0][:-4] +y+ end)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(i[0])
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
title.append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
title1.append(n.get_text())
for l in title1:
if l not in title:
title.append(l)
rows_sp.append(title)
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
row_sp = [y]
for j in h.find_all('td'):
row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
rows_sp.append(row_sp)
print(row_sp)
writer.writerows([row_sp])
except Exception as e:
print(e)
pass
dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)
The urls, stat types, desired format
the ... is just all of the stuff inbetween
trying to get the data on one row
urls for below data ['http://www.pgatour.com/stats/stat.02356.html','http://www.pgatour.com/stats/stat.02568.html',...,'http://www.pgatour.com/stats/stat.111.html']
Statistics Titles
LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek ranklastweek name events rating rounds avg
2017 2 3 Rickie Fowler 10 8.8 62 .614
TOTAL SG:APP MEASURED ROUNDS .... % # SAVES # BUNKERS TOTAL O/U PAR
26.386 43 ....70.37 76 108 +7.00
UPDATE (per comments)
This question is partly about technical methods (Pandas merge()), but it also seems like an opportunity to discuss useful workflows for data collection and cleaning. As such I'm adding a bit more detail and explanation than what is strictly required for a coding solution.
You can basically use the same approach as my original answer to get data from different URL categories. I'd recommend keeping a list of {url:data} dicts as you iterate over your URL list, and then building cleaned data frames from that dict.
There's a little legwork involved in setting up the cleaning portion, as you need to adjust for the different columns in each URL category. I've demonstrated with a manual approach, using only a few tests URLs. But if you have, say, thousands of different URL categories, then you may need to think about how to collect and organize column names programmatically. That feels out of scope for this OP.
As long as you're sure there's a year and PLAYER NAME field in each URL, the following merge should work. As before, let's assume that you don't need to write to CSV, and for now let's leave off making any optimizations to your scraping code:
First, define the url categories in urls. By url category I'm referring to the fact that http://www.pgatour.com/stats/stat.02356.html will actually be used multiple times by inserting a series of years into the url itself, e.g.: http://www.pgatour.com/stats/stat.02356.2017.html, http://www.pgatour.com/stats/stat.02356.2016.html. In this example, stat.02356.html is the url category that contains information about multiple years of player data.
import pandas as pd
# test urls given by OP
# note: each url contains >= 1 data fields not shared by the others
urls = ['http://www.pgatour.com/stats/stat.02356.html',
'http://www.pgatour.com/stats/stat.02568.html',
'http://www.pgatour.com/stats/stat.111.html']
# we'll store data from each url category in this dict.
url_data = {}
Now iterate over urls. Within the urls loop, this code is all the same as my original answer, which in turn is coming from OP - only with some variable names adjusted to reflect our new capturing method.
for url in urls:
print("url: ", url)
url_data[url] = {"row_sp": [],
"rows_sp": [],
"title1": [],
"title": []}
try:
#with open(i[1], 'w+') as fp:
#writer = csv.writer(fp)
for y in years:
current_url = url[:-4] +y+ end
print("current url is: ", current_url)
data = urlopen(current_url)
soup = BeautifulSoup(data, "html.parser")
data1 = urlopen(url)
soup1 = BeautifulSoup(data1, "html.parser")
for table in soup1.find_all('table',{'id':'statsTable'}):
url_data[url]["title"].append('year')
for k in table.find_all('tr'):
for n in k.find_all('th'):
url_data[url]["title1"].append(n.get_text())
for l in url_data[url]["title1"]:
if l not in url_data[url]["title"]:
url_data[url]["title"].append(l)
url_data[url]["rows_sp"].append(url_data[url]["title"])
for table in soup.find_all('table',{'id':'statsTable'}):
for h in table.find_all('tr'):
url_data[url]["row_sp"] = [y]
for j in h.find_all('td'):
url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
url_data[url]["rows_sp"].append(url_data[url]["row_sp"])
#print(row_sp)
#writer.writerows([row_sp])
except Exception as e:
print(e)
pass
Now for each key url in url_data, rows_sp contains the data you're interested in for that particular url category.
Note that rows_sp will now actually be url_data[url]["rows_sp"] when we iterate over url_data, but the next few code blocks are from my original answer, and so use the old rows_sp variable name.
# example rows_sp
[['year',
'RANK THIS WEEK',
'RANK LAST WEEK',
'PLAYER NAME',
'EVENTS',
'RATING',
'year',
'year',
'year',
'year'],
['2017'],
['2017', '1', '1', 'Sam Burns', '1', '9.2'],
['2017', '2', '3', 'Rickie Fowler', '10', '8.8'],
['2017', '2', '2', 'Dustin Johnson', '10', '8.8'],
['2017', '2', '3', 'Whee Kim', '2', '8.8'],
['2017', '2', '3', 'Thomas Pieters', '3', '8.8'],
...
]
Writing rows_sp directly to a data frame shows that the data aren't quite in the right format:
pd.DataFrame(rows_sp).head()
0 1 2 3 4 5 6 \
0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year
1 2017 None None None None None None
2 2017 1 1 Sam Burns 1 9.2 None
3 2017 2 3 Rickie Fowler 10 8.8 None
4 2017 2 2 Dustin Johnson 10 8.8 None
7 8 9
0 year year year
1 None None None
2 None None None
3 None None None
4 None None None
pd.DataFrame(rows_sp).dtypes
0 object
1 object
2 object
3 object
4 object
5 object
6 object
7 object
8 object
9 object
dtype: object
With a little cleanup, we can get rows_sp into a data frame with appropriate numeric data types:
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
"PLAYER NAME","EVENTS","RATING",
"year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)
df.head()
year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING
2 2017 1 1.0 Sam Burns 1 9.2
3 2017 2 3.0 Rickie Fowler 10 8.8
4 2017 2 2.0 Dustin Johnson 10 8.8
5 2017 2 3.0 Whee Kim 2 8.8
6 2017 2 3.0 Thomas Pieters 3 8.8
UPDATED CLEANING
Now that we have a series of url categories to contend with, each with a different set of fields to clean, the above section gets a little more complicated. If you only have a few pages, it may be feasible to just visually review the fields for each category, and store them, like this:
cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', 'AVERAGE',
'TOTAL SG:APP', 'MEASURED ROUNDS',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',]
},
'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS',
'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS',
'%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR']
},
'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK',
'PLAYER NAME', 'EVENTS', 'RATING',
'year1', 'year2', 'year3', 'year4'],
'numeric':['RANK THIS WEEK', 'RANK LAST WEEK',
'EVENTS', 'RATING']
}
}
And then you can loop over url_data again and store in a dfs collection:
dfs = {}
for url in url_data:
page = url.split("/")[-1]
colnames = cols[page]["columns"]
num_cols = cols[page]["numeric"]
rows_sp = url_data[url]["rows_sp"]
df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = colnames
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
# tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators.
df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","")
df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","")
df[num_cols] = df[num_cols].apply(pd.to_numeric)
dfs[url] = df
At this point, we're ready to merge all the different data categories by year and PLAYER NAME. (You could actually have merged iteratively in the cleaning loop, but I'm separating here for demonstrative purposes.)
master = pd.DataFrame()
for url in dfs:
if master.empty:
master = dfs[url]
else:
master = master.merge(dfs[url], on=['year','PLAYER NAME'])
Now master contains the merged data for each player-year. Here's a view into the data, using groupby():
master.groupby(["PLAYER NAME", "year"]).first().head(4)
RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \
PLAYER NAME year
Aam Hawin 2015 66 66.0 7 8.2
2016 80 80.0 12 8.1
2017 72 45.0 8 8.2
Aam Scott 2013 45 45.0 10 8.2
RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \
PLAYER NAME year
Aam Hawin 2015 136 136 95 -0.183
2016 122 122 93 -0.061
2017 56 52 84 0.296
Aam Scott 2013 16 16 61 0.548
TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \
PLAYER NAME year
Aam Hawin 2015 -14.805 81 86
2016 -5.285 87 39
2017 18.067 61 8
Aam Scott 2013 24.125 44 57
RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \
PLAYER NAME year
Aam Hawin 2015 86 95 50.96 80 157
2016 39 93 54.78 86 157
2017 6 84 61.90 91 147
Aam Scott 2013 57 61 53.85 49 91
TOTAL O/U PAR
PLAYER NAME year
Aam Hawin 2015 47.0
2016 43.0
2017 27.0
Aam Scott 2013 11.0
You may want to do a bit more cleaning on the merged columns, as some are duplicated across data categories (e.g. ROUNDS_x and ROUNDS_y). From what I can tell, the duplicate field names seem to contain exactly the same information, so you might just drop the _y version of each one.
Related
I have to create a dataframe from webscraping in a specific way
I have to create a dataframe in python by creating a bunch of lists from a table in a wikipedia article. code: from urllib.request import urlopen from bs4 import BeautifulSoup import csv import pandas as pd import numpy as np url = "https://en.wikipedia.org/wiki/Texas_Killing_Fields" page = urlopen(url) soup = BeautifulSoup(page, 'html.parser') all_tables = soup.find_all('table') all_sortable_tables = soup.find_all('table', class_='wikitable sortable') right_table = all_sortable_tables A = [] B = [] C = [] D = [] E = [] for row in right_table.find_all('tr'): cells = row.find_all('td') if len(cells) == 5: row.strip('\n') A.append(cells[0].find(text=True)) B.append(cells[1].find(text=True)) C.append(cells[2].find(text=True)) D.append(cells[3].find(text=True)) E.append(cells[4].find(text=True)) df = pd.DataFrame(A, columns=['Victim']) df['Victim'] = A df['Age'] = B df['Residence'] = C df['Last Seen'] = D df['Discovered'] = E I keep getting an attribute error "ResultSet object has no attribute 'find_all'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" I have tried a bunch of methods and nothing has helped me. I'm also following a tutorial the teacher gave us and its not helpful either. tutorial: https://alanhylands.com/how-to-web-scrape-wikipedia-python-urllib-beautiful-soup-pandas/#heading-10.-loop-through-the-rows first time here btw as a questioner.
Note: As mentioned by #ggorlen using an existig api would be the best approache. I also would recommend to use a more structured approache to store your data to avoid these bunch of lists. data = [] for row in soup.select('table.wikitable.sortable tr:has(td)'): data.append( dict( zip([h.text.strip() for h in soup.select('table.wikitable.sortable tr th')[:5]], [c.text.strip() for c in row.select('td')][:5]) ) ) pd.DataFrame(data) Just an alternative approach to scrape tables using pandas.read_html() cause you already imported pandas. It also uses BeautifulSoup and is doing the job for you: import pandas as pd df = pd.read_html('https://en.wikipedia.org/wiki/Texas_Killing_Fields')[1] df.iloc[:,:5] ### displays only the first 5 columns as in your example Output: Victim Age Residence Last seen Discovered Brenda Jones 14 Galveston, Texas July 1, 1971 July 2, 1971 Colette Wilson 13 Alvin, Texas June 17, 1971 November 26, 1971 Rhonda Johnson 14 Webster, Texas August 4, 1971 January 3, 1972 Sharon Shaw 13 Webster, Texas August 4, 1971 January 3, 1972 Gloria Gonzales 19 Houston, Texas October 28, 1971 November 23, 1971 ...
Reading nested json to pandas dataframe
I have below URL that has a JSON response. I need to read this json into a pandas dataframe and perform operations on top of it . This is a case of nested JSON which consists of multiple lists and dicts within dicts. URL: 'http://api.nobelprize.org/v1/laureate.json' I have tried below code: import json, pandas as pd,requests resp=requests.get('http://api.nobelprize.org/v1/laureate.json') df=pd.json_normalize(json.loads(resp.content),record_path =['laureates']) print(df.head(5)) Output- id firstname surname born died \ 0 1 Wilhelm Conrad Röntgen 1845-03-27 1923-02-10 1 2 Hendrik A. Lorentz 1853-07-18 1928-02-04 2 3 Pieter Zeeman 1865-05-25 1943-10-09 3 4 Henri Becquerel 1852-12-15 1908-08-25 4 5 Pierre Curie 1859-05-15 1906-04-19 bornCountry bornCountryCode bornCity \ 0 Prussia (now Germany) DE Lennep (now Remscheid) 1 the Netherlands NL Arnhem 2 the Netherlands NL Zonnemaire 3 France FR Paris 4 France FR Paris diedCountry diedCountryCode diedCity gender \ 0 Germany DE Munich male 1 the Netherlands NL NaN male 2 the Netherlands NL Amsterdam male 3 France FR NaN male 4 France FR Paris male prizes 0 [{'year': '1901', 'category': 'physics', 'shar... 1 [{'year': '1902', 'category': 'physics', 'shar... 2 [{'year': '1902', 'category': 'physics', 'shar... 3 [{'year': '1903', 'category': 'physics', 'shar... 4 [{'year': '1903', 'category': 'physics', 'shar... But in this prizes comes as a list. If I create a separate dataframe for prizes, it has affiliations as list.I want all columns to come as separate columns. Some entires may/may not have prizes. So need to handle that case as well. I went through this article https://towardsdatascience.com/all-pandas-json-normalize-you-should-know-for-flattening-json-13eae1dfb7dd. Looks like we'll have to use meta and error=ignore here, but not able to fix it. Appreciate your inputs here. Thanks.
You would have to do this in few steps. The first step would be to extract the first record_path = ['laureates'] The second one would be record_path = ['laureates', 'prizes'] for the nested json records with meta path as the id from the parent record Combine the two datasets by joining on the id column. Drop the unnecessary columns and store import json, pandas as pd, requests resp = requests.get('http://api.nobelprize.org/v1/laureate.json') df0 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates']) df1 = pd.json_normalize(json.loads(resp.content),record_path = ['laureates','prizes'], meta = [['laureates','id']]) output = pd.merge(df0, df1, left_on='id', right_on='laureates.id').drop(['prizes','laureates.id'], axis=1, inplace=False) print('Shape of data ->',output.shape) print('Columns ->',output.columns) Shape of data -> (975, 18) Columns -> Index(['id', 'firstname', 'surname', 'born', 'died', 'bornCountry', 'bornCountryCode', 'bornCity', 'diedCountry', 'diedCountryCode', 'diedCity', 'gender', 'year', 'category', 'share', 'motivation', 'affiliations', 'overallMotivation'], dtype='object')
Found an alternate solution as well with lesser code. This works. from flatten_json import flatten data = winners['laureates'] dict_flattened = (flatten(record, '.') for record in data) df = pd.DataFrame(dict_flattened) print(df.shape) (968, 43)
Pandas.read_html only getting header of html table
So I'm using pandas.read_html to try to get a table from a website. For some reason it's not giving me the entire table and it's just getting the header row. How can I fix this? Code: import pandas as pd term_codes = {"fall":"10", "spring":"20", "summer":"30"} # year must be last number in school year: 2021-2022 so we pick 2022 year = "2022" department = "CSCI" term_code = year + term_codes["fall"] url = "https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code=" + term_code + "&term_subj=" + department + "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search" def findCourseTable(): dfs = pd.read_html(url) print(dfs[0]) #df = dfs[1] #df.to_csv(r'courses.csv', index=False) if __name__ == "__main__": findCourseTable() Output: Empty DataFrame Columns: [CRN, COURSE ID, CRSE ATTR, TITLE, INSTRUCTOR, CRDT HRS, MEET DAY:TIME, PROJ ENR, CURR ENR, SEATS AVAIL, STATUS] Index: []
The page contains malformed HTML code, so use flavor="html5lib" in pd.read_html to read it correctly: import pandas as pd term_codes = {"fall": "10", "spring": "20", "summer": "30"} # year must be last number in school year: 2021-2022 so we pick 2022 year = "2022" department = "CSCI" term_code = year + term_codes["fall"] url = ( "https://courselist.wm.edu/courselist/courseinfo/searchresults?term_code=" + term_code + "&term_subj=" + department + "&attr=0&attr2=0&levl=0&status=0&ptrm=0&search=Search" ) df = pd.read_html(url, flavor="html5lib")[0] print(df) Prints: CRN COURSE ID CRSE ATTR TITLE INSTRUCTOR CRDT HRS MEET DAY:TIME PROJ ENR CURR ENR SEATS AVAIL STATUS 0 16064 CSCI 100 01 C100, NEW Reading#Russia Willner, Dana; Prokhorova, Elena 4 MWF:1300-1350 10 10 0* CLOSED 1 14614 CSCI 120 01 NaN A Career in CS? And Which One? Kemper, Peter 1 M:1700-1750 36 20 16 OPEN 2 16325 CSCI 120 02 NEW Concepts in Computer Science Deverick, James 3 TR:0800-0920 36 25 11 OPEN 3 12372 CSCI 140 01 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:0900-0950 36 24 12 OPEN 4 14620 CSCI 140 02 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:1100-1150 36 27 9 OPEN 5 13553 CSCI 140 03 NEW, NQR Programming for Data Science Khargonkar, Arohi 4 MWF:1300-1350 36 25 11 OPEN ...and so on.
how to " sort csv file" python
I'm trying to create a new file that contains only the data for movies with a rank above 9. The dataset that I'm analyzing contains rating on many movies obtained from IMDB. The data fields are: Votes: the number of people rating the movie Rank: the averate rating of the movie Title: the name of the movie Year: the year in which the movie was released The code I tried: import csv filename = "IMDB.txt" with open(filename, 'rt', encoding='utf-8-sig') as imdb_file: imdb_reader = csv.DictReader(imdb_file, delimiter = '\t') with open('new file.csv', 'w', newline='') as high_rank: fieldnames = ['Votes', 'Rank', 'Title', 'Year'] writer = csv.DictWriter(high_rank, fieldnames=fieldnames) writer.writeheader() for line_number, current_row in enumerate (imdb_reader): if(float(current_row['Rank']) > 9.0): csv_writer.writerow(dict(current_row)) but unfortunately its not working, what should i do ?
Based on your comment, your locale default encoding appears to be something that doesn't support the whole Unicode range. You need to specify an encoding for the output file that will handle arbitrary Unicode characters. Typically, on non-Windows systems you'd use 'utf-8'; on Windows, you might use 'utf-16' or 'utf-8-sig' (Windows programs often assume UTF-8 without an explicit signature is in the locale encoding, and misinterpret it). The fix is as simple as changing: with open('new file.csv', 'w', newline='') as high_rank: to: with open('new file.csv', 'w', encoding='utf-8', newline='') as high_rank: changing the specified encoding to whatever makes sense for your OS and use case.
Let's considere you have the following excel sheet name temp.csv and you want to filter films with a rank above 9 (included): One simple way to do is to use the pandas modules. It gives you the opportunity to : read .csv files with pd.read_csv method (doc) filter data as you want export data to a new file : for .csv output, df.to_csv do the job (doc) Assume you have the following dataframe: The code bellow do the job: # import modules import pandas as pd # Path - name of your file filename = "temp.csv" # Read the csv file df = pd.read_csv(filename, sep=";") print(df) # Votes Rank Film Year # 0 15 16 The Shawshank Redemption 1994 # 1 2004 5 The Godfather 1972 # 2 486 13 The Godfather: Part II 1974 # 3 529 9 Il buono, il brutto, il cattivo. 1966 # 4 289 12 Pulp Fiction 1994 # 5 98 11 Inception 2010 # 6 69 18 Schindler's List 1993 # 7 3 7 Angry Men 1957 # 8 584 14 One Flew Over the Cuckoo's Nest 1975 # Filter the csv file df_filtered = df[df["Rank"] >= 9] print(df_filtered) # Votes Rank Film Year # 0 15 16 The Shawshank Redemption 1994 # 2 486 13 The Godfather: Part II 1974 # 3 529 9 Il buono, il brutto, il cattivo. 1966 # 4 289 12 Pulp Fiction 1994 # 5 98 11 Inception 2010 # 6 69 18 Schindler's List 1993 # 8 584 14 One Flew Over the Cuckoo's Nest 1975 # name new csv file new_filename = filename[:-3] + "_new" + filename[-3:] # Export dataframe to csv file df_filtered.to_csv(new_filename) The new .csv looks like this:
Join dataframes based on partial string-match between columns
I have a dataframe which I want to compare if they are present in another df. after_h.sample(10, random_state=1) movie year ratings 108 Mechanic: Resurrection 2016 4.0 206 Warcraft 2016 4.0 106 Max Steel 2016 3.5 107 Me Before You 2016 4.5 I want to compare if the above movies are present in another df. FILM Votes 0 Avengers: Age of Ultron (2015) 4170 1 Cinderella (2015) 950 2 Ant-Man (2015) 3000 3 Do You Believe? (2015) 350 4 Max Steel (2016) 560 I want something like this as my final output: FILM votes 0 Max Steel 560
There are two ways: get the row-indices for partial-matches: FILM.startswith(title) or FILM.contains(title). Either of: df1[ df1.movie.apply( lambda title: df2.FILM.str.startswith(title) ).any(1) ] df1[ df1['movie'].apply(lambda title: df2['FILM'].str.contains(title)).any(1) ] movie year ratings 106 Max Steel 2016 3.5 Alternatively, you can use merge() if you convert the compound string column df2['FILM'] into its two component columns movie_title (year). . # see code at bottom to recreate your dataframes df2[['movie','year']] = df2.FILM.str.extract('([^\(]*) \(([0-9]*)\)') # reorder columns and drop 'FILM' now we have its subfields 'movie','year' df2 = df2[['movie','year','Votes']] df2['year'] = df2['year'].astype(int) df2.merge(df1) movie year Votes ratings 0 Max Steel 2016 560 3.5 (Acknowledging much help from #user3483203 here and in Python chat room) Code to recreate dataframes: import pandas as pd from pandas.compat import StringIO dat1 = """movie year ratings 108 Mechanic: Resurrection 2016 4.0 206 Warcraft 2016 4.0 106 Max Steel 2016 3.5 107 Me Before You 2016 4.5""" dat2 = """FILM Votes 0 Avengers: Age of Ultron (2015) 4170 1 Cinderella (2015) 950 2 Ant-Man (2015) 3000 3 Do You Believe? (2015) 350 4 Max Steel (2016) 560""" df1 = pd.read_csv(StringIO(dat1), sep='\s{2,}', engine='python', index_col=0) df2 = pd.read_csv(StringIO(dat2), sep='\s{2,}', engine='python')
Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1: s = df1['movie'] + ' (' + df1['year'].astype(str) + ')' res = df2[df2['FILM'].isin(s)] print(res) FILM VOTES 4 Max Steel (2016) 560
smci's option 1 is nearly there, the following worked for me: df1['Votes'] = '' df1['Votes']=df1['movie'].apply(lambda title: df2[df2['FILM'].str.startswith(title)]['Votes'].any(0)) Explanation: Create a Votes column in df1 Apply a lambda to every movie string in df1 The lambda looks up df2, selecting all rows in df2 where Film starts with the movie title Select the Votes column of the resulting subset of df2 Take the first value in this column with any(0)