Spyder (Python 3.8) web scraping question - python

Using the code below, I am trying to pull baseball lineups into a data frame. Starting at line 24, I am receiving the error "ValueError: not enough value to unpack (expected 2, got 1). Is anyone able to assist in resolving this issue? Thanks!
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.baseballpress.com/lineups/2022-08-05"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup-card"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup-card-header .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup-card-header .player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".col--min:nth-of-type(1) .player"),
card.select(".col--min:nth-of-type(2) .player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_csv("MLB Games.csv", index=False)
print(df.head(10).to_markdown(index=False))
I receive the following error code when running the code above:
\Users\15156\AppData\Local\Programs\Spyder\pkgs\pandas\compat\_optional.py", line 141, in import_optional_dependency
raise ImportError(msg)
ImportError: Missing optional dependency 'tabulate'. Use pip or conda to install tabulate.
When I type %pip install tabulate into the console I receive this error message:
Note: you may need to restart the kernel to use updated packages.
C:\Users\15156\AppData\Local\Programs\Spyder\Python\python.exe: No module named pip
However, if I restart the kernel I still receive the same error message. I have looked around and tried installing the package using the code below:
(base) PS C:\Users\15156> conda activate base
(base) PS C:\Users\15156> conda create -n myenv spyder-kernels nltk
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.12.0
latest version: 4.13.0
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: C:\Users\15156\miniconda3\envs\myenv
added / updated specs:
- nltk
- spyder-kernels
The packages were downloaded and installed, and I have looked into where it says the environment location is, however when I run %pip install kernel again it still says that the module cannot be found, spitting out the same error as above. Has anyone run into this issue before?

You have several errors in your code. First, you don't import requests. Next, the first two return statements in get_name() don't have anything following them - you need to bring the next line up to that line. Finally, since get_name() is returning objects where you called the get_text() method on them, it is actually returning strings, so you don't need to access the .text attribute on them when you're assigning to p1 and p2. Here is the corrected code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.baseballpress.com/lineups/2022-08-05"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup-card"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup-card-header .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup-card-header .player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".col--min:nth-of-type(1) .player"),
card.select(".col--min:nth-of-type(2) .player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_csv("73264662.csv", index=False)
print(df.head(10).to_markdown(index=False))
This prints:
| Team1 | Date | Team2 | Player1 | Player2 |
|:--------|:-----------------|:--------|:----------------------|:-------------------------|
| Marlins | August, 5 2:20pm | Cubs | Edward Cabrera (R) | Justin Steele (L) |
| Marlins | August, 5 2:20pm | Cubs | Miguel Rojas (R) SS | Rafael Ortega (L) CF |
| Marlins | August, 5 2:20pm | Cubs | Joey Wendle (L) 2B | Contreras |
| Marlins | August, 5 2:20pm | Cubs | Garrett Cooper (R) 1B | Patrick Wisdom (R) 1B |
| Marlins | August, 5 2:20pm | Cubs | Jesus Aguilar (R) DH | Ian Happ (S) LF |
| Marlins | August, 5 2:20pm | Cubs | De La Cruz | Nelson Velazquez (R) RF |
| Marlins | August, 5 2:20pm | Cubs | JJ Bleday (L) CF | Yan Gomes (R) C |
| Marlins | August, 5 2:20pm | Cubs | Peyton Burdick (R) LF | Zach McKinstry (L) 3B |
| Marlins | August, 5 2:20pm | Cubs | Stallings | Christopher Morel (R) SS |
| Marlins | August, 5 2:20pm | Cubs | Leblanc | Nick Madrigal (R) 2B |
and produces a CSV with all of today's games.

Related

How to write a Function in python pandas to append the rows in dataframe in a loop?

I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+

Handle csv file with almost similar records but different times - need to group them as one record

I am attempting to resolve the below lab and having issues. This problem involves a csv input. There is criteria that the solution needs to meet. Any help or tips at all would be appreciated. My code is at the end of the problem along with my output.
Each row contains the title, rating, and all showtimes of a unique movie.
A space is placed before and after each vertical separator ('|') in each row.
Column 1 displays the movie titles and is left justified with a minimum of 44 characters.
If the movie title has more than 44 characters, output the first 44 characters only.
Column 2 displays the movie ratings and is right justified with a minimum of 5 characters.
Column 3 displays all the showtimes of the same movie, separated by a space.
This is the input:
16:40,Wonders of the World,G
20:00,Wonders of the World,G
19:00,End of the Universe,NC-17
12:45,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
15:00,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
19:30,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
10:00,Adventure of Lewis and Clark,PG-13
14:30,Adventure of Lewis and Clark,PG-13
19:00,Halloween,R
This is the expected output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
My code so far:
import csv
rawMovies = input()
repeatList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
for movie in moviesList:
time = movie[0]
#print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
#print(show)
rating = movie[2]
#print(rating)
print('{0: <44} | {1: <6} | {2}'.format(show, rating, time))
My output doesn't have the rating aligned to the right and I have no idea how to filter for repeated movies without removing the time portion of the list:
Wonders of the World | G | 16:40
Wonders of the World | G | 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45
Buffalo Bill And The Indians or Sitting Bull | PG | 15:00
Buffalo Bill And The Indians or Sitting Bull | PG | 19:30
Adventure of Lewis and Clark | PG-13 | 10:00
Adventure of Lewis and Clark | PG-13 | 14:30
Halloween | R | 19:00
You could collect the input data in a dictionary, with the title-rating-tuples as keys and the showtimes collected in a list, and then print the consolidated information. For example (you have to adjust the filename):
import csv
movies = {}
with open("data.csv", "r") as file:
for showtime, title, rating in csv.reader(file):
movies.setdefault((title, rating), []).append(showtime)
for (title, rating), showtimes in movies.items():
print(f"{title[:44]: <44} | {rating: >5} | {' '.join(showtimes)}")
Output:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
Since the input seems to come in connected blocks you could also use itertools.groupby (from the standard library) and print while reading:
import csv
from itertools import groupby
from operator import itemgetter
with open("data.csv", "r") as file:
for (title, rating), group in groupby(
csv.reader(file), key=itemgetter(1, 2)
):
showtimes = " ".join(time for time, *_ in group)
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
For this consider the max length of the rating string. Subtract the length of the rating from that value. Make a string of spaces of that length and append the rating.
so basically
your_desired_str = ' '*(6-len(Rating))+Rating
also just replace
'somestr {value}'.format(value)
with f strings, much easier to read
f'somestr {value}'
Below is what I ended up with after some tips from the community.
rawMovies = input()
outputList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
movieold = [' ', ' ', ' ']
for movie in moviesList:
if movieold[1] == movie[1]:
outputList[-1][2] += ' ' + movie[0]
else:
time = movie[0]
# print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
# print(show)
rating = movie[2]
outputList.append([show, rating, time])
movieold = movie
# print(rating)
#print(outputList)
for movie in outputList:
print('{0: <44} | {1: <5} | {2}'.format(movie[0], movie[1].rjust(5), movie[2]))
I would use Python's groupby() function for this which helps you to group consecutive rows with the same value.
For example:
import csv
from itertools import groupby
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in groupby(csv_movies, key=lambda x: x[1]):
movies = list(entries)
showtimes = ' '.join(row[0] for row in movies)
rating = movies[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
Giving you:
Wonders of the World | G | 16:40 20:00
End of the Universe | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG | 12:45 15:00 19:30
Adventure of Lewis and Clark | PG-13 | 10:00 14:30
Halloween | R | 19:00
So how does groupby() work?
When reading a CSV file you will get a row at a time. What groupby() does is to group rows together into mini-lists containing rows which have the same value. The value it looks for is given using the key parameter. In this case the lambda function is passed a row at a time and it returns the current value of x[1] which is the title. groupby() keeps reading rows until that value changes. It then returns the current list as entries as an iterator.
This approach does assume that the rows you wish to group are in consecutive rows in the file. You could even write you own kind of group by generator function:
def group_by_title(csv):
title = None
entries = []
for row in csv:
if title and row[1] != title:
yield title, entries
entries = []
title = row[1]
entries.append(row)
if entries:
yield title, entries
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)
for title, entries in group_by_title(csv_movies):
showtimes = ' '.join(row[0] for row in entries)
rating = entries[0][2]
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")

Pretty print a pandas dataframe in VS Code

I'd like to know if it's possible to display a pandas dataframe in VS Code while debugging (first picture) as it is displayed in PyCharm (second picture) ?
Thanks for any help.
df print in vs code:
df print in pycharm:
As of the January 2021 release of the python extension, you can now view pandas dataframes with the built-in data viewer when debugging native python programs. When the program is halted at a breakpoint, right-click the dataframe variable in the variables list and select "View Value in Data Viewer"
Tabulate is an excellent library to achieve fancy/pretty print of the pandas df:
information - link: [https://pypi.org/project/tabulate/]
Please follow following steps in order to achieve pretty print:
(Note: For easy illustration I will create simple dataframe in python)
1) install tabulate
pip install --upgrade tabulate
This statement will always install latest version of the tabulate library.
2) import statements
import pandas as pd
from tabulate import tabulate
3) create simple temporary dataframe
temp_data = {'Name': ['Sean', 'Ana', 'KK', 'Kelly', 'Amanda'],
'Age': [42, 52, 36, 24, 73],
'Maths_Score': [67, 43, 65, 78, 97],
'English_Score': [78, 98, 45, 67, 64]}
df = pd.DataFrame(temp_data, columns = ['Name', 'Age', 'Maths_Score', 'English_Score'])
4) without tabulate our dataframe print will be:
print(df)
Name Age Maths_Score English_Score
0 Sean 42 67 78
1 Ana 52 43 98
2 KK 36 65 45
3 Kelly 24 78 67
4 Amanda 73 97 64
5) after using tabulate your pretty print will be :
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------+-------+---------------+-----------------+
| | Name | Age | Maths_Score | English_Score |
|----+--------+-------+---------------+-----------------|
| 0 | Sean | 42 | 67 | 78 |
| 1 | Ana | 52 | 43 | 98 |
| 2 | KK | 36 | 65 | 45 |
| 3 | Kelly | 24 | 78 | 67 |
| 4 | Amanda | 73 | 97 | 64 |
+----+--------+-------+---------------+-----------------+
nice and crispy print, enjoy!!! Please add comments, if you like my answer!
use vs code jupyter notebooks support
choose between attach to local script or launch mode, up to you.
include a breakpoint() where you want to break if using attach mode.
when debugging use the debug console to:
display(df_consigne_errors)
I have not found a similar feature for VS Code. If you require this feature you might consider using Spyder IDE. Spyder IDE Homepage
In addition to #Shantanu's answer, Panda's to_markdown function, which requires the tabulate library installed in python, provides various plain text formatting for tables which show on VS Code editor, such as:
df = pd.DataFrame(data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
print(df.to_markdown())
| | animal_1 | animal_2 |
|---:|:-----------|:-----------|
| 0 | elk | dog |
| 1 | pig | quetzal |

How to do more customization zeppelin notebook?

I'm using Hortonworks sandbox under the version of 2.5. The zeppelin service running successfully, when i create a zeppelin notebook with sample data in csv file, For eg; list of data available below wise;
+----------------------------------------+
| id name specialization county state |
+----------------------------------------+
| 001 xxxx Android Bronx NY |
+----------------------------------------+
| 002 yyyy ROR Rome NY |
+----------------------------------------+
| 003 zzzz Bigdata Bronx NY |
+----------------------------------------+
| 004 pppp IOS Dallas TX |
+----------------------------------------+
| 005 qqq IOS Dallas TX |
+----------------------------------------+
I have a pie,bar charts,sql table.In pie chart list of states available like TX with respective count on pie chart.
When i click over pie chart for the value TX portion, i want do dynamically data has been filtered in the entire notebook in all widgets like sql table,bar chart,etc. But i got all data has been display in sql table and below table contain 70,000 records, i want only tx state records only.
[![enter image description here][2]][2]
Please tell me how do i make this functionality in zeppelin.
As of 0.7.0, you can create your own charts like https://github.com/1ambda/zeppelin-highcharts-columnrange.
It's called Helium (Pluggable) Visualization (Chart)
Here are some resources you can refer
All available helium visualizations: http://zeppelin.apache.org/helium_packages.html
How to write new helium visualization: http://zeppelin.apache.org/docs/0.7.0/development/writingzeppelinvisualization.html
Zeppelin built in samples: https://github.com/apache/zeppelin/tree/branch-0.7/zeppelin-web/src/app/visualization/builtins

Using Pandas and sqlite3

Try to implement the privote_table of pandas to produce a table for each of party and each state shows how much the party receievd in total contributions from the state.
Is this the right way to do or i has to get into the data base and get fectched out. However the code below gives error.
party_and_state = candidates.merge(contributors, on='id')
party_and_state.pivot_table(df,index=["party","state"],values=["amount"],aggfunc=[np.sum])
The expected result could be something like the table below.
The first coulmn is the state name then the party D underneath the party D is the total votes from each state, the same applies with the party R
+-----------------+---------+--------+
| state | D | R |
+-----------------+---------+--------+
| AK | 500 | 900 |
| IL | 600 | 877 |
| FL | 200 | 400 |
| UT | 300 | 300 |
| CA | 109 | 90 |
| MN | 800 | 888 |
Consider the generalized pandas merge with pd as qualifier instead of a dataframe since the join fields are differently named hence requiring left_on and right_on args. Additionally, do not pass in df if running pivot_table as method of a dataframe since the called df is passed into the function.
Below uses the contributors and contributors_with_candidates text files. Also, per your desired results, you may want to use the values arg of pivot_table:
import numpy as np
import pandas as pd
contributors = pd.read_table('contributors_with_candidate_id.txt', sep="|")
candidates = pd.read_table('candidates.txt', sep="|")
party_and_state = pd.merge(contributors, candidates,
left_on=['candidate_id'], right_on=['id'])
party_and_state.pivot_table(index=["party", "state"],
values=["amount"], aggfunc=np.sum)
# amount
# party state
# D CA 1660.80
# DC 200.09
# FL 4250.00
# IL 200.00
# MA 195.00
# ...
# R AK 1210.00
# AR 14200.00
# AZ 120.00
# CA -6674.53
# CO -5823.00
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)
# amount
# party D R
# state
# AK NaN 1210.00
# AR NaN 14200.00
# AZ NaN 120.00
# CA 1660.80 -6674.53
# CO NaN -5823.00
# CT NaN 2300.00
Do note, you can do the merge as an inner join in SQL with read_sql:
party_and_state = pd.read_sql("SELECT c.*, n.* FROM contributors c " +
"INNER JOIN candidates n ON c.candidate_id = n.id",
con = db)
party_and_state.pivot_table(index=["state"], columns=["party"],
values=["amount"], aggfunc=np.sum)

Categories

Resources