I'd like to know if it's possible to display a pandas dataframe in VS Code while debugging (first picture) as it is displayed in PyCharm (second picture) ?
Thanks for any help.
df print in vs code:
df print in pycharm:
As of the January 2021 release of the python extension, you can now view pandas dataframes with the built-in data viewer when debugging native python programs. When the program is halted at a breakpoint, right-click the dataframe variable in the variables list and select "View Value in Data Viewer"
Tabulate is an excellent library to achieve fancy/pretty print of the pandas df:
information - link: [https://pypi.org/project/tabulate/]
Please follow following steps in order to achieve pretty print:
(Note: For easy illustration I will create simple dataframe in python)
1) install tabulate
pip install --upgrade tabulate
This statement will always install latest version of the tabulate library.
2) import statements
import pandas as pd
from tabulate import tabulate
3) create simple temporary dataframe
temp_data = {'Name': ['Sean', 'Ana', 'KK', 'Kelly', 'Amanda'],
'Age': [42, 52, 36, 24, 73],
'Maths_Score': [67, 43, 65, 78, 97],
'English_Score': [78, 98, 45, 67, 64]}
df = pd.DataFrame(temp_data, columns = ['Name', 'Age', 'Maths_Score', 'English_Score'])
4) without tabulate our dataframe print will be:
print(df)
Name Age Maths_Score English_Score
0 Sean 42 67 78
1 Ana 52 43 98
2 KK 36 65 45
3 Kelly 24 78 67
4 Amanda 73 97 64
5) after using tabulate your pretty print will be :
print(tabulate(df, headers='keys', tablefmt='psql'))
+----+--------+-------+---------------+-----------------+
| | Name | Age | Maths_Score | English_Score |
|----+--------+-------+---------------+-----------------|
| 0 | Sean | 42 | 67 | 78 |
| 1 | Ana | 52 | 43 | 98 |
| 2 | KK | 36 | 65 | 45 |
| 3 | Kelly | 24 | 78 | 67 |
| 4 | Amanda | 73 | 97 | 64 |
+----+--------+-------+---------------+-----------------+
nice and crispy print, enjoy!!! Please add comments, if you like my answer!
use vs code jupyter notebooks support
choose between attach to local script or launch mode, up to you.
include a breakpoint() where you want to break if using attach mode.
when debugging use the debug console to:
display(df_consigne_errors)
I have not found a similar feature for VS Code. If you require this feature you might consider using Spyder IDE. Spyder IDE Homepage
In addition to #Shantanu's answer, Panda's to_markdown function, which requires the tabulate library installed in python, provides various plain text formatting for tables which show on VS Code editor, such as:
df = pd.DataFrame(data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]})
print(df.to_markdown())
| | animal_1 | animal_2 |
|---:|:-----------|:-----------|
| 0 | elk | dog |
| 1 | pig | quetzal |
Related
I am being provided with a data set and i am writing a function.
my objectice is quiet simple. I have a air bnb data base with various columns my onjective is simple. I am using a for loop over neighbourhood group list (that i created) and i am trying to extract (append) the data related to that particular element in a empty dataframe.
Example:
import pandas as pd
import numpy as np
dict1 = {'id' : [2539,2595,3647,3831,12937,18198,258838,258876,267535,385824],'name':['Clean & quiet apt home by the park','Skylit Midtown Castle','THE VILLAGE OF HARLEM....NEW YORK !','Cozy Entire Floor of Brownstone','1 Stop fr. Manhattan! Private Suite,Landmark Block','Little King of Queens','Oceanview,close to Manhattan','Affordable rooms,all transportation','Home Away From Home-Room in Bronx','New York City- Riverdale Modern two bedrooms unit'],'price':[149,225,150,89,130,70,250,50,50,120],'neighbourhood_group':['Brooklyn','Manhattan','Manhattan','Brooklyn','Queens','Queens','Staten Island','Staten Island','Bronx','Bronx']}
df = pd.DataFrame(dict1)
df
I created a function as follows
nbd_grp = ['Bronx','Queens','Staten Islands','Brooklyn','Manhattan']
# Creating a function to find the cheapest place in neighbourhood group
dfdf = pd.DataFrame(columns = ['id','name','price','neighbourhood_group'])
def cheapest_place(neighbourhood_group):
for elem in nbd_grp:
data = df.loc[df['neighbourhood_group']==elem]
cheapest = data.loc[data['price']==min(data['price'])]
dfdf = cheapest.copy()
cheapest_place(nbd_grp)
My Expected Output is :
id
name
Price
neighbourhood group
267535
Home Away From Home-Room in Bronx
50
Bronx
18198
Little King of Queens
70
Queens
258876
Affordable rooms,all transportation
50
Staten Island
3831
Cozy Entire Floor of Brownstone
89
Brooklyn
3647
THE VILLAGE OF HARLEM....NEW YORK !
150
Manhattan
My advice is that anytime you are working in a database or in a dataframe and you think "I need to loop", you should think again.
When in a dataframe you are in a world of set-based logic and there is likely a better set-based way of solving the problem. In your case you can groupby() your neighbourhood_group and get the min() of the price column and then merge or join that result set back to your original dataframe to get your id and name columns.
That would look something like:
df_min_price = df.groupby('neighbourhood_group').price.agg(min).reset_index().merge(df, on=['neighbourhood_group','price'])
+-----+---------------------+-------+--------+-------------------------------------+
| idx | neighbourhood_group | price | id | name |
+-----+---------------------+-------+--------+-------------------------------------+
| 0 | Bronx | 50 | 267535 | Home Away From Home-Room in Bronx |
| 1 | Brooklyn | 89 | 3831 | Cozy Entire Floor of Brownstone |
| 2 | Manhattan | 150 | 3647 | THE VILLAGE OF HARLEM....NEW YORK ! |
| 3 | Queens | 70 | 18198 | Little King of Queens |
| 4 | Staten Island | 50 | 258876 | Affordable rooms,all transportation |
+-----+---------------------+-------+--------+-------------------------------------+
Using the code below, I am trying to pull baseball lineups into a data frame. Starting at line 24, I am receiving the error "ValueError: not enough value to unpack (expected 2, got 1). Is anyone able to assist in resolving this issue? Thanks!
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.baseballpress.com/lineups/2022-08-05"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup-card"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup-card-header .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup-card-header .player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".col--min:nth-of-type(1) .player"),
card.select(".col--min:nth-of-type(2) .player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_csv("MLB Games.csv", index=False)
print(df.head(10).to_markdown(index=False))
I receive the following error code when running the code above:
\Users\15156\AppData\Local\Programs\Spyder\pkgs\pandas\compat\_optional.py", line 141, in import_optional_dependency
raise ImportError(msg)
ImportError: Missing optional dependency 'tabulate'. Use pip or conda to install tabulate.
When I type %pip install tabulate into the console I receive this error message:
Note: you may need to restart the kernel to use updated packages.
C:\Users\15156\AppData\Local\Programs\Spyder\Python\python.exe: No module named pip
However, if I restart the kernel I still receive the same error message. I have looked around and tried installing the package using the code below:
(base) PS C:\Users\15156> conda activate base
(base) PS C:\Users\15156> conda create -n myenv spyder-kernels nltk
Collecting package metadata (current_repodata.json): done
Solving environment: done
==> WARNING: A newer version of conda exists. <==
current version: 4.12.0
latest version: 4.13.0
Please update conda by running
$ conda update -n base -c defaults conda
## Package Plan ##
environment location: C:\Users\15156\miniconda3\envs\myenv
added / updated specs:
- nltk
- spyder-kernels
The packages were downloaded and installed, and I have looked into where it says the environment location is, however when I run %pip install kernel again it still says that the module cannot be found, spitting out the same error as above. Has anyone run into this issue before?
You have several errors in your code. First, you don't import requests. Next, the first two return statements in get_name() don't have anything following them - you need to bring the next line up to that line. Finally, since get_name() is returning objects where you called the get_text() method on them, it is actually returning strings, so you don't need to access the .text attribute on them when you're assigning to p1 and p2. Here is the corrected code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.baseballpress.com/lineups/2022-08-05"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
def get_name(tag):
if tag.select_one(".desktop-name"):
return tag.select_one(".desktop-name").get_text()
elif tag.select_one(".mobile-name"):
return tag.select_one(".mobile-name").get_text()
else:
return tag.get_text()
data = []
for card in soup.select(".lineup-card"):
header = [
c.get_text(strip=True, separator=" ")
for c in card.select(".lineup-card-header .c")
]
h_p1, h_p2 = [
get_name(p) for p in card.select(".lineup-card-header .player")
]
data.append([*header, h_p1, h_p2])
for p1, p2 in zip(
card.select(".col--min:nth-of-type(1) .player"),
card.select(".col--min:nth-of-type(2) .player"),
):
p1 = get_name(p1).split(maxsplit=1)[-1]
p2 = get_name(p2).split(maxsplit=1)[-1]
data.append([*header, p1, p2])
df = pd.DataFrame(
data, columns=["Team1", "Date", "Team2", "Player1", "Player2"]
)
df.to_csv("73264662.csv", index=False)
print(df.head(10).to_markdown(index=False))
This prints:
| Team1 | Date | Team2 | Player1 | Player2 |
|:--------|:-----------------|:--------|:----------------------|:-------------------------|
| Marlins | August, 5 2:20pm | Cubs | Edward Cabrera (R) | Justin Steele (L) |
| Marlins | August, 5 2:20pm | Cubs | Miguel Rojas (R) SS | Rafael Ortega (L) CF |
| Marlins | August, 5 2:20pm | Cubs | Joey Wendle (L) 2B | Contreras |
| Marlins | August, 5 2:20pm | Cubs | Garrett Cooper (R) 1B | Patrick Wisdom (R) 1B |
| Marlins | August, 5 2:20pm | Cubs | Jesus Aguilar (R) DH | Ian Happ (S) LF |
| Marlins | August, 5 2:20pm | Cubs | De La Cruz | Nelson Velazquez (R) RF |
| Marlins | August, 5 2:20pm | Cubs | JJ Bleday (L) CF | Yan Gomes (R) C |
| Marlins | August, 5 2:20pm | Cubs | Peyton Burdick (R) LF | Zach McKinstry (L) 3B |
| Marlins | August, 5 2:20pm | Cubs | Stallings | Christopher Morel (R) SS |
| Marlins | August, 5 2:20pm | Cubs | Leblanc | Nick Madrigal (R) 2B |
and produces a CSV with all of today's games.
Datatable is popular for R, but it also has a Python version. However, I don't see anything in the docs for applying a user defined function over a datatable.
Here's a toy example (in pandas) where a user function is applied over a dataframe to look for po-box addresses:
df = pd.DataFrame({'customer':[101, 102, 103],
'address':['12 main st', '32 8th st, 7th fl', 'po box 123']})
customer | address
----------------------------
101 | 12 main st
102 | 32 8th st, 7th fl
103 | po box 123
# User-defined function:
def is_pobox(s):
rslt = re.search(r'^p(ost)?\.? *o(ffice)?\.? *box *\d+', s)
if rslt:
return True
else:
return False
# Using .apply() for this example:
df['is_pobox'] = df.apply(lambda x: is_pobox(x['address']), axis = 1)
# Expected Output:
customer | address | rslt
----------------------------|------
101 | 12 main st | False
102 | 32 8th st, 7th fl| False
103 | po box 123 | True
Is there a way to do this .apply operation in datatable? Would be nice, because datatable seems to be quite a bit faster than pandas for most operations.
Tour = Tour Name
Start = Available reservations at the start
End = Amount of reservations left
csv file columns:
ID | Tour | Start | End
12345 | Italy | 100 | 80
13579 | China | 50 | 30
24680 | France | 50 | 30
I have this so far
import pandas as pd
df = pd.read_csv("items4.csv",sep=",").set_index('ID')
d = dict(zip(df.index,df.values.tolist()))
print(d)
{12345: ['Italy', 100, 80], 13579: ['China', 50, 30], 24680: ['France', 50, 30]} #This is the output
I want to make a bar chart that looks something like this with this given data.
IIUC, call set_index and plot.bar:
df
ID Tour Start End
0 12345 Italy 100 80
1 13579 China 50 30
2 24680 France 50 30
df.set_index('Tour')[['Start', 'End']].plot.bar()
plt.show()
If you're interested in annotating the bars too, take a look at Annotate bars with values on Pandas bar plots.
You can also do this without set_index()
df.plot.bar(x = 'Tour', y = ['Start', 'End'])
Hi all so using this past link:
I am trying to consolidate columns of values into rows using groupby:
hp = hp[hp.columns[:]].groupby('LC_REF').apply(lambda x: ','.join(x.dropna().astype(str)))
#what I have
22 23 24 LC_REF
TV | WATCH | HELLO | 2C16
SCREEN | SOCCER | WORLD | 2C16
TEST | HELP | RED | 2C17
SEND |PLEASE |PARFAIT | 2C17
#desired output
22 | TV,SCREEN
23 | WATCH, SOCCER
24 | HELLO, WORLD
25 | TEST, SEND
26 | HELP,PLEASE
27 | RED, PARFAIT
Or some sort of variation where column 22,23,24 is combined and grouped by LC_REF. My current code turns all of column 22 into one row, all of column 23 into one row, etc. I am so close I can feel it!! Any help is appreciated
It seems you need:
df = hp.groupby('LC_REF')
.agg(lambda x: ','.join(x.dropna().astype(str)))
.stack()
.rename_axis(('LC_REF','a'))
.reset_index(name='vals')
print (df)
LC_REF a vals
0 2C16 22 TV,SCREEN
1 2C16 23 WATCH,SOCCER
2 2C16 24 HELLO,WORLD
3 2C17 22 TEST,SEND
4 2C17 23 HELP,PLEASE
5 2C17 24 RED,PARFAIT