I used pd.read_html to try and import a table, but I'm getting a long string instead when I run it. Is there a simple way to change the format of the result to get 1 word per row rather than a long string, or should i be using a function other than pd.read_html? Thank you!
here is my code:
import requests
import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
dfs = pd.read_html(url, header =0)
df = pd.concat(dfs)
df
i also used this and got the same result:
import pandas as pd
url ='http://www.linfo.org/acronym_list.html'
data = pd.read_html(url, header=0)
data[0]
Out[1]:
ABCDEFGHIJKLMNOPQRSTUVWXYZ A AMD Advanced Micro Devices API application programming interface ARP address resolution protocol ARPANET Advanced Research Projects Agency Network AS autonomous system ASCII American Standard Code for Information Interchange AT&T American Telephone and Telegraph Company ATA advanced technology attachment ATM asynchronous transfer mode B B byte BELUG Bellevue Linux Users Group BGP border gateway protocol...
The problem is how the table was created in this site.
According to https://www.w3schools.com/html/html_tables.asp, an HTML table is defined with the < table > tag. Each table row is defined with the < tr > tag. A table header is defined with the < th > tag. By default, table headings are bold and centered. A table data/cell is defined with the < td > tag.
If you press CTRL+SHIFT+I, you can inspect the html elements of your site and you will see that this site does not follow this standard. That is why you are not getting the correct dataframe using pandas.read_html.
Related
I have used the following packages
import pandas as pd
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.exceptions import KustoServiceError
from azure.kusto.data.helpers import dataframe_from_result_table
I would like to see the detailed output of the results, but I followed the official tutorial and I am not sure if I am correct
x = dataframe_from_result_table(response.primary_results[0])
His results look like this
Empty DataFrame
Columns: [Resource]
Index: []
Is this result wrong or normal?
If it is normal, how do I call them? What would it look like if the executed database had output?
I want to see the content of the Resource in the specified content: Columns: [Resource], because this will have the output I want. I am using translation software, please understand
Officially, I can manipulate the data according to the python panda, but I won't be able to call out the data
kusto query results
When I use other query statements. The result shows
Empty DataFrame
Columns: [Tag,Level,Sequence,Message,Metrics]
Index: []
How do I retrieve the values of Tag, Level, Sequence, Message, Metrics from the results?
The result class looks like this
You can play with the publicly available help cluster.
Please note that the connection requires an interactive login.
A login window will pop when you execute the code.
from azure.kusto.data import KustoClient, KustoConnectionStringBuilder
from azure.kusto.data.helpers import dataframe_from_result_table
cluster = "https://help.kusto.windows.net"
db = "Samples"
query = """
StormEvents
| summarize count() by EventType
| top 5 by count_
"""
kcsb = KustoConnectionStringBuilder.with_interactive_login(cluster)
client = KustoClient(kcsb)
response = client.execute(db, query)
df = dataframe_from_result_table(response.primary_results[0])
print(df)
EventType count_
0 Thunderstorm Wind 13015
1 Hail 12711
2 Flash Flood 3688
3 Drought 3616
4 Winter Weather 3349
P.S.
Another option is to leverage the KWE (Kusto Web Explorer) experience.
Get your own free cluster and easily ingest data using OneClick.
I have recently worked on a scraper for NBA box scores from www.basketball-reference.com and am very new to Beautiful Soup. I have attempted to use Widgets however many are broken and it is not an option.So, I have attempted alternative methods for extracting the two stat tables. There is code that works ,for some such as this
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'Https://www.basketball-reference.com/boxscores/202110190MIL.html/'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
df1 = pd.read_html(html)[7]
df2 = pd.read_html(html)[12]
print(df1)
print(df2)
The output of two data frames, such as outputted by this, is my desired output.
While this works on specific games, the location of the tables is somewhat inconsistent across different games and years. So, for alternative links, the number is either out of bounds or returns the wrong table. I have tried to incorporate a myriad of exceptions that lead to different locations, however, this is quite cumbersome, slow, and ineffective.
However, class and ID seem to be structured the same across all, from what I can tell, or at the very least, are a lot more consistent. However, I cannot come up with a method to extract them universally. Ultimately I need both the two basic box score tables extracted in full(Including Team Totals) into two separate data frames from any potential game link. My input data has both team names, and I believe the three-letter abbreviations are included in ids, so I am able to use them as such.
If anyone can provide any help with this, that would be amazing. I have provided several other games with the alternative structures below as examples Thank you in advance.
https://www.basketball-reference.com/boxscores/202204010ORL.html/
https://www.basketball-reference.com/boxscores/202206160BOS.html/
https://www.basketball-reference.com/boxscores/194910290TRI.html/
I don't think that panda's read_html() method can handle these tables. You may have to do it manually, along the lines below. I tried it on two of your 4 urls and it works, but you may have to tweak it for some other pages or for a particular presentation. You should remember that universal solutions in web scraping are rare, and even if one works today, it may not work next week.
This solutions involves beautifulsoup and css selectors. I apologize for not adding a line-by-line explanation of the code, but it's getting late here; once you review this, it should become more or less self evident:
So for a given url:
tabs = soup.select('table[id*="-game-basic"]')
for tab in tabs:
cols, players = [], []
for s in tab.select('thead tr:nth-child(2) th'):
cols.append(s.text)
for j in (tab.select('tbody tr, tfoot tr')):
player = [dat.text for dat in j.select('td,th') ]
players.append(player)
#some of the rows in some tables have blank cells, and these need padding
max_length = len(max(players, key=len))
players_plus = [player + [""]*(max_length - len(player)) for player in players]
print(pd.DataFrame(players_plus,columns=cols))
print('------------------------')
Just wanted to point out 1 thing first. When you use pd.read_html(), it will return a list of tables/dataframes. Theres no need to do it twice. And is actually less efficient as it needs to make an http request twice then, for the same url.
So rather than doing this:
df1 = pd.read_html(html)[7]
df2 = pd.read_html(html)[12]
Do:
dfs = pd.read_html(html) # <-- returns a list of dataframes
df1 = dfs[7]
df2 = dfs[12]
So to your question, there's a couple ways to attack this:
Insert some logic that checks the length of the rows and/or columns, and pull out those.
**Note: I don't like this solution in this case, as there are multiple tables that fit this criteria. SO not the most robust in this situation.
urls = ['https://www.basketball-reference.com/boxscores/202110190MIL.html/',
'https://www.basketball-reference.com/boxscores/202204010ORL.html/',
'https://www.basketball-reference.com/boxscores/202206160BOS.html/',
'https://www.basketball-reference.com/boxscores/194910290TRI.html/']
for url in urls:
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
output = []
dfs = pd.read_html(html)
for each in output:
if len(each.columns) >= 21 or len(each) >= 14:
output.append(each)
You can use the html attributes to get the table you want:
This is the best option. Pull out the specific class or id you are after. This is using regex to find the id that is in the form of box-<team id>-game-basic
import requests
import pandas as pd
from bs4 import BeautifulSoup
import re
urls = ['https://www.basketball-reference.com/boxscores/202110190MIL.html/',
'https://www.basketball-reference.com/boxscores/202204010ORL.html/',
'https://www.basketball-reference.com/boxscores/202206160BOS.html/',
'https://www.basketball-reference.com/boxscores/194910290TRI.html/']
for url in urls:
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
tables = soup.find_all('table', {'id': re.compile('box-.*-game-basic')})
dfs = pd.read_html(str(tables), header=1)
df1 = dfs[0]
df2 = dfs[1]
print(df1)
print(df2)
I have a HTML document (its a 10-K filing from Apple), which I would like to store in one single pandas dataframe column as a string with readable content. Therefore, I so far used BeautifulSoup to translate the HTML document into a "readable" text. By using the get_text() function, I am so far able to print the text in a readable format, however I am not able to store this text as a string being able to be inserted in a dataframe.
This is the format I am looking for:
Ticker
10-K
AAPL
Apple Inc. builds Macs ...
Here my thoughts so far:
from bs4 import BeautifulSoup
import pandas as pd
file = open("filing-details.html", "r")
soup = BeautifulSoup(file)
#print(soup.get_text()) gives me an output in a readable form, such as "Apple Inc. builds Macs"
print(soup.get_text())
Now, I would like to store this generated string "Apple Inc. builds Macs", so that I can insert it to a dataframe. Therefore, I tried this:
text = soup.get_text()
df1["txt"] = text
df1
However, this gives me the following output:
Ticker
10-K
AAPL
\n10-K\n1\na10-k20179302017.htm\n10-K\n\n\n\n...
I also tried the following, but received the same result:
df1["txt"] = str(soup.get_text())
Does somebody know, how I can store the exact same output I receive by using print(soup.get_text()) as a string in a dataframe? (the dataframe is later used for textual analysis)
I would really appreciate your help. Thank you in advance!
I would recommend you to create a dictionary with title as key and data text as value and convert them into DataFrame.
new_data={
"Ticker":['AAPL',...],
"10-K":['soup.get_text()',...]
}
df = pandas.DataFrame(new_data)
#list of values is optional, required only if there are more data
try this method and please reply if it worked
I'm trying to access the table details to ultimately put into a dataframe and save as a csv with a limited number of rows(the dataset is massive) from the following site: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
I'm just starting out webscraping and was practicing on this dataset. I can effectively pull tags like div but when I try soup.findAll('tr') or td, it returns an empty set.
The table appears to be embedded in a different code(see link above) so that's maybe my issue, but still unsure how to access the detail rows and headers, etc..., Selenium maybe?
Thanks in advance!
By the looks of it, the website already allows you to export the data:
As it would seem, the original link is:
https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2/data
The .csv download link is:
https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
The .json link is:
https://data.cityofchicago.org/resource/ijzp-q8t2.json
Therefore you could simply extract the ID of the data, in this case ijzp-q8t2, and replace it on the download links above. Here is the official documentation of their API.
import pandas as pd
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.cityofchicago.org,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("ijzp-q8t2", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
I've crawled a tracklist of 36.000 songs, which have been played on the Danish national radio station P3. I want to do some statistics on how frequently each of the genres have been played within this period, so I figured the discogs API might help labeling each track with genre. However, the documentation for the API doesent seem to include an example for querying the genre of a particular song.
I have a CSV-file with with 3 columns: Artist, Title & Test(Test where i want the API to label each song with the genre).
Here's a sample of the script i've built so far:
import json
import pandas as pd
import requests
import discogs_client
d = discogs_client.Client('ExampleApplication/0.1')
d.set_consumer_key('key-here', 'secret-here')
input = pd.read_csv('Desktop/TEST.csv', encoding='utf-8',error_bad_lines=False)
df = input[['Artist', 'Title', 'Test']]
df.columns = ['Artist', 'Title','Test']
for i in range(0, len(list(df.Artist))):
x = df.Artist[i]
g = d.artist(x)
df.Test[i] = str(g)
df.to_csv('Desktop/TEST2.csv', encoding='utf-8', index=False)
This script has been working with a dummy file with 3 records in it so far, for mapping the artist of a given ID#. But as soon as the file gets larger(ex. 2000), it returns a HTTPerror when it cannot find the artist.
I have some questions regarding this approach:
1) Would you recommend using the search query function in the API for retrieving a variable as 'Genre'. Or do you think it is possible to retrieve Genre with a 'd.' function from the API?
2) Will I need to aquire an API-key? I have succesfully mapped the 3 records without an API-key so far. Looks like the key is free though.
Here's the guide I have been following:
https://github.com/discogs/discogs_client
And here's the documentation for the API:
https://www.discogs.com/developers/#page:home,header:home-quickstart
Maybe you need to re-read the discogs_client examples, i am not an expert myself, but a newbie trying to use this API.
AFAIK, g = d.artist(x) fails because x must be a integer not a string.
So you must first do a search, then get the artist id, then d.artist(artist_id)
Sorry for no providing an example, i am python newbie right now ;)
Also have you checked acoustid for
It's a probably a rate limit.
Read the status code of your response, you should find an 429 Too Many Requests
Unfortunately, if that's the case, the only solution is to add a sleep in your code to make one request per second.
Checkout the api doc:
http://www.discogs.com/developers/#page:home,header:home-rate-limiting
I found this guide:
https://github.com/neutralino1/discogs_client.
Access the api with your key and try something like:
d = discogs_client.Client('something.py', user_token=auth_token)
release = d.release(774004)
genre = release.genres
If you found a better solution please share.