Trouble creating pandas dataframe from lists - python

I am having some trouble creating a pandas df from lists I generate while scraping data from the web. Here I am using beautifulsoup to pull a few pieces of information about local farms from localharvest.org (farm name, city, and description). I am able to scrape the data effectively, creating a list of objects on each pass. The trouble I'm having is outputting these lists into a tabular df.
My complete code is as follows:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text
fname.append(name)
city = item.contents[3].text
fcity.append(city)
desc = item.find_all("div", {'class': 'short-desc'})[0].text
fdesc.append(desc)
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')
Interestingly, the print(df) function shows that all three lists have been passed to the dataframe. But the resultant .CSV output contains only a single column of values (fcity) with the fname and fdesc column labels present. Interstingly, If I do something crazy like try to force tab delineated output with df.to_csv('farmdata.csv', sep='\t'), I get a single column with jumbled output, but it appears to at least be passing the other elements of the dataframe.
Thanks in advance for any input.

Try stripping out the newline and space characters:
import requests
from bs4 import BeautifulSoup
import pandas
url = "http://www.localharvest.org/search.jsp?jmp&lat=44.80798&lon=-69.22736&scale=8&ty=6"
r = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("div", {'class': 'membercell'})
fname = []
fcity = []
fdesc = []
for item in data:
name = item.contents[1].text.split()
fname.append(' '.join(name))
city = item.contents[3].text.split()
fcity.append(' '.join(city))
desc = item.find_all("div", {'class': 'short-desc'})[0].text.split()
fdesc.append(' '.join(desc))
df = pandas.DataFrame({'fname': fname, 'fcity': fcity, 'fdesc': fdesc})
print (df)
df.to_csv('farmdata.csv')

Consider, instead of using lists of the information for each farm entity that you scrape, to use a list of dictionaries, or a dict of dicts. eg:
[{name:farm1, city: San Jose... etc},
{name: farm2, city: Oakland...etc}]
Now you can call Pandas.DataFrame.from_dict() on the above defined list of dicts.
Pandas method: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html
An answer that might describe this solution in more detail: Convert Python dict into a dataframe

It works for me:
# Taking a few slices of each substring of a given string after stripping off whitespaces
df['fname'] = df['fname'].str.strip().str.slice(start=0, stop=20)
df['fdesc'] = df['fdesc'].str.strip().str.slice(start=0, stop=20)
df.to_csv('farmdata.csv')
df
fcity fdesc fname
0 South Portland, ME Gromaine Farm is pro Gromaine Farm
1 Newport, ME We are a diversified Parker Family Farm
2 Unity, ME The Buckle Farm is a The Buckle Farm
3 Kenduskeag, ME Visit wiseacresfarm. Wise Acres Farm
4 Winterport, ME Winter Cove Farm is Winter Cove Farm
5 Albion, ME MISTY BROOK FARM off Misty Brook Farm
6 Dover-Foxcroft, ME We want you to becom Ripley Farm
7 Madison, ME Hide and Go Peep Far Hide and Go Peep Far
8 Etna, ME Fail Better Farm is Fail Better Farm
9 Pittsfield, ME We are a family farm Snakeroot Organic Fa
Maybe you had a lot of empty spaces which was misinterpreted by the default delimiter(,) and hence picked up fcity column as it contained(,) in it which led to the ordering getting affected.

Related

Trouble Looping through JSON elements pulled using API

I am trying to pull search results data from an API on a website and put it into a pandas dataframe. I've been able to successfully pull the info from the API into a JSON format.
The next step I'm stuck on is how to loop through the search results on a particular page and then again for each page of results.
Here is what I've tried so far:
#Step 1: Connect to an API
import requests
import json
response_API = requests.get('https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page=1')
#200
#Step 2: Get the data from API
data = response_API.text
#Step 3: Parse the data into JSON format
parse_json = json.loads(data)
#Step 4: Extract data
title = parse_json['results'][0]['title']
pub_date = parse_json['results'][0]['publication_date']
agency = parse_json['results'][0]['agencies'][0]['name']
Here is where I've tried to put this all into a loop:
import numpy as np
import pandas as pd
df=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
for i in parse_json:
title = parse_json['results'][i]['title']
pub_date = parse_json['results'][i]['publication_date']
agency = parse_json['results'][i]['agencies'][0]['name']
df.append([title,pub_date,agency])
cols = ["Title", "Date","Agency"]
df = pd.DataFrame(df,columns=cols)
I feel like I'm close to the correct answer, but I'm not sure how to move forward from here. I need to iterate through the results where I placed the i's when parsing through the json data, but I get an error that reads, "Type Error: list indices must be integers or slices, not str". I understand I can't put the i's in those spots, but how else am I supposed to iterate through the results?
Any help would be appreciated!
Thank you!
I think you are very close!
import numpy as np
import pandas as pd
import requests
BASE_URL = "'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}"
results = []
for page in range(0, 7):
response = requests.get(BASE_URL.format(page=page))
if response.ok:
resp_json = response.json()
for res in resp_json["results"]:
results.append(
[
res["title"],
res["publication_date"],
[agency["name"] for agency in res["agencies"]]
]
)
df = pd.DataFrame(results, columns=["Title", "Date", "Agencies"])
In this block of code, I used the requests library's built-in .json() method, which can automatically convert a response's text to a JSON dict (if it's in the proper format).
The if response.ok is a little less-verbose way provided by requests to check if the status code is < 400, and can prevent errors that might occur when attempting to parse the response if there was a problem with the HTTP call.
Finally, I'm not sure what data you need exactly for your DataFrame, but each object in the
"results" list from the pages pulled from that website has "agencies" as a list of agencies... wasn't sure if you wanted to drop all that data, so I kept the names as a list.
*Edit:
In case the response objects don't contain the proper keys, we can use the .get() method of Python dictionaries.
# ...snip
for res in resp_json["results"]:
results.append(
[
res.get("title"), # This will return `None` as a default, instead of causing a KeyError
res.get("publication_date"),
[
# Here, get the 'raw_name' or None, in case 'name' key doesn't exist
agency.get("name", agency.get("raw_name"))
for agency in res.get("agencies", [])
]
]
)
Slightly different approach: rather than iterating through the response, read into a dataframe then save what you need. The saves the first agency name in the list.
df_list=[]
for page in np.arange(0,7):
url = 'https://www.federalregister.gov/api/v1/documents.json?conditions%5Bpublication_date%5D%5Bgte%5D=09%2F01%2F2021&conditions%5Bterm%5D=economy&order=relevant&page={page}'.format(page=page)
response_API = requests.get(url)
# print(response_API.status_code)
data = response_API.text
parse_json = json.loads(data)
df = pd.json_normalize(parse_json['results'])
df['Agency'] = df['agencies'][0][0]['raw_name']
df_list.append(df[['title', 'publication_date', 'Agency']])
df_final = pd.concat(df_list)
df_final
title publication_date Agency
0 Determination of the Promotion of Economy and ... 2021-09-28 OFFICE OF MANAGEMENT AND BUDGET
1 Corporate Average Fuel Economy Standards for M... 2021-09-03 OFFICE OF MANAGEMENT AND BUDGET
2 Public Hearing for Corporate Average Fuel Econ... 2021-09-14 OFFICE OF MANAGEMENT AND BUDGET
3 Investigation of Urea Ammonium Nitrate Solutio... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
4 Call for Nominations To Serve on the National ... 2021-09-08 OFFICE OF MANAGEMENT AND BUDGET
.. ... ... ...
15 Energy Conservation Program: Test Procedure fo... 2021-09-14 DEPARTMENT OF COMMERCE
16 Self-Regulatory Organizations; The Nasdaq Stoc... 2021-09-09 DEPARTMENT OF COMMERCE
17 Regulations To Improve Administration and Enfo... 2021-09-20 DEPARTMENT OF COMMERCE
18 Towing Vessel Firefighting Training 2021-09-01 DEPARTMENT OF COMMERCE
19 Patient Protection and Affordable Care Act; Up... 2021-09-27 DEPARTMENT OF COMMERCE
[140 rows x 3 columns]

If-condition is not executed in a for-loop when scraping data from kworb.net

I need to collect data on the countries where artists are streamed most frequently on Spotify. To do that, I am using this source that contains a list of 10.000 artists.
So the aim of my code is to create a table with two columns:
artist name;
country where the artist is streamed the most.
I wrote a code (see below) that gets this information from each artist's personal page (here is an example for Drake). An artist's name is taken from the title of a page and the country code -- from table column heading preceded by the column titled "Global". For some artists, there is no column titled "Global" and I need to account for this condition. And here is where my problems comes in.
I am using the following if-condition:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
But only the first condition is executed, where the code extracts the text from the 4th column. Alternatively, I tried the reverse condition:
if "<th>Global</th>" in soup2.find_all('table')[0].find_all('th'):
Country = soup2.find_all('table')[0].find_all('th')[5].text
else:
Country = soup2.find_all('table')[0].find_all('th')[4].text
country.append(Country)
But the code still extracts the text from the 4th column, even if I want it to extract it from the 5th column when the 4th column is titled "Global".
This reproducible code is run for a subset of artists, for whom there is a column titled "Global" (e.g. LANY) and for whom there is none (e.g. Henrique & Diego)(#391 to #395 as of June 16, 2019):
from time import sleep
from random import randint
from requests import get
from bs4 import BeautifulSoup as bs
import pandas as pd
response1 = get('https://kworb.net/spotify/artists.html', headers = headers)
soup1 = bs(response1.text, 'html.parser')
table = soup1.find_all('table')[0]
rows = table.find_all('tr')[391:396] #selected subset of 10.000 artists
artist = []
country = []
for row in rows:
artist_url = row.find('a')['href']
response2 = get('https://kworb.net/spotify/' + artist_url)
sleep(randint(8,15))
soup2 = bs(response2.text, 'html.parser')
Artist = soup2.find('title').text[:-24]
artist.append(Artist)
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'): #problem suspected in this if-condition
Country = soup2.find_all('table')[0].find_all('th')[4].text
else:
Country = soup2.find_all('table')[0].find_all('th')[5].text
country.append(Country)
df = pd.DataFrame({'Artist': artist,
'Country': country
})
print(df)
As a result, I get the following:
Artist Country
0 YNW Melly Global
1 Henrique & Diego BR
2 LANY Global
3 Parson James Global
4 ANAVITÃRIA BR
While the actual output, as of June 16, 2019, should be:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR
I suspect the wrong if-condition for the variable country. I would appreciate any help with regard to that.
You compare bs4 object with string.
Need first get text from each found object then compare with string:
replace:
if "<th>Global</th>" not in soup2.find_all('table')[0].find_all('th'):
with:
# get text options from html
found_options = [item.text for item in soup2.find_all('table')[0].find_all('th')]
if "Global" not in found_options:
Output:
Artist Country
0 YNW Melly US
1 Henrique & Diego BR
2 LANY PH
3 Parson James US
4 ANAVITÃRIA BR

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.
I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:
index cik conm type date path
0 0 1000045 NICHOLAS FINANCIAL INC 10-Q 2019-02-14 edgar/data/1000045/0001193125-19-039489.txt
1 1 1000045 NICHOLAS FINANCIAL INC 4 2019-01-15 edgar/data/1000045/0001357521-19-000001.txt
2 2 1000045 NICHOLAS FINANCIAL INC 4 2019-02-19 edgar/data/1000045/0001357521-19-000002.txt
3 3 1000045 NICHOLAS FINANCIAL INC 4 2019-03-15 edgar/data/1000045/0001357521-19-000003.txt
4 4 1000045 NICHOLAS FINANCIAL INC 8-K 2019-02-01 edgar/data/1000045/0001193125-19-024617.txt
Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):
from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)
Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.
My two-part question is this:
I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.
Thank you for the help!
You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.
Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.
def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content
Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:
cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)
Then you create your base_url, with the items inserted and get your data:
for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)
And go from there.

scraping data from wikipedia table

I'm just trying to scrape data from a wikipedia table into a panda dataframe.
I need to reproduce the three columns: "Postcode, Borough, Neighbourhood".
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'xml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = []
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighbourhood'] = pd.Series(Neighbourhood)
df
And it returns only the borough...
Thanks
You may be overthinking the problem, if you only want the script to pull one table from the page. One import, one line, no loops:
import pandas as pd
url='https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M'
df=pd.read_html(url, header=0)[0]
df.head()
Postcode Borough Neighbourhood
0 M1A Not assigned Not assigned
1 M2A Not assigned Not assigned
2 M3A North York Parkwoods
3 M4A North York Victoria Village
4 M5A Downtown Toronto Harbourfront
You need to iterate over each row in the table and store the data row by row, not just in one giant list. Try something like this:
import pandas
import requests
from bs4 import BeautifulSoup
website_text = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
soup = BeautifulSoup(website_text,'xml')
table = soup.find('table',{'class':'wikitable sortable'})
table_rows = table.find_all('tr')
data = []
for row in table_rows:
data.append([t.text.strip() for t in row.find_all('td')])
df = pandas.DataFrame(data, columns=['PostalCode', 'Borough', 'Neighbourhood'])
df = df[~df['PostalCode'].isnull()] # to filter out bad rows
then
>>> df.head()
PostalCode Borough Neighbourhood
1 M1A Not assigned Not assigned
2 M2A Not assigned Not assigned
3 M3A North York Parkwoods
4 M4A North York Victoria Village
5 M5A Downtown Toronto Harbourfront
Basedig provides a platform to download Wikipedia tables as Excel, CSV or JSON files directly. Here is a link to the Wikipedia source: https://www.basedig.com/wikipedia/
If you do not find the dataset you are looking for on Basedig, send them the link to your article and they'll parse it for you.
Hope this helps

Scraping wikipedia table to pandas data frame

I need to scrape a wikipedia table to a pandas data frame and create three columns: PostalCode, Borough, and Neighborhoods.
https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M
Here is the code that I have used:
import requests
website_url = requests.get('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M').text
from bs4 import BeautifulSoup
soup = BeautifulSoup(website_url,'lxml')
print(soup.prettify())
My_table = soup.find('table',{'class':'wikitable sortable'})
My_table
links = My_table.findAll('a')
links
Neighbourhood = [ ]
for link in links:
Neighbourhood.append(link.get('title'))
print (Neighbourhood)
import pandas as pd
df = pd.DataFrame([])
df['PostalCode', 'Borough', 'Neighborhood'] = Neighbourhood
df
And it returns that:
(PostalCode, Borough, Neighborhood)
0 North York
1 Parkwoods
2 North York
3 Victoria Village
4 Downtown Toronto
5 Harbourfront (Toronto)
6 Downtown Toronto
7 Regent Park
8 North York
I can't figure out how to pick up the postcode and the neighbourhood from the wikipedia table.
Thank you
pandas allow you to do it in one line of code:
df = pd.read_html('https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M')[0]
Provide the error message.
By looking at it, first you have df['Neighbourhoods'] = Neighbourhoods where your list has the name Neighborhoods.
You have two small errors:
df = pd.dataframe() should be df = pd.DataFrame([])
You also misspelled Neighborhoods as Neighbourhoods the second time.
You might also need to change soup = BeautifulSoup(website_url,'lxml') to soup = BeautifulSoup(website_url,'xml'), but we can't help you more without knowing your exact error message.
Instead of using
df = pd.dataframe()
df['Neighbourhoods'] = Neighbourhoods
You can use
df['Neighbourhoods'] = pd.Series(Neighbourhoods)
This would solve your error and add new columns similarly using pd.Series(listname) or you can give list of lists containing PostalCode, Borough, and Neighborhoods using this code
df = pd.Dataframe(list_of_lists)
It looks like you're only picking up one of the columns here:
links = My_table.findAll('a')
You should be looking for 'tr' rather than 'a' as that signifies a new row in the table.
You should then use a for loop to populate a list of lists, this code should work:
v = []
for tr in values:
td = tr.find_all('td')
row = [i.text for i in td]
v.append(row)
df = pd.DataFrame.from_records(v)

Categories

Resources