How to write csv file from scraped data from web in python

How to write csv file from scraped data from web in python - python

I am trying to scrape data from web pages and able to scrape also.
After using below script getting all div class data but I am confused how to write data in CSV file like.
First Data in the first name column
Last name data in last name column
.
.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
for i in range(len(name_box)):
data = name_box[i].text.strip()
Data:
Information Type
Individual
First Name
KACHAM
Middle Name
Last Name
RAJESHWAR
Father Full Name
RAMAIAH
Do you have any Past Experience ?
No
Do you have any registration in other State than registred State?
No
House Number
8-2-293/82/A/446/1
Building Name
SAI KRUPA
Street Name
ROAD NO 20
Locality
JUBILEE HILLS
Landmark
JUBILEE HILLS
State
Telangana
Division
Division 1
District
Hyderabad
Mandal
Shaikpet
Village/City/Town
Pin Code
500033
Office Number
04040151614
Fax Number
Website URL
Authority Name
Plan Approval Number
1/18B/06558/2018
Project Name
SKV S ANANDA VILAS
Project Status
New Project
Proposed Date of Completion
17/04/2024
Litigations related to the project ?
No
Project Type
Residential
Are there any Promoter(Land Owner/ Investor) (as defined by Telangana RERA Order) in the project ?
Yes
Sy.No/TS No.
00
Plot No./House No.
10-2-327
Total Area(In sqmts)
526.74
Area affected in Road widening/FTL of Tanks/Nala Widening(In sqmts)
58.51
Net Area(In sqmts)
1
Total Building Units (as per approved plan)
1
Proposed Building Units(as per agreement)
1
Boundaries East
PLOT NO 213
Boundaries West
PLOT NO 215
Boundaries North
PLOT NO 199
Boundaries South
ROAD NO 8
Approved Built up Area (In Sqmts)
1313.55
Mortgage Area (In Sqmts)
144.28
State
Telangana
District
Hyderabad
Mandal
Maredpally
Village/City/Town
Street
ROAD NO 8
Locality
SECUNDERABAD COURT
Pin Code
500026
above is the data getting after run above code.
Edit
for i in range(len(name_box)):
data = name_box[i].text.strip()
print (data)
fname = 'out.csv'
with open(fname) as f:
next(f)
for line in f:
head = []
value = []
for row in line:
head.append(row)
print (row)
Expected
Information Type | First | Middle Name | Last Name | ......
Individual | KACHAM | | RAJESHWAR | .....
I have 200 url but all url data is not same means some of these missing. I want to write such way if data not avaialble then write anotthing just blank.
Please suggest. Thank you in advance

to write to csv you need to know what value should be in head and body, in this case head value should be html element contain <label
from urllib2 import urlopen
from bs4 import BeautifulSoup
html = 'http://rerait.telangana.gov.in/PrintPreview/PrintPreview/UHJvamVjdElEPTQmRGl2aXNpb249MSZVc2VySUQ9MjAyODcmUm9sZUlEPTEmQXBwSUQ9NSZBY3Rpb249U0VBUkNIJkNoYXJhY3RlckQ9MjImRXh0QXBwSUQ9'
page = urlopen(html)
data = BeautifulSoup(page, 'html.parser')
name_box = data.findAll('div', attrs={'class': 'col-md-3 col-sm-3'}) #edited companyName_99a4824b -> companyName__99a4824b
heads = []
values = []
for i in range(len(name_box)):
data = name_box[i].text.strip()
dataHTML = str(name_box[i])
if 'PInfoType' in dataHTML:
# <div class="col-md-3 col-sm-3" id="PInfoType">
# empty value, maybe additional data for "Information Type"
continue
if 'for="2"' in dataHTML:
# <label for="2">No</label>
# it should be head but actually value
values.append(data)
elif '<label' in dataHTML:
# <label for="PersonalInfoModel_InfoTypeValue">Information Type</label>
# head or top row
heads.append(data)
else:
# <div class="col-md-3 col-sm-3">Individual</div>
# value for second row
values.append(data)
csvData = ', '.join(heads) + '\n' + ', '.join(values)
with open("results.csv", 'w') as f:
f.write(csvData)
print "finish."

Question: How to write csv file from scraped data
Read the Data into a dict and use csv.DictWriter(... to write to CSV file.
Documentations about:
csv.DictWriter
while
next
break
Mapping Types — dict
Skip the first line, as it's the title
Loop Data lines
key = next(data)
value = next(data)
Break loop if no further data
Build dict[key] = value
After finishing the loop, write dict to CSV file
Output:
{'Individual': '', 'Father Full Name': 'RAMAIAH', 'First Name': 'KACHAM', 'Middle Name': '', 'Last Name': 'RAJESHWAR',... (omitted for brevity)

Related

Python University Names and Abbrevations and Weblink

I want to prepare a dataframe of universities, its abbrevations and website link.
My code:
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
abb_df_list = pd.read_html(abb_html)
Present answer:
ValueError: No tables found
Expected answer:
df =
| | university_full_name | uni_abb | uni_url|
---------------------------------------------------------------------
| 0 | Albert Einstein College of Medicine | AECOM | https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine|

That's one funky page you have there...
First, there are indeed no tables in there. Second, some organizations don't have links, others have redirect links and still others use the same abbreviation for more than one organization.
So you need to bring in the heavy artillery: xpath...
import pandas as pd
import requests
from lxml import html as lh
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
doc = lh.fromstring(response.text)
rows = []
for uni in doc.xpath('//h2[./span[#class="mw-headline"]]//following-sibling::ul//li'):
info = uni.text.split(' – ')
abb = info[0]
#for those w/ no links
if not uni.xpath('.//a'):
rows.append((abb," ",info[1]))
#now to account for those using the same abbreviation for multiple teams
for a in uni.xpath('.//a'):
dat = a.xpath('./#*')
#for those with redirects
if len(dat)==3:
del dat[1]
link = f"https://en.wikipedia.org{dat[0]}"
rows.append((abb,link,dat[1]))
#and now, at last, to the dataframe
cols = ['abb','url','full name']
df = pd.DataFrame(rows,columns=cols)
df
Output:
abb url full name
0 AECOM https://en.wikipedia.org/wiki/Albert_Einstein_... Albert Einstein College of Medicine
1 AFA https://en.wikipedia.org/wiki/United_States_Ai... United States Air Force Academy
etc.
Note: you can rearrange the order of columns in the dataframe, if you are so inclined.

Select and iterate only the expected <li> and extract its information, but be aware there is a university without an <a> (SUI – State University of Iowa), so this should be handled with if-statement in example:
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States"
response = requests.get(url)
soup = BeautifulSoup(response.text)
data = []
for e in soup.select('h2 + ul li'):
data.append({
'abb':e.text.split('-')[0],
'full_name':e.text.split('-')[-1],
'url':'https://en.wikipedia.org' + e.a.get('href') if e.a else None
})
pd.DataFrame(data)
Output:
abb
full_name
url
0
AECOM
Albert Einstein College of Medicine
https://en.wikipedia.org/wiki/Albert_Einstein_College_of_Medicine
1
AFA
United States Air Force Academy
https://en.wikipedia.org/wiki/United_States_Air_Force_Academy
2
Annapolis
U.S. Naval Academy
https://en.wikipedia.org/wiki/United_States_Naval_Academy
3
A&M
Texas A&M University, but also others; see A&M
https://en.wikipedia.org/wiki/Texas_A%26M_University
4
A&M-CC or A&M-Corpus Christi
Corpus Christi
https://en.wikipedia.org/wiki/Texas_A%26M_University%E2%80%93Corpus_Christi
...

There are no tables on this page, but lists. So the goal will be to go through the <ul> and then <li> tags, skipping the paragraphs you are not interested in (the first and those after the 26th).
You can extract aab_code of the university this way:
uni_abb = li.text.strip().replace(' - ', ' - ').replace(' - ', ' - ').split(' - ')[0]
while to get the url you have to access the 'href' and 'title' parameter inside the <a> tag:
for a in li.find_all('a', href=True):
title = a['title']
url= f"https://en.wikipedia.org/{a['href']}"
Accumulate the extracted information into a list, and finally create the dataframe by assigning appropriate column names.
Here is the complete code, in which I use BeautifulSoup:
import requests
import pandas as pd
from bs4 import BeautifulSoup
abb_url = 'https://en.wikipedia.org/wiki/List_of_colloquial_names_for_universities_and_colleges_in_the_United_States'
abb_html = requests.get(abb_url).content
soup = BeautifulSoup(abb_html)
l = []
for ul in soup.find_all("ul")[1:26]:
for li in ul.find_all("li"):
uni_abb = li.text.strip().replace(' - ', ' – ').replace(' — ', ' – ').split(' – ')[0]
for a in li.find_all('a', href=True):
l.append((a['title'], uni_abb, f"https://en.wikipedia.org/{a['href']}"))
df = pd.DataFrame(l, columns=['university_full_name', 'uni_abb', 'uni_url'])
Result:
university_full_name uni_abb uni_url
0 Albert Einstein College of Medicine AECOM https://en.wikipedia.org//wiki/Albert_Einstein...
1 United States Air Force Academy AFA https://en.wikipedia.org//wiki/United_States_A...

How to scrape keywords that change every time?

I am trying to scrape a keyword in an xml document with BeautifulSoup but am unsure how to do so.
The xml document contains "Central Index Key," which changes each time for each document scraped. How would I be able to log the central index key for every unique xml document I scrape?
A sample is below. I want to log the string "0001773427" in this example:
<SEC-DOCUMENT>0001104659-22-079974.txt : 20220715
<SEC-HEADER>0001104659-22-079974.hdr.sgml : 20220715
<ACCEPTANCE-DATETIME>20220715060341
ACCESSION NUMBER: 0001104659-22-079974
CONFORMED SUBMISSION TYPE: 8-K
PUBLIC DOCUMENT COUNT: 14
CONFORMED PERIOD OF REPORT: 20220714
ITEM INFORMATION: Departure of Directors or Certain Officers; Election of Directors; Appointment of Certain Officers: Compensatory Arrangements of Certain Officers
ITEM INFORMATION: Financial Statements and Exhibits
FILED AS OF DATE: 20220715
DATE AS OF CHANGE: 20220715
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: SpringWorks Therapeutics, Inc.
CENTRAL INDEX KEY: 0001773427
STANDARD INDUSTRIAL CLASSIFICATION: BIOLOGICAL PRODUCTS (NO DIAGNOSTIC SUBSTANCES) [2836]
IRS NUMBER: 000000000
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: 8-K
SEC ACT: 1934 Act
SEC FILE NUMBER: 001-39044
FILM NUMBER: 221084206
BUSINESS ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
BUSINESS PHONE: 203-883-9490
MAIL ADDRESS:
STREET 1: 100 WASHINGTON BOULEVARD
CITY: STAMFORD
STATE: CT
ZIP: 06902
</SEC-HEADER>

The problem is that your document is SGML, not XML. XPath requires that your document be XML.
Use a different tool or convert your document to XML if you wish to use XPath. An example of a SGML to XML converter is sx by James Clark.

from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Sample Company Name AdminContact#<sample company domain>.com'}
r = requests.get('https://www.sec.gov/Archives/edgar/data/1773427/000110465922079974/0001104659-22-079974.txt', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
print([x.strip().replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0])
This will return:
CENTRAL INDEX KEY: 0001773427
If you only want the key:
print([x.replace('\t', ' ') for x in soup.text.splitlines() if 'CENTRAL INDEX KEY:' in x ][0].split(':')[1].strip())

Create a data frame with headings as column names and <li> tag content as rows, then print this data frame into a text file

I'm trying to get the main body data from this website
I want to get a data frame (or any other object which makes life easier) as output with subheadings as column names and body under the subheading as lines under that column.
My code is below:
from bs4 import BeautifulSoup
import requests
import re
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headings = []
lines = []
my_df = pd.DataFrame(index=range(100))
for strong in article.findAll('strong'):
if strong.parent.name =='p':
if strong.find(text=re.compile("News")):
headings.append(strong.text)
#headings
k=0
for ul in article.findAll('ul'):
for li in ul.findAll('li'):
lines.append(li.text)
lines= lines + [""]
my_df[k] = pd.Series(lines)
k=k+1
my_df
I want to use the "headings" list to get the data frame column names.
Clearly I'm not writing the correct logic. I explored nextSibling, descendants and other attributes too, but I can't figure out the correct logic. Can someone please help?

Once you get the headline, use .find_next() to get that news article list. Then add them into a list under the headline as a key in a dictionary. Then simply use pd.concat() with ignore_index=False
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
url = "https://www.bankersadda.com/17th-september-2021-daily-gk-update/"
page = requests.get(url)
html = page.text
soup = BeautifulSoup(html,'lxml') #"html.parser")
article = soup.find(class_ = "entry-content")
headlines = {}
news_headlines = article.find_all('p',text=re.compile("News"))
for news_headline in news_headlines:
end_of_news = False
sub_title = news_headline.find_next('p')
headlines[news_headline.text] = []
#print(news_headline.text)
while end_of_news == False:
headlines[news_headline.text].append(sub_title.text)
articles = sub_title.find_next('ul')
for li in articles.findAll('li'):
headlines[news_headline.text].append(li.text)
#print(li.text)
sub_title = articles.find_next('p')
if 'News' in sub_title.text or sub_title.text == '' :
end_of_news = True
df_list = []
for headings, lines in headlines.items():
temp = pd.DataFrame({headings:lines})
df_list.append(temp)
my_df = pd.concat(df_list, ignore_index=False, axis=1)
Output:
print(my_df)
National News ... Obituaries News
0 1. Cabinet approves 100% FDI under automatic r... ... 11. Eminent Kashmiri Writer Aziz Hajini passes...
1 The Union Cabinet, chaired by Prime Minister N... ... Noted writer and former secretary of Jammu and...
2 A total of 9 structural and 5 process reforms ... ... He has over twenty books in Kashmiri to his cr...
3 Change in the definition of AGR: The definitio... ... 12. Former India player and Mohun Bagan great ...
4 Rationalised Spectrum Usage Charges: The month... ... Former India footballer and Mohun Bagan captai...
5 Four-year Moratorium on dues: Moratorium has b... ... Bhabani Roy helped Mohun Bagan win the Rovers ...
6 Foreign Direct Investment (FDI): The governmen... ... 13. 2 times Olympic Gold Medalist Yuriy Sedykh...
7 Auction calendar fixed: Spectrum auctions will... ... Double Olympic hammer throw gold medallist Yur...
8 Important takeaways for all competitive exams: ... He set the world record for the hammer throw w...
9 Minister of Communications: Ashwini Vaishnaw. ... He won his first gold medal at the 1976 Olympi...
[10 rows x 8 columns]

Scrape zoho-analitics externally stored table. Is it possible?

I am trying to scrape a zoho-analytics table from this webpage for a project at the university. For the moment I have no ideas. I can't see the values in the inspect, and therefore I cannot use Beautifulsoup in Python (my favourite one).
enter image description here
Does anybody have any idea?
Thanks a lot,
Joseph

I tried it with BeautifulSoup, seems like you can't soup these values that are inside the table because they are not on the website but stored externally(?)
EDIT:
https://analytics.zoho.com/open-view/938032000481034014
This is the link the table and its data are stored.
So I tried scraping from it with bs4 and it works.
The class of the rows is "zdbDataRowDiv"
Try:
container = page_soup.findAll("div","class":"zdbDataRowDiv")
Code explanation:
container # the variable where your data is stored, name it how you like
page_soup # your html page you souped with BeautifulSoup
findAll("tag",{"attribute":"value"}) # this function finds every tag which has the specific value inside its attribute

They are stored within the <script> tags in json format. Just a matter of pulling those out and parsing:
from bs4 import BeautifulSoup
import pandas as pd
import requests
import json
url = 'https://flo.uri.sh/visualisation/4540617/embed'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
scripts = soup.find_all('script')
for script in scripts:
if 'var _Flourish_data_column_names = ' in script.text:
json_str = script.text
col_names = json_str.split('var _Flourish_data_column_names = ')[-1].split(',\n')[0]
cols = json.loads(col_names)
data = json_str.split('_Flourish_data = ')[-1].split(',\n')[0]
loop=True
while loop == True:
try:
jsonData = json.loads(data)
loop = False
break
except:
data = data.rsplit(';',1)[0]
rows = []
headers = cols['rows']['columns']
for row in jsonData['rows']:
rows.append(row['columns'])
table = pd.DataFrame(rows,columns=headers)
for col in headers[1:]:
table.loc[table[col] != '', col] = 'A'
Output:
print (table)
Company Climate change Forests Water security
0 Danone A A A
1 FIRMENICH SA A A A
2 FUJI OIL HOLDINGS INC. A A A
3 HP Inc A A A
4 KAO Corporation A A A
.. ... ... ... ...
308 Woolworths Limited A
309 Workspace Group A
310 Yokogawa Electric Corporation A A
311 Yuanta Financial Holdings A
312 Zalando SE A
[313 rows x 4 columns]

Writing A Loop: Taking a List of URLS And Only Getting The Title Text and Meta Description - BeautifulSoup/Python

I am a fairly new data worker in the public health field. Any help is appreciated.
Basically our goal is to create a quick way to extract the title and meta description from a list of URLs. We are using Python. We do not need anything else from the webpage.
I have the following list called "urlList". I have written out (using Beautiful Soup) the
urlList = https://www.freeclinics.com/cit/ca-los_angeles?sa=X&ved=2ahUKEwjew7SbgMXoAhUJZc0KHYHUB-oQ9QF6BAgIEAI,
https://www.freeclinics.com/cit/ca-los_angeles,
https://www.freeclinics.com/co/ca-los_angeles,
http://cretscmhd.psych.ucla.edu/healthfair/HF%20Services/LinkingPeopletoServices_CLinics_List_bySPA.pdf
Then I was able to extract the title and description of one of the URL's (see below code). I was hoping to loop this over the list. I am open to any form of data export - i.e. It can be a data table, .csv, or .txt file.
I know my current Print output shows the title and description as strings, where the description output is in [ ]. This is fine. My main concern of this post is looping through the whole urlList.
urlList = "https://www.freeclinics.com/cit/ca-los_angeles?sa=X&ved=2ahUKEwjew7SbgMXoAhUJZc0KHYHUB-oQ9QF6BAgIEAI"
response = requests.get(urlList)
soup = BeautifulSoup(response.text)
metas = soup.find_all('meta')
print((soup.title.string),[ meta.attrs['content'] for meta in metas if 'name' in meta.attrs and meta.attrs['name'] == 'description' ])
>> Output: Free and Income Based Clinics Los Angeles CA ['Search below and find all of the free and income based health clinics in Los Angeles CA. We have listed out all of the Free Clinics listings in Los Angeles, CA below']
P.s - At the most, the urlList will have a max of 10-20 links. All are very similar in page structure.

You can define a function that takes urlList as an arguments and returns list of list where each sublist in main list contains title and its corresponding description.
Try this:
def extract_info(url_list):
info = []
for url in url_list:
with requests.get(url) as response:
soup = BeautifulSoup(response.text, "lxml")
title = soup.find('title') .text if soup.find('title') else None
description = soup.find('meta', {"name": "description"})["content"] if soup.find('meta', {"name": "description"}) else None
info.append([title, description])
return info
Output:
[['Free and Income Based Clinics Los Angeles CA',
'Search below and find all of the free and income based health clinics in '
'Los Angeles CA. We have listed out all of the Free Clinics listings in Los '
'Angeles, CA below']
...
]]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to write csv file from scraped data from web in python - python

Related

Python University Names and Abbrevations and Weblink

How to scrape keywords that change every time?

Create a data frame with headings as column names and <li> tag content as rows, then print this data frame into a text file

Scrape zoho-analitics externally stored table. Is it possible?

Writing A Loop: Taking a List of URLS And Only Getting The Title Text and Meta Description - BeautifulSoup/Python

Categories

Resources