Exporting values from a Spreadsheet using Python for webscraping (BeautifulSoup4)

Exporting values from a Spreadsheet using Python for webscraping (BeautifulSoup4) - python

A. My Objective:
Use Python to extract unique OCPO IDs from an Excel Spreadsheet and using these IDs to web-scrape for corresponding company names and NIN IDs. (Note: Both NIN and OCPO IDs are unique to one company).
B. Details:
i. Extract OCPO IDs from an Excel Spreadsheet using openpyxl.
ii. Search OCPO IDs one-by-one in a business registry (https://focus.kontur.ru/) and find corresponding company names and company IDs (NIN) using BeautifulSoup4.
Example: A search for OCPO ID "00044428" yields a matching company name ПАО "НК "РОСНЕФТЬ" and corresponding NIN ID "7706107510."
Save in Excel the list of company names and NIN IDs.
C. My progress:
i. I'm able to extract the list of OCPO IDs from Excel to Python.
# Pull the Packages
import openpyxl
import requests
import sys
from bs4 import BeautifulSoup
# Pull OCPO from the Spreadsheet
wb = openpyxl.load_workbook(r"C:\Users\ksong\Desktop\book1.xlsx")
sheet = wb.active
sheet.columns[0]
for cellobjc in sheet.columns[0]:
print(cellobjc.value)
ii. I'm able to search an OCPO ID and let Python scrape matching company name and corresponding company NIN ID.
# Part 1a: Pull the Website
r = requests.get("https://focus.kontur.ru/search?query=" + "00044428")
r.encoding = "UTF-8"
# Part 1b: Pull the Content
c = r.content
soup = BeautifulSoup(c, "html.parser", from_encoding="UTF-8")
# Part 2a: Pull Company name
name = soup.find("a", attrs={'class':"js-subject-link"})
name_box = name.text.strip()
print(name_box)
D. Help
i. How do you code so that loop each OCPO IDs are searched individually as a loop so that I don't get a list of OCPOs IDs but instead a list of search results? In other words, each OCPO is searched and matched with corresponding Company Name and NIN ID. This loop would have to be fed as ######## ("https://focus.kontur.ru/search?query=" + "########").
ii. Also, what code should I use for Python to save all the search results in an Excel Spreadsheet?

1) Create an empty workbook to write to:
wb2 = Workbook()
ws1 = wb2.active
2) Put all that code in the 2nd box into your for loop from the first box.
3) Change "00044428" to str(cellobjc.value)
4) At the end of each loop, append your row to the new worksheet:
row = [cellobjc.value, date_box, other_variables]
ws1.append(row)
5) After the loop finishes, save your file
wb2.save("results.xlsx")

Related

List is putting everything in one row instead of multiple rows in Excel? (Pandas/Python)

I have an Excel with a list of Zip Codes. The Zip Codes are fed into a ISP Search engine that shows the best internet providers by you. I take the top three using Selenium and append them to 3 different lists. I am trying to then output those Top 3 internet providers into three columns, which it does, but everything is in one row instead of multiple rows under each column.
This is
the current result:
And this is the expected result (also don't want the \n printing out as well as the brackets):
This is the function:
#Function to loop through all last names starting with A through Z
def CheckXfinityList():
#Reads the excel file
df = pd.read_excel("BranchAddressList.xlsx")
#Specifies the column we want info from and assigns it to a variable
address = df["Zip Code"]
#Create values for lists for Providers to put into their columns
pv1 = []
pv2 = []
pv3 = []
for x in address:
try:
#srchbtn = SeleniumSetup.driver.find_element("xpath", '//*[#id="main-content"]/div[1]/div/div/div/div/div/button/img')
#if srchbtn exist
#srchbtn.click()
SeleniumSetup("https://www.highspeedinternet.com/providers")
# Find the "Zip code" box and enters text into it from the excel document
link = SeleniumSetup.driver.find_element("id",'providerHero')
link.send_keys(x)
# Hits the "Enter" key to search the zip code
link.send_keys(Keys.ENTER)
time.sleep(5)
#Grab top provider names
provider1 = SeleniumSetup.driver.find_element("xpath", '//*[#id="residential"]/div[3]/div/div[1]')
provider2 = SeleniumSetup.driver.find_element("xpath", '//*[#id="residential"]/div[4]/div[1]/div[1]')
provider3 = SeleniumSetup.driver.find_element("xpath", '//*[#id="residential"]/div[5]/div/div[1]')
pv1.append(provider1.text)
pv2.append(provider2.text)
pv3.append(provider3.text)
except:
print("Couldn't find the value")
finally:
SeleniumSetup.driver.quit()
# Creates new Excel file that will list Top 3 Internet providers in there area
top3 = pd.DataFrame({'Zip Code': [address],
'Internet Provider 1': [pv1],
'Internet Provider 2': [pv2],
'Internet Provider 3': [pv3]})
#Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('TopProviders.xlsx', engine='xlsxwriter')
#Convert the dataframe to an XlsxWriter Excel object.
top3.to_excel(writer, sheet_name='Sheet1', index=False)
#Close the Pandas Excel writer and output the Excel file.
writer.save()
I've seen methods on here so just .explode the list but that hasn't worked for me and I think the issue is just how I am appending them to the list, perhaps I need another loop for them? Any help is appreciated! (Also my function is not showing as indented properly on this site so don't worry about that)

#Yagami-Light had the correct solution. Removing the brackets around the variables in the Dataframe solved the issue. It was making a list of a list

Python Edgar package - get CIK number

I am reading S-1 filings from Edgar sec. I get my initial data from Bloomberg. Through the company name I can look for the matching CIK number using the term get_cik_by_company_name(company_name: str). I should be able to get the CIK number which I than want to save in a list -> cik_list. However it is not working - Invalid Syntax for str.
BloombergList ist the Excel Bloomberg created with all the relevant company names. In column 4 I got the names which I import as a list, than get the matching CIK and than export the CIK list in the right order back to the BloombergList - theoretically.
I am happy if someone can help. Thanks in advance.
#needed packages
import pandas as pd
from openpyxl import load_workbook
from edgar import Edgar
#from excel get company names
book = load_workbook('BloombergList.xlsx')
sheet = book['Cleaned up']
for row in sheet.rows:
row_list = row[4].value
print (row_list)
#use edgar package to get CIK numbers from row_list
edgar = Edgar()
cik_list = []
for x in row_list:
possible_companies = edgar.find_company_name(x)
cik_list.append(get_cik_by_company_name(company_name: str))
#export generated CIK numbers back to excel
df = pd.DataFrame({'CIK':[cik_list]})
df.to_excel('BloombergList.xlsx', sheet_name="CIK", index=False)
print ("done")

Trying to incorporate RegEX with Excel exercise from "Automate the Boring Stuff with Python"

I am trying to modify the project code from Automate The boring stuff with python chapter 13 page 315 to incorporate regular expressions into the code. For my Excel sheet I have a list of sensor names that need the "Unit" column filled out appropriately (see image below for an example). I have updated the project code dictionary to incorporate the corresponding units to the _BattV, _ft_H20,etc. as the sensor name is structured XXX_123_BattV, where the XXX is the project code, 123 is the sensor number and _BattV is the suffix indicating what type of sensor it is. I would like to match the last chunk of each sensor name using RegEX so that the code will update each sensor with _BattV with 'volts' in the unit column and so on.
Here is the code I have modified from the project example so far.
#! python3
#updateProduce.py - Corrects cost in produce sales spreadsheet.
import openpyxl
filename = 'stackoverflowexample.xlsx'
wb = openpyxl.load_workbook(filename)
sheet = wb['stackoverflowexample']
#The produce types and their updated prices
UNIT_UPDATES = {
'_BattV': 'volts',
'_ft_H2O': 'Feet of H20',
'_GWE': 'Elevation (ft)',
'_PSI': 'PSI',
'_TempC': 'deg C'}
#Loop through the rows and update the prices.
for rowNum in range(2, sheet.max_row): #skip the first row
Sensor_name = sheet.cell(row=rowNum, column=1).value
if Sensor_name in UNIT_UPDATES:
sheet.cell(row=rowNum, column=2).value = UNIT_UPDATES[Sensor_name]
wb.save(f'updatedstackoverflowexample.xlsx')
Here is what I have gleaned from the RegEx section of the book:
import re
unitRegex = re.compile(r'_BattV|_ft_H2O|_GWE|_PSI|_TempC')
voltRegex = re.compile(r'.*_BattV')
fth20Regex = re.compile(r'.*_ft_H2O')
gweRegex = re.compile(r'.*_GWE')
psiRegex = re.compile(r'.*_PSI')
tempRegex = re.compile(r'.*_TempC')
mo = unitRegex.search('insert cell data here')
I am also curious if it is better to run the Regex and feed all of the matches into the dictionary first and then run the rest. Or if it is better to incorporate it within the for loop.
Finally here is the example screenshot:
Screenshot of excel spreadsheet showing structure of data:

While you can do this with a regex, you might simply split the string:
units = {"BattV":"volts", …}
for cell in ws[A][1:]:
project, sensor, unit = cell.value.split("_")
cell.offset(column=2).value = units[unit]

Scraping worldometers homepage to pull COVID-19 table data but values doesn't pulls incorrectly (Python)

I'm scraping worldometers home page to pull the data in the table in Python, but I am struggling as the values aren't pulling in correctly. (The strings are... (Country: USA, Spain, Italy...).
import requests
import lxml.html as lh
import pandas as pd
from tabulate import tabulate
url="https://www.worldometers.info/coronavirus/"
#Create a handle, page, to handle the contents of the website
page = requests.get(url)
#Store the contents of the website under doc
doc = lh.fromstring(page.content)
#Parse data that are stored between <tr>..</tr> of HTML
tr_elements = doc.xpath('//tr')
#Create empty list
col=[]
colLen = len(tr_elements[1])
i=0
#For each row, store each first element (header) and an empty list
for t in tr_elements[0]:
i+=1
name=t.text_content()
print ('%d:"%s"'%(i,name))
col.append((name,[]))
print(colLen)
#Since out first row is the header, data is stored on the second row onwards
for j in range(1,len(tr_elements)):
#T is our j'th row
T=tr_elements[j]
if len(T)!=len(tr_elements[0]): break
#i is the index of our column
i=0
#Iterate through each element of the row
for t in T.iterchildren():
data=t.text_content()
#Append the data to the empty list of the i'th column
col[i][1].append(data)
#Increment i for the next column
i+=1
Dict={title:column for (title,column) in col}
df=pd.DataFrame(Dict)
df.head()
#Print Total Cases Col (this is incorrect when comparing to the webpage)
print(col[1][0:])
#Print Country Col (this is correct)
print(col[0][0:])
I can't seem to figure out what the issue is. Please help to solve the issue. I'm also open for suggestion to do this another way :)
Data Table on Webpage
Command Prompt output for Country ( Correct)
Command Prompt output for Total Cases ( incorrect)

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.

I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.

Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Exporting values from a Spreadsheet using Python for webscraping (BeautifulSoup4) - python

Related

List is putting everything in one row instead of multiple rows in Excel? (Pandas/Python)

Python Edgar package - get CIK number

Trying to incorporate RegEX with Excel exercise from "Automate the Boring Stuff with Python"

Scraping worldometers homepage to pull COVID-19 table data but values doesn't pulls incorrectly (Python)

Read data from excel after a string matches

Categories

Resources