Export HTML table to excel without page refresh using python - python

I have a web page in which user can generate a table with no of rows and no of columns input.
Now I want to export this HTML table to an excel file using python. After some googling, I came to know about the to_excel snippet as shown below.
import pandas as pd
# The webpage URL whose table we want to extract
url = "https://www.geeksforgeeks.org/extended-operators-in-relational-algebra/"
# Assign the table data to a Pandas dataframe
table = pd.read_html(url)[0]
# Store the dataframe in Excel file
table.to_excel("data.xlsx")
As you can observe from the above code that the program navigates to the specified url, but in my web page, if the url is hit, all the data is gone (after page refresh) because I am generating number of rows and columns on the go without page refresh.
Can someone suggest alternate approach for excel export of HTML table using python?

Don't pass the url, pass the raw string containing html:
Parameters:
io: (str, path object or file-like object)
A URL, a file-like object, or a raw string containing HTML. Note that
lxml only accepts the http, ftp and file url protocols. If you have a
URL that starts with 'https' you might try removing the 's'.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html

Related

Create column containing the url of the hyperlinked text

I have a data source that has a column containing text that is hyperlinked. When I read at pandas, the hyperlinks are gone. I want to still get the URL of each of the rows and create it into a new column called "URL".
So, the idea is to create a new column that contains the URL. In this example, the pandas dataframe will have 4 columns:
Agreement Code
URL
Entity Name
Agreement Date
As per my knowledge pandas doesn't have this functionality as there is an open feature request for hyperlinks here. Moreover you can use openpyxl to accomplish this task:
import openpyxl
### Loads the worksheet
wb = openpyxl.load_workbook('file_name.xlsx')
ws = wb.get_sheet_by_name('sheet_name')
### You can access the hyperlinks like this by changing row number
print(ws.cell(row=2, column=1).hyperlink.target)
You can iterate row-wise to get all the hyperlinks and store in a new column. For more details regarding openpyxl please refer the docs.

Exporting Pandas Dataframe as CSV

This is a question concerning how to allow a user to export a Pandas dataframe to CSV format in Python 3.
For context, I have a Django view that accepts POST requests from jQuery, such that when a user clicks on a button on my website, it triggers a POST request to that Django view and performs some filtering to generate a Pandas dataframe. I want the users to be able to export the dataframe on their end, not into my personal local machine/project directory.
I make a sharp distinction between "downloading" and "exporting". Downloading can be easily done through the pd.to_csv method and basically saves the CSV file into a specified directory within my local machine (or my project folder, in fact). The problem is that the behavior I want is "exporting", which I define as when a user, upon clicking a button, is able to get the dataframe on their local machine.
The way I do "exporting" currently is by converting the Dataframe to an HTML table element, returning the HTML as the response of the POST request to jQuery, and use vanilla JS to inspect the table element to export the data on the user's end, following a protocol similar to How do I export html table data as .csv file?. The problem, however, is that when the dataframe grows too big, it becomes impossible to inspect the associated table element to generate a CSV file.
Any suggestion for exporting a Pandas dataframe to CSV is appreciated - it could be an original solution, in fact.
try this in your view function
import csv
import pandas as pd
def get(request):
response = HttpResponse(content_type='text/csv')
response['Content-Disposition'] = 'attachment; filename="{filename}.csv"'.format(filename='myname')
writer = csv.writer(response)
df = pd.DataFrame([{"name": "haha", "age": 18}, {"name": "haha", "age": 18}])
writer.writerow([column for column in df.columns])
writer.writerows(df.values.tolist())
return response
df.to_csv(directory/file_name.csv')

How to write to an existing excel file without over-writing existing data using pandas

I know similar questions have been posted before, but i haven't found something working for this case. I hope you can help.
Here is a summary of the issue:
I'am writing a web scraping code using selenium(for an assignment purpose)
The code utilizes a for-loop to go from one page to another
The output of the code is a dataframe from each page number that is imported to excel. (basically a table)
Dataframes from all the web pages to be captured in one excel sheet only.(not multiple sheets within the excel file)
Each web page has the same data format (ie. number of columns and column headers are the same, but the row values vary..)
For info, I'am using pandas as it is helping convert the output from the website to excel
The problem i'm facing is that when the dataframe is exported to excel, it over-writes the data from the previous iteration. hence, when i run the code and scraping is completed, I will only get the data from the last for-loop iteration.
Please advise the line(s) of coding i need to add in order for all the iterations to be captured in the excel sheet, in other words and more specifically, each iteration should export the data to excel starting from the first empty row.
Here is an extract from the code:
for i in range(50, 60):
url= (urlA + str(i)) #this is the url generator, URLA is the main link excluding pagination
driver.get(url)
time.sleep(random.randint(3,7))
text=driver.find_element_by_xpath('/html/body/pre').text
data=pd.DataFrame(eval(text))
export_excel = data.to_excel(xlpath)
Thanks Dijkgraaf. Your proposal worked.
Here is the full code for others (for future reference).
apologies for the font, couldnt set it properly. anyway hope below is to some use for someone in the future.
xlpath= "c:/projects/excelfile.xlsx"
df=pd.DataFrame() #creating a data frame before the for loop. (dataframe is empty before the for loop starts)
Url= www.your website.com
for i in irange(1,10):
url= (urlA + str(i)) #this is url generator for pagination (to loop thru the page)
driver.get(url)
text=driver.find_element_by_xpath('/html/body/pre').text # gets text from site
data=pd.DataFrame(eval(text)) #evalues the extracted text from site and converts to Pandas dataframe
df=df.append(data) #appends the dataframe (df) specificed before the for-loop and adds the new (data)
export_excel = df.to_excel(xlpath) #exports consolidated dataframes (df) to excel

Python: Saving AJAX response data to .json and save this to pandas DataFrame

Hello and thank your for taking the time to have a read at this,
I am looking to extract company information from a particular stock exchange and then save this information to a pandas DataFrame.
Each firm has it's own webpage that are all determined by the "KodeEmiten" ending. These codes are saved in a column of the first Dataframe:
df = pd.DataFrame.from_dict(data['data'])
Now my goal is to use these codes to call each companies website individually and create a json file for each
for i in range (len(df)):
requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()
While this works i can't save this to a new DataFrame due list index out of range and incorrect keyword errors. There is significantly more information in the xhr than i actually need and the different structures are what I believe to cause the error trying to save them to a new DataFrame. I'm really just interested in getting the data in these xhr headers:
AnakPerusahaan:, Direktur:, Komisaris, PemegangSaham:
So my question is kind of two-in-one:
a) How can I just extract the information from those specific xhr headers (all of them are tables)
b) how can i save those to a new dataframe (or even list I don't really mind)
import requests
import pandas as pd
import json
import time
# gets broad data of main page of the stock exchange
sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = sxow.json() # save the request as .json file
df = pd.DataFrame.from_dict(data['data']) #creates DataFrame based on the data (.json) file
# add: compare file contents and overwrite original if same
cdate = time.strftime ("%Y%m%d") # creating string-variable w/ current date year|month|day
df.to_excel(f"{cdate}StockExchange_Overview.xlsx") # converts DataFrame to Excel file, can't overwrite existing file
for i in range (len(df)) :
requests.get(f'https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten={df.loc[i, "KodeEmiten"]}').json()
#This is where I'm completely stuck
You don't need to convert the result to a dataframe. You can just loop through the json object and concatenate the url to get other companies website details.
Follow the code below:
import requests
import pandas as pd
import json
import time
# gets broad data of main page of the stock exchange
sxow = requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfiles?draw=1&columns%5B0%5D%5Bdata%5D=KodeEmiten&columns%5B0%5D%5Bname%5D&columns%5B0%5D%5Bsearchable%5D=true&columns%5B0%5D%5Borderable%5D=false&columns%5B0%5D%5Bsearch%5D%5Bvalue%5D&columns%5B0%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B1%5D%5Bdata%5D=KodeEmiten&columns%5B1%5D%5Bname%5D&columns%5B1%5D%5Bsearchable%5D=true&columns%5B1%5D%5Borderable%5D=false&columns%5B1%5D%5Bsearch%5D%5Bvalue%5D&columns%5B1%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B2%5D%5Bdata%5D=NamaEmiten&columns%5B2%5D%5Bname%5D&columns%5B2%5D%5Bsearchable%5D=true&columns%5B2%5D%5Borderable%5D=false&columns%5B2%5D%5Bsearch%5D%5Bvalue%5D&columns%5B2%5D%5Bsearch%5D%5Bregex%5D=false&columns%5B3%5D%5Bdata%5D=TanggalPencatatan&columns%5B3%5D%5Bname%5D&columns%5B3%5D%5Bsearchable%5D=true&columns%5B3%5D%5Borderable%5D=false&columns%5B3%5D%5Bsearch%5D%5Bvalue%5D&columns%5B3%5D%5Bsearch%5D%5Bregex%5D=false&start=0&length=700&search%5Bvalue%5D&search%5Bregex%5D=false&_=155082600847')
data = sxow.json() # save the request as .json file
list_of_json = []
for nested_json in data['data']:
list_of_json.append(requests.get('https://www.idx.co.id/umbraco/Surface/ListedCompany/GetCompanyProfilesDetail?emitenType=&kodeEmiten='+nested_json['KodeEmiten']).json())
time.sleep(1)
The list_of_json will contain all the json results you requested for.
Here nested_json is the loop variable to loop through the array of json of different KodeEmiten.
This is a slight improvement on #bigbounty's approach:
Since the aim is to save the information to a list and then use said list further in the script list comprehension is actually a tad faster.
i.e.
list_of_json = [requests.get('url+nested_json["KodeEmiten"]).json() for nested_json in data["data"]]'

switching a specific part of URL and save the result to CSV

I have an excel that has 'ID' key and API call URL
the Excel looks like this:
And the URL is this:
http://api.*******.com/2.0/location/{property_id}?key=123456ASDFG
The result of the API call is in JSON format
I would like to iterate the excel's property_id in the URL and save the result to csv in each row.
what I did so far is
import requests
import json
url = "http://api.*******.com/2.0/location/{property_id}?key=123456ASDFG"
response = requests.get(url)
data = response.test
print data
the result is basically same as what I just put the url in Chrome browser
I somehow have to read each row in the excel column A and switch the value and insert into {property_id} in the url
then, append the result to csv as row number increases..
I'm very new to API and I have no idea where to start.
I was trying to find similar questions on Stack-overflow and could not find any. (maybe wrong keywords?)
Any help is very helpful.
Thanks

Categories

Resources