Create column containing the url of the hyperlinked text - python

I have a data source that has a column containing text that is hyperlinked. When I read at pandas, the hyperlinks are gone. I want to still get the URL of each of the rows and create it into a new column called "URL".
So, the idea is to create a new column that contains the URL. In this example, the pandas dataframe will have 4 columns:
Agreement Code
URL
Entity Name
Agreement Date

As per my knowledge pandas doesn't have this functionality as there is an open feature request for hyperlinks here. Moreover you can use openpyxl to accomplish this task:
import openpyxl
### Loads the worksheet
wb = openpyxl.load_workbook('file_name.xlsx')
ws = wb.get_sheet_by_name('sheet_name')
### You can access the hyperlinks like this by changing row number
print(ws.cell(row=2, column=1).hyperlink.target)
You can iterate row-wise to get all the hyperlinks and store in a new column. For more details regarding openpyxl please refer the docs.

Related

How to keep the same style for a table using openpyxl and python?

So my goal was to add data in an already existing table using openpyxl and python. I did it by using .cell(row, column).value method.
After doing this I had a problem because the table I was writing the data in was not expanding correctly. So i found this method and it worked fine :
from openpyxl import load_workbook
#getting the max number of row
ok = bs_sheet.max_row
#expanding the table
bs_sheet.tables['Table1'].ref = "A1:H"+str(ok)
What I initially thought is that the format of the table would expand accordingly. What I mean by that is if I had a formula in a column, when expanding using openpyxl it would also expand the formula (or the position etc.). Just like it works when you do it manually. But it doesn't. And this is where I have a problem because I haven't found anything.
What I am having trouble with is when extending the table, the shaping that was already done on the existing rows doesn't extend down on the rows of the table. Is there a way I could fix this ?
Using xlwings helps to keep the same format (including justifications, formulas) of a table.
When inserting data, the table will automatically expand. See example below :
import xlwings as xw
wb = xw.Book('test_book.xlsx')
tableau = wb.sheets[0].tables[0]
sheet = wb.sheets[0]
tableau.name = 'new'
sheet.range(3,1).value = 'VAMONOS'
wb.save('test_book.xlsx')
wb.close()
In ths example, it will add value in an alreading existing table (and also change the name of the table). You will see the table already expanded when opening the file again.

I want to sort data present in excel file in sheet with respect to column. (The excel file has multiple sheets)

Excel Data
This is the data I've in an excel file. There are 10 sheets containing different data and I want to sort data present in each sheet by the 'BA_Rank' column in descending order.
After sorting the data, I've to write the sorted data in an excel file.
(for eg. the data which was present in sheet1 of the unsorted sheet should be written in sheet1 of the sorted list and so on...)
If I remove the heading from the first row, I can use the pandas (sort_values()) function to sort the data present in the first sheet and save it in another list.
like this
import pandas as pd
import xlrd
doc = xlrd.open_workbook('without_sort.xlsx')
xl = pd.read_excel('without_sort.xlsx')
length = doc.nsheets
#print(length)
#for i in range(0,length):
#sheet = xl.parse(i)
result = xl.sort_values('BA_Rank', ascending = False)
result.to_excel('SortedData.xlsx')
print(result)
So is there any way I can sort the data without removing the header file from the first row?
and how can I iterate between sheets so as to sort the data present in multiple sheets?
(Note: All the sheets contain the same columns and I need to sort every sheet using 'BA_Rank' in descending order.)
First input, you don't need to call xlrd when using pandas, it's done under the hood.
Secondly, the read_excel method its REALLY smart. You can (imo should) define the sheet you're pulling data from. You can also set up lines to skip, inform where the header line is or to ignore it (and then set column names manually). Check the docs, it's quite extensive.
If this "10 sheets" it's merely anecdotal, you could use something like xlrd to extract the workbook's sheet quantity and work by index (or extract names directly).
The sorting looks right to me.
Finally, if you wanna save it all in the same workbook, I would use openpyxl or some similar library (there are many others, like pyexcelerate for large files).
This procedure pretty much always looks like:
Create/Open destination file (often it's the same method)
Write down data, sheet by sheet
Close/Save file
If the data is to be writen all on the same sheet, pd.concat([all_dataframes]).to_excel("path_to_store") should get it done

How to write to an existing excel file without over-writing existing data using pandas

I know similar questions have been posted before, but i haven't found something working for this case. I hope you can help.
Here is a summary of the issue:
I'am writing a web scraping code using selenium(for an assignment purpose)
The code utilizes a for-loop to go from one page to another
The output of the code is a dataframe from each page number that is imported to excel. (basically a table)
Dataframes from all the web pages to be captured in one excel sheet only.(not multiple sheets within the excel file)
Each web page has the same data format (ie. number of columns and column headers are the same, but the row values vary..)
For info, I'am using pandas as it is helping convert the output from the website to excel
The problem i'm facing is that when the dataframe is exported to excel, it over-writes the data from the previous iteration. hence, when i run the code and scraping is completed, I will only get the data from the last for-loop iteration.
Please advise the line(s) of coding i need to add in order for all the iterations to be captured in the excel sheet, in other words and more specifically, each iteration should export the data to excel starting from the first empty row.
Here is an extract from the code:
for i in range(50, 60):
url= (urlA + str(i)) #this is the url generator, URLA is the main link excluding pagination
driver.get(url)
time.sleep(random.randint(3,7))
text=driver.find_element_by_xpath('/html/body/pre').text
data=pd.DataFrame(eval(text))
export_excel = data.to_excel(xlpath)
Thanks Dijkgraaf. Your proposal worked.
Here is the full code for others (for future reference).
apologies for the font, couldnt set it properly. anyway hope below is to some use for someone in the future.
xlpath= "c:/projects/excelfile.xlsx"
df=pd.DataFrame() #creating a data frame before the for loop. (dataframe is empty before the for loop starts)
Url= www.your website.com
for i in irange(1,10):
url= (urlA + str(i)) #this is url generator for pagination (to loop thru the page)
driver.get(url)
text=driver.find_element_by_xpath('/html/body/pre').text # gets text from site
data=pd.DataFrame(eval(text)) #evalues the extracted text from site and converts to Pandas dataframe
df=df.append(data) #appends the dataframe (df) specificed before the for-loop and adds the new (data)
export_excel = df.to_excel(xlpath) #exports consolidated dataframes (df) to excel

How do I execute this python code automatically in in excel cells?

I need to extract the domain for example: (http: //www.example.com/example-page, http ://test.com/test-page) from a list of websites in an excel sheet and modify that domain to give its url (example.com, test.com). I have got the code part figured put but i still need to get these commands to work on excel sheet cells in a column automatically.
here's_the_code
I think you should read in the data as a pandas DataFrame (pd.read_excel), make a function from your code then apply to the dframe (df.apply). Then it is easy to save to excel with pd.to_excel().
ofc you will need pandas to be installed.
Something like:
import pandas as pd
dframe = pd.read_excel(io='' , sheet_name='')
dframe['domains'] = dframe['urls col name'].apply(your function)
dframe.to_excel('your path')
Best

Making a dictionary of single key ,multiple values from excel data by dealing with merged cells using xlrd in python

I am using xlrd to extract two columns in an excel file which have around 300 data per sheet.
I have extracted the two columns in two lists and have made a dictionary using dict(zip(list1,list2))
the problem i am facing is some of the entries in list1 are merged cells so they have multiple values in list2.
sample input is :
Request: 4.01
04.01.01
04.01.02
06.01.01
06.01.04.01
06.01.04.02
6.08
Request is the Key, extracted from column A and all the numbers are values from col B.
How do I make a dictionary in such cases?
Code snippet:
file_loc = 'D:/Tool/HC.xlsx'
workbook = xlrd.open_workbook(file_loc)
sheet = workbook.sheet_by_index(0)
tot_cols = sheet.ncols
tot_rows = sheet.nrows
File_name_list =[]
FD_list=[]
Extraction of the values:
for row in range(tot_rows):
new_list =[sheet.cell_value(row,1)]
File_name_list.append(new_list)
new_list2= sheet.cell_value(row,3)
FD_list.append(new_list2)
dic= dict(zip(File_name_list,FD_list) # Making a dictionary but due to merged cells all the values are not mapped.
If your problem is indeed coming from merge cells you can unmerge them like explained here. But as I said csv is a better choice - more portable and easier to work with (I admit I loathe microsoft stuff). Basically here are the details:
To get rid of all the merged cells in an Excel 2007 workbook, follow
these steps:
Make a backup copy of the workbook, and store it somewhere safe.
Right-click one of the sheet tabs, and click Select All Sheets
On the active sheet, click the Select All button, at the top left of the worksheet
On the Ribbon's Home tab, click the drop down arrow for Merge & Center
Click Unmerge Cells

Categories

Resources