Saving scraped documents in two sheets in an excel file - python

I've created a scraper which is supposed to parse some documents from a webpage and save it to an excel file creating two sheets. However, when I run it, I can see that It only saves the documents of last link in a single sheet whereas there should be two sheets with documents from two links properly. I even printed the results to see what is happening in the background but i found there nothing wrong. I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file. Thanks in advance to take a look into it.
Here is my code:
import requests
from lxml import html
from pyexcel_ods3 import save_data
name_list = ['Altronix','APC']
def docs_parser(link, name):
res = requests.get(link)
root = html.fromstring(res.text)
vault = {}
for post in root.cssselect(".SubBrandList a"):
if post.text == name:
refining_docs(post.attrib['href'], vault)
def refining_docs(new_link, vault):
res = requests.get(new_link).text
root = html.fromstring(res)
sheet = root.cssselect("#BrandContent h2")[0].text
for elem in root.cssselect(".ProductDetails"):
name_url = elem.cssselect("a[class]")[0].attrib['href']
vault.setdefault(sheet, []).append([str(name_url)])
save_data("docs.ods", vault)
if __name__ == '__main__':
for name in name_list:
docs_parser("http://store.immediasys.com/brands/" , name)
But, the same way when I write code for another site, it meets the expectation creating different sheets and saving documents in those. Here is the link:
https://www.dropbox.com/s/bgyh1xxhew8hcvm/Pyexcel_so.txt?dl=0

Question: I thing the first sheet is overwritten and second one is never created. How to get around this so that data will be saved in two sheets in an excel file.
You overwrite the Workbook File on every Link that's be appended.
You should never call save_data(... within a loop, only once at the End of your Script.
Comparing you Two Scripts there is No difference, both behave the same, again and again overwriting the Workbook File. Maybe the File IO get overloaded as you doing more than 160 Times overwriting the Workbook File within a short Time.
The First Script should create 13 Sheets:
data sheet:powerpivot-etc links:20
data sheet:flappy-owl-videos links:1
data sheet:reporting-services-videos links:20
data sheet:csharp links:14
data sheet:excel-videos links:9
data sheet:excel-vba-videos links:20
data sheet:sql-server-videos links:9
data sheet:report-builder-2016-videos links:4
data sheet:ssrs-2016-videos links:5
data sheet:sql-videos links:20
data sheet:integration-services links:19
data sheet:excel-vba-user-form links:20
data sheet:archived-videos links:16
The Second Script should create 2 Sheets:
vault sheet:Altronix links:16
vault sheet:APC links:16

Related

Search Keyword from multiple Excel colomn/row in multiples pdf files

I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!

Reduce API call from gspread library

I'm currently trying to get data using gspread API from a drive folder containing about 50 excel files, each containing about 10 sheets (about 500 sheets in total).
I want to get 3 specific columns for all files in all sheets and append it into a dataframe.
I got this code working :
scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("gspread/service_account.json",scope)
client = gs.authorize(creds)
file_list = client.list_spreadsheet_files()
file_list = list(map(itemgetter('name'),file_list))
for files in file_list:
file = client.open(files)
worksheet_list = file.worksheets()
print(files)
for sheet in worksheet_list[1:]:
print(sheet)
set_with_dataframe(sheet, df)
df = get_as_dataframe(sheet, parse_dates=True, usecols=[5, 7, 9], skiprows=1, header=None)
df.drop(df.index[10:100], axis=0, inplace=True)
print('Ajout de : ',df)
df_final = df_final.append(df)
df_final.fillna('', inplace=True)
df_final.reset_index(drop=True, inplace=True)
print(df_final)
The thing is I always get an error 429 (too much API calls) and I can't figured out how to reduced the number of calls. Even when using get_all_values() or get_all_records() functions it has to loops through all sheets.
Even when making a time.sleep(30) for each files.
If I'm not wrong I have about 22 calls for each files of 10 sheets
(file = client.open(files) and worksheet_list = file.worksheets()) and 2 for each sheets ( set_with_dataframe() and get_as_dataframe()
I could make a time.sleep() for each worksheets instead but it would take very long (500+ sheets).
I could also change the pause time depending on the number of sheets (some files have more).
My questions are :
1) Is there function that could get all data from all sheets instead of having to loop through all
sheets?
2) If not, is there a solution to reduce the calls without having to use pauses...?
If yes, it would be a solution and would reduce drastically the calls
Thanks in advance and nice evening,
Alex
to answer your questions:
There will be soon in gspread a method they returns all the data from all the sheets. See this issue here
As of today the above solution is not available so in order to pull all data from a spreadsheet you can build your own request and use gspread to handle that request for you. Have a look at the library source code in the Worksheet class in the method that return some values (like get_range or get or range). You can easily build your request and request the full spreadsheet values.
on the side, if you wish to reduce API calls you can do so by opening your spreadsheet files using their ID instead of listing all the files then looping over them. You will reduce the initial API call to the Drive API to list the files.
The method list_spreadsheet_files is costly because the response is paginated, so it will make as many calls to get all the pages (i don't know the size of a page). Then the method open is costly too because it makes a new call to list_spreadsheet_files then loops over the result to open the right file using it's ID.
I would definitely start on that part of your code to reduce API calls.

openpyxl blocking excel file after first read

I am trying to overwrite a value in a given cell using openpyxl. I have two sheets. One is called Raw, it is populated by API calls. Second is Data that is fed off of Raw sheet. Two sheets have exactly identical shape (cols/rows). I am doing a comparison of the two to see if there is a bay assignment in Raw. If there is - grab it to Data sheet. If both Raw and Data have the value in that column missing - then run a complex Algo (irrelevant for this question) to assign bay number based on logic.
I am having problems with rewriting Excel using openpyxl.
Here's example of my code.
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
no_bay_res = data_df[data_df['Bay assignment'].isnull()].reset_index() #grab rows where there is no bay assignment in a specific column
book = load_workbook("Algo Build v23test.xlsx")
sheet = book["MondayData"]
for index, reservation in no_bay_res.iterrows():
idx = int(reservation['index'])
if pd.isna(raw_df.iloc[idx, 13]):
continue
else:
value = raw_df.iat[idx,13]
data_df.iloc[idx, 13] = value
sheet.cell(idx+2, 14).value = int(value)
book.save("Algo Build v23test.xlsx")
book.close()
print(value) #302
Now the problem is that it seems that book.close() is not working. Book is still callable in python. Now, it overwrites Excel totally fine. However, if I try to run these two lines again
data_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayData')
raw_df = pd.read_excel('Algo Build v23test.xlsx', sheet_name='MondayRaw')
I am getting datasets full of NULL values, except for the value that was replaced. (attached the image).
However, if I open that Excel file manually from the folder and save it (CTRL+S) and try running the code again - it works properly. Weirdest problem.
I need to loop the code above for Monday-Sunday, so I need it to be able to read the data again without manually resaving the file.
Due to some reason, pandas will read all the formulas as NaN after the file been used in the script by openpyxl until the file has been opened, saved and closed. Here's the code that helps do that within the script. However, it is rather slow.
import xlwings as xl
def df_from_excel(path, sheet_name):
app = xl.App(visible=False)
book = app.books.open(path)
book.save()
app.kill()
return pd.read_excel(path, sheet_name)
I got the same problem, the only workaround I found is to terminate the excel.exe manually from taskmanager. After that everything went fine.

Corrupt File As Result of Accessing SharePoint

I am trying to create a python code that will take 10+ excel tables of data, and append them into a single table. I'd also like to clear the destination table contents prior to doing this, so every time the code runs it will provide updated values from the 10 files.
I've tried to access these files within SharePoint but everytime I do this, the file returns as corrupt. I'm in the first stages of this code, but I'd like for it to at least be able to pull the data into Python for me to start manipulating
import requests
from requests.auth import HTTPBasicAuth
file = "https://ts.company.com/:x:/r/sites/XXXX/Shared%20Documents/General/Folder/XXXX.xlsx?d=w2ef8664034b243c89400e549b0cbf36d&csf=1&e=b5mA10"
username = 'robert.xxxxx#company.com'
password = 'password123'
resp=requests.get(file, auth=HTTPBasicAuth(username, password))
output = open('test.xlsx', 'wb')
output.write(resp.content)
output.close()

Is there a way for 2 scripts to write to the same file?

I have limited python experience, but determined to learn. I am trying to create a script that would write some data inputs to excel until stopped. It is very straightforward when a single person is using it but the problem is that 2 people will be using it at once.
I am thinking about making it simple and just having 2 exact same scripts running at the same time, but the problem comes in when the file is going to be saved. If I have two files being saved with the same name, one is going to overwrite the other and the data will be lost. Is there a way to have the scripts create files with different names without having to manually change the code? (This would eventually be scaled to up to 20 computers running it)
The loop looks like:
import xlwt
from xlwt import Workbook
wb = Workbook()
s1 = wb.add_sheet('Sheet 1')
data = []
while user != '0':
user = input('Scan ID Badge: ')
data.append(user)
order = input('Scan order: ')
data.append(order)
item = input('Scan item barcode: ')
data.append(item)
for i in range(len(data)):
s1.write(row,i,data[i])
wb.save('OrderData.xls')
data = []
row += 1
If you want to use a tabular form of data storage anyways, you could switch to a real database and on interval create an excel-like summary of the db file.
If you know all of the users using this script will be using machines with different network names, you could include the computer name in the XLS name:
import platform
filename = 'AssociateEfficiencyTemp-' + platform.node() + '.xls'
# ...
wb.save(filename)
(You can also use getpass.getuser() to (try and) get the username of the user running the script.)
You can then write another script that reads all of the separate files (glob.glob('AssociateEfficiencyTemp-*.xls') etc.) and combines them.
(I would suggest using another format than .xls for the intermediary files though, such as plain text files of JSON lines.)

Categories

Resources