Generating a table with docx from a dataframe in python - python

Hellow,
Currently I´m working in a project in which I have to generate some info with docx library in python. I want to know how to generate a docx table from a dataframe in order to have the output with all the columns and rows from de dataframe I've created. Here is my code, but its not working correctly because I can´t reach the final output:
table = doc.add_table(rows = len(detalle_operaciones_total1), cols=5)
table.style = 'Table Grid'
table.rows[0].cells[0].text = 'Nombre'
table.rows[0].cells[1].text = 'Operacion Nro'
table.rows[0].cells[2].text = 'Producto'
table.rows[0].cells[3].text = 'Monto en moneda de origen'
table.rows[0].cells[4].text = 'Monto en moneda local'
for y in range(1, len(detalle_operaciones_total1)):
Nombre = str(detalle_operaciones_total1.iloc[y,0])
Operacion = str(detalle_operaciones_total1.iloc[y,1])
Producto = str(detalle_operaciones_total1.iloc[y,2])
Monto_en_MO = str(detalle_operaciones_total1.iloc[y,3])
Monto_en_ML = str(detalle_operaciones_total1.iloc[y,4])
table.rows[y].cells[0].text = Nombre
table.rows[y].cells[1].text = Operacion
table.rows[y].cells[2].text = Producto
table.rows[y].cells[3].text = Monto_en_MO
table.rows[y].cells[4].text = Monto_en_ML

Related

Python Index out of range Error in lib loop issue

everything's fine? I hope so.
I'm dealing with this issue: List index out of range. -
Error message:
c:\Users.....\Documents\t.py:41: FutureWarning: As the xlwt package is no longer maintained, the xlwt engine will be removed in a future version of pandas. This is the only engine in pandas that supports writing in the xls format. Install openpyxl and write to an xlsx file instead. You can set the option io.excel.xls.writer to 'xlwt' to silence this warning. While this option is deprecated and will also raise a warning, it can be globally set and the warning suppressed.
read_file.to_excel(planilhaxls, index = None, header=True)
The goal: I need to create a loop that store a specific line of a worksheet such as sheet_1.csv, this correspondent line in sheet_2.csv and a third sheet also, stored in 3 columns in a sheet_output.csv
Issue: It's getting an index error out of range that I don't know what to do
Doubt: There is any other way that I can do it?
The code is below:
(Please, ignore portuguese comments)
import xlrd as ex
import pyautogui as pag
import os
import pyperclip as pc
import pandas as pd
import pygetwindow as pgw
import openpyxl
#Inputs
numerolam = int(input('Escolha o número da lamina: '))
amostra = input('Escoha a amostra: (X, Y, W ou Z): ')
milimetro_inicial = int(input("Escolha o milimetro inicial: "))
milimetro_final = int(input("Escolha o milimetro final: "))
tipo = input("Escolha o tipo - B para Branco & E para Espelho: ")
linha = int(input("Escolha a linha da planilha: "))
# Conversão de código
if tipo == 'B':
tipo2 = 'BRA'
else:
tipo2 = 'ESP'
#Arquivo xlsx
#planilhaxlsx = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.xlsx'
#planilhaxls = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.xls'
#planilhacsv = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.csv'
#planilhacsv_ = f'A{numerolam}{amostra}{milimetro_final}{tipo2}.csv'
#arquivoorigin = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.opj'
#Pasta
pasta = f'L{numerolam}{amostra}'
while milimetro_inicial < milimetro_final:
planilhaxlsx = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.xlsx'
planilhaxls = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.xls'
planilhacsv = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.csv'
planilhacsv_ = f'A{numerolam}{amostra}{milimetro_final}{tipo2}.csv'
arquivoorigin = f'A{numerolam}{amostra}{milimetro_inicial}{tipo2}.opj'
# Converte o arquivo .csv para .xls e .xlsx
read_file = pd.read_csv(planilhacsv)
read_file.to_excel(planilhaxls, index = None, header=True)
#read_file.to_excel(planilhaxlsx, index = None, header=True)
# Abre o arquivo .xls com o xlrd - arquivo excel.
book = ex.open_workbook(planilhaxls)
sh = book.sheet_by_index(0)
# Declaração de variáveis.
coluna_inicial = 16 # Q - inicia em 0
valor = []
index = 0
# Loop que armazena o valor da linha pela coluna Q-Z na variável valor 0-(len-1)
while coluna_inicial < 25:
**#ERRO NA LINHA ABAIXO**
**temp = sh.cell_value(linha, coluna_inicial)**
valor.append(temp) # Adiciona o valor
print(index)
print(valor[index])
index +=1
coluna_inicial += 1
# Abre planilha de saída
wb = openpyxl.Workbook()
ws = wb.active
#Inicia loop de escrita
colunas = ['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z']
idx_colunas = 0
contador_loop = colunas[idx_colunas]
linha_loop = 1
index_out = 0
s = f'{contador_loop}{linha_loop}'
print(s)
while linha_loop < len(valor):
valor[index_out] = "{}".format(valor[index_out])
ws[s].value = valor[index_out]
print(valor[index_out] + ' feito')
linha_loop += 1
idx_colunas += 1
index_out += 1
# Salva planilha de saída
wb.save("teste.xlsx")
milimetro_inicial += 1
Your problem is on this line
temp = sh.cell_value(linha, coluna_inicial)
There are two index params used linha and coluna_inicial, 'linha' appears to be a static value so the problem would seem to be with 'coluna_inicial' which gets increased by 1 each iteration
coluna_inicial += 1
The loop continues while 'coluna_inicial' value is less than 25. I suggest you check number of columns in the sheet 'sh' using
sh.ncols
either for debugging or as the preferred upper value of your loop. If this is less than 25 you will get the index error once 'coluna_inicial' value exceeds the 'sh.ncols' value.
<---------------Additional Information---------------->
Since this is an xls file there would be no need for delimiter settings, your code as is should open it correctly. However since the xls workbook to be opened is determined by params entered by the user at the start presumably meaning there are a number in the directory to choose from, are you sure you are checking the xls file your code run is opening? Also if there is more than one sheet in the workbook(s) are you opening the correct sheet?
You can print the workbook name to be sure which one is being opened. Also by adding verbosity to the open_workbook command (level 2 should be high enough), it will upon opening the book, print in console details of the sheets available including number of rows and columns in each.
print(planilhaxls)
book = ex.open_workbook(planilhaxls, verbosity=2)
sh = book.sheet_by_index(0)
print(sh.name)
E.g.
BOF: op=0x0809 vers=0x0600 stream=0x0010 buildid=14420 buildyr=1997 -> BIFF80
sheet 0('Sheet1') DIMENSIONS: ncols=21 nrows=21614
BOF: op=0x0809 vers=0x0600 stream=0x0010 buildid=14420 buildyr=1997 ->
BIFF80
sheet 1('Sheet2') DIMENSIONS: ncols=13 nrows=13
the print(sh.name) as shown checks the name of the sheet that 'sh' is assigned to.

Update dictionary within dictionary dynamically return same character count for different parameters

I'm trying to retrieve wikipedia pages' characters count, for articles in different languages. I'm using a dictionary with as key the name of the page and as value a dictionary with the language as key and the count as value.
The code is:
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicty = {}
dicto ={}
numz = 0
for x in langs:
wikipedia.set_lang(x)
for y in pages:
pagelang = wikipedia.page(y)
splittedpage = pagelang.content
dicto[y] = dicty
for char in splittedpage:
numz +=1
dicty[x] = numz
If I print dicto, I get
{"L'arte della gioia": {'it': 72226, 'en': 111647}, 'Il nome della rosa': {'it': 72226, 'en': 111647}}
The count should be different for the two pages.
Please try this code. I didn't run it because I don't have the wikipedia module.
Notes:
Since your expected result is dict[page,dict[lan,cnt]], I think first iterate pages is more natural, then iterate languages. Maybe for performance reason you want first iterate languages, please comment.
Characters count of text can simply be len(text), why iterate and sum again?
Variable names. You will soon be lost in x y like variables.
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicto = {}
for page in pages:
lang_cnt_dict = {}
for lang in langs:
wikipedia.set_lang(lang)
page_lang = wikipedia.page(page)
chars_cnt = len(pagelang.content)
lang_cnt_dict[lan] = chars_cnt
dicto[page] = lang_cnt_dict
print(dicto)
update
If you want iterate langs first
pages = ["L'arte della gioia", "Il nome della rosa"]
langs = ["it", "en"]
dicto = {}
for lang in langs:
wikipedia.set_lang(lang)
for page in pages:
page_lang = wikipedia.page(page)
chars_cnt = len(pagelang.content)
if page in dicto:
dicto[page][lang] = chars_cnt
else:
dicto[page] = {lang: chars_cnt}
print(dicto)

Scrapy using loops in Python

I want an activity to scrapy a web page. The part of data web is route_data.
route_data = ["javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/215_14.jpg', '/wc/htdocs/web', 'Batet Lamaña, Meritxell (Presidenta del Congreso de los Diputados)', 'Diputada por Barcelona', 'G.P. Socialista' ,'','');",
"javascript:mostrarFotografiaHemiciclo( '/wc/htdocs/web/img/diputados/peq/168_14.jpg', '/wc/htdocs/web', 'Rodríguez Gómez de Celis, Alfonso (Vicepresidente Primero)', 'Diputado por Sevilla', 'G.P. Socialista' ,'','');",]
I create a dictionary with empty values.
dictionary_data = {"Nombre":None, "Territorio":None, "Partido":None, "url":None}
I have to save in dictionary_data each one line:
url = /wc/htdocs/web/img/diputados/peq/215_14.jpg
Nombre = Batet Lamaña, Meritxell
Territorio = Diputada por Barcelona
Partido = G.P. Socialista
For thus, and I loop over route_data.
for i in route_data:
text = i.split(",")
nombre = text[2:4]
territorio = text[4]
partido = text[5]
But the output is:
[" 'Batet Lamaña", " Meritxell (Presidenta del Congreso de los Diputados)'"] 'Diputada por Barcelona' 'G.P. Socialista'
[" 'Rodríguez Gómez de Celis", " Alfonso (Vicepresidente Primero)'"] 'Diputado por Sevilla' 'G.P. Socialista'
How can it get put correct in dictionary?
A simple solution would be:
all_routes = []
for i in route_data:
text = re.findall("'.+?'", i)
all_routes.append(
{"Nombre": re.sub('\(.*?\)', '', text[2]).strip(),
"Territorio": text[3],
"Partido": text[-2],
"Url": text[0]})

Parse xml w/ xsd to CSV with Python?

I am trying to parse a very large XML file which I downloaded from OSHA's website and convert it into a CSV so I can use it in a SQLite database along with some other spreadsheets. I would just use an online converter, but the osha file is apparently too big for all of them.
I wrote a script in Python which looks like this:
import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
xml_data_to_csv =open('Out.csv', 'w')
list_head=[]
Csv_writer=csv.writer(xml_data_to_csv)
count=0
for element in root.findall('data'):
List_nodes =[]
if count== 0:
inspection_number = element.find('inspection_number').tag
list_head.append(inspection_number)
establishment_name = element.find('establishment_name').tag
list_head.append(establishment_name)
city = element.find('city')
list_head.append(city)
state = element.find('state')
list_head.append(state)
zip_code = element.find('zip_code')
list_head.append(zip_code)
sic_code = element.find('sic_code')
list_head.append(sic_code)
naics_code = element.find('naics_code')
list_head.append(naics_code)
sampling_number = element.find('sampling_number')
list_head.append(sampling_number)
office_id = element.find('office_id')
list_head.append(office_id)
date_sampled = element.find('date_sampled')
list_head.append(date_sampled)
date_reported = element.find('date_reported')
list_head.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc')
list_head.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type')
list_head.append(instrument_type)
lab_number = element.find('lab_number')
list_head.append(lab_number)
field_number = element.find('field_number')
list_head.append(field_number)
sample_type = element.find('sample_type')
list_head.append(sample_type)
blank_used = element.find('blank_used')
list_head.append(blank_used)
time_sampled = element.find('time_sampled')
list_head.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled')
list_head.append(air_volume_sampled)
sample_weight = element.find('sample_weight')
list_head.append(sample_weight)
imis_substance_code = element.find('imis_substance_code')
list_head.append(imis_substance_code)
substance = element.find('substance')
list_head.append(substance)
sample_result = element.find('sample_result')
list_head.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement')
list_head.append(unit_of_measurement)
qualifier = element.find('qualifier')
list_head.append(qualifier)
Csv_writer.writerow(list_head)
count = +1
inspection_number = element.find('inspection_number').text
List_nodes.append(inspection_number)
establishment_name = element.find('establishment_name').text
List_nodes.append(establishment_name)
city = element.find('city').text
List_nodes.append(city)
state = element.find('state').text
List_nodes.append(state)
zip_code = element.find('zip_code').text
List_nodes.append(zip_code)
sic_code = element.find('sic_code').text
List_nodes.append(sic_code)
naics_code = element.find('naics_code').text
List_nodes.append(naics_code)
sampling_number = element.find('sampling_number').text
List_nodes.append(sampling_number)
office_id = element.find('office_id').text
List_nodes.append(office_id)
date_sampled = element.find('date_sampled').text
List_nodes.append(date_sampled)
date_reported = element.find('date_reported').text
List_nodes.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
List_nodes.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type').text
List_nodes.append(instrument_type)
lab_number = element.find('lab_number').text
List_nodes.append(lab_number)
field_number = element.find('field_number').text
List_nodes.append(field_number)
sample_type = element.find('sample_type').text
List_nodes.append(sample_type)
blank_used = element.find('blank_used').text
List_nodes.append()
time_sampled = element.find('time_sampled').text
List_nodes.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled').text
List_nodes.append(air_volume_sampled)
sample_weight = element.find('sample_weight').text
List_nodes.append(sample_weight)
imis_substance_code = element.find('imis_substance_code').text
List_nodes.append(imis_substance_code)
substance = element.find('substance').text
List_nodes.append(substance)
sample_result = element.find('sample_result').text
List_nodes.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement').text
List_nodes.append(unit_of_measurement)
qualifier= element.find('qualifier').text
List_nodes.append(qualifier)
Csv_writer.writerow(List_nodes)
xml_data_to_csv.close()
But when I run the code I get a CSV with nothing in it. I suspect this may have something to do with the XSD file associated with the XML, but I'm not totally sure.
Does anyone know what the issue is here?
The code below is a 'compact' version of your code.
It assumes that the XML structure looks like in the script variable xml. (Based on https://www.osha.gov/opengov/sample_data_2011.zip)
The main difference bwtween this sample code and yours is that I define the fields that I want to collect once (see FIELDS) and I use this definition across the script.
import xml.etree.ElementTree as ET
FIELDS = ['lab_number', 'instrument_type'] # TODO add more fields
xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
<DATA_RECORD>
<inspection_number>316180165</inspection_number>
<establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
<city>EUFAULA</city>
<state>AL</state>
<zip_code>36027</zip_code>
<sic_code>1799</sic_code>
<naics_code>238990</naics_code>
<sampling_number>434866166</sampling_number>
<office_id>418600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>TBD</instrument_type>
<lab_number>L13645</lab_number>
<field_number>S1</field_number>
<sample_type>B</sample_type>
<blank_used>N</blank_used>
<time_sampled></time_sampled>
<air_volume_sampled></air_volume_sampled>
<sample_weight></sample_weight>
<imis_substance_code>S777</imis_substance_code>
<substance>Soil</substance>
<sample_result>0</sample_result>
<unit_of_measurement>AAAAA</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD>
<DATA_RECORD>
<inspection_number>315516757</inspection_number>
<establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
<city>WORCESTER</city>
<state>MA</state>
<zip_code>1608</zip_code>
<sic_code>1771</sic_code>
<naics_code>238110</naics_code>
<sampling_number>423259902</sampling_number>
<office_id>112600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>GRAV</instrument_type>
<lab_number>L13355</lab_number>
<field_number>9831B</field_number>
<sample_type>P</sample_type>
<blank_used>N</blank_used>
<time_sampled>184</time_sampled>
<air_volume_sampled>340.4</air_volume_sampled>
<sample_weight>.06</sample_weight>
<imis_substance_code>9135</imis_substance_code>
<substance>Particulates not otherwise regulated (Total Dust)</substance>
<sample_result>0.176</sample_result>
<unit_of_measurement>M</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD></main>'''
root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
out.write(','.join(FIELDS) + '\n')
for record in records:
values = [record.find(f).text for f in FIELDS]
out.write(','.join(values) + '\n')
out.csv
lab_number,instrument_type
L13645,TBD
L13355,GRAV

What can I do to scrape 10000 pages without appearing captchas?

Hi there i've been trying to collect all the information in 10,000 pages of this page for a school project, I thought everything was fine until on page 4 I got a mistake. I check the page manually and I find that the page now asks me for a captcha.
What can I do to avoid it? Maybe set a timer between the searchs?
Here it is my code.
import bs4, requests, csv
g_page = requests.get("http://www.usbizs.com/NY/New_York.html")
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
get_Pnum = m_page.select('div[class="pageNav"]')
MAX_PAGE = int(get_Pnum[0].text[9:16])
print("Recolectando información de la página 1 de {}.".format(MAX_PAGE))
contador = 0
information_list = []
for k in range(1, MAX_PAGE):
c_items = m_page.select('div[itemtype="http://schema.org/Corporation"] a')
c_links = []
i = 0
for link in c_items:
c_links.append(link.get("href"))
i+=1
for j in range(len(c_links)):
temp = []
s_page = requests.get(c_links[j])
i_page = bs4.BeautifulSoup(s_page.text, "lxml")
print("Ingresando a: {}".format(c_links[j]))
info_t = i_page.select('div[class="infolist"]')
info_1 = info_t[0].text
info_2 = info_t[1].text
temp = [info_1,info_2]
information_list.append(temp)
contador+=1
with open ("list_information.cv", "w") as file:
writer=csv.writer(file)
for row in information_list:
writer.writerow(row)
print("Información de {} clientes recolectada y guardada correctamente.".format(j+1))
g_page = requests.get("http://www.usbizs.com/NY/New_York-{}.html".format(k+1))
m_page = bs4.BeautifulSoup(g_page.text, "lxml")
print("Recolectando información de la página {} de {}.".format(k+1,MAX_PAGE))
print("Programa finalizado. Información recolectada de {} clientes.".format(contador))

Categories

Resources