Reading YAML file in python with `---` at the end and beginning - python

stream = open(afile, 'r')
self.meta = yaml.load(stream)
you can easyly read YAML file in python, but it has not got --- at the end I reach error (same with ...):
yaml.composer.ComposerError: expected a single document in the stream
in "", line 2, column 1
but found another document
in "", line 6, column 1
But YAML specs allow that:
YAML uses three dashes (“---”) to separate directives from document content. This also serves to signal the start of a document if no directives are present. Three dots ( “...”) indicate the end of a document without starting a new one, for use in communication channels.
So, how do you read this
title: "El punt de llibre"
abstract: "Estimar a quina pàgina està el punt de llibre"
keywords: ["when", "activitat", "3/3", "grup", "estimació", "aproximació", "funció lineal - proporcionalitat", "ca"]
comments: true
in python?

Your YAML stream/file appears to have more than document in it, for example trying to parse this would give the same error message:
title: "El punt de llibre"
abstract: "Estimar a quina pàgina està el punt de llibre"
keywords: ["when", "activitat", "3/3", "grup", "estimació", "aproximació", "funció lineal - proporcionalitat", "ca"]
comments: true
title: "El punt de llibre"
abstract: "Estimar a quina pàgina està el punt de llibre"
keywords: ["when", "activitat", "3/3", "grup", "estimació", "aproximació", "funció lineal - proporcionalitat", "ca"]
comments: true
title: "El punt de llibre"
abstract: "Estimar a quina pàgina està el punt de llibre"
keywords: ["when", "activitat", "3/3", "grup", "estimació", "aproximació", "funció lineal - proporcionalitat", "ca"]
comments: true
To process such a stream you could use the following approach:
import yaml
with open('test.yaml') as f_yaml:
for doc in yaml.safe_load_all(f_yaml):
print doc
Which would show you the following:
{'keywords': ['when', 'activitat', '3/3', 'grup', u'estimaci\xf3', u'aproximaci\xf3', u'funci\xf3 lineal - proporcionalitat', 'ca'], 'abstract': u'Estimar a quina p\xe0gina est\xe0 el punt de llibre', 'comments': True, 'title': 'El punt de llibre'}
{'keywords': ['when', 'activitat', '3/3', 'grup', u'estimaci\xf3', u'aproximaci\xf3', u'funci\xf3 lineal - proporcionalitat', 'ca'], 'abstract': u'Estimar a quina p\xe0gina est\xe0 el punt de llibre', 'comments': True, 'title': 'El punt de llibre'}
{'keywords': ['when', 'activitat', '3/3', 'grup', u'estimaci\xf3', u'aproximaci\xf3', u'funci\xf3 lineal - proporcionalitat', 'ca'], 'abstract': u'Estimar a quina p\xe0gina est\xe0 el punt de llibre', 'comments': True, 'title': 'El punt de llibre'}

If your YAML source contains more than one document, you can get the first document with
However, it seems strange that a ... causes PyYaml to break and you may want to report that as bug.

Use ruamel.yaml file to handle a YAML file with comments and spaces
import ruamel.yaml
yaml = ruamel.yaml.YAML()
with open(yaml_file) as f:
for doc in yaml.load_all(f):


insert xml attributes texts inside json file

I am iterating over an XML file to insert some of its attributes inside a JSON to develop a Corpus. For some reason, when inserting the date and the body of the XML it always inserts the same text inside the JSON line
The XML file format (I'm trying to get the title, timestamp and text from all the pages):
<text bytes="34" xml:space="preserve">/* Utilizar MediaWiki:Common.js */</text>
My code:
if __name__ == "__main__":
path = 'local path'
tree = ET.parse(path)
root = tree.getroot()
with open("C:/Users/User/Documents/Github/News-Corpus/corpus.json", "w") as f:
for page in tqdm(root.findall('page')):
title = page.find('title').text
dictionary = {}
dictionary["title"] = title
for revision in root.iter('revision'):
timestamp = revision.find('timestamp').text
dictionary["timestamp"] = timestamp
body = revision.find('text').text
dictionary["body"] = body
The output I get:
{"title": "MediaWiki:Monobook.js", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Administrators", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allmessages", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allmessagestext", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
{"title": "MediaWiki:Allpagessubmit", "timestamp": "2022-09-16T13:07:15Z", "body": "Los pasajeros les fue devuelta el importe de su billete de vuelo perdida?"}
As you can see, I always get the same timestamp and the same body, does anyone know why this happens? Help is much appreciated.

How to add a new line into the comment of Glue Table Schema using Pyspark?

I tried with \\n, but it didn't worked, so I think with another reserved word could work, but I can not find the good one.
table_config = [
'dbName': f'gomez_datalake_{env}_{team}_{dataset}_db',
'table': 'ConFac',
'partitionKey': 'DL_PERIODO',
'schema': [
['TIPO_DE_VALOR', 'STRING', 2, None,

AttributeError: 'ChatBot' object has no attribute 'input'

I'm having trouble finding the error in my code:
from chatterbot import ChatBot
from chatterbot.trainers import ChatterBotCorpusTrainer
from chatterbot.comparisons import JaccardSimilarity
from chatterbot.comparisons import LevenshteinDistance
from chatterbot.conversation import Statement
import nltk'stopwords')'punkt')'averaged_perceptron_tagger')'wordnet')
#Creo una instancia de la clase ChatBot
chatbot = ChatBot(
database='./database.sqlite5', #fichero de la base de datos (si no existe se creará automáticamente)
input_adapter='chatterbot.input.TerminalAdapter', #indica que la pregunta se toma del terminal
output_adapter='chatterbot.output.TerminalAdapter', #indeica que la respuesta se saca por el terminal
#Un Logic_adapter es una clase que devuelve una respuesta ante una pregunta dada.
#Se pueden usar tantos logic_adapters como se quiera
#'chatterbot.logic.MathematicalEvaluation', #Este es un logic_adapter que responde preguntas sobre matemáticas en inglés
#'chatterbot.logic.TimeLogicAdapter', #Este es un logic_adapter que responde preguntas sobre la hora actual en inglés
"import_path": "chatterbot.logic.BestMatch",
"statement_comparison_function": "chatterbot.comparisons.levenshtein_distance",
"response_selection_method": "chatterbot.response_selection.get_most_frequent_response"
# 'import_path': 'chatterbot.logic.LowConfidenceAdapter',
# 'threshold': 0.51,
# 'default_response': 'Disculpa, no te he entendido bien. ¿Puedes ser más específico?.'
# 'import_path': 'chatterbot.logic.SpecificResponseAdapter',
# 'input_text': 'Eso es todo',
# 'output_text': 'Perfecto. Hasta la próxima'
trainer = ChatterBotCorpusTrainer(chatbot)
# '¿Cómo estás?',
# 'Bien.',
# 'Me alegro.',
# 'Gracias.',
# 'De nada.',
# '¿Y tú?'
levenshtein_distance = LevenshteinDistance(None)
disparate=Statement('No te he entendido')#convertimos una frase en un tipo statement
entradaDelUsuario="" #variable que contendrá lo que haya escrito el usuario
while entradaDelUsuario!="adios":
entradaDelUsuario = chatbot.input.process_input_statement() #leemos la entrada del usuario
statement, respuesta = chatbot.generate_response(entradaDelUsuario)
print('¿Qué debería haber dicho?')
entradaDelUsuarioCorreccion = chatbot.input.process_input_statement()
print("He aprendiendo que cuando digas {} debo responder {}".format(entradaDelUsuarioAnterior.text,entradaDelUsuarioCorreccion.text))
print("\n%s\n\n" % respuesta)
I have tried to follow the tutorial, I am new to pyton and I would like you to help me find the error since the following appears when compiling:
AttributeError: 'ChatBot' object has no attribute 'input'

xls to JSON using python3 xlrd

I have to directly convert a xls file to a JSON document using python3 and xlrd.
Table is here.
It's divided in three main categories (PUBLICATION, CONTENU, CONCLUSION) whose names are on column one (first column is zero) and number of rows by category can vary. Each rows has three key values (INDICATEURS, EVALUATION, PROPOSITION) on column 3, 5 and 7. There can be empty lines, or missing values
I have to convert that table to the following JSON data I have written directly has a reference. It's valid.
"INDICATEUR": "Page de garde",
"EVALUATION": "Inexistante ou non conforme",
"PROPOSITION D'AMELIORATION": "Consulter l'example sur CANVAS"
"INDICATEUR": "Page de garde",
"EVALUATION": "Titre du TFE non conforme",
"PROPOSITION D'AMELIORATION": "Utilisez le titre avalisé par le conseil des études"
"INDICATEUR": "Orthographe et grammaire",
"EVALUATION": "Nombreuses fautes",
"PROPOSITION D'AMELIORATION": "Faire relire le document"
"INDICATEUR": "Nombre de page",
"EVALUATION": "Nombre de pages grandement différent à la norme",
"INDICATEUR": "Développement du sujet",
"EVALUATION": "Présentation de l'entreprise",
"INDICATEUR": "Développement du sujet",
"EVALUATION": "Plan de localisation inutile",
"PROPOSITION D'AMELIORATION": "Supprimer le plan de localisation"
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran excessives",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran Inutiles",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
"INDICATEUR": "Figures et capture d'écran",
"EVALUATION": "Captures d'écran illisibles",
"PROPOSITION D'AMELIORATION": "Pour chaque figure et capture d'écran se poser la question 'Qu'est-ce que cela apporte à mon sujet ?'"
"INDICATEUR": "Conclusion",
"EVALUATION": "Conclusion inexistante",
"INDICATEUR": "Bibliographie",
"EVALUATION": "Inexistante",
"INDICATEUR": "Bibliographie",
"EVALUATION": "Non normalisée",
"PROPOSITION D'AMELIORATION": "Ecrire la bibliographie selon la norme APA"
"EVALUATION": "Grave manquement sur le plan de la présentation",
"PROPOSITION D'AMELIORATION": "Lire le document 'Conseil de publication' disponible sur CANVAS"
"EVALUATION": "Risque de refus du document par le conseil des études",
My intention is to loop through lines, check rows[1] to identify the category, and sub-loop to add data as dictionary in a list by category.
Here is my code so far :
import xlrd
file = '/home/eh/Documents/Base de Programmation/Feedback/EvaluationEI.xls'
wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
def readRows():
for rownum in range(2,sheet.nrows):
rows = sheet.row_values(rownum)
indicateur = rows[3]
evaluation = rows[5]
amelioration = rows[7]
publication = []
contenu = []
conclusion = []
if rows[1] == "PUBLICATION":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
publication.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
if rows[1] == "CONTENU":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
contenu.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
if rows[1] == "CONCLUSION":
if rows[3] == '' and rows[5] == '' and rows[7] == '':
conclusion.append("INDICATEUR : " + indicateur , "EVALUATION : " + evaluation , "PROPOSITION D'AMELIORATION : " + amelioration)
print (publication)
print (contenu)
print (conclusion)
I am having a hard time figuring out how to sub-loop for the right number of rows to separate data by categories.
Any help would be welcome.
Thank you in advance
Using the json package and the OrderedDict (to preserve key order), I think this gets to what you're expecting, and I've modified slightly so we're not building a string literal, but rather a dict which contains the data that we can then convert with json.dumps.
As Ron noted above, your previous attempt was skipping the lines where rows[1] was not equal to one of your three key values.
This should read every line, appending to the last non-empty key:
def readRows(file, s_index=0):
file: path to xls file
s_index: sheet_index for the xls file
returns a dict of OrderedDict of list of OrderedDict which can be parsed to JSON
d = {"EVALUATION" : OrderedDict()} # this will be the main dict for our JSON object
wb = xlrd.open_workbook(file)
sheet = wb.sheet_by_index(s_index)
# getting the data from the worksheet
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
# fill the dict with data:
for _,row in enumerate(data[3:]):
if row[1]: # if there's a value, then this is a new categorie element
categorie = row[1]
d["EVALUATION"][categorie] = []
if categorie:
i,e,a = row[3::2][:3]
if i or e or a: # as long as there's any data in this row, we write the child element
return d
This returns a dict which can be easily parsed to json. Screenshot of some output:
Write to file if needed:
import io # for python 2
d = readRows(file,0)
with'c:\debug\output.json','w',encoding='utf8') as out:
Note: in Python 3, I don't think you need
Is pandas not an option? Would add as a comment but don't have the rep.
From Documentation
df = pandas.read_excel('path_to_file.xls')
df.to_json(path_or_buf='output_path.json', orient='table')

Unicode elements in list save to file

I have two questions:
1) What I have done wrong in the script below? The result in not encoded propertly and all non standard characters are stored incorrectly. When I print out data list it gives me a proper list of unicode types:
[u'Est-ce que tu peux traduire \xc3\xa7a pour moi? \n \n \n Can you translate this for me?'], [u'Chicago est tr\xc3\xa8s diff\xc3\xa9rente de Boston. \n \n \n Chicago is very different from Boston.'],
After that I strip all extra spaces and next lines and result in file is like this (looks same when print and save to file):
Est-ce que tu peux traduire ça pour moi?;Can you translate this for me?
Chicago est très différente de Boston.;Chicago is very different from Boston.
2) What other than Python scripting langage would you recommend?
import requests
import unicodecsv, os
from bs4 import BeautifulSoup
import re
import html5lib
countries = ["fr"] #,"id","bn","my","chin","de","es","fr","hi","ja","ko","pt","ru","th","vi","zh"]
for country in countries:
f = open("phrase_" + country + ".txt","w")
w = unicodecsv.writer(f, encoding='utf-8')
toi = 1
print country
while toi<2:
url = ""+ country +"/english-phrases.cfm?newCategoryShowed=" + str(toi) + "&sortBy=28"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html5lib')
[s.extract() for s in soup('script')]
[s.extract() for s in soup('style')]
[s.extract() for s in soup('head')]
[s.extract() for s in soup("table" , { "height" : "102" })]
[s.extract() for s in soup("td", { "class" : "copyLarge"})]
[s.extract() for s in soup("td", { "width" : "21%"})]
[s.extract() for s in soup("td", { "colspan" : "3"})]
[s.extract() for s in soup("td", { "width" : "25%"})]
[s.extract() for s in soup("td", { "class" : "blacktext"})]
[s.extract() for s in soup("div", { "align" : "center"})]
data = []
rows = soup.find_all('tr', {"class": re.compile("Data.")})
for row in rows:
cols = row.find_all('td')
cols = [ele.text.strip() for ele in cols]
data.append([ele for ele in cols if ele])
wordsList = []
for index, item in enumerate(data):
str_tmp = "".join(data[index]).encode('utf-8')
str_tmp = re.sub(r' +\n\s+', ';', str_tmp)
str_tmp = re.sub(r' +', ' ', str_tmp)
print str_tmp
toi += 1
You should use r.text not r.content because content are the bytes and text is the decoded text:
soup = BeautifulSoup(r.text, 'html5lib')
You can just write utf-8 encoded to file:
with open("out.txt","w") as f:
for d in data:
d = " ".join(d).encode("utf-8")
d = re.sub(r'\n\s+', ';', d)
d = re.sub(r' +', ' ', d)
Fais attention en conduisant. ;Be careful driving.Fais attention. ;Be careful.Est-ce que tu peux traduire ça pour moi? ;Can you translate this for me?Chicago est très différente de Boston. ;Chicago is very different from Boston.Ne t'inquiète pas. ;Don't worry.Tout le monde le sais. ;Everyone knows it.Tout est prêt. ;Everything is ready.Excellent. ;Excellent.De temps en temps. ;From time to time.Bonne idée. ;Good idea.Il l'aime beaucoup. ;He likes it very much.A l'aide! ;Help!Il arrive bientôt. ;He's coming soon.Il a raison. ;He's right.Il est très ennuyeux. ;He's very annoying.Il est très célèbre. ;He's very famous.Comment ça va? ;How are you?Comment va le travail? ;How's work going?Dépêche-toi! ;Hurry!J'ai déjà mangé. ;I ate already.Je ne vous entends pas. ;I can't hear you.Je ne sais pas m'en servir. ;I don't know how to use it.Je ne l'aime pas. ;I don't like him.Je ne l'aime pas. ;I don't like it.Je ne parle pas très bien. ;I don't speak very well.Je ne comprends pas. ;I don't understand.Je n'en veux pas. ;I don't want it.Je ne veux pas ça. ;I don't want that.Je ne veux pas te déranger. ;I don't want to bother you.Je me sens bien. ;I feel good.Je sors du travail à six heures. ;I get off of work at 6.J'ai mal à la tête. ;I have a headache.J'espère que votre femme et vous ferez un bon voyage. ;I hope you and your wife have a nice trip.Je sais. ;I know.Je l'aime. ;I like her.J'ai perdu ma montre. ;I lost my watch.Je t'aime. ;I love you.J'ai besoin de changer de vêtements. ;I need to change clothes.J'ai besoin d'aller chez moi. ;I need to go home.Je veux seulement un en-cas. ;I only want a snack.Je pense que c'est bon. ;I think it tastes good.Je pense que c'est très bon. ;I think it's very good.Je pensais que les vêtements étaient plus chers. ;I thought the clothes were cheaper.J'allais quitter le restaurant quand mes amis sont arrivés. ;I was about to leave the restaurant when my friends arrived.Je voudrais faire une promenade. ;I'd like to go for a walk.Si vous avez besoin de mon aide, faites-le-moi savoir s'il vous plaît. ;If you need my help, please let me know.Je t'appellerai vendredi. ;I'll call you when I leave.Je reviendrai plus tard. ;I'll come back later.Je paierai. ;I'll pay.Je vais le prendre. ;I'll take it.Je t'emmenerai à l'arrêt de bus. ;I'll take you to the bus stop.Je suis un Américain. ;I'm an American.Je nettoie ma chambre. ;I'm cleaning my room.J'ai froid. ;I'm cold.Je viens te chercher. ;I'm coming to pick you up.Je vais partir. ;I'm going to leave.Je vais bien, et toi? ;I'm good, and you?Je suis content. ;I'm happy.J'ai faim. ;I'm hungry.Je suis marié. ;I'm married.Je ne suis pas occupé. ;I'm not busy.Je ne suis pas marié. ;I'm not married.Je ne suis pas encore prêt. ;I'm not ready yet.Je ne suis pas sûr. ;I'm not sure.Je suis désolé, nous sommes complets. ;I'm sorry, we're sold out.J'ai soif. ;I'm thirsty.Je suis très occupé. Je n'ai pas le temps maintenant. ;I'm very busy. I don't have time now.Est-ce que Monsieur Smith est un Américain? ;Is Mr. Smith an American?Est-ce que ça suffit? ;Is that enough?C'est plus long que deux kilomètres. ;It's longer than 2 miles.Je suis ici depuis deux jours. ;I've been here for two days.J'ai entendu dire que le Texas était beau comme endroit. ;I've heard Texas is a beautiful place.Je n'ai jamais vu ça avant. ;I've never seen that before.Juste un peu. ;Just a little.Juste un moment. ;Just a moment.Laisse-moi vérifier. ;Let me check.laisse-moi y réfléchir. ;Let me think about it.Allons voir. ;Let's go have a look.Pratiquons l'anglais. ;Let's practice English.Pourrais-je parler à madame Smith s'il vous plaît? ;May I speak to Mrs. Smith please?Plus que ça. ;More than that.Peu importe. ;Never mind.La prochaine fois. ;Next time.Non, merci. ;No, thank you.Non. ;No.N'importe quoi. ;Nonsense.Pas récemment. ;Not recently.Pas encore. ;Not yet.Rien d'autre. ;Nothing else.Bien sûr. ;Of course.D'accord. ;Okay.S'il vous plaît remplissez ce formulaire. ;Please fill out this form.S'il vous plaît emmenez-moi à cette adresse. ;Please take me to this address.S'il te plaît écris-le. ;Please write it down.Vraiment? ;Really?Juste ici. ;Right here.Juste là. ;Right there.A bientôt. ;See you later.A demain. ;See you tomorrow.A ce soir. ;See you tonight.Elle est jolie. ;She's pretty.Désolé de vous déranger. ;Sorry to bother you.Arrête! ;Stop!Tente ta chance. ;Take a chance.Réglez ça dehors. ;Take it outside.Dis-moi. ;Tell me.Merci Mademoiselle. ;Thank you miss.Merci Monsieur. ;Thank you sir.Merci beaucoup. ;Thank you very much.Merci. ;Thank you.Merci pour tout. ;Thanks for everything.Merci pour ton aide. ;Thanks for your help.Ça a l'air super. ;That looks great.Ça sent mauvais. ;That smells bad.C'est pas mal. ;That's alright.Ça suffit. ;That's enough.C'est bon. ;That's fine.C'est tout. ;That's it.Ce n'est pas juste. ;That's not fair.Ce n'est pas vrai. ;That's not right.C'est vrai. ;That's right.C'est dommage. ;That's too bad.C'est trop. ;That's too many.C'est trop. ;That's too much.Le livre est sous la table. ;The book is under the table.Ils vont revenir tout de suite. ;They'll be right back.Ce sont les mêmes. ;They're the same.Ils sont très occupés. ;They're very busy.Ça ne marche pas. ;This doesn't work.C'est très difficile. ;This is very difficult.C'est très important. ;This is very important.Essaie-le/la. ;Try it.Très bien, merci. ;Very good, thanks.Nous l'aimons beaucoup. ;We like it very much.Voudriez-vous prendre un message s'il vous plaît? ;Would you take a message please?Oui, vraiment. ;Yes, really.Vos affaires sont toutes là. ;Your things are all here.Tu es belle. ;You're beautiful.Tu es très sympa. ;You're very nice.Tu es très intelligent. ;You're very smart.
Also you don't actually use the data in your list comps so they seem a little pointless:

