I have a similar problem to this question that I need to insert newlines in a YAML mapping value string and prefer not to insert \n myself. The answer suggests using:
Data: |
Some data, here and a special character like ':'
Another line of data on a separate line
instead of
Data: "Some data, here and a special character like ':'\n
Another line of data on a separate line"
which also adds a newline at the end, that is unacceptable.
I tried using Data: > but that showed to give completely different results. I have been stripping the final newline after reading in the yaml file, and of course that works but that is not elegant. Any better way to preserve newlines without adding an extra one at the end?
I am using python 2.7 fwiw
If you use | This makes a scalar into a literal block style scalar. But the default behaviour of |, is clipping and that doesn't get you the string you wanted (as that leaves the final newline).
You can "modify" the behaviour of | by attaching block chomping indicators
Strip
Stripping is specified by the “-” chomping indicator. In this case, the final line break and any trailing empty lines are excluded from the scalar’s content.
Clip
Clipping is the default behavior used if no explicit chomping indicator is specified. In this case, the final line break character is preserved in the scalar’s content. However, any trailing empty lines are excluded from the scalar’s content.
Keep
Keeping is specified by the “+” chomping indicator. In this case, the final line break and any trailing empty lines are considered to be part of the scalar’s content. These additional lines are not subject to folding.
By adding the stripchomping operator '-' to '|', you can prevent/strip the final newline:¹
import ruamel.yaml as yaml
yaml_str = """\
Data: |-
Some data, here and a special character like ':'
Another line of data on a separate line
"""
data = yaml.load(yaml_str)
print(data)
gives:
{'Data': "Some data, here and a special character like ':'\nAnother line of data on a separate line"}
¹ This was done using ruamel.yaml of which I am the author. You should get the same result with PyYAML (of which ruamel.yaml is a superset, preserving comments and literal scalar blocks on round-trip).
Related
I have a long string in a dictionary which I will dump to a YAML file.
As an example
d = {'test': {'long_string':'this is a long string that does not succesfully split when it sees the character '\n' which is an issue'}}
ff = open('./test.yaml', 'w+')
yaml.safe_dump(d, ff)
Which produces the following output in the YAML file
test:
long_string: "this is a long string that does not successfully split when it sees\
\ the character '\n' which is an issue"
I want the string which is inside the YAML file to only be split into a new line when it sees the "\n", also, I don't want any characters indicating that it's a newline. I want the output as follows:
test:
long_string: "this is a long string that does not successfully split when it sees the character ''
which is an issue"
What do I need to do to make the yaml.dump or yaml.safe_dump fulfill this?
There is no general solution. YAML is a format intentionally designed in a way that lets the implementation decide on the exact representation of values.
What you can do is to suggest a format. The dumper will honor this suggestion if possible. The one scalar format that breaks at literal newlines in the value and nowhere else is a literal block scalar. This code will dump your string as such if possible:
import yaml, sys
class as_block(str):
#staticmethod
def represent(dumper, data):
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
yaml.SafeDumper.add_representer(as_block, as_block.represent)
d = {'test': {'long_string':as_block('this is a long string that does not succes
fully split when it sees the character\n which is an issue')}}
yaml.safe_dump(d, sys.stdout)
Output:
test:
long_string: |-
this is a long string that does not succesfully split when it sees the character
which is an issue
I use as_block for the string that should be written as block scalar.
You can theoretically use this for all strings, but be aware that long_string and test would then also be written als block scalars, which is most probably not what you want.
This will not work when there is space before the line break, because YAML ignores space at the end of a line of a block scalar, so the serializer will choose another format to not lose the space character(s).
You can also take a step back and ask yourself why this is an issue in the first place. A YAML implementation is perfectly able to load the generated YAML and reconstruct your string.
I'm using Python to parse a .csv file that contains line breaks in most values. This isn't an issue, since values are delimited by ".
However, I've noticed that during the construction of the .csv file at one point in time, long values were split into multiple lines (but kept within the same value), with an = character put at the end of one line to signify "the following line break is actually a concatenation". A minimal working example: the value
Hello, world!
How are you today?
could be represented as
"Hello, world!\n
How are you t=\n
oday?"
where \n denotes the one-byte line break character.
Does CSV have the concept of "line continuation characters"? The documentation of Python's csv library does not mention anything about it under the formatting section, and hence I wonder if this is common practice and if Python nevertheless has support. I know how to write a parser that concatenates these lines (a simple str.replace(v,"=\n","") probably suffices), but I'm just curious whether this is an idiosyncrasy of my file.
This seems to be not a feature of CSV, but rather of MIME (and since my dataset consists of e-mails, this solves my question).
This usage of equals characters is part of quoted-printable encoding, and can be handled by the quopri Python module. See this answer for more details.
Using this module is better than a simple str.replace(v, "=\n", ""), because e-mails can contain other quoted-printable tokens that need decoding and do not appear on line ends (e.g. =09 to represent a horizontal tab). With quopri, you would write:
import quopri
v = ...
original = quopri.decodestring(v.encode("utf-8")).decode("utf-8")
I stream data via Server Send Event and get about 500.000 datasets but instead of getting one json I get this (example of 2 of the 500.000 datasets)(this is how it looks like opening it in gedit, all question marks are \" and all new lines are \n):
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
... -
My goal is to get this into a database. I actually thought I put this into a dictionary and afterwards create a pandas dataframe from here on I should be able to get it into a database. But this ends up to be quite cumbersome. I ended up with something like this:
c1 = data_json[1:-1]
c2 = c1.replace('{data:{', '{\"data\":{')
c3 = c2.replace('}data:{', ', ')
c4 = '{' + c3 + '}'
but even here I have some problems since I have to add /n/n for the new lines. But as soon as I change c3 to c2.replace('}\n\ndata:{', ', ') I get Process finished with exit code 137 (interrupted by signal 9: SIGKILL). Coming from .NET I could handle this quite easy with a deserializer and I am wondering if there is a similar way to deserialize the data.
I get the data via sseclient and would be able to store them as bytes instead of string, if this would help, just fyi.
Any suggestions?
Juggling with replaces is of course a convoluted path -
the language does have the parsers for this kind of escaping built in -
the simpler of which would be passing the string that contains JSON through an eval call. But eval is seldom needed and should be avoided in most cases as "not elegant" - if not outright unsafe (but being unsafe actually just applies when you have no control over the input data - and even them, ast.literal_eval instead of plain eval can mitigate that). Anyway, there are other problems with the format that will prevent eval to work outright - the missing quotes of the outmost data:, for example.
Random rants apart, if your file content is actually:
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
It has two problems: "under-quoting' of the outmost data and an
"over-scaping" of the inner-data.
On an interactive Python session, using the "raw string" marker I can input your example line as it will be read from a file:
In [263]: a = r"""data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n"""
In [264]: print(a)
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
So, on to remove one level of backslashes - Python have an "unicode_escape" text encoding, but it only works from bytes-objects. We then resort to the "latin1" encoding, as it provides a byte-for-byte conversion of the unicode literal in "a" to bytes, and then apply an unicode_escape to remove the "\" :
In [266]: b = a.encode("latin1").decode("unicode_escape")
In [267]: print(b, "\n", repr(b))
data:{"data":["Kendrick","Lamar"]}
data:{"data":["David","Bowie"]}
'data:{"data":["Kendrick","Lamar"]}\n\ndata:{"data":["David","Bowie"]}\n\n'
now it is easy to parse:
We split the resulting string at "\n\n" and have one list with one record
(those you are calling "dataset") per element. Then we resort to string
manipulation to get rid of the starting "data:" and finally, json.load can work on the remaining part.
so:
import json
raw_data = open("mystrangefile.pseudo_json").read()
data = data.encode("latin1").decode("unicode_escape")
records = [json.loads(record.split(":", 1)[-1]) for record in data.split("\n\n")]
And "records" now should contain well behaved Python objects dictionaries, you can put in a database. (Unless Pandas can provide automatic mapping of the columns to a databas, it seems to be an uneeded step - a raw connection.executemany(""" INSERT ...""", records) with a proper open DB connection should suffice.
Also, on a sidenote you mentioned that you could handle this easily with a .NET deserializer: that is only if your files are not as broken as you have shown us - no possible standard serializer could know how to handle such an specific data format out of the box. But, if you actually is that more proeficient in another language/technology to do that, you could resort to write just a converter from the broken input to a properly encoded file, and use that as an intermediate step.
I'm not completely sure if I understood the format in which you get the string correctly, so please correct me if I'm wrong here:
data_json = 'data:{\\"data\\":[\\"Kendrick\\",\\"Lamar\\"]}\\n\\ndata:{\\"data\\":[\\"David\\",\\"Bowie\\"]}\\n\\n'
Your first line seems to strip the first and last character, which I don't see. Are there any additional characters you are stripping away here?
The two following substring replacements seem to have no effect as the substrings are not present in the initial string (if I got it correctly in the first place).
And finally in the last line you are wrapping your result with { and } which is not correct for lists in json. It should be [...]
I can't really tell why you would get a SIGKILL here, though. It does not throw any errors for me, it just does not do what you want it to do. Maybe you're running out of memory with all the 500k examples?
However, this would be a working solution (again, given that I got the initial string correctly):
c1 = data_json.replace('\\n\\n', '') # removing escaped newlines
c2 = c1.replace('data:', ',') # replacing the additional 'data:' with json delimiter ','
c3 = c2.replace('\\', '') # removing artificial escapes
c4 = c3[1:-1] # removing leading ',' (introduced in c2) and trailing newline
c5 = '[' + c4 + ']' # wrapping as list
Now you should be able to json.loads(c5) or whatever you need to do with that string.
While going through LPTHW, I've set on reading the code here:
https://github.com/BrechtDeMan/secretsanta/blob/master/pairing.py
I've been trying to understand why the output CSV has double-quotes. There are several questions here about the problem but I'm not groking.
Where are the quotes getting introduced?
Edit: I wrote the author a couple of weeks back but haven't heard back.
Edit 2: An example of the output...
"Alice,101,alice#mail.org,Wendy,204,wendy#mail.org"
Double quotes are introduced in write_file function.
CSV files look simple on the surface, but sooner or later you will encounter some more complex problems. The first one is: what should happen if character denoting delimiter occurs in field content? Because there is no real standard for CSV format, different people had different ideas of correct answer for this question.
Python csv library tries to abstract this complexity and various approaches and make it easier to read and write CSV files following different rules. This is done by Dialect class objects.
The author of write_file function decided to construct output row manually by joining all fields and delimiter characters together, but then used csv module to actually write data into file:
writer.writerow([givers_list[ind][1] + ',' + givers_list[ind][2]
+ ',' + givers_list[ind][3]
+ ',' + givers_list[rand_vec[ind]][1] + ','
+ givers_list[rand_vec[ind]][2] + ',' + givers_list[rand_vec[ind]][3]])
This inconsistent usage of csv module resulted in entire row of data being treated as single field. Because that field contains characters used as field delimiters, Dialect.quoting decides how it should be handled. Default quoting configuration, csv.QUOTE_MINIMAL says that field should be quoted using Dialect.quotechar - which defaults to double quote character ("). That's why eventually entire field ends up surrounded by double quote characters.
Fast and easy, but not correct, solution would be changing quoting algorithm to csv.QUOTE_NONE. This will tell writer object to never surround fields, but instead to escape special characters by Dialect.escapechar. According to documentation, setting it to None (default) will raise an error. I guess that setting it to empty string could do the job.
The correct solution is feeding writer.writerrow with expected input data - list of fields. This should do (untested):
writer.writerow([givers_list[ind][1], givers_list[ind][2],
givers_list[ind][3],
givers_list[rand_vec[ind]][1],
givers_list[rand_vec[ind]][2], givers_list[rand_vec[ind]][3]])
In general, (double)quotes are needed when there is a seperator-char inside a field - and if there are quotes inside that field, they need to be 'escaped' with another quote.
Do you have an example of the output and the quotes you are talking about?
Edit (after example):
Ok, the whole row is treated as one field here. As Miroslaw Zalewski mentioned, those values should be treated as seperate fields instead of one long string.
Could somebody tell me which character is a non-ASCII character in the following:
Columns(str) – comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword.
Essentially I am defining a class and trying to give it a comment to define how it works:
def test(self,uniprot_id):
'''
Same as the UniProt.search() method arguments:
search(query, frmt='tab', columns=None, include=False, sort='score', compress=False, limit=None, offset=None, maxTrials=10)
query (str) -- query must be a valid uniprot query. See http://www.uniprot.org/help/text-search, http://www.uniprot.org/help/query-fields See also example below
frmt (str) -- a valid format amongst html, tab, xls, asta, gff, txt, xml, rdf, list, rss. If tab or xls, you can also provide the columns argument. (default is tab)
include (bool) -- include isoform sequences when the frmt parameter is fasta. Include description when frmt is rdf.
sort (str) -- by score by default. Set to None to bypass this behaviour
compress (bool) -- gzip the results
limit (int) -- Maximum number of results to retrieve.
offset (int) -- Offset of the first result, typically used together with the limit parameter.
maxTrials (int) -- this request is unstable, so we may want to try several time.
Columns(str) -- comma-seperated list of values. Works only if format is tab or xls. For UnitprotKB, some possible columns are: id, entry name, length, organism. Some column names must be followed by a database name (i.e. ‘database(PDB)’). Again see uniprot website for more details. See also _valid_columns for the full list of column keyword. '
'''
u = UniProt()
uniprot_entry = u.search(uniprot_id)
return uniprot_entry
Without the line 52, i.e. the one beginning with 'columns' in the quoted out comment block, this works as expected but as soon as I describe what 'columns' is I get the following error:
SyntaxError: Non-ASCII character '\xe2' in file /home/cw00137/Documents/Python/Identify_gene.py on line 52, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
Does anybody know what is going on?
You are using 'fancy' curly quotes in that line:
>>> u'‘database(PDB)’'
u'\u2018database(PDB)\u2019'
That's a U+2018 LEFT SINGLE QUOTATION MARK at the start and U+2019 RIGHT SINGLE QUOTATION MARK at the end.
Use ASCII quotes (U+0027 APOSTROPHE or U+0022 QUOTATION MARK) or declare an encoding other than ASCII for your source.
You are also using an U+2013 EN DASH:
>>> u'Columns(str) –'
u'Columns(str) \u2013'
Replace that with a U+002D HYPHEN-MINUS.
All three characters encode to UTF-8 with a leading E2 byte:
>>> u'\u2013 \u2018 \u2019'.encode('utf8')
'\xe2\x80\x93 \xe2\x80\x98 \xe2\x80\x99'
which you then see reflected in the SyntaxError exception message.
You may want to avoid using these characters in the first place. It could be that your OS is replacing these as you type, or you are using a word processor instead of a plain text editor to write your code and it is replacing these for you. You probably want to switch that feature off.
Previously encountered the same problem and same error, python2 will default to using ASCII encoding.
You can try to declare the following comment in the py file's first or second line:
# -*- coding: utf-8 -*-