Misformatted Quarto R magic chunk output - python

Consider the following quarto document:
---
title: "Untitled"
format: pdf
jupyter: python3
---
```{python}
#|echo: false
%load_ext rpy2.ipython
```
```{python}
#| result: asis
%%R
x <- c(1, 2)
names(x) <- c('a', 'b')
print(x)
```
The output is misformatted:
Could someone please help me with fixing that?

Related

How to rewrite my append line in Python 3 code?

My code
token = open('out.txt','r')
linestoken=token.readlines()
tokens_column_number = 1
r=[]
for x in linestoken:
r.append(x.split()[tokens_column_number])
token.close()
print (r)
Output
'"tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz",'
Desired output
"tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz"
How to get rid of '' and , ?
It would be nice to see your input data. I have created an input file which is similar than yours (I hope).
My test file content:
example1 "tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz", aaaa
example2 "tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz", bbbb
You should replace the ", ',' characters.
You can do it with replace (https://docs.python.org/3/library/stdtypes.html#str.replace):
r.append(x.split()[tokens_column_number].replace('"', "").replace(",", ""))
You can do it with strip (https://docs.python.org/2/library/string.html#string.strip):
r.append(x.split()[tokens_column_number].strip('",'))
You can do it with re.sub (https://docs.python.org/3/library/re.html#re.sub):
import re
...
...
for x in linestoken:
x = re.sub('[",]', "", x.split()[tokens_column_number])
r.append(x)
...
...
Output in both cases:
>>> python3 test.py
['tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz', 'tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz']
As you can see above the output (r) is a list type but if you want to get the result as a string, you should use the join (https://docs.python.org/3/library/stdtypes.html#str.join).
Output with print(",".join(r)):
>>> python3 test.py
tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz,tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz
Output with print("\n".join(r)):
>>> python3 test.py
tick2/tick_calculated_2_2020-05-27T11-59-06.json.gz
tick2/tick_calculated_2_2020-05-27T11-59-07.json.gz

How to delimit string based on regular expression using Python

I have Python string with below mention data.
--- Data-['tag']-['cli'] command ---> show date:
Current time: 2020-03-12 11:36:37 PDT
--- Data-['tag']-['shell'] command ---> show version:
OS Kernel 64-bit
[builder_stable]
--- Data-['tag']-['cli'] command ---> show host:
Model: New
I want to delimit above string based on any line that starts with "--- Data" and ends with ":" irrespective of any contents that is inside between "--- Data" and ":" character.
My python code is shown below.
array = data.split("--- Data")
for word in array:
print(word)
I want delimited data to be returned in order and with the delimiter as well.
For e.g.
First split result should be like:
--- Data-['tag']-['cli'] command ---> show date:
Current time: 2020-03-12 11:36:37 PDT
Second split result be like:
--- Data-['tag']-['shell'] command ---> show version:
OS Kernel 64-bit
[builder_stable]
And so on. Any help?
You can use re.findall with a pattern that looks for the delimiter pattern and then lazily matches any characters until the next delimiter pattern or the end of the string:
import re
s = '''--- Data-['tag']-['cli'] command ---> show date:
Current time: 2020-03-12 11:36:37 PDT
--- Data-['tag']-['shell'] command ---> show version:
OS Kernel 64-bit
[builder_stable]
--- Data-['tag']-['cli'] command ---> show host:
Model: New'''
delimiter = r'--- Data[^\n]*?:'
print(re.findall(r'{0}.*?(?={0}|$)'.format(delimiter), s, re.S))
Another solution:
import re
s = '''--- Data-['tag']-['cli'] command ---> show date:
Current time: 2020-03-12 11:36:37 PDT
--- Data-['tag']-['shell'] command ---> show version:
OS Kernel 64-bit
[builder_stable]
--- Data-['tag']-['cli'] command ---> show host:
Model: New'''
split_start = "--- Data"
l = re.split(split_start, s)
curr_split = [split_start+cs for cs in l if cs != ""]

How convert Record Separator character into line break

Hello i'm using for this porpouse pyspark
I have a txt file that contains this information
c-234r4|Julio|38|Madrida-533r2|Ana|32|Madrida-543r4|Sonia|33|Bilbaob-654r4|Jorge|23|Barcelona
If you see all records are concateneted using the Record Separator character (see this link)
I'm trying to do this, but without results
df = spark.read.load("s3://my-bucket/txt_file/data.txt", format="csv", sep="|", inferSchema="true", encoding="UTF-8", escape='U+001E')
df.show(10, False)
Error:
Py4JJavaError: An error occurred while calling o496.load.
: java.lang.RuntimeException: escape cannot be more than one character
at org.apache.spark.sql.execution.datasources.csv.CSVOptions.getChar(CSVOptions.scala:52)
the final result must be looks like this:
+-------+-----+---+--------------+
|_c0 |_c1 |_c2|_c3 |
+-------+-----+---+--------------+
|c-234r4|Julio|38 |Madrid |
|a-533r2|Ana |32 |Madrid |
|a-543r4|Sonia|33 |Bilbao |
|b-654r4|Jorge|23 |Barcelona |
+-------+-----+---+--------------+
Options tested:
option-1 --> This totally wrong
option-2 --> This show rows as columns ... and this is wrong
Can somebody give me an advice i need an idea to resolve this on my actual role?
I'll appreciete
Thanks

Is there a way to read a multi-line csv file in Apache Beam using the ReadFromText transform (Python)?

Is there a way to read a multi-line csv file using the ReadFromText transform in Python? I have a file that contains one line I am trying to make Apache Beam read the input as one line, but cannot get it to work.
def print_each_line(line):
print line
path = './input/testfile.csv'
# Here are the contents of testfile.csv
# foo,bar,"blah blah
# more blah blah",baz
p = apache_beam.Pipeline()
(p
| 'ReadFromFile' >> apache_beam.io.ReadFromText(path)
| 'PrintEachLine' >> apache_beam.FlatMap(lambda line: print_each_line(line))
)
# Here is the output:
# foo,bar,"blah blah
# more blah blah",baz
The above code parses the input as two lines even though the standard for multi-line csv files is to wrap multi-line elements within double-quotes.
Beam doesn't support parsing CSV files. You can however use Python's csv.reader. Here's an example:
import apache_beam
import csv
def print_each_line(line):
print line
p = apache_beam.Pipeline()
(p
| apache_beam.Create(["test.csv"])
| apache_beam.FlatMap(lambda filename:
csv.reader(apache_beam.io.filesystems.FileSystems.open(filename)))
| apache_beam.FlatMap(print_each_line))
p.run()
Output:
['foo', 'bar', 'blah blah\nmore blah blah', 'baz']
None of the answers worked for me but this did
(
p
| beam.Create(['data/test.csv'])
| beam.FlatMap(lambda filename:
csv.reader(io.TextIOWrapper(beam.io.filesystems.FileSystems.open(known_args.input)))
| "Take only name" >> beam.Map(lambda x: x[0])
| WriteToText(known_args.output)
)
ReadFromText parses a text file as newline-delimited elements. So ReadFromText treats two lines as two elements. If you would like to have the contents of the file as a single element, you could do the following:
contents = []
contents.append(open(path).read())
p = apache_beam.Pipeline()
p | beam.Create(contents)

Creating RDF file using csv file as input

I need to convert a csv file to rdf with rdflib, I already have the code that reads the csv but I do not know how to convert it to rdf.
I have the following code:
import csv
from rdflib.graph import Graph
# Open the input file
with open('data.csv', 'rb') as fcsv:
g = Graph()
csvreader = csv.reader(fcsv)
y = True
for row in csvreader:
if y:
names = row
y = False
else:
for i in range(len(row)):
continue
print(g.serialize(format='xml'))
fcsv.close()
Can someone explain and give me an example?
Example csv file
With courtesy of KRontheWeb, I use the following example csv file to answer your question:
https://github.com/KRontheWeb/csv2rdf-tutorial/blob/master/example.csv
"Name";"Address";"Place";"Country";"Age";"Hobby";"Favourite Colour"
"John";"Dam 52";"Amsterdam";"The Netherlands";"32";"Fishing";"Blue"
"Jenny";"Leidseplein 2";"Amsterdam";"The Netherlands";"12";"Dancing";"Mauve"
"Jill";"52W Street 5";"Amsterdam";"United States of America";"28";"Carpentry";"Cyan"
"Jake";"12E Street 98";"Amsterdam";"United States of America";"42";"Ballet";"Purple"
Import Libraries
import pandas as pd #for handling csv and csv contents
from rdflib import Graph, Literal, RDF, URIRef, Namespace #basic RDF handling
from rdflib.namespace import FOAF , XSD #most common namespaces
import urllib.parse #for parsing strings to URI's
Read in the csv file
url='https://raw.githubusercontent.com/KRontheWeb/csv2rdf-tutorial/master/example.csv'
df=pd.read_csv(url,sep=";",quotechar='"')
# df # uncomment to check for contents
Define a graph 'g' and namespaces
g = Graph()
ppl = Namespace('http://example.org/people/')
loc = Namespace('http://mylocations.org/addresses/')
schema = Namespace('http://schema.org/')
Create the triples and add them to graph 'g'
It's a bit dense, but each g.add() consists of three parts: subject, predicate, object. For more info, check the really friendly rdflib documentation, section 1.1.3 onwards at https://buildmedia.readthedocs.org/media/pdf/rdflib/latest/rdflib.pdf
for index, row in df.iterrows():
g.add((URIRef(ppl+row['Name']), RDF.type, FOAF.Person))
g.add((URIRef(ppl+row['Name']), URIRef(schema+'name'), Literal(row['Name'], datatype=XSD.string) ))
g.add((URIRef(ppl+row['Name']), FOAF.age, Literal(row['Age'], datatype=XSD.integer) ))
g.add((URIRef(ppl+row['Name']), URIRef(schema+'address'), Literal(row['Address'], datatype=XSD.string) ))
g.add((URIRef(loc+urllib.parse.quote(row['Address'])), URIRef(schema+'name'), Literal(row['Address'], datatype=XSD.string) ))
Note that:
I borrow namespaces from rdflib and create some myself;
It is good practice to define the datatype whenever you can;
I create URI's from the addresses (example of string handling).
Check the results
print(g.serialize(format='turtle').decode('UTF-8'))
A snippet of the output:
<http://example.org/people/Jake> a ns2:Person ;
ns1:address "12E Street 98"^^xsd:string ;
ns1:name "Jake"^^xsd:string ;
ns2:age 42 .
Save the results to disk
g.serialize('mycsv2rdf.ttl',format='turtle')
There is "A commandline tool for semi-automatically converting CSV to RDF" in rdflib/rdflib/tools/csv2rdf.py
csv2rdf.py \
-b <instance-base> \
-p <property-base> \
[-D <default>] \
[-c <classname>] \
[-i <identity column(s)>] \
[-l <label columns>] \
[-s <N>] [-o <output>] \
[-f configfile] \
[--col<N> <colspec>] \
[--prop<N> <property>] \
<[-d <delim>] \
[-C] [files...]"
Have a look at pyTARQL which has recently been added to the RDFlib family of tools. It is specifically for parsing and serializing CSV to RDF.

Categories

Resources