Creating RDF file using csv file as input - python

I need to convert a csv file to rdf with rdflib, I already have the code that reads the csv but I do not know how to convert it to rdf.
I have the following code:
import csv
from rdflib.graph import Graph
# Open the input file
with open('data.csv', 'rb') as fcsv:
g = Graph()
csvreader = csv.reader(fcsv)
y = True
for row in csvreader:
if y:
names = row
y = False
else:
for i in range(len(row)):
continue
print(g.serialize(format='xml'))
fcsv.close()
Can someone explain and give me an example?

Example csv file
With courtesy of KRontheWeb, I use the following example csv file to answer your question:
https://github.com/KRontheWeb/csv2rdf-tutorial/blob/master/example.csv
"Name";"Address";"Place";"Country";"Age";"Hobby";"Favourite Colour"
"John";"Dam 52";"Amsterdam";"The Netherlands";"32";"Fishing";"Blue"
"Jenny";"Leidseplein 2";"Amsterdam";"The Netherlands";"12";"Dancing";"Mauve"
"Jill";"52W Street 5";"Amsterdam";"United States of America";"28";"Carpentry";"Cyan"
"Jake";"12E Street 98";"Amsterdam";"United States of America";"42";"Ballet";"Purple"
Import Libraries
import pandas as pd #for handling csv and csv contents
from rdflib import Graph, Literal, RDF, URIRef, Namespace #basic RDF handling
from rdflib.namespace import FOAF , XSD #most common namespaces
import urllib.parse #for parsing strings to URI's
Read in the csv file
url='https://raw.githubusercontent.com/KRontheWeb/csv2rdf-tutorial/master/example.csv'
df=pd.read_csv(url,sep=";",quotechar='"')
# df # uncomment to check for contents
Define a graph 'g' and namespaces
g = Graph()
ppl = Namespace('http://example.org/people/')
loc = Namespace('http://mylocations.org/addresses/')
schema = Namespace('http://schema.org/')
Create the triples and add them to graph 'g'
It's a bit dense, but each g.add() consists of three parts: subject, predicate, object. For more info, check the really friendly rdflib documentation, section 1.1.3 onwards at https://buildmedia.readthedocs.org/media/pdf/rdflib/latest/rdflib.pdf
for index, row in df.iterrows():
g.add((URIRef(ppl+row['Name']), RDF.type, FOAF.Person))
g.add((URIRef(ppl+row['Name']), URIRef(schema+'name'), Literal(row['Name'], datatype=XSD.string) ))
g.add((URIRef(ppl+row['Name']), FOAF.age, Literal(row['Age'], datatype=XSD.integer) ))
g.add((URIRef(ppl+row['Name']), URIRef(schema+'address'), Literal(row['Address'], datatype=XSD.string) ))
g.add((URIRef(loc+urllib.parse.quote(row['Address'])), URIRef(schema+'name'), Literal(row['Address'], datatype=XSD.string) ))
Note that:
I borrow namespaces from rdflib and create some myself;
It is good practice to define the datatype whenever you can;
I create URI's from the addresses (example of string handling).
Check the results
print(g.serialize(format='turtle').decode('UTF-8'))
A snippet of the output:
<http://example.org/people/Jake> a ns2:Person ;
ns1:address "12E Street 98"^^xsd:string ;
ns1:name "Jake"^^xsd:string ;
ns2:age 42 .
Save the results to disk
g.serialize('mycsv2rdf.ttl',format='turtle')

There is "A commandline tool for semi-automatically converting CSV to RDF" in rdflib/rdflib/tools/csv2rdf.py
csv2rdf.py \
-b <instance-base> \
-p <property-base> \
[-D <default>] \
[-c <classname>] \
[-i <identity column(s)>] \
[-l <label columns>] \
[-s <N>] [-o <output>] \
[-f configfile] \
[--col<N> <colspec>] \
[--prop<N> <property>] \
<[-d <delim>] \
[-C] [files...]"

Have a look at pyTARQL which has recently been added to the RDFlib family of tools. It is specifically for parsing and serializing CSV to RDF.

Related

csvwriter, commas separators create undesirable columns between the actual values

I am trying to create a CSV file I can open with excel from an API data extraction (I don't know how to import it here) by using csvwriter, however for now commas separators are considered a value and added to columns between the actual values.
My code looks like this :
import csv
resources_list="resources_list.csv"
list_keys_res=[]
list_keys_res=list(res['items'][0].keys())
list_items_res=[]
list_items_res=list(res['items'])
resources = open(resources_list, 'w', newline ='')
with resources:
# identifying header
writer = csv.writer(resources, delimiter=';')
header_res = writer.writerow([list_keys_res[1]]+[";"]+[list_keys_res[0]]+[";"]+[list_keys_res[2]]+[";"]+[list_keys_res[4]])
resource=None
# loop to write the data into each row of the csv file - a row describes 1 resource
for i in list_items_res:
resource=list_items_res.index(i)
list_values_res=[]
list_values_res=list(list_items_res[resource].values())
writer.writerow([list_values_res[1]]+[";"]+[list_values_res[0]]+[';']+[list_values_res[2]]+[";"]+[list_values_res[4]])
i=resource+1`
My goal is to have :
name ; id ; userName ; email "Resource 1" ; 1 ; res1 ; myaddress1#host.com "Resource 2" ; 2 ; res2 ; myaddress2#host.com "Resource 3" ; 3 ; res3 ; myaddress3#host.com
...
And for now I have :
name;";";id;";";userName;";";email Resource 1;";";1;";"; res1;";"; myaddress1#host.com Resource 2;";";2;";"; res2;";"; myaddress2#host.com Resource 3;";";3;";"; res3;";"; myaddress3#host.com
...
There is the look on Excel for now
The code works just fine I just can't seem to make the form right. Thanks in advance, I hope this will help others aswell!

Split large CSV file based on row value

The porblem
I have a csv file called data.csv. On each row I have:
timestamp: int
account_id: int
data: float
for instance:
timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990
This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M
How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?
Things I tried
to generate a sample:
N_ROW = 10**6
N_ACCOUNT = 10**5
# Generate data to split
with open('./data.csv', 'w') as csv_file:
csv_file.write('timestamp,account_id,value\n')
for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
account_id = random.randint(1,N_ACCOUNT)
data = random.random()
csv_file.write(f'{timestamp},{account_id},{data}\n')
# Clean result folder
if os.path.isdir('./result'):
shutil.rmtree('./result')
os.mkdir('./result')
Solution 1
Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).
Code:
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file_path = f'result/{account_id}.csv'
file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
with open(account_file_path, file_opening_mode) as account_file:
account_file.write(row)
p_bar.update(1)
Issues:
It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?
Solution 2 (works on small example not on large csv file)
Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary
Code:
# A dict that will contain all files
account_file_dict = {}
# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
file = account_file_dict.get(account_id, None)
if file is None:
file = open(f'./result/{account_id}.csv', 'w')
account_file_dict[account_id] = file
file.__enter__()
return file
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file = get_account_file(account_id)
account_file.write(row)
p_bar.update(1)
Issues:
I am not sure it is actually faster.
I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.
Solution 3 (works on small example not on large csv file)
Use awk command, solution from: split large csv text file based on column value
code:
after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv
Issues:
I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files
#VinceM :
While not quite 15 GB, I do have a 7.6 GB one with 3 columns :
-- 148 mn prime numbers, their base-2 log, and their hex
in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%
148,156,631 lines 7773.641 MB ( 8151253694) /dev/stdin
|
f="$( grealpath -ePq ~/master_primelist_19d.txt )"
( time ( for __ in '12' '34' '56' '78' '9'; do
( gawk -v ___="${__}" -Mbe 'BEGIN {
___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]"
} ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done |
gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"
|
out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]
( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe "${f}" &)
0.36s user 3 out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]
Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,
—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file
if it can square that quickly, then merely grouping by account_id should be a walk in the park
Have a look at https://docs.python.org/3/library/sqlite3.html
You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:
from convtools.contrib.tables import Table
from convtools import conversion as c
iterable_of_matched_rows = (
Table.from_csv("tmp/in.csv", header=True)
.filter(c.col("account_id") == "1")
.into_iter_rows(dict)
)
However this won't be faster than reading 100M row csv file with csv.reader.

Parsing XML output of BLAST results after using Biopython

I have a FASTA file (test.fasta) which contains many sequences which I aligned with BLASTN using Biopython.
import Bio
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
sequence_data = open("/Users/Desktop/test.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", sequence_data)
with open('results.xml', 'w') as save_file:
blast_results = result_handle.read()
save_file.write(blast_results)
The alignment was saved as xml file.
Now, I would like to parse the output of the xml file in order to get the information of the list of all the species found to have a match and possibly I would like to keep only specific species:
Example xml:
<Hit_num>1</Hit_num>
<Hit_id>gi|2020514704|emb|FR989945.1|</Hit_id>
<Hit_def>Plebejus argus genome assembly, chromosome: 19</Hit_def>
<Hit_accession>FR989945</Hit_accession>
<Hit_len>13381465</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>44.5672</Hsp_bit-score>
<Hsp_score>48</Hsp_score>
<Hsp_evalue>1.07773</Hsp_evalue>
<Hsp_query-from>65</Hsp_query-from>
<Hsp_query-to>99</Hsp_query-to>
<Hsp_hit-from>12008397</Hsp_hit-from>
<Hsp_hit-to>12008366</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>-1</Hsp_hit-frame>
<Hsp_identity>31</Hsp_identity>
<Hsp_positive>31</Hsp_positive>
<Hsp_gaps>3</Hsp_gaps>
<Hsp_align-len>35</Hsp_align-len>
<Hsp_qseq>ACTATCTTTTATTTAGATTAGGTTCAGTATCCCTC</Hsp_qseq>
<Hsp_hseq>ACTATGTTTTATTT---TTAGGTTCAGTATCCCTC</Hsp_hseq>
<Hit_num>2</Hit_num>
<Hit_id>gi|1812775970|gb|CP048843.1|</Hit_id>
<Hit_def>Crassostrea gigas strain QD chromosome 5</Hit_def>
<Hit_accession>CP048843</Hit_accession>
<Hit_len>60957391</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_bit-score>42.7638</Hsp_bit-score>
<Hsp_score>46</Hsp_score>
<Hsp_evalue>3.76165</Hsp_evalue>
<Hsp_query-from>63</Hsp_query-from>
<Hsp_query-to>95</Hsp_query-to>
<Hsp_hit-from>42721025</Hsp_hit-from>
<Hsp_hit-to>42720993</Hsp_hit-to>
<Hsp_query-frame>1</Hsp_query-frame>
<Hsp_hit-frame>-1</Hsp_hit-frame>
<Hsp_identity>29</Hsp_identity>
<Hsp_positive>29</Hsp_positive>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>33</Hsp_align-len>
<Hsp_qseq>ATACTATCTTTTATTTAGATTAGGTTCAGTATC</Hsp_qseq>
<Hsp_hseq>ATACTGTATTTTGTTTAGATTAGGTTCAGTTTC</Hsp_hseq>
Expected output would be in this case:
Plebejus argus genome assembly, chromosome: 19
Crassostrea gigas strain QD chromosome 5
I, addition I would like to keep for example the matches where there is Homo sapiens at line "Hit_def" as well but I have not figured it out yet.
I have written something like this so far:
results_handle=open('results.xml')
for record in NCBIXML.parse(results_handle):
for alignment in record.alignments:
for hit in alignment.hits:
print(hit_def)
However, I keep getting some errors:
ValueError: I/O operation on closed file.
or
ValueError: More than one record found in handle
Any advice?

How to solve problem decoding from wrong json format

everyone. Need help opening and reading the file.
Got this txt file - https://yadi.sk/i/1TH7_SYfLss0JQ
It is a dictionary
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}
But it was written using json into txt file.
#This is how I dump the data into a txt
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
So, the file structure is
{"id0":"url0", "id1":"url1", ..., "idn":"urln"}{"id2":"url2", "id3":"url3", ..., "id4":"url4"}{"id5":"url5", "id6":"url6", ..., "id7":"url7"}
And it is all a string....
I need to open it and check repeated ID, delete and save it again.
But getting - json.loads shows ValueError: Extra data
Tried these:
How to read line-delimited JSON from large file (line by line)
Python json.loads shows ValueError: Extra data
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 190)
But still getting that error, just in different place.
Right now I got as far as:
with open('111111111.txt', 'r') as log:
before_log = log.read()
before_log = before_log.replace('}{',', ').split(', ')
mu_dic = []
for i in before_log:
mu_dic.append(i)
This eliminate the problem of several {}{}{} dictionaries/jsons in a row.
Maybe there is a better way to do this?
P.S. This is how the file is made:
json.dump(after,open(os.path.join(os.getcwd(), 'before_log.txt'), 'a'))
Your file size is 9,5M, so it'll took you a while to open it and debug it manually.
So, using head and tail tools (found normally in any Gnu/Linux distribution) you'll see that:
# You can use Python as well to read chunks from your file
# and see the nature of it and what it's causing a decode problem
# but i prefer head & tail because they're ready to be used :-D
$> head -c 217 111111111.txt
{"1933252590737725178": "https://instagram.fiev2-1.fna.fbcdn.net/vp/094927bbfd432db6101521c180221485/5CC0EBDD/t51.2885-15/e35/46950935_320097112159700_7380137222718265154_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net",
$> tail -c 219 111111111.txt
, "1752899319051523723": "https://instagram.fiev2-1.fna.fbcdn.net/vp/a3f28e0a82a8772c6c64d4b0f264496a/5CCB7236/t51.2885-15/e35/30084016_2051123655168027_7324093741436764160_n.jpg?_nc_ht=instagram.fiev2-1.fna.fbcdn.net"}
$> head -c 294879 111111111.txt | tail -c 12
net"}{"19332
So the first guess is that your file is a malformed series ofJSON data, and the best guess is to seperate }{ by a \n for further manipulations.
So, here is an example of how you can solve your problem using Python:
import json
input_file = '111111111.txt'
output_file = 'new_file.txt'
data = ''
with open(input_file, mode='r', encoding='utf8') as f_file:
# this with statement part can be replaced by
# using sed under your OS like this example:
# sed -i 's/}{/}\n{/g' 111111111.txt
data = f_file.read()
data = data.replace('}{', '}\n{')
seen, total_keys, to_write = set(), 0, {}
# split the lines of the in memory data
for elm in data.split('\n'):
# convert the line to a valid Python dict
converted = json.loads(elm)
# loop over the keys
for key, value in converted.items():
total_keys += 1
# if the key is not seen then add it for further manipulations
# else ignore it
if key not in seen:
seen.add(key)
to_write.update({key: value})
# write the dict's keys & values into a new file as a JSON format
with open(output_file, mode='a+', encoding='utf8') as out_file:
out_file.write(json.dumps(to_write) + '\n')
print(
'found duplicated key(s): {seen} from {total}'.format(
seen=total_keys - len(seen),
total=total_keys
)
)
Output:
found duplicated key(s): 43836 from 45367
And finally, the output file will be a valid JSON file and the duplicated keys will be removed with their values.
The basic difference between the file structure and actual json format is the missing commas and the lines are not enclosed within [. So the same can be achieved with the below code snippet
with open('json_file.txt') as f:
# Read complete file
a = (f.read())
# Convert into single line string
b = ''.join(a.splitlines())
# Add , after each object
b = b.replace("}", "},")
# Add opening and closing parentheses and ignore last comma added in prev step
b = '[' + b[:-1] + ']'
x = json.loads(b)

Python - formatting output with varying numbers of spaces

I'm new to python and am struggling with finding a way to format output similar to the below into csv.
The following code runs an expect script that yields columns separated by varying numbers of spaces.
out = subprocess.check_output([get_script, "|", "grep Up"], shell=True
print out
1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356
What I want is to remove all whitespace and add a "," separator.
1501,4122:1501,Mesh,1.2.3.4,Up,262075,261927
1502,4121:1502,Mesh,1.2.3.5,Up,262089,261552
1502,4122:1502,Spok,1.2.3.6,Up,262074,261784
701000703,4121:701000703,Mesh,1.2.3.7,Up,262081,261356
I can achieve this via awk with awk -v OFS=',' '{$1=$1};1' but am struggling to find a python equivalent.
Any guidance is most appreciated!
Split each line on space and then join the result with a comma
# The commented out step is needed if out is not a list of lines already
# out=out.strip().split('\n')
for line in out:
print ','.join(line.split())
You can convert out as follows:
import csv
import StringIO
out = """1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356"""
csv_input = csv.reader(StringIO.StringIO(out), delimiter=' ', skipinitialspace=True)
with open('output.csv', 'wb') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerows(csv_input)
Giving you output.csv file containing:
1501,4122:1501,Mesh,1.2.3.4,Up,262075,261927
1502,4121:1502,Mesh,1.2.3.5,Up,262089,261552
1502,4122:1502,Spok,1.2.3.6,Up,262074,261784
701000703,4121:701000703,Mesh,1.2.3.7,Up,262081,261356
StringIO is used to make your out string appear as a file object for the csv module to use.
Tested using Python 2.7.9
This should be possible by using the sub function in the re (regex) library.
import re
table_str = """
1501 4122:1501 Mesh 1.2.3.4 Up 262075 261927
1502 4121:1502 Mesh 1.2.3.5 Up 262089 261552
1502 4122:1502 Spok 1.2.3.6 Up 262074 261784
701000703 4121:701000703 Mesh 1.2.3.7 Up 262081 261356
"""
re.sub(" +", ",", table_str)
The " +" string literal in the sub function does a greedy match on all whitespace.

Categories

Resources