Loading multiple files with bobobo-etl

Loading multiple files with bobobo-etl - python

I'm new to bonobo-etl and I'm trying to write a job that loads multiple files at once but I can't get the CsvReader to work with the #use_context_processor annotation. A snippet of my code:
def input_file(self, context):
yield 'test1.csv'
yield 'test2.csv'
yield 'test3.csv'
#use_context_processor(input_file)
def extract(f):
return bonobo.CsvReader(path=f,delimiter='|')
def load(*args):
print(*args)
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(extract,load)
return graph
When I run the job I get something like <bonobo.nodes.io.csv.CsvReader object at 0x7f849678dc88> rather than the lines of the CSV.
If I hardcode the reader like graph.add_chain(bonobo.CsvReader(path='test1.csv',delimiter='|'),load), it works.
Any help would be appreciated.
Thank you.

As bonobo.CsvReader does not support (yet) to read file names from the input stream, you need use a custom reader for that.
Here is a solution that works for me on a set of csvs I have:
import bonobo
import bonobo.config
import bonobo.util
import glob
import csv
#bonobo.config.use_context
def read_multi_csv(context, name):
with open(name) as f:
reader = csv.reader(f, delimiter=';')
headers = next(reader)
if not context.output_type:
context.set_output_fields(headers)
for row in reader:
yield tuple(row)
def get_graph(**options):
graph = bonobo.Graph()
graph.add_chain(
glob.glob('prenoms_*.csv'),
read_multi_csv,
bonobo.PrettyPrinter(),
)
return graph
if __name__ == '__main__':
with bonobo.parse_args() as options:
bonobo.run(get_graph(**options))
Few comments on this snippet, in reading order:
use_context decorator will inject the node execution context to the transformation call, allowing to use .set_output_fields(...) using the first csv headers.
Other csv headers are ignored, in my case they're all the same. You may need a slightly more complex logic for your own case.
Then, we just generate the filenames in a bonobo.Graph instance using glob.glob (in my case, the stream will contain: prenoms_2004.csv prenoms_2005.csv ... prenoms_2011.csv prenoms_2012.csv) and pass it to our custom reader, which will be called once for each file, open it, and yield its lines.
Hope that helps!

Related

How to implement a check for the presence of a file and access to a file?

I need to test my methods using the Pytest library. I have written such tests.
def test_read_data_from_file(): # check for missing file.(1)
try:
read_data_from_file('example.csv')
except FileNotFoundError:
pytest.fail('File not found.')
def test_roots(): # check for lack of access to the file.(2)
try:
read_data_from_file('err.csv')
except Permission Error:
py test.fail('The file is not readable.')
The method itself
def read_data_from_file(filename):
# we will return it.
rows_csv = []
with open(filename) as csvfile:
read_csv = csv.reader(csvfile)
# for convenience, when performing split, we will separate the data in lines with a space.
for row in read_csv:
rows_csv.append(' '.join(row))
rows_csv.pop(0)
return rows_csv
How can I change the tests so that "ExceptionInfo" is used in their implementation?(with pytest.raises(FileNotFoundError) as exception_info:)

Python how to read from and write to different files using multiprocessing

I have several files and I would like to read those files, filter some keywords and write them into different files. I use Process() and it turns out that it takes more time to process the readwrite function.
Do I need to separate the read and write to two functions? How I can read multiple files at one time and write key words in different files to different csv?
Thank you very much.
def readwritevalue():
for file in gettxtpath(): ##gettxtpath will return a list of files
file1=file+".csv"
##Identify some variable
##Read the file
with open(file) as fp:
for line in fp:
#Process the data
data1=xxx
data2=xxx
....
##Write it to different files
with open(file1,"w") as fp1
print(data1,file=fp1 )
w = csv.writer(fp1)
writer.writerow(data2)
...
if __name__ == '__main__':
p = Process(target=readwritevalue)
t1 = time.time()
p.start()
p.join()
Want to edit my questions. I have more functions to modify the csv generated by the readwritevalue() functions.
So, if Pool.map() is fine. Will it be ok to change all the remaining functions like this? However, it seems that it did not save much time for that.
def getFormated(file): ##Merge each csv with a well-defined formatted csv and generate a final report with writing all the csv to one output csv
csvMerge('Format.csv',file,file1)
getResult()
if __name__=="__main__":
pool=Pool(2)
pool.map(readwritevalue,[file for file in gettxtpath()])
pool.map(GetFormated,[file for file in getcsvName()])
pool.map(Otherfunction,file_list)
t1=time.time()
pool.close()
pool.join()

You can extract the body of the for loop into its own function, create a multiprocessing.Pool object, then call pool.map() like so (I’ve used more descriptive names):
import csv
import multiprocessing
def read_and_write_single_file(stem):
data = None
with open(stem, "r") as f:
# populate data somehow
csv_file = stem + ".csv"
with open(csv_file, "w", encoding="utf-8") as f:
w = csv.writer(f)
for row in data:
w.writerow(data)
if __name__ == "__main__":
pool = multiprocessing.Pool()
result = pool.map(read_and_write_single_file, get_list_of_files())
See the linked documentation for how to control the number of workers, tasks per worker, etc.

I may have found an answer myself. Not so sure if it is indeed a good answer, but the time is 6 times shorter than before.
def readwritevalue(file):
with open(file, 'r', encoding='UTF-8') as fp:
##dataprocess
file1=file+".csv"
with open(file1,"w") as fp2:
##write data
if __name__=="__main__":
pool=Pool(processes=int(mp.cpu_count()*0.7))
pool.map(readwritevalue,[file for file in gettxtpath()])
t1=time.time()
pool.close()
pool.join()

writer.writerow not work for writing multiple CSV in for loop

Please look at the pseudocode below:
def main():
queries = ['A','B','C']
for query in queries:
filename = query + '.csv'
writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
...
FUNCTION (query)
def FUNCTION(query):
...
writer.writerow(XXX)
I'd like to write to multiple csv files, so I use for loop to generate different file names, followed by writing into the file in another def()
However, this is not working, the file will be empty.
If I try to get rid of using main() or stop for loop:
writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
...
FUNCTION (query)
def FUNCTION(query):
...
writer.writerow(XXX)
It'll work.
I don't know why? Anything related to for loop or main()?

A simple fix is to pass the file handle and not the name to FUNCTION. Since the file has been opened in main, you don't need/want the name in the subroutine, just the file handle so change the call to FUNCTION(writer) and the definition to
def FUNCTION(writer):
and use writer.writerow(xxx) wherever you need to stream output in the subroutine.
Note: you changed the name of the file pointer from writer to write in your example.

I think the possible reason is you didn't close the file pointer. You can use the context manager like:
with open(filename, 'wt', encoding = 'utf-8') as f:
writer = csv.writer(f)
...
FUNCTION (query)
which will help you auto close file.

Load csv file in processing.py

I am trying to load a csv file in processing.py as a table. The Java environment allows me to use the loadTable() function, however, I'm unable to find an equivalent function in the python environment.

The missing functionality could be added as follows:
import csv
class Row(object):
def __init__(self, dict_row):
self.dict_row = dict_row
def getFloat(self, key):
return float(self.dict_row[key])
def getString(self, key):
return self.dict_row[key]
class loadTable(object):
def __init__(self, csv_filename, header):
with open(csv_filename, "rb") as f_input:
csv_input = csv.DictReader(f_input)
self.data = [Row(row) for row in csv_input]
def rows(self):
return self.data
This reads the csv file into memory using Python's csv.DictReader class. This treats each row in the csv file as a dictionary. For each row, it creates an instance of a Row class which then lets you retrieve entries in the format required. Currently I have just coded for getFloat() and getString() (which is the default format for all csv values).

You could create an empty Table object with this:
from processing.data import Table
t = Table()
And then populate it as discussed at https://discourse.processing.org/t/creating-an-empty-table-object-in-python-mode-and-some-other-hidden-data-classes/25121
But I think a Python Dict as proposed by #martin-evans would be nice. You load it like this:
import csv
from codecs import open # optional to have the 'enconding="utf-8"' in Python 2
with open("data/pokemon.csv", encoding="utf-8") as f:
data = list(csv.DictReader(f)) # a list of dicts, col-headers as keys

Transferring CSV data into different Functions in Python

i need some help. Basically, i have to create a function to read a csv file then i have to transfer this data into another function to use the data to generate a xml file.
Here is my code:
import csv
from xml.etree.ElementTree import Element, SubElement, Comment, tostring
from xml.etree.ElementTree import ElementTree
import xml.etree.ElementTree as etree
def read_csv():
with open ('1250_12.csv', 'r') as data:
reader = csv.reader(data)
return reader
def generate_xml(reader):
root = Element('Solution')
root.set('version','1.0')
tree = ElementTree(root)
head = SubElement(root, 'DrillHoles')
head.set('total_holes', '238')
description = SubElement(head,'description')
current_group = None
i = 0
for row in reader:
if i > 0:
x1,y1,z1,x2,y2,z2,cost = row
if current_group is None or i != current_group.text:
current_group = SubElement(description, 'hole',{'hole_id':"%s"%i})
information = SubElement (current_group, 'hole',{'collar':', '.join((x1,y1,z1)),
'toe':', '.join((x2,y2,z2)),
'cost': cost})
i+=1
def main():
reader = read_csv()
generate_xml(reader)
if __name__=='__main__':
main()
but i get an error when i try to pass reader, the error is: ValueError: I/O operation on closed file

Turning the reader into a list should work:
def read_csv():
with open ('1250_12.csv', 'r') as data:
return list(csv.reader(data))
You tried to read from a closed file. list will trigger the reader to read the whole file.

the with statement tells python to clean up the context manager (in this case, a file) once control exits its body. Since functions exit when they return, there's no way to get data out of it with the file still open.
Other answers suggest reading the whole thing into a list, and returning that; this works, but may be awkward if the file is very large.
Fortunately, we can use generators:
def read_csv():
with open('1250_12.csv', 'r') as data:
reader = csv.reader(data)
for row in reader:
yield row
Since we yield from inside the with, we don't have to clean up the file before getting some rows. Once the data is consumed, (or if the generator is itself cleaned up,) the file will be closed.

So when you read a csv file it is very important to put that file into a list. This is because most operations you cannot perform on the csv.reader file, and that if you do, once you loop through it and it is at the end of the file, you can no longer do anything with it unless you open it and read it again. So lets just change your read_csv function
def read_csv():
with open ('1250_12.csv', 'r') as data:
reader = csv.reader(data)
x = [row for row in reader]
return x
Now you are manipulating a list and everything should work perfectly!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading multiple files with bobobo-etl - python

Related

How to implement a check for the presence of a file and access to a file?

Python how to read from and write to different files using multiprocessing

writer.writerow not work for writing multiple CSV in for loop

Load csv file in processing.py

Transferring CSV data into different Functions in Python

Categories

Resources