Generate files directly in HDFS

Generate files directly in HDFS - python

Is there a way to generate a file on HDFS directly?
I want to avoid generating a local file and then over hdfs command line like:
hdfs dfs -put - "file_name.csv" to copy to HDFS.
Or is there any python library?

Have you tried with HdfsCli?
To quote the paragraph Reading and Writing files:
# Loading a file in memory.
with client.read('features') as reader:
features = reader.read()
# Directly deserializing a JSON object.
with client.read('model.json', encoding='utf-8') as reader:
from json import load
model = load(reader)

Is extremly slow when I use hdfscli the write method?
Is there an any way to speedup with using hdfscli?
with client.write(conf.hdfs_location+'/'+ conf.filename, encoding='utf-8', buffersize=10000000) as f:
writer = csv.writer(f, delimiter=conf.separator)
for i in tqdm(10000000000):
row = [column.get_value() for column in conf.columns]
writer.writerow(row)
Thanks lot.

hdfs dfs -put does not require yo to create a file on local. Also, no need of creating a zero byte file on hdfs (touchz) and append to it (appendToFile). You can directly write a file on hdfs as:
hadoop fs -put - /user/myuser/testfile
Hit enter. On the command prompt, enter the text you want to put in the file. Once you are finished, say Ctrl+D.

Two ways of write local files to hdfs using python:
One way is using hdfs python package:
Code snippet:
from hdfs import InsecureClient
hdfsclient = InsecureClient('http://localhost:50070', user='madhuc')
hdfspath="/user/madhuc/hdfswritedata/"
localpath="/home/madhuc/sample.csv"
hdfsclient.upload(hdfspath, localpath)
Outputlocation:'/user/madhuc/hdfswritedata/sample.csv'
Otherway is subprocess python package using PIPE
Code Sbippet:
from subprocess import PIPE, Popen
# put file into hdfs
put = Popen(["hadoop", "fs", "-put", localpath, hdfspath], stdin=PIPE, bufsize=-1)
put.communicate()
print("File Saved Successfully")

Related

How to write file to memory filepath and read from memory filepath in Python?

An existing Python package requires a filepath as input parameter for a method to be able to parse the file from the filepath. I want to use this very specific Python package in a cloud environment, where I can't write files to the harddrive. I don't have direct control over the code in the existing Python package, and it's not easy to switch to another environment, where I would be able to write files to the harddrive. So I'm looking for a solution that is able to write a file to a memory filepath, and let the parser read directly from this memory filepath. Is this possible in Python? Or are there any other solutions?
Example Python code that works by using harddrive, which should be changed so that no harddrive is used:
temp_filepath = "./temp.txt"
with open(temp_filepath, "wb") as file:
file.write("some binary data")
model = Model()
model.parse(temp_filepath)
Example Python code that uses memory filesystem to store file, but which does not let parser read file from memory filesystem:
from fs import open_fs
temp_filepath = "./temp.txt"
with open_fs('osfs://~/') as home_fs:
home_fs.writetext(temp_filepath, "some binary data")
model = Model()
model.parse(temp_filepath)

You're probably looking for StringIO or BytesIO from io
import io
with io.BytesIO() as tmp:
tmp.write(content)
# to continue working, rewind file pointer
tmp.seek(0)
# work with tmp
pathlib may also be an advantage

Run python zip file from memory at runtime?

I am trying to run a python zip file which is retrieved using requests.get. The zip file has several directories of python files in addition to the __main__.py, so in the interest of easily sending it as a single file, I am zipping it.
I know the file is being sent correctly, as I can save it to a file and then run it, however, I want to execute it without writing it to storage.
The working part is more or less as follows:
import requests
response = requests.get("http://myurl.com/get_zip")
I can write the zip to file using
f = open("myapp.zip","wb")
f.write(response.content)
f.close()
and manually run it from command line. However, I want something more like
exec(response.content)
This doesn't work since it's still compressed, but you get the idea.
I am also open to ideas that replace the zip with some other format of sending the code over internet, if it's easier to execute it from memory.

A possible solution is this
import io
import requests
from zipfile import ZipFile
response = requests.get("http://myurl.com/get_zip")
# Read the contents of the zip into a bytes object.
binary_zip = io.BytesIO(response.content)
# Convert the bytes object into a ZipFile.
zip_file = ZipFile(binary_zip, "r")
# Iterate over all files in the zip (folders should be also ok).
for script_file in zip_file.namelist():
exec(zip_file.read(script_file))
But it is a bit convoluted and probably can be improved.

How to read a jar file inside a zip file using Python

I have a large zip file that contains many jar files. I want to read the content of the jar files.
What I tried is to read the inner jar file into memory, which seem to work (see below). However, I am not sure how large the jar files can be, and am concerned that they won't fit into memory.
Is there streaming solution for this problem?
hello.zip
+- hello.jar
+- Hello.class
#!/usr/local/bin/python3
import os
import io
import zipfile
zip = zipfile.ZipFile('hello.zip', 'r')
for zipname in zip.namelist():
if zipname.endswith('.jar'):
print(zipname)
jarname = zip.read(zipname)
memfile = io.BytesIO(jarname)
jar = zipfile.ZipFile(memfile)
for f in jar.namelist():
print(f)
hello.jar
META-INF/
META-INF/MANIFEST.MF
Hello.class

How to save many CSV files from URL

I have many CSV files that I need to get from a URL. I found this reference: How to read a CSV file from a URL with Python?
It does almost the thing I want, but I don't want to go through Python to read the CSV and then have to save it. I just want to directly save the CSV file from the URL to my hard drive.
I have no problem with for loops and cycling through my URLs. It is simply a matter of saving the CSV file.

If all you want to do is save a csv, then I wouldn't suggest using python at all. In fact this is more of a unix question. Making the assumption here that you're working on some kind of *nix system I would suggest just using wget. For instance:
wget http://someurl/path/to/file.csv
You can run this command directly from python like so:
import subprocess
bashCommand = lambda url, filename: "wget -O %s.csv %s" % (filename, url)
save_locations = {'http://someurl/path/to/file.csv': 'test.csv'}
for url, filename in save_locations.items():
process = subprocess.Popen(bashCommand(url, filename).split(), stdout=subprocess.PIPE)
output = process.communicate()[0]

load multiple txt files into mysql

I have more than 40 txt files needed to be loaded into a table in Mysql. Each file contains 3 columns of data, each column lists one specific type of data, but in general the format of each txt file is exactly the same, but these file names are various, first I tried LOAD DATA LOCAL INFILE 'path/*.txt' INTO TABLE xxx"
Cause I think maybe use *.txt can let Mysql load all the txt file in this folder. But it turned out no.
So how can I let Mysql or python do this? Or do I need to merge them into one file manually first, then use LOAD DATA LOCAL INFILE command?
Many thanks!

If you want to avoid merging your text files, you can easily "scan" the folder and run the SQL import query for each file:
import os
for dirpath, dirsInDirpath, filesInDirPath in os.walk("yourFolderContainingTxtFiles"):
for myfile in filesInDirPath:
sqlQuery = "LOAD DATA INFILE %s INTO TABLE xxxx (col1,col2,...);" % os.path.join(dirpath, myfile)
# execute the query here using your mysql connector.
# I used string formatting to build the query, but you should use the safe placeholders provided by the mysql api instead of %s, to protect against SQL injections

The only and best way is to merge your data into 1 file. That's fairly easy using Python :
fout=open("out.txt","a")
# first file:
for line in open("file1.txt"):
fout.write(line)
# now the rest:
for num in range(2,NB_FILES):
f = open("file"+str(num)+".txt")
for line in f:
fout.write(line)
f.close() # not really needed
fout.close()
Then run the command you know (... INFILE ...) to load the one file to MySql. Works fine as long as your separation between columns are strictly the same. Tabs are best in my opinion ;)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Generate files directly in HDFS - python

Is there a way to generate a file on HDFS directly? I want to avoid generating a local file and then over hdfs command line like: hdfs dfs -put - "file_name.csv" to copy to HDFS. Or is there any python library?

Related

How to write file to memory filepath and read from memory filepath in Python?

Run python zip file from memory at runtime?

How to read a jar file inside a zip file using Python

How to save many CSV files from URL

load multiple txt files into mysql

Categories

Resources