Read/Write files on hdfs using Python

Read/Write files on hdfs using Python - python

I am a newbie to Python, I want to read a file from hdfs (which I have achieved).
after reading the file I am doing some string operations and I want to write these modified contents into the output file.
Reading the file I achieved using subprocess (which took a lot of time) since open didn't work for me.
cat = Popen(["hadoop", "fs", "-cat", "/user/hdfs/test-python/input/test_replace"],stdout=PIPE)
Now, how to write to the output file with the modified contents is the question.
Your help is highly appreciated

You can use a library for reading and writing to HDFS, like https://github.com/mtth/hdfs

Related

Converting .csv file to .mat file without reading the csv

I have a python project that gives outputs as csv. These outputs sometimes can be large as 15-16gb. When I try to save it with scipy, ram and cpu can't handle the data and closes the program so I need to convert csv file to mat file without reading the file. Is there a way to do that?

Yes and no. You can't do anything with the file unless you read it, but it's not necessary that you finish all the reading. I don't know all the details but usually you can use fopen and fscanf to read just a few lines of the csv file, process however you like and save the partial result, then repeat from fscanf for some more lines again and again.

How to not read a csv file if its being written to in that instance?

Is this something that can be done in python or any language? Is there a way to detect if a csv file is being written to in that instantaneous moment?

So you want to update the CSV file atomically. Starting to write over the existing file as you have realized is not atomic and will get you in trouble.
The trick is to write the new data to a temporary new file and then move the temp file over the live file. The move operation is atomic (for practical purposes).
create-new-csv-data > new-data.csv
mv new-data.csv data.csv
For probably more info than you want to know about how atomic a mv really is, see for example https://unix.stackexchange.com/questions/322038/is-mv-atomic-on-my-fs.

Python Zipfile - is entire file unzipped to memory?

I have some code which I am using to open a large zip which contains some csv files and then parse them.
I am using this code below but I am now wondering if I am actually unzipping the entire file into memory and then extracting the file contents to disk as well, after which I read the files in one by one.
def unzip_file(file_path):
zip_ref = zipfile.ZipFile(file_path, 'r')
extracted = zip_ref.namelist()
zip_ref.extractall('/tmp/extracts')
zip_ref.close()
return extracted
Is this actually unzipping the files and their contents into memory and then extracting the files straight to disk? I use the extracted variable afterwards as it contains a list of the file names I need to process but I dont also want to open each file into memory and then read them again.

Your concern is that you are wasting memory or being inefficient in the manner you are reading the files when extracting them. The answer to if you're doing anything "wrong" is simply: "No". Your code is correct and it does not keep files in memory after you have finished the function call.
A few notes on what you can improve though.
Use Context Managers to Automatically Close File
The ZipFile is also a context manager and it is generally considered best practice to use it to make sure that files are closed and cleaned up from memory correctly. Instead of calling .close() manually you could do the following:
with ZipFile(file_path, "r") as zip_ref:
zip_ref.extractall("/tmp/extracts")
It will then automatically close the file after the context manager is done, and make sure that nothing is stored in memory.
Since you close the file, you do not have to fear that it will stay in memory.
Read Files without Extracting
Since you are extracting the files to a /tmp/ folder, I guess(?) that you actually don't want to store the files on disk. Perhaps all you want to do is to read the data and do something with it.
You can read each file within the zip file without extracting them to disk.
with ZipFile('spam.zip') as myzip:
with myzip.open('eggs.txt') as myfile:
print(myfile.read())
This might be a better solution depending on what you want to achieve. You can see more from the python docs.

Beginner Python: Saving an excel file while it is open

I have a simple problem that I hope will have a simple solution.
I am writing python(2.7) code using the xlwt package to write excel files. The program takes data and writes it out to a file that is being saved constantly. The problem is that whenever I have the file open to check the data and python tries to save the file the program crashes.
Is there any way to make python save the file when I have it open for reading?

My experience is that sashkello is correct, Excel locks the file. Even OpenOffice/LibreOffice do this. They lock the file on disk and create a temp version as a working copy. ANY program trying to access the open file will be denied by the OS. The reason for this is because many corporations treat Excel files as databases but the users have no understanding of the issues involved in concurrency and synchronisation.
I am on linux and I get this behaviour (at least when the file is on a SAMBA share). Look in the same directory as your file, if a file called .~lock.[filename]# exists then you will be unable to read your file from another program. I'm not sure what enforces this lock but I suspect it's an NTFS attribute. Note that even a simple cp or cat fails: cp: error reading ‘CATALOGUE.ods’: Input/output error
UPDATE: The actual locking mechanism appears to be 'oplocks`, a concept connected to Windows shares: http://oreilly.com/openbook/samba/book/ch05_05.html . If the share is managed by Samba the workaround is to disable locks on certain file types, eg:
veto oplock files = /*.xlsx/
If you aren't using a share or NTFS on linux then I guess you should be able to RW the file as long as your script has write permissions. By default only the user who created the file has write access.
WORKAROUND 2: The restriction only seems to apply if you have the file open in Excel/LO as writable, however LO at least allows you to open a file as read-only (Go to File -> Properties -> Security, set Read-Only, Save and re-open the file). I don't know if this will also make it RO for xlwt though.

Hah, funny I ran across your post. I actually just implemented this tonight.
The issue is that Excel files write, and that's it, not both. You cannot read/write off the same object. So if you have another method to save data please do. I'm in a position where I don't have an option.. and so might you.
You're going to need xlutils it's the bread and butter to this.
Here's some example code:
from xlutils.copy import copy
wb_filename = 'example.xls'
wb_object = xlrd.open_workbook(wb_filename)
# And then you can read this file to your hearts galore.
# Now when it comes to writing to this, you need to copy the object and work off that.
write_object = copy(wb_object)
# Write to it all you want and then save that object.
And that's it, now if you read the object, write to it, and read the original one again it won't be updated. You either need to recreate wb_object or you need to create some sort of table in memory that you can keep track of while working through it.

python beginner questions

i just installed python
i am trying to run this script:
import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
print row
i am running on windows.
do i have to type each line individually into python shell or can i save this code into a text file and then run it from the shell?
where does some.csv have to be in order to run it? in the same c:\python26 folder?
what is this code supposed to do?

Yes, you can create a file. The interactive shell is only for learning syntax, etc., and toying with ideas. It's not for writing programs.
a. Note that the script must have a .py extension, e.g., csvprint.py. To run it, you enter python csvprint.py. This will try to load csvprint.py from the current directory and run it.
The some.csv file has to be in the current working directory, which doesn't have to be (in fact, almost never should be) in the Python folder. Usually this will be you home directory, or some kind of working area that you setup, like C:\work. It's entirely up to you, though.
Without knowing the csv module that well myself, I'm guessing it reads CSV separated values from the file as tuples and prints each one out on the console.
One final note: The usual way to write such logic is to take the input from the command-line rather than hard-coding it. Like so:
import csv
reader = csv.reader(open(sys.argv[1], "rb"))
for row in reader:
print row
And run it like so:
python csvprint.py some.csv
In this case you can put some.csv anywhere:
python csvprint.py C:\stuff\csvfiles\some.csv

When you have IDLE open, click File > New Window. (Or hit Ctrl + N)
This opens up a new window for you that's basically just a text editor with Python syntax highlighting. This is where you can write a program and save it. To execute it quickly, hit F5.

You can do both! To run the code from a text file (such as 'csvread.py', but the extension doesn't matter), type: python csvread.py at the command prompt. Make sure your PATH is set to include the Python installation directory.
"some.csv" needs to be in the current directory.
This code opens a Python file descriptor specifically designed to read CSVs. The reader file descriptor then prints out each row of the CSV in order. Check the documentation out for a more detailed example: http://docs.python.org/library/csv.html

Type the code into a *.py file, and then execute it.
I think the file should be in the same folder as your *.py script.
This opens a file stored in comma separated value format and prints the contents of each row.

All import does "Python code in one module gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the import machinery, but it is not the only way". The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. There is no “CSV standard”, so the format is operationally defined by the many applications which read and write it. The lack of a standard means that subtle differences often exist in the data produced and consumed by different applications. These differences can make it annoying to process CSV files from multiple sources. Still, while the delimiters and quoting characters vary, the overall format is similar enough that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer.
The CSV module implements classes to read and write tabular data in CSV format. It allows programmers to say, “write this data in the format preferred by Excel,” or “read data from this file which was generated by Excel,” without knowing the precise details of the CSV format used by Excel. Programmers can also describe the CSV formats understood by other applications or define their own special-purpose CSV formats. All your code is doing is looping through that file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.