Saving multiple items to HDFS with (spark, python, pyspark, jupyter)

Saving multiple items to HDFS with (spark, python, pyspark, jupyter) - python

I´m used to program in Python. My company now got a Hadoop Cluster with Jupyter installed. Until now I never used Spark / Pyspark for anything.
I am able to load files from HDFS as easy as this:
text_file = sc.textFile("/user/myname/student_grades.txt")
And I´m able to write output like this:
text_file.saveAsTextFile("/user/myname/student_grades2.txt")
The thing I´m trying to achieve is to use a simple "for loop" to read text files one-by-one and write it's content into one HDFS file. So I tried this:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile("/user/myname/all.txt")
So this works for the first element of the list, but then gives me this error message:
Py4JJavaError: An error occurred while calling o714.saveAsTextFile.
: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
XXXXXXXX/user/myname/all.txt already exists
To avoid confusion I "blured"-out the IP address with XXXXXXXX.
What is the right way to do this?
I will have tons of datasets (like 'text1', 'text2' ...) and want to perform a python function with each of them before saving them into HDFS. But I would like to have the results all together in "one" output file.
Thanks a lot!
MG
EDIT:
It seems like that my final goal was not really clear. I need to apply a function to each text file seperately and then I want to append the output to the existing output directory. Something like this:
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file = really_cool_python_function(text_file)
text_file.saveAsTextFile("/user/myname/all.txt")

I wanted to post this as comment but could not do so as I do not have enough reputation.
You have to convert your RDD to dataframe and then write it in append mode. To convert RDD to dataframe please look into this answer:
https://stackoverflow.com/a/39705464/3287419
or this link http://spark.apache.org/docs/latest/sql-programming-guide.html
To save dataframe in append mode below link may be useful:
http://spark.apache.org/docs/latest/sql-programming-guide.html#save-modes
Almost same question is here also Spark: Saving RDD in an already existing path in HDFS . But the answer provided is for scala. I hope something similar can be done in python also.
There is yet another (but ugly) approach. Convert your RDD to string. Let the resulting string be resultString . Use subprocess to append that string to destination file i.e.
subprocess.call("echo "+resultString+" | hdfs dfs -appendToFile - <destination>", shell=True)

you can read multiple files and save them by
textfile = sc.textFile(','.join(['/user/myname/'+f for f in list]))
textfile.saveAsTextFile('/user/myname/all')
you will get all part files within output directory.

If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.

I would try this, it should be fine:
list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt']
for i in list:
text_file = sc.textFile("/user/myname/" + i)
text_file.saveAsTextFile(f"/user/myname/{i}")

Related

How do I import images with filenames corresponding to column values in a dataframe?

I'm a doctor trying to learn some code for work, and was hoping you could help me solve a problem I have with regards to importing multiple images into python.
I am working in Jupyter Notebook, where I have created a dataframe (named df_1) using pandas. In this dataframe each row represents a patient, and the first column shows the case number for each patient (e.g. 85).
Now, what I want to do is import multiple images (.bmp) from a given folder(same location as the .ipynb file). There are many images in this folder, and I do not want all of them - only the ones who have filenames corresponding to the "case_number" column in my dataframe (e.g. 85.bmp).
I already read this post, but I must admit it was way to complicated for me to understand.
Is there some simple loop (or something else) I could create to import all images with filenames corresponding to the values of the "case number" column in the dataframe?
I was imagining something like the below would be possible, I just do not know how to write it.
for i=[(df_1['case_number'()]
cv2.imread('[i].bmp')
The images don't really need to be implemented in the dataframe, but I would like to be able to view them in my notebook by using e.g. plt.imshow(85) afterwards.
Here is an image of the head of my dataframe
Thank you for helping!

You can access all of your files using this:
imageList = []
for i in range(0, len(df_1)):
cv2.imread('./' + str(df_1['case_number'][i]) + '.bmp')
imageList.append('./' + str(df_1['case_number'][i]) + '.bmp')
plt.imshow(imagelist[x])
This is looping through every item in the case_number column, the ./ shows that your file is within the current directory, using the directory path leading up to your current file. And by making everything a string and joining it you make it so that the file path is readable. The path created by joining the strings should look something like ./85.bmp, which should open your desired file. Also, you are appending the filenames to the list so that they can be accessed by the plt.imshow()
If you would like to access the files based on their name, you can use another variable (which could be set as an input) and implement the code below
fileName = input('Enter Your Value: ')
inputFile = imageList.index('./' + fileName + '.bmp')
and from here, you could use the same plt.imshow(imagelist[x]), but replace the x with the inputFile variable.

How to print rdd in python in spark

I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.

This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:

How to operate on unsaved Excel file?

I'd like to automate a loop:
ABAQUS generates a Excel file;
Matlab utilises data in Excel file;
loop 1 and 2.
Now my question is: after step 1, the Excel file from ABAQUS is unsaved as Book1. I cannot use Matlab command to save it. Is there a way not to save this ''Book1'' file, but use the data in it? Or if I can find where it is so I can use the data inside? (I assume that Excel always saves the file even though user doesn't?)
Thank you!

As agentp mentioned, if you are running Abaqus via a Python script, you can just use Python to create a .txt file to save all the relevant information. If well structured, a .txt file can be as readable as an Excel spreadsheet. Because Matlab and Python have intrinsic functions to read and write files this communication can be easily done.
As for Matlab calling Abaqus, you can use something similar to:
system('abaqus cae nogui=YOUR_SCRIPT.py')

Your script that pipes to Excel should have some code similar to this:
abq_ExcelUtilities.excelUtilities.XYtoExcel(
xyDataNames='S:Mises PI: PART-1-1 E: 4309 IP: 1', trueName='')
writing the same data to a report (.rpt) file the code looks like this:
x0 = session.xyDataObjects['S:Mises PI: PART-1-1 E: 4309 IP: 1']
session.writeXYReport(fileName='abaqus.rpt', xyData=(x0, ))
now to "roll your own", use that x0 object: x0.data is a regular python tuple holding the actual data which you can write to a file however you like, eg:
file=open('myfile.csv','w')
for point in x0.data: file.write('%g,%g\n'%point)
file.close()
(you can comment or delete the writeXYReport call )

Writing to CSV - string being recognized as date in Excel

This is probably an easy fix, but I can't seem to figure it out...
outputting a list to CSV in Python using the following code:
w = csv.writer(file('filename.csv','wb'))
w.writerows(mylist)
One of the list items is a ratio, so it contains values like '23/54', '9/12', etc. Excel is recognizing some of these values (like 9/12) as a date. What's the easiest way to solve this?
Thanks

Because csv is a text-only format, you cannot tell Excel anything about how to interpret the data, I am afraid.
You'd have to generate actual Excel files (using xlwt for example, documentation and tutorials available on http://www.python-excel.org/).

You could do this:
# somelist contains data like '12/51','9/43' etc
mylist = ["'" + val + "'" for val in somelist]
w = csv.writer(open('filename.csv','wb'))
for me in mylist:
w.writerow([me])
This would ensure your data is written as it is to csv.

Open URL stored in a csv file

I'm almost an absolute beginner in Python, but I am asked to manage some difficult task. I have read many tutorials and found some very useful tips on this website, but I think that this question was not asked until now, or at least in the way I tried it in the search engine.
I have managed to write some url in a csv file. Now I would like to write a script able to open this file, to open the urls, and write their content in a dictionary. But I have failed : my script can print these addresses, but cannot process the file.
Interestingly, my script dit not send the same error message each time. Here the last : req.timeout = timeout
AttributeError: 'list' object has no attribute 'timeout'
So I think my script faces several problems :
1- is my method to open url the right one ?
2 - and what is wrong in the way I build the dictionnary ?
Here is my attempt below. Thanks in advance to those who would help me !
import csv
import urllib
dict = {}
test = csv.reader(open("read.csv","rb"))
for z in test:
sock = urllib.urlopen(z)
source = sock.read()
dict[z] = source
sock.close()
print dict

First thing, don't shadow built-ins. Rename your dictionary to something else as dict is used to create new dictionaries.
Secondly, the csv reader creates a list per line that would contain all the columns. Either reference the column explicitly by urllib.urlopen(z[0]) # First column in the line or open the file with a normal open() and iterate through it.
Apart from that, it works for me.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Saving multiple items to HDFS with (spark, python, pyspark, jupyter) - python

you can read multiple files and save them by textfile = sc.textFile(','.join(['/user/myname/'+f for f in list])) textfile.saveAsTextFile('/user/myname/all') you will get all part files within output directory.

If the text files all have the same schema, you could use Hive to read the whole folder as a single table, and directly write that output.

I would try this, it should be fine: list = ['text1.txt', 'text2.txt', 'text3.txt', 'text4.txt'] for i in list: text_file = sc.textFile("/user/myname/" + i) text_file.saveAsTextFile(f"/user/myname/{i}")

Related

How do I import images with filenames corresponding to column values in a dataframe?

How to print rdd in python in spark

How to operate on unsaved Excel file?

Writing to CSV - string being recognized as date in Excel

Open URL stored in a csv file

Categories

Resources