Filter RDD based on row_number

Filter RDD based on row_number - python

sc.textFile(path) allows to read an HDFS file but it does not accept parameters (like skip a number of rows, has_headers,...).
in the "Learning Spark" O'Reilly e-book, it's suggested to use the following function to read a CSV (Example 5-12. Python load CSV example)
import csv
import StringIO
def loadRecord(line):
"""Parse a CSV line"""
input = StringIO.StringIO(line)
reader = csv.DictReader(input, fieldnames=["name", "favouriteAnimal"])
return reader.next()
input = sc.textFile(inputFile).map(loadRecord)
My question is about how to be selective about the rows "taken":
How to avoid loading the first row (the headers)
How to remove a specific row (for instance, row 5)
I see some decent solutions here: select range of elements but I'd like to see if there is anything more simple.
Thx!

Don't worry about loading the rows/lines you don't need. When you do:
input = sc.textFile(inputFile)
you are not loading the file. You are just getting an object that will allow you to operate on the file. So to be efficient, it is better to think in terms of getting only what you want. For example:
header = input.take(1)[0]
rows = input.filter(lambda line: line != header)
Note that here I am not using an index to refer to the line I want to drop but rather its value. This has the side effect that other lines with this value will also be ignored but is more in the spirit of Spark as Spark will distribute your text file in different parts across the nodes and the concept of line numbers gets lost in each partition. This is also the reason why this is not easy to do in Spark(Hadoop) as each partition should be considered independent and a global line number would break this assumption.
If you really need to work with line numbers I recommend that you add them to the file outside of Spark(see here) and then just filter by this column inside of Spark.
Edit: Added zipWithIndex solution as suggested by #Daniel Darabos.
sc.textFile('test.txt')\
.zipWithIndex()\ # [(u'First', 0), (u'Second', 1), ...
.filter(lambda x: x[1]!=5)\ # select columns
.map(lambda x: x[0])\ # [u'First', u'Second'
.collect()

Related

Parsing two files with Python

I'm still new to python and cannot achieve to make what i'm looking for. I'm using Python 3.7.0
I have one file, called log.csv, containing a log of CANbus messages.
I want to check what is the content of column label Data2 and Data3 when the ID is 348 in column label ID.
If they are both different from "00", I want to make a new string called fault_code with the "Data3+Data2".
Then I want to check on another CSV file where this code string appear, and print the column 6 of this row (label description). But this last part I want to do it only one time per fault_code.
Here is my code:
import csv
CAN_ID = "348"
with open('0.csv') as log:
reader = csv.reader(log,delimiter=',')
for log_row in reader:
if log_row[1] == CAN_ID:
if (log_row[5]+log_row[4]) != "0000":
fault_code = log_row[5]+log_row[4]
with open('Fault_codes.csv') as fault:
readerFC = csv.reader(fault,delimiter=';')
for fault_row in readerFC:
if "0x"+fault_code in readerFC:
print("{fault_row[6]}")
Here is a part of the log.csv file
Timestamp,ID,Data0,Data1,Data2,Data3,Data4,Data5,Data6,Data7,
396774,313,0F,00,28,0A,00,00,C2,FF
396774,314,00,00,06,02,10,00,D8,00
396775,**348**,2C,00,**00,00**,FF,7F,E6,02
and this is a part of faultcode.csv
Level;LED Flashes;UID;FID;Type;Display;Message;Description;RecommendedAction
1;2;1;**0x4481**;Warning;F12001;Handbrake Fault;Handbrake is active;Release handbrake
1;5;1;**0x4541**;Warning;F15001;Fan Fault;blablabla;blablalba
1;5;2;**0x4542**;Warning;F15002;blablabla
Also do you think of a better way to do this task? I've read that Pandas can be very good for large files. As log.csv can have 100'000+ row, it's maybe a better idea to use it. What do you think?
Thank you for your help!

Be careful with your indentation, you get this error because you sometimes you use spaces and other tabs to indent.
As PM 2Ring said, reading 'Fault_codes.csv' everytime you read 1 line of your log is really not efficient.
You should read faultcode once and store the content in RAM (if it fits). You can use pandas to do it, and store the content into a DataFrame. I would do that before reading your logs.
You do not need to store all log.csv lines in RAM. So I'd keep reading it line by line with csv module, do my stuff, write to a new file, and read the next line. No need to use pandas here as it will fill your RAM for nothing.

What's a Fail-safe Way to Update Python CSV

This is my function to build a record of user's performed action in python csv. It will get the username from the global and perform increment given in the amount parameter to the specific location of the csv, matching the user's row and current date.
In brief, the function will read the csv in a list, and do any modification on the data before rewriting the whole list back into the csv file.
Every first item on rows is the username, and the header has the dates.
Accs\Dates,12/25/2016,12/26/2016,12/27/2016
user1,217,338,653
user2,261,0,34
user3,0,140,455
However, I'm not sure why sometimes, the header get's pushed down to the second row, and data gets wiped entirely when it crashes.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
I'm thinking maybe I can write the stats separately and uniquely to each users and combine later, hence eliminating the possible clashing in writing. Although would be great if I could just improve from what I have here and read/write everything on a file.
Any fail-safe way to do what I'm trying to do here?
# Search current user in first rows and updating the count on the column (today's date)
# 'amount' will be added to the respective position
def dailyStats(self, amount, code = None):
def initStats():
# prepping table
with open(self.stats, 'r') as f:
reader = csv.reader(f)
for row in reader:
if row:
self.statsTable.append(row)
self.statsNames.append(row[0])
def getIndex(list, match):
# get the index of the matched date or user
for i, j in enumerate(list):
if j == match:
return i
self.statsTable = []
self.statsNames = []
self.statsDates = None
initStats()
today = datetime.datetime.now().strftime('%m/%d/%Y')
user_index = None
today_index = None
# append header if the csv is empty
if len(self.statsTable) == 0:
self.statsTable.append([r'Accs\Dates'])
# rebuild updated table
initStats()
# add new user/date if not found in first row/column
self.statsDates = self.statsTable[0]
if getIndex(self.statsNames, self.username) is None:
self.statsTable.append([self.username])
if getIndex(self.statsDates, today) is None:
self.statsDates.append(today)
# rebuild statsNames after table appended
self.statsNames = []
for row in self.statsTable:
self.statsNames.append(row[0])
# getting the index of user (row) and date (column)
user_index = getIndex(self.statsNames, self.username)
today_index = getIndex(self.statsDates, today)
# the row where user is matched, if there are previous dates than today which
# has no data, append 0 (e.g. user1,0,0,0,) until the column where today's date is match
if len(self.statsTable[user_index]) < today_index + 1:
for i in range(0,today_index + 1 - len(self.statsTable[user_index])):
self.statsTable[user_index].append(0)
# insert pv or tb code if found
if code is None:
self.statsTable[user_index][today_index] = amount + int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0))
else:
self.statsTable[user_index][today_index] = str(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0)) + ' - ' + code
# Writing final table
with open(self.stats, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(self.statsTable)
# return the summation of the user's total count
total_follow = 0
for i in range(1, len(self.statsTable[user_index])):
total_follow += int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][i])).group(0))
return total_follow

As David Z says, concurrency is more likely the cause of your problem.
I will add that CSV format is not suitable for Database storing, indexing, sorting, because it is plain/text and sequential.
You could handle it using a RDBMS for storing and updating your data and periodically processing your stats. Then your CSV format is just an import/export format.
Python offers a SQLite binding in its Standard Library, if you build a connector that import/update CSV content in a SQLite schema and then dump results as CSV you will be able to handle concurency and keep your native format without worring about installing a database server and installing new packages in Python.

Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
More likely than not that is exactly your issue. When two things are trying to write to the same file at the same time, the outputs from the two sources can easily get mixed up together, resulting in a file full of gibberish.
An easy way to fix this is just what you mentioned in the question, have each different process (or thread) write to its own file and then have separate code to combine all those files in the end. That's what I would probably do.
If you don't want to do that, what you can do is have different processes/threads send their information to an "aggregator process", which puts everything together and writes it to the file - the key is that only the aggregator ever writes to the file. Of course, doing that requires you to build in some method of interprocess communication (IPC), and that in turn can be tricky, depending on how you do it. Actually, one of the best ways to implement IPC for simple programs is by using temporary files, which is just the same thing as in the previous paragraph.

Reading selected parts of a file of unknown name

I am having a lot of datafiles with unknown names. I have figured out a way to get them all read and printed but I want to make graphs of them so I need the data in a way that is workable.
The datafiles are very neatly arranged (every line of the header contains information on what is stored there) but I am having trouble making a script that selects the data I need. The first 50+ lines of the file contain headers of which I need only a few to be used, this poses no problem when using something like:
for filename in glob.glob(fullpath):
with open(filename, 'r') as f:
for line in f:
if 'xx' in line:
Do my thing
if 'yy' in line:
Do my thing etc.
But below the headers there is a block of data of undetermined number of columns and undetermined number of lines (number of columns and what each column is, is specified in the headers). This I can't get read in a way that a graph can be made by for example matplotlib. (I can get it right by manually copying the data to a separate file and read that to a plottable format but that is not what I want to do every time of every file...) The line before the data starts contains the very useful #eoh but I can't figure out a way to combine the selective reading of the first 50+ lines and then swith to reading everything into an array. If there are methods to do what I want in a better way (including the selection of the map and seeing which files are there and readable) I am open to suggestions.
Update:
The solution proposed by #ImportanceOfBeingErnest seems very useful but I don't get it to work.
So I'll start with the data mentioned as missing in the answer.
Columnnames are given in the following format:
#COLUMNINFO= NUMBER1, UNIT, MEASUREMENT, NUMBER2
In this format number1 is the columnnumber, unit is the unit of the measurement, measurement is what is measured and number2 is in numbers what is measured.
The data is separated by spaces but that won't be a problem, I suspect.
I tried to implement the reading of the headers in the loop to determine the end of the headers, which failed to have any visible effects, even the print commands to check intermediate results did not show.
Once I put 'print line' after 'for line in f:' I thought I could see what went wrong but it appears as if the whole loop is ignored, including the break command which causes an error since the file is done reading and no data is left to read for the other parts...
Any help would be appreciated.

First of all, if the header has a certain character at the beginning of each line, this can be used to filter the header out automatically. Using numpy you could use numpy.loadtxt(filename, delimiter=";", comment="#") to load the data and every line starting with # would simply be ignored.
I don't know if this is the case here?!
In the case that you describe, where you have a header-ending flag #eoh you could first read in the header line by line to find out how many lines you later need to ignore and then use that number when loading the file.
I have assembled a little example, how it could work.
def readFile(filename):
#first find the number of lines to skip in the header
eoh = 0
with open(filename, "r") as f:
for line in f:
eoh = eoh+1
if "#eoh" in line:
break
# now at this point we need to find out about the column names
# but as no data is given as example, this is impossible
columnnames = []
# load the file by skipping eoh lines,
# the rest should be well behaving
a = np.genfromtxt(filename, skip_header = eoh, delimiter=";" )
return a, columnnames
def plot(a, columnnames, show=True, save=False, filename="something"):
fig = plt.figure()
ax = fig.add_subplot(111)
# plot the forth column agains the second
ax.plot(a[:, 1], a[:,3])
# if we had some columnname, we could also plot
# column named "ulf" against the one named "alf"
#ax.plot(a[:, columnnames.index("alf")], a[:,columnnames.index("ulf")])
#now save and/or show
if save:
plt.savefig(filename+".png")
if show:
plt.show()
if __name__ == "__main__":
fullpath = "path/to/files/file*.txt" # or whatever
for filename in glob.glob(fullpath):
a, columnnames = readFile(filename)
plot(a, columnnames, show=True, save=False, filename=filename[:-4])
One remaining problem is the names of the columns. Since you did not provide any example data, it's hard to estimate how to exactly do that.
This all assumes that you do not have any missing data in between or anything of that kind. If this was the case, then you'd need to use all the arguments to numpy.genfromtxt() to filter the data accordingly.

Iteratively replace two strings with values from numpy array

I'm currently trying to make an automation script for writing new files from a master where there are two strings I want to replace (x1 and x2) with values from a 21 x 2 array of numbers (namely, [[0,1000],[50,950],[100,900],...,[1000,0]]). Additionally, with each double replacement, I want to save that change as a unique file.
Here's my script as it stands:
import numpy
lines = []
x1x2 = numpy.array([[0,1000],[50,950],[100,900],...,[1000,0])
for i,j in x1x2:
with open("filenamexx.inp") as infile:
for line in infile:
linex1 = line.replace('x1',str(i))
linex2 = line.replace('x2',str(j))
lines.append(linex1)
lines.append(linex2)
with open("filename"+str(i)+str(j)+".inp", 'w') as outfile:
for line in lines:
outfile.write(line)
With my current script there are a few problems. First, the string replacements are being done separately, i.e. I end up with a new file that contains the contents of the master file twice where one line has the first change and then the next will reflect the second separately. Second, with each subsequent iteration, the new files have the contents of the previous file prepended (i.e. filename100900.inp will contain its unique contents as well as the contents of both filename01000.inp and filename50950.inp before it). Anyone think they can take a crack at solving my problem?
Note: I've looked at using regex module solutions (somehing like this: https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch01s19.html) in order to do multiple replacements in a single pass, but I'm not sure if the way I'm indexing is translatable to a dictionary object.

I'm not sure I understood the second issue but you can use replace more than one time on the same string, so:
s = "x1 stuff x2"
s = s.replace('x1',str(1)).replace('x2',str(2))
print(s)
, will output:
1 stuff 2
No need to do this two times for two different variables. As for the second issue it just seems as your not "reset-ing" the "lines" variable before starting to write a new file. So once you finish writing a file just add:
lines = []
It should be enough to solve these issues.

Comparing a file containing a chromosomal region and another containing point coordinates

Could I please be advised on the following problem. I have csv files which I would like to compare. The first contains coordinates of specific points in the genome (e.g. chr3: 987654 – 987654). The other csv files contain coordinates of genomic regions (e.g.chr3: 135596 – 123456789). I would like to cross compare my first file with my other files to see if any point locations in the first file overlaps with any regional coordinates in the other files and to write this set of overlap into a separate file. To make things simple for a start, I have drafted a simple code to cross compare between 2 csv files. Strangely, my code runs and prints the coordinates but does not write the point coordinates into a separate file. My first question is if my approach (from my code) at comparing these two files optimal or is there a better way of doing this? Secondly, why is it not writing into a separate file?
import csv
Region = open ('Region_test1.csv', 'rt', newline = '')
reader_Region = csv.reader (Region, delimiter = ',')
DMC = open ('DMC_test.csv', 'rt', newline = '')
reader_DMC = csv.reader (DMC, delimiter = ',')
DMC_testpoint = open ('DMC_testpoint.csv', 'wt', newline ='')
writer_Exon = csv.writer (DMC_testpoint, delimiter = ',')
for col in reader_Region:
Chr_region = col[0]
Start_region = int(col[1])
End_region = int(col [2])
for col in reader_DMC:
Chr_point = col[0]
Start_point = int(col [1])
End_point = int(col[2])
if Chr_region == Chr_point and Start_region <= Start_point and End_region >= End_point:
print (True, col)
else:
print (False, col)
writer_Exon.writerow(col)
Region.close()
DMC.close()

A couple of things are wrong, not the least of which is that you never check to see if your files opened successfully. The most glaring is that you never close your writer.
That said this an incredibly non-optimal way to go about the program. File I/O is slow. You don't want to keep rereading everything in a factorial fashion. Given how your search requires all possible comparisons you'll want to store at least one of the two files completely in memory, and potentially use a generator/iterator over the other if you dont wish to store both complete sets of data in memory.
One you have both sets loaded, proceed to do your intersection checks
I'd suggest you take a look at http://docs.python.org/2/library/csv.html for how to use a csv reader because what you are doing doesn't appear to make anysense because col[0], col[1] and col[2] aren't going to be what you think they are.
These are style and readability things but:
The names of some iteration variables seem a bit off, for col in ... should probably be for token in ... because you are processing token by token, and not column by columns/line by line, etc.
Additionally it would be nice to pick something consistent to stick to for your variables, sometimes you start with uppercase, sometimes you save the uppercase for after your '_'
That are putting ' ' between your objects and some function noames and not others is also very odd. But again these dont change the functionality of your code.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.