Spark Remove Line (Python)

Spark Remove Line (Python) - python

I have a dataset that I am running statistical functions on and that I need to potentially remove the first and last line (depending on if there is a header trailer). What would be the easiest way to accomplish this?
dataSplit = sc.textFile(inputFile).map(lambda line: line.split(","))

I'm just a beginner with spark, but I guess this would work. Please correct me, if it doesn't work or there are any better practices.
# get file
inputRDD = sc.textFile(inputFile).cache()
# get header
header = inputRDD.first()
# get trailer, but be careful with large RDDs and collect()!
trailer = inputRDD.collect()[-1]
# remove header trailer
filtered_inputRDD = inputRDD.filter(lambda x: x != header).filter(lambda x: x != trailer)
# afterwards you can split
dataSplit = filtered_inputRDD.map(lambda line: line.split(","))
I tried something different to get the trailer in a more efficient way:
# this is a helper function which iterates through
# the part it gets and returns the last item of the part
#
# item is set to "empty" in case part is empty
# replace it with desired output for empty parts
def iterate(part):
item = "empty"
my_iter = iter(part)
for item in my_iter:
pass
return item
# instead of collecting the RDD and returning the last item
# it now does a mapPartitions first and iterates through every part
# and returns the last items of every partition
# then you only have to collect [numPartitions] rows and the
# selection of the last item is much easier
trailer_efficient = inputRDD.mapPartitions(lambda x: [iterate(x)]).collect()[-1]

Related

How to decode a list and remove items from two lists when there is a match in both of them based on an index?

I have two lists which contain the following type of information.
List #1:
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3", ...same format elements]
List #2:
Reply_List = ["1/1/0", "1/3/1", "1/2/0", ...same format elements]
From the "Reply" list, I want to be able to compare the second item in the "#/#/#", in this case it will be 1,3,2, and so on with all the items in the Reply list and check if there is a match with the second item in "Request list". If there is a match, then I want to be able to return a new list which would contain the information of the third index in the request string appended with the third index of the matching string in the reply.
The result would be like the following.
Result = ["1.34.0", "1.3.5.0", "1.2.3.1"]
Note that the 0 was appended to the 1.34, the 1 was appended to the 1.3.4 and the 0 was appended to the 1.2.3 from the corresponding indexes in the "Reply" list as the second index existed in the "Reply" list. The 'Reply" list could have the item anywhere placed in the list.
The code which does the problem stated above is shown below.
def get_list_of_error_codes(self, Reply_List , Request_List ):
decoded_Reply_List = Reply_List .decode("utf-8") # I am not sure if this is
the right way to decode all the elements in the list?
Result = [
f"{i.split('/')[-1]}.{j.split('/')[-1]}"
for i in Request_List
for j in decoded_Reply_List
if (i.split("/")[1] == j.split("/")[1])
]
return Result
res = get_list_of_error_codes(Reply_List , Request_List)
print (res) # ["1.34.0", "1.3.5.0", "1.2.3.1"]
Issues I am facing right now:
I am NOT sure if I decode the Reply_List correctly and in the proper manner. Can someone help me also verify this?
I am not sure on how to also remove the corresponding items for the Reply_List and Request_List when I find a match based on the condition if (i.split("/")[1] == j.split("/")[1]).

You can use list comprehension to decode the list:
decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
In this case, if you wanted to also remove items from the list while you create the new list, I would say list comprehension isn't the right move. Just go with the nested for loops:
def get_list_of_error_codes(self, Reply_List, Request_List):
decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
Result = []
for i in list(Request_List):
for j in decoded_Reply_List:
if (i.split("/")[1] == j.split("/")[1]):
Result.append(f"{i.split('/')[-1]}.{j.split('/')[-1]}")
Reply_List.remove(j)
break
else:
continue
Request_List.remove(i)
return Result
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3"]
Reply_List = [b"1/1/0", b"1/3/1", b"1/2/0"]
print(get_list_of_error_codes("Foo", Reply_List, Request_List))
# Output: ['1.34.0', '1.3.5.0', '1.2.3.1']
Some things to note:
I added a break so that we don't keep looking for matches if we find one. It will only match the first pair, then move on.
In for i in list(Request_List), I added the list() cast to effectively make a copy of the list. This allows us to remove entries from Request_List without disrupting the loop. I didn't do this for for j in decoded_Reply_List because it's already a copy of Reply_List. (I assumed you wanted to remove the entries from Reply_List)
The last is the else: continue. We don't want to reach Request_List.remove(i) if we didn't find a match. If break is called, else will not be called, which means we will reach Request_List.remove(i). But if the loop completes without finding a match, the loop will then enter else and we will skip the removal step by calling continue
EDIT:
Actually, Reply_List.remove(j) breaks, since we've decoded j in this method, thus decoded j is not the same object as it is in Reply_List. Here's some revised code which will solve this issue:
def get_list_of_error_codes(Reply_List, Request_List):
# decoded_Reply_List = [li.decode(encoding='utf-8') for li in Reply_List]
Result = []
for i in list(Request_List):
for j in list(Reply_List):
dj = j.decode(encoding='utf-8')
if (i.split("/")[1] == dj.split("/")[1]):
Result.append(f"{i.split('/')[-1]}.{dj.split('/')[-1]}")
Reply_List.remove(j)
break
else:
continue
Request_List.remove(i)
return Result
Request_List = ["1/1/1.34", "1/2/1.3.5", "1/3/1.2.3"]
Reply_List = [b"1/1/0", b"1/3/1", b"1/2/0"]
print("Result: ", get_list_of_error_codes(Reply_List, Request_List))
print("Reply_List: ", Reply_List)
print("Request_List: ", Request_List)
# Output:
# Result: ['1.34.0', '1.3.5.0', '1.2.3.1']
# Reply_List: []
# Request_List: []
What I've done is that instead of creating a separate decoded list, I just decode the entries as they're looped through, and then remove the un-decoded entry from Reply_List. This should be a little more efficient too, since we're not looping through Reply_List twice now.

Python - Get item from a list under a list

I have a list like below.
list = [[Name,ID,Age,mark,subject],[karan,2344,23,87,Bio],[karan,2344,23,87,Mat],[karan,2344,23,87,Eng]]
I need to get only the name 'Karan' as output.
How can I get that?

This is a 2D list,
list[i][j]
will give you the 'i'th list within your list and the 'j'th item within that list.
So to get Karen you want list[1][0]

I upvoted Lio Elbammalf, but decided to provide an answer that made a couple of assumptions that should have been clarified in the question:
The First item of the list is the headers, they are actually in the list (and not there as part of the question), and they are provided as part of the list because there is no guarantee that the headers will always be in the same order.
This is probably a CSV file
Ignoring 2 for the moment, what you would want to do is remove the "headers" from the list (so that the rest of the list is uniform), and then find the index of "Name" (your desired output).
myinput = [["Name","ID","Age","mark","subject"],
["karan",2344,23,87,"Bio"],
["karan",2344,23,87,"Mat"],
["karan",2344,23,87,"Eng"]]
## Remove the headers from the list to simplify everything
headers = myinput.pop(0)
## Figure out where to find the person's Name
nameindex = headers.index("Name")
## Return a list of the Name in each row
return [stats[nameindex] for stats in myinput]
If the name is guaranteed to be the same in each row, then you can just return myinput[0][nameindex] like is suggested in the other answer
Now, if 2 is true, I'm assuming you're using the csv module, in which case load the file using the DictReader class and then just access each row using the 'Name' key:
def loadfile(myfile):
with open(myfile) as f:
reader = csv.DictReader(f)
return list(reader)
def getname(rows):
## This is the same return as above, and again you can just
## return rows[0]['Name'] if you know you only need the first one
return [row['Name'] for row in rows]

In Python 3 you can do this
_, [x, _, _, _, _], *_ = ls
Now x will be karan.

Parsing sequences from a FASTA file in python

I have a text file:
>name_1
data_1
>name_2
data_2
>name_3
data_3
>name_4
data_4
>name_5
data_5
I want to store header (name_1, name_2....) in one list and data (data_1, data_2....) in another list in a Python program.
def parse_fasta_file(fasta):
desc=[]
seq=[]
seq_strings = fasta.strip().split('>')
for s in seq_strings:
if len(s):
sects = s.split()
k = sects[0]
v = ''.join(sects[1:])
desc.append(k)
seq.append(v)
for l in sys.stdin:
data = open('D:\python\input.txt').read().strip()
parse_fasta_file(data)
print seq
this is my code which i have tried but i am not able to get the answer.

The most fundamental error is trying to access a variable outside of its scope.
def function (stuff):
seq = whatever
function('data')
print seq ############ error
You cannot access seq outside of function. The usual way to do this is to have function return a value, and capture it in a variable within the caller.
def function (stuff):
seq = whatever
return seq
s = function('data')
print s
(I have deliberately used different variable names inside the function and outside. Inside function you cannot access s or data, and outside, you cannot access stuff or seq. Incidentally, it would be quite okay, but confusing to a beginner, to use a different variable with the same name seq in the mainline code.)
With that out of the way, we can attempt to write a function which returns a list of sequences and a list of descriptions for them.
def parse_fasta (lines):
descs = []
seqs = []
data = ''
for line in lines:
if line.startswith('>'):
if data: # have collected a sequence, push to seqs
seqs.append(data)
data = ''
descs.append(line[1:]) # Trim '>' from beginning
else:
data += line.rstrip('\r\n')
# there will be yet one more to push when we run out
seqs.append(data)
return descs, seqs
This isn't particularly elegant, but should get you started. A better design would be to return a list of (description, data) tuples where the description and its data are closely coupled together.
descriptions, sequences = parse_fasta(open('file', 'r').read().split('\n'))
The sys.stdin loop in your code does not appear to do anything useful.

Is there a way to do it faster?

ladder have around 15000 elements, this code snippet performed in 5-8sec, is there any way to do it faster? I try do it without checking for duplicate and without creating accs list and time was down to 2-3sec, but I don't need duplicate in csv file.
I work in python 2.7.9
accs =[]
with codecs.open('test.csv','w', encoding="UTF-8") as out:
row = ''
for element in ladder:
if element['account']['name'] not in accs:
accs.append(element['account']['name'])
row += element['account']['name']
if 'twitch' in element['account']:
row += "," + element['account']['twitch']['name'] + ","
else:
row += ",,"
row += str(element['account']['challenges']['total']) + "\n"
out.write(row)

seen = set()
results = []
for user in ladder:
acc = user['account']
name = acc['name']
if name not in seen:
seen.add(name)
twitch_name = acc['twitch']['name'] if "twitch" in acc else ''
challenges = acc['challenges']['total']
results.append("%s,%s,%d" % (name, twitch_name, challenges))
with codecs.open('test.csv','w', encoding="UTF-8") as out:
out.write("\n".join(results))

You can’t do much about the loop, since you need to go through every element in ladder after all. However, you can improve this membership test:
if element['account']['name'] not in accs:
Since accs is a list, this will essentially loop through all items of accs and check if the name is in there. And you loop for every element in ladder, so this can easily become very inefficient.
Instead, use a set instead of a list for accs as this will give you a constant membership lookup. So you reduce your algorithm from a quadratic complexity to a linear complexity. For that, just use accs = set() and change your code to use accs.add() instead of append.
Another issue is that you are doing string concatenation. Every time you do someString + "something" you are throwing away that string object and create a new one. This can become inefficient for a high number of operations too. Instead, use a list here to collect all the elements you want to write, and then join them:
row = []
row.append(element['account']['name'])
if 'twitch' in element['account']:
row.append(element['account']['twitch']['name'])
else:
row.append('')
row.append(str(element['account']['challenges']['total']))
out.write(','.join(row))
out.write('\n')
Alternatively, since you are writing to a file anyway, you could just call out.write multiple times with each string part.
Finally, you could also look into the csv module if you are interested in writing out CSV data.

python: getting rid of values from a list

drug_input=['MORPHINE','CODEINE']
def some_function(drug_input)
generic_drugs_mapping={'MORPHINE':0,
'something':1,
'OXYCODONE':2,
'OXYMORPHONE':3,
'METHADONE':4,
'BUPRENORPHINE':5,
'HYDROMORPHONE':6,
'CODEINE':7,
'HYDROCODONE':8}
row is a list.
I would like to set all the members of row[..]='' EXCEPT for those that drug_input defines, in this case it is 0, and 7.
So row[1,2,3,4,5,6,8]=''
If row is initially:
row[0]='blah'
row[1]='bla1'
...
...
row[8]='bla8'
I need:
row[0]='blah' (same as before)
row[1]=''
row[2]=''
row[3]=''
...
...
row[7]='bla7'
row[8]=''
How do I do this?

You could first create a set of all the indexes that should be kept, and then set all the other ones to '':
keep = set(generic_drugs_mapping[drug] for drug in drug_input)
for i in range(len(row)):
if i not in keep:
row[i] = ''

I'd set up a defaultdict unless you really need it to be a list:
from collections import defaultdict # put this at the top of the file
class EmptyStringDict(defaultdict):
__missing__ = lambda self, key: ''
newrow = EmptyStringDict()
for drug in drug_input:
keep = generic_drugs_mapping[drug]
newrow[keep] = row[keep]
saved_len = len(row) # use this later if you need the old row length
row = newrow
Having a list that's mostly empty strings is wasteful. This will build an object that returns '' for every value except the ones actually inserted. However, you'd need to change any iterating code to use xrange(saved_len). Ideally, though, you would just modify the code that uses the list so as not to need such a thing.
If you really want to build the list:
newrow = [''] * len(row) # build a list of empty strings
for drug in drug_input:
keep = generic_drugs_mapping[drug]
newrow[keep] = row[keep] # fill it in where we need to
row = newrow # throw the rest away

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Spark Remove Line (Python) - python

I have a dataset that I am running statistical functions on and that I need to potentially remove the first and last line (depending on if there is a header trailer). What would be the easiest way to accomplish this? dataSplit = sc.textFile(inputFile).map(lambda line: line.split(","))

Related

How to decode a list and remove items from two lists when there is a match in both of them based on an index?

Python - Get item from a list under a list

Parsing sequences from a FASTA file in python

Is there a way to do it faster?

python: getting rid of values from a list

Categories

Resources