I have a text file that contains the following:
n 1 id 10 12:17:32 type 6 is transitioning
n 2 id 10 12:16:12 type 5 is active
n 2 id 10 12:18:45 type 6 is transitioning
n 3 id 10 12:16:06 type 6 is transitioning
n 3 id 10 12:17:02 type 6 is transitioning
...
I need to sort these lines in Python by the timestamp. I can read line by line, collect all timestamps, then sort them using sorted(timestamps) but then I need to arrange the lines according to sorted timestamp.
How to get the index of sorted timestamps?
Is there some more elegant solution (I'm sure there is)?
import time
nID = []
mID = []
ts = []
ntype = []
comm = []
with open('changes.txt') as fp:
while True:
line = fp.readline()
if not line:
break
lx = line.split(' ')
nID.append(lx[1])
mID.append(lx[3])
ts.append(lx[4])
ntype.append(lx[6])
comm.append(lx[7:])
So, now I can use sorted(ts) to sort the timestamp, but I don't get the index of sorted timestamp values.
Related
I'm new to Hadoop mrjob. I have a text file which consists of data "id groupId value" in each line. I am trying to calculate a median of all values in the text file using Hadoop map-reduce. But i'm stuck when it comes to calculate only the median value. What I get is a median value for each id like:
"123213" 5.0
"123218" 2
"231532" 1
"234634" 7
"234654" 2
"345345" 9
"345445" 4.5
"345645" 2
"346324" 2
"436324" 6
"436456" 2
"674576" 10
"781623" 1.5
The output should be like "median value of all values is: ####". I got influnced by this article https://computehustle.com/2019/09/02/getting-started-with-mapreduce-in-python/
My python file median-mrjob.py :
from mrjob.job import MRJob
from mrjob.step import MRStep
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats, combiner=self.reducer_count_stats),
MRStep(reducer=self.reducer_sort_by_values),
MRStep(reducer=self.reducer_retrieve_median)
]
def mapper_get_stats(self, _, line):
line_arr = line.split(" ")
values = int(float(line_arr[-1]))
id = line_arr[0]
yield id, values
def reducer_count_stats(self, key, values):
yield str(sum(values)).zfill(2), key
def reducer_sort_by_values(self, values, ids):
for id in ids:
yield id, values
def reducer_retrieve_median(self, id, values):
valList=[]
median = 0
for val in values:
valList.append(int(val))
N = len(valList)
#find the median
if N % 2 == 0:
#if N is even
m1 = N / 2
m2 = (N / 2) + 1
#Convert to integer, match post
m1 = int(m1) - 1
m2 = int(m2) - 1
median = (valList[m1] + valList[m2]) / 2
else:
m = (N + 1) / 2
# Convert to integer, match position
m = int(m) - 1
median = valList[m]
yield (id, median)
if __name__ == '__main__':
MRMedian.run()
My original text files is about 1million and 1billion line of data, but I have created a test file which has arbitrary data. It has the name input.txt :
781623 2 2.3243
781623 1 1.1243
234654 1 2.122
123218 8 2.1245
436456 22 2.26346
436324 3 6.6667
346324 8 2.123
674576 1 10.1232
345345 1 9.56135
345645 7 2.1231
345445 10 6.1232
231532 1 1.1232
234634 6 7.124
345445 6 3.654376
123213 18 8.123
123213 2 2.1232
What I care about is the values. Considering that might be duplicates. I run the command line in the terminal to run the code python median-mrjob.py input.txt
Update: The point of the assignment is not to use any libraries, so I need to sort the list manually(or maybe some of it as I understood) and calculate the median manually(hardcoding). Otherwise the goal of using MapReduce disappears. Using PySpark is not allowed in this assignment. Check this link for more inspiration Computing median in map reduce
The output should be like "median value of all values is: ####"
Then you need to force all data to one reducer first (effectively defeating the purpose of using MapReduce).
You'd do that by not using the ID as the key and discarding it
def mapper_get_stats(self, _, line):
line_arr = line.split()
if line_arr: # prevent empty lines
value = float(line_arr[-1])
yield None, value
After that, sort and find the median (I fixed your parameter order)
def reducer_retrieve_median(self, key, values):
import statistics
yield None, f"median value of all values is: {statistics.median(values)}" # automatically sorts the data
So, only two steps
class MRMedian(MRJob):
def steps(self):
return [
MRStep(mapper=self.mapper_get_stats),
MRStep(reducer=self.reducer_retrieve_median)
]
For the given file, you should see
null "median value of all values is: 2.2938799999999997"
original text files is about 1million and 1billion line of data
Not that it matters, but which is it?
You should upload the file to HDFS first, then you can use better tools than MrJob for this like Hive or Pig.
This in continuation to my previous question:
How to fetch the modified rows after comparing 2 versions of same data frame
I am now done with the MODIFICATIONS, however, I am using below method for finding the INSERTS and DELETES.
It work fine, however, it takes a lot of time. Typically for a CSV file which has 10 columns and 10M rows.
For my problem,
INSERT are the records which are not in old file, but in new file.
DELETE are the records which are in old file, but not in new file.
Below is the code:
def getInsDel(df_old,df_new,key):
#concatinating old and new data to generate comparisons
df = pd.concat([df_new,df_old])
df= df.reset_index(drop = True)
#doing a group by for getting the frequency of each key
print('Grouping data for frequency of key...')
df_gpby = df.groupby(list(df.columns))
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1]
df_delta = df.reindex(idx)
df_delta_freq = df_delta.groupby(key).size().reset_index(name='Freq')
#Filtering data for frequency = 1, since these will be the target records for DELETE and INSERT
print('Creating data frame to get records with Frequency = 1 ...')
filter = df_delta_freq['Freq']==1
df_delta_freq_ins_del = df_delta_freq.where(filter)
#Dropping row with NULL
df_delta_freq_ins_del = df_delta_freq_ins_del.dropna()
print('Creating data frames of Insert and Deletes ...')
#Creating INSERT dataFrame
df_ins = pd.merge(df_new,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
#Creating DELETE dataFrame
df_del = pd.merge(df_old,
df_delta_freq_ins_del[key],
on = key,
how = 'inner'
)
print('size of INSERT file: ' + str(df_ins.shape))
print('size of DELETE file: ' + str(df_del.shape))
return df_ins,df_del
The section where I am doing a group by for the frequency of each key, it takes around 80% of the total time, so for my CSV it takes around 12-15 mins.
There must be a faster approach for doing this?
For your reference, below is my result expectation:
For example, Old data is:
ID Name X Y
1 ABC 1 2
2 DEF 2 3
3 HIJ 3 4
and new data set is:
ID Name X Y
2 DEF 2 3
3 HIJ 55 42
4 KLM 4 5
Where ID is the Key.
Insert_DataFrame should be:
ID Name X Y
4 KLM 4 5
Deleted_DataFrame should be:
ID Name X Y
1 ABC 1 2
to be deleted
delete=pd.merge(old,new,how='left',on='ID',indicator=True)
delete=delete.loc[delete['_merge']=='left_only']
delete.dropna(1,inplace=True)
to be inserted
insert=pd.merge(new,old,how='left',on='ID',indicator=True)
insert=insert.loc[insert['_merge']=='left_only']
insert.dropna(1,inplace=True)
I have a data frame with 384 rows (and an additional dummy one in the bigining).
each row has 4 variable I wrote manually. 3 calculated fields based on those 4 variables.
and 3 that are comparing each calculated variable to the row before. each field can have 1 of two values (basically True/False).
Final goal - I want to arrange the data frame in a way that the 64 possible combination of the 6 calculated fields (2^6), occur 6 times (2^6*6=384).
Each iteration does a frequency table (pivot) and if one of the fields differ from 6 it breaks and randomize the order.
The problem that there are 384!-12*6! possible combinations and my computer is running the following script for over 4 days without a solution.
import pandas as pd
from numpy import random
# a function that calculates if a row is congruent or in-congruent
def set_cong(df):
if df["left"] > df["right"] and df["left_size"] > df["right_size"] or df["left"] < df["right"] and df["left_size"] < df["right_size"]:
return "Cong"
else:
return "InC"
# open file and calculate the basic fields
DF = pd.read_csv("generator.csv")
DF["distance"] = abs(DF.right-DF.left)
DF["CR"] = DF.left > DF.right
DF["Cong"] = DF.apply(set_cong, axis=1)
again = 1
# main loop to try and find optimal order
while again == 1:
# make a copy of the DF to not have to load it each iteration
df = DF.copy()
again = 0
df["rand"] = [[random.randint(low=1, high=100000)] for i in range(df.shape[0])]
# as 3 of the fields are calculated based on the previous row the first one is a dummy and when sorted needs to stay first
df.rand.loc[0] = 0
Sorted = df.sort_values(['rand'])
Sorted["Cong_n1"] = Sorted.Cong.eq(Sorted.Cong.shift())
Sorted["Side_n1"] = Sorted.CR.eq(Sorted.CR.shift())
Sorted["Dist_n1"] = Sorted.distance.eq(Sorted.distance.shift())
# here the dummy is deleted
Sorted = Sorted.drop(0, axis=0)
grouped = Sorted.groupby(['distance', 'CR', 'Cong', 'Cong_n1', 'Dist_n1', "Side_n1"])
for name, group in grouped:
if group.shape[0] != 6:
again = 1
break
Sorted.to_csv("Edos.csv", sep="\t",index=False)
print ("bye")
the data frame looks like this:
left right size_left size_right distance cong CR distance_n1 cong_n1 side_n1
1 6 22 44 5 T F dummy dummy dummy
5 4 44 22 1 T T F T F
2 3 44 22 1 F F T F F
I'm stuck in a process of creating list of columns. I tried to avoid using defaultdict.
Thanks for any help!
Here is my code:
# Read CSV file
with open('input.csv', 'r') as csvfile:
reader = csv.reader(csvfile)
#-----------#
row_list = []
column_list = []
year = []
suburb = []
for each in reader:
row_list = row_list + [each]
year = year + [each[0]]#create list of years
suburb = suburb + [each[2]]#create list of suburb
for (i,v) in enumerate(each[3:-1]):
column_list[i].append(v)
#print i,v
#print column_list[0]
My error message:
19 suburb = suburb + [each[2]]#create list of suburb
20 for i,v in enumerate(each[3:-1]):
---> 21 column_list[i].append(v)
22 #print i,v
23 #print column_list[0]
IndexError: list index out of range
printed result of (i,v):
0 10027
1 14513
2 3896
3 23362
4 77966
5 5817
6 24699
7 9805
8 62692
9 33466
10 38792
0 0
1 122
2 0
3
4 137
5 0
6 0
7
8
9 77
10
Basically, I want to have lists to look like this.
column[0]=['10027','0']
column[1]=['14513','122']
A sample of my csv file:
enter image description here
Yes Like Alex mentioned the problem is indeed due to trying to access the index before creating/initializing it as an alternative solution you can also consider this.
for (i,v) in enumerate(each[3:-1]):
if len(column_list) < i+1:
column_list.append([])
column_list[i].append(v)
hope It may Help !
The error happens because column_list is empty and so you can't access column_list[i] because it doesn't exist. It doesn't matter that you want to append to it because you can't append to something nonexistent, and appending doesn't create it from scratch.
column_list = defaultdict(list) would indeed solve this but since you don't want to do that, the simplest is to make sure that column_list starts out with plenty of empty lists to append to. Like this:
column_list = [[] for _ in range(size)]
where size is the number of columns, the length of each[3:-1], which is apparently 11 according to your output.
I have written a program in python, where I have used a hash table to read data from a file and then add data in the last column of the file corresponding to the values in the 2nd column of the file. for example, for all entries in column 2 with same values, the corresponding last column values will be added.
Now I have implemented the above successfully. Now I want to sort the table in descending order according to last column values and print these values and the corresponding 2nd column (key) values. i am not able to figure out on how to do this. Can anyone please help ?
pmt txt file is of the form
0.418705 2 3 1985 20 0
0.420657 4 5 119 3849 5
0.430000 2 3 1985 20 500
and so on...
So, for example, for number 2 in column 2, i have added all data of last column corresponding to all numbers '2' in the 2nd column. So, this process will continue for the next set of numbers lie 4, 5 ,etc in column 2.
I'm using python 3
import math
source_ip = {}
f = open("pmt.txt","r",1)
lines = f.readlines()
for line in lines:
s_ip = line.split()[1]
bit_rate = int(line.split()[-1]) + 40
if s_ip in source_ip.keys():
source_ip[s_ip] = source_ip[s_ip] + bit_rate
print (source_ip[s_ip])
else:
source_ip[s_ip] = bit_rate
f.close()
for k in source_ip.keys():
print(str(k)+": "+str(source_ip[k]))
print ("-----------")
It sounds like you want to use the sorted function with a key parameter that gets the value from the key/value tuple:
sorted_items = sorted(source_ip.items(), key=lambda x: x[1])
You could also use itemgetter from the operator module, rather than a lambda function:
import operator
sorted_items = sorted(source_ip.items(), key=operator.itemgetter(1))
How about something like this?
#!/usr/local/cpython-3.4/bin/python
import collections
source_ip = collections.defaultdict(int)
with open("pmt.txt","r",1) as file_:
for line in file_:
fields = line.split()
s_ip = fields[1]
bit_rate = int(fields[-1]) + 40
source_ip[s_ip] += bit_rate
print (source_ip[s_ip])
for key, value in sorted(source_ip.items()):
print('{}: {}'.format(key, value))
print ("-----------")