I have been working with a dataset that has been reduced to a following structure:
10,47,110,296,318,356,364,377,454,527,539,590,593,597,648,858,1097,1197,1206,1214,1221,1265,1291,1721,1961,2571,2628,2706,2716,3147,3578,3717,3793,4306,4993,5952,6539,7153,7438
Where each row of the RDD has the above structure.
I am attempting to count each pair within the row and insert the value to a dictionary. A sample output for this dictionary would be:
(10,47): 1, (10, 110):1, (10,296):1 etc.
I was able to get a basic implementation working but it was taking ten minutes longer on larger datasets vs. a simpler non dictionary approach in pyspark (I am practicing pairs and stripes mapreduce algorithms)
Previously, I was calling my own reduce function that would iterate through all the combination of pairs and then emit the counts for that. Is there a better way to be doing this?
The end goal is to count each row of an RDD and have a dictionary for (val1,val2): count
With the above data example as an rdd called dataRDD I have been performing the following
pairCount = dataRDD.map(combinePairs)
Where combinePairs is defined as
goodDict = defaultdict(int)
def combinePairs(data):
data = data.split(',')
for v in itertools.combinations(data,2):
first = v[0]
second = v[1]
pair = (first,second)
goodDict[pair] = goodDict[pair]+1
return goodDict
Any suggestions greatly appreciated
Related
I have a RDD like this:
[('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
and I want to remove first N elements from it.
For example if N = 3, then the new RDD should be like this:
[('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)]
I had to do some Maps and Reduces, and as you might know, Map and reduces are only available for RDDs. But the more important reason, I have a task that should be done only by RDDs.
I'm new to Pyspark and don't know how to do it. Besides, I've looked for an answer but didn't find anything.
I'd like to just remove the first elements and not iterate through all elements.
Thank you for any help you can offer..
I have never used pyspark before, but is it possible to get the first n elements first, and then filter with the first n elements?
Here is some code that I try to write to implement, but I am not sure whether it will work. I referred to How to remove elements how to delete elemts from one rdd based on other rdd and create new rdd in pyspark?
from pyspark import SparkContext
sc = SparkContext('local')
n = 3
rdd = sc.parallelize([('FRO11987', 104),('SNA90258', 550),('ELE91550', 23),('ELE52966', 380),('FRO90334', 63),('FRO84225', 74),('SNA80192', 258)])
rdd_first=rdd.take(n)
first_list = rdd_first.collect()
filtered_rdd = rdd.filter(lambda x: x not in first_list)
I am iterating through a pandas dataframe (df) and adding scores to a dictionary containing python lists (scores):
for index, row in df.iterrows():
scores[row["key"]][row["pos"]] = scores[row["key"]][row["pos"]] + row["score"]
The scores dictionary initially is not empty. The dataframe is very large and this loop takes a long time. Is there a way to do this without a loop or speed it up in some other way?
A for loop seems somewhat inevitable, but we can speed things up with NumPy's fancy indexing and Pandas' groupby:
# group the scores over `key` and gather them in a list
grouped_scores = df.groupby("key").agg(list)
# for each key, value in the dictionary...
for key, val in scores.items():
# first lookup the positions to update and the corresponding scores
pos, score = grouped_scores.loc[key, ["pos", "score"]]
# then fancy indexing with `pos`: reaching all positions at once
scores[key][pos] += score
I want to change the target value in the data set within a certain interval. When doing it with 500 data, it takes about 1.5 seconds, but I have around 100000 data. Most of the execution time is spent in this process. I want to speed this up.
What is the fastest and most efficient way to append rows to a DataFrame?
I tried the solution in this link, tried to create a dictionary, but I couldn't do it.
Here is the code which takes around 1.5 seconds for 500 data.
def add_new(df,base,interval):
df_appended = pd.DataFrame()
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df
df_new["DeltaG"] = s[i]
df_appended = df_appended.append(df_new)
return df_appended
DataFrames in the pandas are continuous peaces of memory, so appending or concatenating etc. dataframes is very inefficient - this operations create new DataFrames and overwrite all data from old DataFrames.
But basic python structures as list and dicts are not, when append new element to it python just create pointer to new element of structure.
So my advice - make all you data processing on lists or dicts and convert them to DataFrames in the end.
Another advice can be creating preallocated DataFrame of the final size and just change values in it using .iloc. But it works only if you know final size of your resulting DataFrame.
Good examples with code: Add one row to pandas DataFrame
If you need more code examples - let me know.
def add_new(df1,base,interval,has_interval):
dictionary = {}
if has_interval == 0:
for i in range(0,5):
dictionary[i] = (df1.copy())
elif has_interval == 1:
np.random.seed(5)
s = np.random.normal(base,interval/3,4)
s = np.append(s,base)
for i in range(0,5):
df_new = df1
df_new[4] = s[i]
dictionary[i] = (df_new.copy())
return dictionary
It works. It takes around 10 seconds for whole data. Thanks for your answers.
I have a dictionary "c" with 30000 keys and around 600000 unique values (around 20 unique values per key)
I want to create a new pandas series "'DOC_PORTL_ID'" to get a sample value from each row of column "'image_keys'" and then look for its key in my dictionary and return. So I wrote a function like this:
def find_match(row, c):
for key, val in c.items():
for item in val:
if item == row['image_keys']:
return key
and then I use .apply to create my new column like:
df_image_keys['DOC_PORTL_ID'] = df_image_keys.apply(lambda x: find_match(x, c), axis =1)
This takes a long time. I am wondering if I can improve my snippet code to make it faster.
I googled a lot and was not able to find the best way of doing this. Any help would appreciated.
You're using your dictionary as a reverse lookup. And frankly, you haven't given us enough information about the dictionary. Are the 600,000 values unique? If not, you're only returning the first one you find. Is that expected?
Assume they are unique
reverse_dict = {val: key for key, values in c.items() for val in values}
df_image_keys['DOC_PORTL_ID'] = df_image_keys['image_keys'].map(reverse_dict)
This is as good as you've done yourself. If those values are not unique, you'll have to provide a better explanation of what you expect to happen.
I need to convert an rdd with two rows, inot an rdd with one row. Example:
rdd1=a
b
I need:
rdd2=(a,b)
How can I do this step in pyspark?
The question could be stupid but I'm new in spark.
"UPDATE"
This is to performing cartesian between rdd2 and rdd3, starting from rdd1. Like:
rdd3:(k,l)
(c,g)
(f,x)
I want this output:
rddOut:[(a,b),(k,l)]
[(a,b),(c,g)]
[(a,b),(f,x)]
Thanks in advance
update my anwser:
initRDD = sc.parallelize(list('aeiou')).map(lambda x: (x, ord(x))).collect()
ssc = StreamingContext(sc, batchDuration=3)
lines = ssc.socketTextStream('localhost', 9999)
items = lines.flatMap(lambda x: x.split())
counts = items.countByValue().map(lambda x: ([x] + initRDD))
It looks like broadcast rather than cartesian.
Can you explain a little bit more on your need? Having an RDD with a single row is not a good idea as you lose all parallelism.
If you want to collect the data by key, you can convert the RDD into an RDD of pairs (key and value). Then you can do reduceByKey in order to collect everything by the key to a list simply by having the reduce function be a list concatenation.
If my understanding of your question is correct, using flatMap for this will get you the required output.