Sort by key (Month) using RDDs in Pyspark

Sort by key (Month) using RDDs in Pyspark - python

I have this RDD and wanna sort it by Month (Jan --> Dec). How can i do it in pyspark?
Note: Don't want to use spark.sql or Dataframe.
+-----+-----+
|Month|count|
+-----+-----+
| Oct| 1176|
| Sep| 1167|
| Dec| 2084|
| Aug| 1126|
| May| 1176|
| Jun| 1424|
| Feb| 1286|
| Nov| 1078|
| Mar| 1740|
| Jan| 1544|
| Apr| 1080|
| Jul| 1237|
+-----+-----+

You can use rdd.sortBy with a helper dictionary available in python's calendar module or create your own month dictionary:
import calendar
d = {i:e for e,i in enumerate(calendar.month_abbr[1:],1)}
#{'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6, 'Jul': 7,
#'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
myrdd.sortBy(keyfunc=lambda x: d.get(x[0])).collect()
[('Jan', 1544),
('Feb', 1286),
('Mar', 1740),
('Apr', 1080),
('May', 1176),
('Jun', 1424),
('Jul', 1237),
('Aug', 1126),
('Sep', 1167),
('Oct', 1176),
('Nov', 1078),
('Dec', 2084)]

myList = myrdd.collect()
my_list_dict = dict(myList)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
newList = []
for m in months:
newList.append((m, my_list_dict[m]))
print(newList)

Related

Pyspark - replace values in column with dictionary

I'm avoiding repeating the .when function 12 times, so I thought about a dictionary. I don't know if it's a limitation of the Spark function or a logic error. Does the function allow this concatenation?
months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
'7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}
for num, month in months.items():
custoDF1 = custoDF.\
withColumn("Month",
when(col("Nummes") == num, month)
.otherwise(month))
custoDF1.select(col('Nummes').alias('NumMonth'), 'month').distinct().orderBy("NumMonth").show(200)

You can use the replace method of the DataFrame class:
import pyspark.sql.functions as F
months = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun',
'7': 'Jul', '8': 'Aug', '9': 'Sep', '10':'Oct', '11': 'Nov', '12':'Dec'}
df = (df.withColumn('month', F.col('NumMonth').cast('string'))
.replace(months, subset=['month']))
df.show()
+--------+-----+
|NumMonth|month|
+--------+-----+
| 1| Jan|
| 2| Feb|
| 3| Mar|
| 4| Apr|
| 5| May|
| 6| Jun|
| 7| Jul|
| 8| Aug|
| 9| Sep|
| 10| Oct|
| 11| Nov|
| 12| Dec|
+--------+-----+
Here I had to cast NumMonth to string because your mapping in months dictionary had string keys; alternatively, you can change them to integer and avoid casting to string.

How to create a python dict with a default value from a list?

I have a list of names (say months) in a list. How can I create a dict with same value (say 0) without a comprehension if it is possible in some way?
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
What need is:
months_values = {'Jan': 0, 'Feb': 0, 'Mar': 0, 'Apr': 0, 'May': 0, 'Jun': 0}

months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
months_dict = dict.fromkeys(months,0)

month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']
dict1=dict.fromkeys(months,0)
print(dict1)

Combine Bar and line plots in one chart for matplotlib [duplicate]

I am trying to plot a chart with the 1st and 2nd columns of data as bars and then a line overlay for the 3rd column of data.
I have tried the following code but this creates 2 separate charts but I would like this all on one chart.
left_2013 = pd.DataFrame({'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
'2013_val': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 6]})
right_2014 = pd.DataFrame({'month': ['jan', 'feb'], '2014_val': [4, 5]})
right_2014_target = pd.DataFrame({'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep', 'oct', 'nov', 'dec'],
'2014_target_val': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})
df_13_14 = pd.merge(left_2013, right_2014, how='outer')
df_13_14_target = pd.merge(df_13_14, right_2014_target, how='outer')
df_13_14_target[['month','2013_val','2014_val','2014_target_val']].head(12)
plt.figure()
df_13_14_target[['month','2014_target_val']].plot(x='month',linestyle='-', marker='o')
df_13_14_target[['month','2013_val','2014_val']].plot(x='month', kind='bar')
This is what I currently get

The DataFrame plotting methods return a matplotlib AxesSubplot or list of AxesSubplots. (See the docs for plot, or boxplot, for instance.)
You can then pass that same Axes to the next plotting method (using ax=ax) to draw on the same axes:
ax = df_13_14_target[['month','2014_target_val']].plot(x='month',linestyle='-', marker='o')
df_13_14_target[['month','2013_val','2014_val']].plot(x='month', kind='bar',
ax=ax)
import pandas as pd
import matplotlib.pyplot as plt
left_2013 = pd.DataFrame(
{'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep',
'oct', 'nov', 'dec'],
'2013_val': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 9, 6]})
right_2014 = pd.DataFrame({'month': ['jan', 'feb'], '2014_val': [4, 5]})
right_2014_target = pd.DataFrame(
{'month': ['jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 'aug', 'sep',
'oct', 'nov', 'dec'],
'2014_target_val': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]})
df_13_14 = pd.merge(left_2013, right_2014, how='outer')
df_13_14_target = pd.merge(df_13_14, right_2014_target, how='outer')
ax = df_13_14_target[['month', '2014_target_val']].plot(
x='month', linestyle='-', marker='o')
df_13_14_target[['month', '2013_val', '2014_val']].plot(x='month', kind='bar',
ax=ax)
plt.show()

How to fill missing values in a list that belong to another list using Python?

Considering having this type of lists:
month_list = ['Mar', 'Aug', 'Okt', 'Nov']
value_for_each_month = [4, 10, 8, 5]
So, each value belongs to the month in the month_list, e.g. 'Mar' --> 4, 'Aug' --> 10 and so on..
Now, how to fill both lists in Python to achieve this result:
month_list_new = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
value_for_each_month_new = [0, 0, 4, 0, 0, 0, 0, 10, 0, 8, 5, 0]

Create a dictionary mapping month names to values...
>>> month_list = ['Mar', 'Aug', 'Okt', 'Nov']
>>> value_for_each_month = [4, 10, 8, 5]
>>> month_values = dict(zip(month_list, value_for_each_month))
>>> month_values
{'Aug': 10, 'Mar': 4, 'Nov': 5, 'Okt': 8}
... than use that dict in a list comprehension:
>>> month_list_new = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
>>> value_for_each_month_new = [month_values.get(m, 0) for m in month_list_new]
>>> value_for_each_month_new
[0, 0, 4, 0, 0, 0, 0, 10, 0, 0, 5, 0]

python appending elements to a list from a list

I would like to create a list that adds elements alternately from 2 seperate lists in python .
I had the following idea but it doesn't seem to work:
t1 = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
t2 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
t3= [len(t1)+len(t2)]
a = 0
while a < len(t1)+len(t2):
t3.extend(t1[a])
t3.extend(t2[a])
a = a + 1
print t3
So basically I would like ['Jan',31,'Feb',28,'Mar',31, ect...]

The shortest solution may be:
list(sum(zip(t2, t1), ()))

In Python you don't need to "reserve capacity" for a list. Just write
t3 = []
In fact, t3 = [len(t1)+len(t2)] doesn't even creates a list with length 24, but creates a list with a single entry [24].
t1[a] and t2[a] are elements you want to add to the list. To add an element, you use the .append method:
t3.append(t1[a])
t3.append(t2[a])
.extend is used to add a list (in fact, any iterable) to a list, e.g.
t3.extend([t1[a], t2[a]])
The problem itself can be solved easily using list comprehensions.
[a for l in zip(t2, t1) for a in l]
There are many other improvements could be made (e.g. use a for loop instead of a while loop). You could take it to http://codereview.stackexchange.com.
(BTW, this code does not handle leap year.)

Here you go:
t1 = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
t2 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
t3 = list()
for i, j in zip(t1, t2):
t3.append(i)
t3.append(j)
print(t3)

Just zip the lists and flatten the result.
>>> from itertools import chain
>>> t1 = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
>>> t2 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
... 'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
>>> list(chain(*zip(t2, t1)))
['Jan', 31, 'Feb', 28, 'Mar', 31, 'Apr', 30, 'May', 31, 'Jun', 30, 'Jul', 31, 'Aug', 31, 'Sept', 30, 'Oct', 31, 'Nov', 30, 'Dec', 31]
Without chain:
>>> [x for tup in zip(t2, t1) for x in tup]
['Jan', 31, 'Feb', 28, 'Mar', 31, 'Apr', 30, 'May', 31, 'Jun', 30, 'Jul', 31, 'Aug', 31, 'Sept', 30, 'Oct', 31, 'Nov', 30, 'Dec', 31]

you probably have to read more about python lists and their methods. t3= [len(t1)+len(t2)] this is not necessary at all. I guess you have a C background and trying to initialize the list with size. In python you don't have to initialize the list size (its auto increasing). And the items you have in a list are not stored as per the sequence you have entered them in. Please check tuple in python if you want your items to be in the same sequence.
Happy Coding

t1 = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
t2 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
arr = []
for i in range(12):
arr.append(t2[i])
arr.append(t1[i])
print(arr)
Output -
['Jan', 31, 'Feb', 28, 'Mar', 31, 'Apr', 30, 'May', 31, 'Jun', 30, 'Jul', 31, 'Aug', 31, 'Sept', 30, 'Oct', 31, 'Nov', 30, 'Dec', 31]
You can alternatively write -
import itertools
arr = list(itertools.chain.from_iterable(zip(t2, t1))

In Python, you can't create lists with a fixed length like you can do with arrays in other languages, so the third line should just be t3 = [].
Also, the extend() function is used to concatenate lists. To add a single new value, you need to use the append() function instead.

Python is dynamic programming language, the type of the identifier is determined when it is assigned value.
so basically you can do in this way:
t1 = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]
t2 = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sept', 'Oct', 'Nov', 'Dec']
t3= []
for a in range(len(t1)):
append.append(t1[a])
apppend.append(t2[a])
print t3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sort by key (Month) using RDDs in Pyspark - python

myList = myrdd.collect() my_list_dict = dict(myList) months = ['Jan', 'Feb', 'Mar', 'Apr', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] newList = [] for m in months: newList.append((m, my_list_dict[m])) print(newList)

Related

Pyspark - replace values in column with dictionary

How to create a python dict with a default value from a list?

Combine Bar and line plots in one chart for matplotlib [duplicate]

How to fill missing values in a list that belong to another list using Python?

python appending elements to a list from a list

Categories

Resources