How to run pyspark code in distributed environment

How to run pyspark code in distributed environment - python

I have 1 millions records and I want to try spark for this. I have list of items and want to perform lookup in records using this list items.
l = ['domestic',"private"]
text = ["On the domestic front, growth seems to have stalled, private investment and credit off-take is feeble, inflation seems to be bottoming out and turning upward, current account situation is not looking too promising, FPI inflows into debt and equity have slowed, and fiscal deficit situation of states is grim.", "Despite the aforementioned factors, rupee continues to remain strong against the USD and equities continue to outperform.", "This raises the question as to whether the asset prices are diverging from fundamentals and if so when are they expected to fall in line. We examine each of the above factors in a little more detail below.Q1FY18 growth numbers were disappointing with the GVA, or the gross value added, coming in at 5.6 percent. Market participants would be keen to ascertain whether the disappointing growth in Q1 was due to transitory factors such as demonetisation and GST or whether there are structural factors at play. There are silver linings such as a rise in core GVA (GVA excluding agri and public services), a rise in July IIP (at 1.2%), pickup in activity in the cash-intensive sectors, pick up in rail freight and containers handled by ports.However, there is a second school of thought as well, which suggests that growth slowdown could be structural. With demonetisation and rollout of GST, a number of informal industries have now been forced to enter the formal setup."]
res = {}
for rec in text:
for word in l:
if word in rec:
res[rec] = 1
break
print res
This is simple python script and same logic I want to execute using pyspark(Will this same code work?) in distributed manner to reduce the execution time.
Can you please guide me how to do this. I am sorry as I am very new to spark, you help will be much appereciated.

After instanciating a spark context and/or a spark session, you'll have to convert your list of records to a dataframe:
df = spark.createDataFrame(
sc.parallelize(
[[rec] for rec in text]
),
["text"]
)
df.show()
+--------------------+
| text|
+--------------------+
|On the domestic f...|
|Despite the afore...|
|This raises the q...|
+--------------------+
Now you can check for each line if words in l are present or not:
sc.broadcast(l)
res = df.withColumn("res", df.text.rlike('|'.join(l)).cast("int"))
res.show()
+--------------------+---+
| text|res|
+--------------------+---+
|On the domestic f...| 1|
|Despite the afore...| 0|
|This raises the q...| 0|
+--------------------+---+
rlike is for performing regex matching
sc.broadcast is for copying object l to every node so they don't have to go get it on the driver
Hope this helps

Related

Issue in getting the correct list items/values using python

I have a list in python like this:
list1 = ['Security Name % to Net Assets* DEBENtURES 0.04, Britannia Industries Ltd. EQUity & RELAtED 96.83, HDFC Bank 6.98, ICICI 4.82']
if I get the length of this list using len(list1) then it gives as 1 simply because it considers all as 1.
How can I transform my list1 such that it would look like this:
list1_altered = ['Security Name % to Net Assets* DEBENtURES 0.04', 'Britannia Industries Ltd. EQUity & RELAtED 96.83', 'HDFC Bank 6.98', 'ICICI 4.82']
after which upon using len(list1_altered) I should be able to get value as 4
I have tried using replace(",", "\',\'") however it doesn't give the desired result.
Please help me do this.

Replacing the commas by quotes is not going to change the fact that you'll still have a single string. You won't transform an object (string) into another (list of strings) this way.
Use str.split to generate a list of multiple strings split on a separator:
list1_altered = list1[0].split(',')

How to get rid of the bold tag from xml document in python 3 without removing the enclosed text?

I am trying to remove the bold tag (<b> Some text in bold here </b>) from this xml document (but want to keep the text covered by the tags intact). The bold tags are present around the following words/text: Objectives, Design, Setting, Participants, Interventions, Main outcome measures, Results, Conclusion, and Trial registrations.
This is my Python code:
import requests
import urllib
from urllib.request import urlopen
import xml.etree.ElementTree as etree
from time import sleep
import json
urlHead = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&rettype=abstract&id='
pmid = "28420629"
completeUrl = urlHead + pmid
response = urllib.request.urlopen(completeUrl)
tree = etree.parse(response)
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
print(studyAbstractPart.text)
The problem with this code is that it finds all the text under "AbstractText" tag but it stops (or ignores) the text in bold tags and after it. In principle, I need all the text between the "<AbstractText> </AbstractText>" tags, but the bold formatting <b> </b> is just a shitty obstruction to it.

You can use the itertext() method to get all the text in <AbstractText> and its subelements.
studyAbstractParts = tree.findall('.//AbstractText')
for studyAbstractPart in studyAbstractParts:
for t in studyAbstractPart.itertext():
print(t)
Output:
Objectives
 To determine whether preoperative dexamethasone reduces postoperative vomiting in patients undergoing elective bowel surgery and whether it is associated with other measurable benefits during recovery from surgery, including quicker return to oral diet and reduced length of stay.
Design
 Pragmatic two arm parallel group randomised trial with blinded postoperative care and outcome assessment.
Setting
 45 UK hospitals.
Participants
 1350 patients aged 18 or over undergoing elective open or laparoscopic bowel surgery for malignant or benign pathology.
Interventions
 Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia compared with standard care.
Main outcome measures
 Primary outcome: reported vomiting within 24 hours reported by patient or clinician.
vomiting with 72 and 120 hours reported by patient or clinician; use of antiemetics and postoperative nausea and vomiting at 24, 72, and 120 hours rated by patient; fatigue and quality of life at 120 hours or discharge and at 30 days; time to return to fluid and food intake; length of hospital stay; adverse events.
Results
 1350 participants were recruited and randomly allocated to additional dexamethasone (n=674) or standard care (n=676) at induction of anaesthesia. Vomiting within 24 hours of surgery occurred in 172 (25.5%) participants in the dexamethasone arm and 223 (33.0%) allocated standard care (number needed to treat (NNT) 13, 95% confidence interval 5 to 22; P=0.003). Additional postoperative antiemetics were given (on demand) to 265 (39.3%) participants allocated dexamethasone and 351 (51.9%) allocated standard care (NNT 8, 5 to 11; P<0.001). Reduction in on demand antiemetics remained up to 72 hours. There was no increase in complications.
Conclusions
 Addition of a single dose of 8 mg intravenous dexamethasone at induction of anaesthesia significantly reduces both the incidence of postoperative nausea and vomiting at 24 hours and the need for rescue antiemetics for up to 72 hours in patients undergoing large and small bowel surgery, with no increase in adverse events.
Trial registration
 EudraCT (2010-022894-32) and ISRCTN (ISRCTN21973627).

python how to nicely align long text in pandas dataframe?

Given a panda's dataframe (received from a database) I'm trying to output the result to the console in such was that it will be complete and readable.
The challenge I have is with respect to the long text in 2 columns:LPQ_REASON & LPQ_RESOLUTION. You will note from the output below (print df) that both LPQ columns are ended with 3 dots (...) so I can't read the text. This comes despite my initial settings of:
pd.set_option('display.max_rows', 1500)
pd.set_option('display.max_columns', 1500)
pd.set_option('display.width', 1000)
so the result on the console looks like this:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (h... This Memo was issued with conjunction to our j... 3979
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% (h... This Memo was issued on matching viloation for... 3979
An optimal solution I'm looking for (if doable) is to print the entire line such that:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (here This Memo was issued with conjunction to our 3979
comes the rest of the sentence. it might be analysis to foster a better bs when writing
long, or not, it might be short or whatever)
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% yes and This Memo was issued on matching viloation for 3979
here I'm going to write the entire sentence who cares on what violation. just issued.
as if I really remember what was written. ha

Not as optimal as you want, but you can try the following:
pd.set_option('display.max_colwidth',100)
where 100 is the column width you can choose. but this will not create a multi-line cell but rather a very long column.
or:
not much elegant, but
you can try and use 'tabulate' library (https://pypi.python.org/pypi/tabulate) which creates nice text tables like:
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
with tabulate you can use the '\n' new-line character.
just iterate over your text cells and put an '\n' every X characters (lets say every 50 characters).
a simple code for that:
for i in range(len(data)):
data.at[i,'text'] = data.at[i,'text'][0:50] + '\n' + data.at[i,'text'][50:]
the above is limited to only one line break, but you can improve it to make multi breaks for a long text. and also doesn't take into consideration whether it breaks in a middle of a word.
!Make sure to do that on a copy of the data, because it changes your data. and if you'll try to print it with a regular 'print' then you will see the '\n' stuck inside the middle of the text!

Parsing a pretty printed table into Python objects [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
Suppose there is a String called likes_and_dislikes, formatted visually as a table like that as shown below.
How can I parse the string and return a list of tuples with likes and dislikes. Also the top header(likes, dislikes) has to be removed from list of tuples.
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """

The key here is to examine the table thoroughly and understand what you are trying to pull out.
First of all, parsing strings like this is generally easier when done line-by-line, so you need to split based on the table rows and then parse columns based on that. We do this primarily because the likes and dislikes span across lines.
1. Getting each row
We don't know how wide the table might be so we use regular expressions to break up our table like so:
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
This gives us an array that correesponds to our multiline rows. The array slicing at the end removes the header and any trailing whitespace that we don't want to process. However, we still have the problem of pulling together strings that span multiple lines in a cell.
2. Finding a like and dislike
If we iterate through this array of rows, we know each row has a like and a dislike that spans an unknown array of lines. We initialise this like and dislike each as an array to make concatentation quicker at the end.
for p in pairs:
like,dislike = [],[]
3. Dealing with each line
With our row, we need to split it based on newlines, then split based on the pipes (|).
for l in p.split('\n'):
pair = l.split('|')
4. Pulling out each like and dislike
If the pair we are given has more than one value, then there must be a pair of likes or dislikes for us to capture. So append it to our like and dislike array - not the likes or dislikes as these hold our finally formatted strings. We also should perform a strip on these to remove any trailing or leading whitespace.
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
5. Creating the final text
Once we are done processing the row we can join the strings with a single space, and can finally add these to our likes and dislikes array.
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
6. Using our new data structure
Now we can use these two new lists to process in anyway we choose, either printing each list separately...
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
... or zip() them together to create a list of paired likes and dislikes!
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
The complete code:
likes_and_dislikes="""
+------------------------------------+-----------------------------------+
| likes | dislikes |
+------------------------------------+-----------------------------------+
| Meritocracy | Favoritism, ass-kissing, politics |
+------------------------------------+-----------------------------------+
| Healthy debates and collaboration | Ego-driven rhetoric, drama and FUD|
| | to get one's way |
+------------------------------------+-----------------------------------+
| Autonomy given by confident leaders| Micro-management by insecure |
| capable of attracting top-tier | managers compensating for a weak, |
| talent | immature team |
+------------------------------------+-----------------------------------+ """
import re
likes,dislikes = [],[]
pairs = re.split("\+-*\+-*\+\n?",likes_and_dislikes)[2:-1] #Drop the header and the tail
for p in pairs:
like,dislike = [],[]
for l in p.split('\n'):
pair = l.split('|')
if len(pair) > 1:
# Not a blank line
like.append(pair[1].strip())
dislike.append(pair[2].strip())
if len(like) > 0:
likes.append(" ".join(like))
if len(dislike) > 0:
dislikes.append(" ".join(dislike))
from pprint import pprint
print "Likes:"
pprint(likes,indent=4)
print "Dislikes:"
pprint(dislikes,indent=4)
print "A set of paired likes and dislikes"
pprint(zip(likes,dislikes),indent=4)
This results in:
Likes:
[ 'Meritocracy',
'Healthy debates and collaboration ',
'Autonomy given by confident leaders capable of attracting top-tier talent']
Dislikes:
[ 'Favoritism, ass-kissing, politics',
"Ego-driven rhetoric, drama and FUD to get one's way",
'Micro-management by insecure managers compensating for a weak, immature team']
A set of paired likes and dislikes
[ ('Meritocracy', 'Favoritism, ass-kissing, politics'),
( 'Healthy debates and collaboration ',
"Ego-driven rhetoric, drama and FUD to get one's way"),
( 'Autonomy given by confident leaders capable of attracting top-tier talent',
'Micro-management by insecure managers compensating for a weak, immature team')]
You can see the complete code in action on codepad.

That's (one of) the table formats used in ReST (Restructured Text, a pythonic form of markup) and there are various parsers kicking around for it.
Here's one, on the old python.org site: http://legacy.python.org/scripts/ht2html/docutils/parsers/rst/tableparser.py

How to Aggregate records in Python?

I have a following csv file (each line is dynamic number of characters but the columns are fixed... hope i am making sense)
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
I am trying to get results into another csv file in the following format where it is grouped by student id and order by date ascending order.
001,#Math;A;02/20/2013#Biology;B;05/01/2013#Literature;B;03/02/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
003,#Biology;A;04/08/2013
There is one constraint though. The input file is huge around 200 millions rows. I tried using c# and storing it DB and write sql query. Its very slow and not accepted. After googling i hear python is very powerful for these operations. I am new to Python started playing with the code. I really appreciate the PYTHON gurus to help me to get the results as I mentioned above.

content='''
**001** Math **02/20/2013** A
**001** Literature **03/02/2013** B
**002** Biology **01/01/2013** A
**003** Biology **04/08/2013** A
**001** Biology **05/01/2013** B
**002** Math **03/10/2013** C
'''
from collections import defaultdict
lines = content.split("\n")
items_iter = (line.split() for line in lines if line.strip())
aggregated = defaultdict(list)
for items in items_iter:
stud, class_, date, grade = (t.strip('*') for t in items)
aggregated[stud].append((class_, grade, date))
for stud, data in aggregated.iteritems():
full_grades = [';'.join(items) for items in data]
print '{},#{}'.format(stud, '#'.join(full_grades))
Output:
003,#Biology;A;04/08/2013
002,#Biology;A;01/01/2013#Math;C;03/10/2013
001,#Math;A;02/20/2013#Literature;B;03/02/2013#Biology;B;05/01/2013
Of course, this is an ugly hackish code just to show you how it can be done in python. When working with large streams of data, use generators and iterators, and don't use file.readlines(), just iterate. The iterators will not read all the data at once but read chunk-by-chunk when you iterate over them, and not earlier.
If you are concerned if 200m records fit memory, then do the following:
sort the records into separate "buckets" (like in bucket sort) by students id
cat all_records.txt | grep 001 > stud_001.txt # do if for other students also
do the processing per bucket
merge
grep is just example. make a fancier script (awk or also python) that will filter by student ID and, for example, filter all with ID < 1000, later 1000 < ID < 2000 and so on. You can do it safely because your records per student are disjoint.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.