How to load complex data using Pyspark - python

I have a CSV dataset that looks like the below:
Also, PFB data in form of text:
Timestamp,How old are you?,What industry do you work in?,Job title,What is your annual salary?,Please indicate the currency,Where are you located? (City/state/country),How many years of post-college professional work experience do you have?,"If your job title needs additional context, please clarify here:","If ""Other,"" please indicate the currency here: "
4/24/2019 11:43:21,35-44,Government,Talent Management Asst. Director,75000,USD,"Nashville, TN",11 - 20 years,,
4/24/2019 11:43:26,25-34,Environmental nonprofit,Operations Director,"65,000",USD,"Madison, Wi",8 - 10 years,,
4/24/2019 11:43:27,18-24,Market Research,Market Research Assistant,"36,330",USD,"Las Vegas, NV",2 - 4 years,,
4/24/2019 11:43:27,25-34,Biotechnology,Senior Scientist,34600,GBP,"Cardiff, UK",5-7 years,,
4/24/2019 11:43:29,25-34,Healthcare,Social worker (embedded in primary care),55000,USD,"Southeast Michigan, USA",5-7 years,,
4/24/2019 11:43:29,25-34,Information Management,Associate Consultant,"45,000",USD,"Seattle, WA",8 - 10 years,,
4/24/2019 11:43:30,25-34,Nonprofit ,Development Manager ,"51,000",USD,"Dallas, Texas, United States",2 - 4 years,"I manage our fundraising department, primarily overseeing our direct mail, planned giving, and grant writing programs. ",
4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",
Now, I tried the below code to load the data:
df = spark.read.option("header", "true").option("multiline", "true").option(
"delimiter", ",").csv("path")
It gives me the output as below for the last record which divides the columns and also the output is not as expected:
The value should be null for the last column i.e "If ""Other,"" please indicate the currency here: " and the entire string should be wrapped up in the earlier column which is "If your job title needs additional context, please clarify here:"
I also tried .option('quote','/"').option('escape','/"') but didn't work too.
However, when I tried to load this file using Pandas, it was loaded correctly. I was surprised how Pandas can identify where the new column name starts and all. Maybe I can define a String schema for all the columns and load it back to the spark data frame but since I am using the lower spark version it won't work in a distributed manner hence I was exploring a way how Spark can handle this efficiently.
Any help is much appreciated.

Main issue is consecutive double quotes in your csv file.
you have to escape extra double quotes in your csv file
Like this :
4/24/2019 11:43:30,25-34,Higher Education,Student Records Coordinator,"54,371",USD,Philadelphia,8 - 10 years,equivalent to Assistant Registrar,
4/25/2019 8:35:51,25-34,Marketing,Associate Product Manager,"43,000",USD,"Cincinnati, OH, USA",5-7 years,"I started as the Marketing Coordinator, and was given the \\" \ " Associate Product Manager \\" \ " title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.",
After this it is generating result as expected :
df2 = spark.read.option("header",True).csv("sample1.csv")
df2.show(10,truncate=False)
******** Output ********
|4/25/2019 8:35:51 |25-34 |Marketing |Associate Product Manager |43,000 |USD |Cincinnati, OH, USA |5-7 years |I started as the Marketing Coordinator, and was given the ""Associate Product Manager"" title as a promotion. My duties remained mostly the same and include graphic design work, marketing, and product management.|null |null |
Or you can use blow code
df2 = spark.read.option("header",True).option("multiline","true").option("escape","\"").csv("sample1.csv")

Related

Extract information from table-columns of a PDF file with a pattern

my bank only gives me my activity statement as pdf whereas business-clients can get a .csv
I need to extract data from the relevant table in that multi-page pdf which (the relevant table) is between the letter-head of the bank and each page is prepended with one 6-column information 1-row-table.
The cells are like
date
transaction information
debit
credit
13.09./13.09.
payment by card ACME-SHOP Reference {SOME 26 digits code} {SHOP OWNER WITH LINE-B\\REAK} Terminal {some 8 digits code} {DATE-FORMAT LIKE THIS: 2022-09-06T14:25:11} SequenceNr. 012 Expirationd. {4 digit code}
-312,12
After the last such column follow two tables in horizontal split so
table 1
table 2
3-row, 2-column table with information on your account
3-row, 2-column table with new balance
I have taken a look at the tabula-library for python but it looks like it doesn't offer any option for pattern-matching or something.
So my question would be if there is a more sophisticated open-source solution to this problem. Doesn't have to be in Python, but I just took a guess that the AI community of Python would be a good place to start looking for such kind of extraction tools.
Otherwise I guess I have to extract the columns and then do pattern-matching to re-accumulate the data.
Thanks

How to locate the element in api?

I am new to the web scraping. I am trying to scrape "When purchase Online"
When purchased online in the Target. But i did not find it in the HTML.
.
Does anyone konw how to locate the element in HTML? Any help appreciates. Thanks!
Product Url:
https://www.target.com/c/allergy-sinus-medicines-treatments-health/-/N-4y5ny?Nao=144
https://www.target.com/p/genexa-dextromethorphan-kids-39-cough-and-chest-congestion-suppressant-4-fl-oz/-/A-80130848#lnk=sametab
I have no idea which element you want to get but API sends JSON data, not HTML, and you may simply convert it to dictionary/list and use keys/indexes to get value.
But you have to manually find correct keys in JSON data.
Or you may write some script to search in JSON (using for-loops and recursions)
Minimal working code. I found keys manually.
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130848&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA' # JSON
response = requests.get(url)
data = response.json()
product = data['data']['product']
print('price:', product['price']['current_retail'])
print('title:', product['item']['product_description']['title'])
print('description:', product['item']['product_description']['downstream_description'])
print('------------')
for bullet in product['item']['product_description']['bullet_descriptions']:
print(bullet)
print('------------')
print(product['item']['product_description']['soft_bullets']['title'])
for bullet in product['item']['product_description']['soft_bullets']['bullets']:
print('-', bullet)
print('------------')
for attribute in product['item']['wellness_merchandise_attributes']:
print('-', attribute['value_name'])
print(' ', attribute['wellness_description'])
Result:
price: 13.99
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
description: Genexa Kids’ Cough & Chest Congestion is real medicine, made clean - a powerful cough suppressant and expectorant that helps control cough, relieves chest congestion and helps thin and loosen mucus. This liquid, non-drowsy medicine has the same active ingredients you need (dextromethorphan HBr and guaifenesin), but without the artificial ones you don’t (dyes, common allergens, parabens). We only use ingredients people deserve to make the first gluten-free, non-GMO, certified vegan medicines to help your little ones feel better. <br /><br />Genexa is the first clean medicine company. Founded by two dads who believe in putting People Over Everything, Genexa makes medicine with the same active ingredients people need, but without the artificial ones they don’t. It’s real medicine, made clean.
------------
<B>Suggested Age:</B> 4 Years and Up
<B>Product Form:</B> Liquid
<B>Primary Active Ingredient:</B> Dextromethorphan
<B>Package Quantity:</B> 1
<B>Net weight:</B> 4 fl oz (US)
------------
highlights
- This is an age restricted item and will require us to take a quick peek at your ID upon pick-up
- Helps relieve kids’ chest congestion and makes coughs more productive by thinning and loosening mucus
- Non-drowsy so your little ones (ages 4+) can get back to playing
- Our medicine is junk-free, with no artificial sweeteners or preservatives, no dyes, no parabens, and no common allergens
- Certified gluten-free, vegan, and non-GMO
- Flavored with real organic blueberries
- Gentle on little tummies
------------
- Dye-Free
A product that either makes an unqualified on-pack statement indicating that it does not contain dye, or carries an unqualified on-pack statement such as "no dyes" or "dye-free."
- Gluten Free
A product that has an unqualified independent third-party certification, or carries an on-pack statement relating to the finished product being gluten-free.
- Non-GMO
A product that has an independent third-party certification, or carries an unqualified on-pack statement relating to the final product being made without genetically engineered ingredients.
- Vegan
A product that carries an unqualified independent, third-party certification, or carries on-pack statement relating to the product being 100% vegan.
- HSA/FSA Eligible
Restrictions apply; contact your insurance provider about plan allowances and requirements
EDIT:
Information "When purchased online" (or "at Cedar Rapids South") are in different url.
For example
Product url:
https://www.target.com/p/genexa-kids-39-diphenhydramine-allergy-liquid-medicine-organic-agave-4-fl-oz/-/A-80130847
API product data:
https://redsky.target.com/redsky_aggregations/v1/web/pdp_client_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&tcin=80130847&is_bot=false&member_id=0&store_id=1771&pricing_store_id=1771&has_pricing_store_id=true&scheduled_delivery_store_id=1771&has_financing_options=true&visitor_id=01819D268B380201B177CA755BCE70CC&has_size_context=true&latitude=41.9831&longitude=-91.6686&zip=52404&state=IA
API "at Cedar Rapids South":
https://redsky.target.com/redsky_aggregations/v1/web_platform/product_fulfillment_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&is_bot=false&tcin=80130847&store_id=1771&zip=52404&state=IA&latitude=41.9831&longitude=-91.6686&scheduled_delivery_store_id=1771&required_store_id=1771&has_required_store_id=true
But probably in some situations it may use other information in product data to put "When purchased online" instead of "at Cedar Rapids South" - and this can be hardcoded in JavaScript. For example product which displays "When purchased online" has formatted_price $13.99 but product which displays "at Cedar Rapids South" has formatted_price "See price in cart"
import requests
url = 'https://redsky.target.com/redsky_aggregations/v1/web/plp_search_v1?key=9f36aeafbe60771e321a7cc95a78140772ab3e96&brand_id=q643lel65ir&channel=WEB&count=24&default_purchasability_filter=true&offset=0&page=%2Fb%2Fq643lel65ir&platform=desktop&pricing_store_id=1771&store_ids=1771%2C1768%2C1113%2C3374%2C1792&useragent=Mozilla%2F5.0+%28X11%3B+Linux+x86_64%3B+rv%3A101.0%29+Gecko%2F20100101+Firefox%2F101.0&visitor_id=01819D268B380201B177CA755BCE70CC' # JSON
response = requests.get(url)
data = response.json()
for product in data['data']['search']['products']:
print('title:', product['item']['product_description']['title'])
print('price:', product['price']['current_retail'])
print('formatted:', product['price']['formatted_current_price'])
print('---')
Result:
title: Genexa Kids' Diphenhydramine Allergy Liquid Medicine - Organic Agave - 4 fl oz
price: 7.99
formatted: See price in cart
---
title: Genexa Dextromethorphan Kids' Cough and Chest Congestion Suppressant - 4 fl oz
price: 13.99
formatted: $13.99
---

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Rearrange data using python

SO following the question I asked before, (Replacing special patterns in a string, reading from a file) I seemed to resolve the problem there. But now I'm pretty much stuck with this problem for a month.
So with the code there , it reads the data from a file, parse it (so it divides the fields with tabs) but now I want the program to recognize the data ( so it needs to recognize in this example the authors, year of publication , isbn) and has to rearrange that data into a specific form/ pattern like for example:
INPUT DATA: Aubrecht, Christoph; Özceylan, Aubrecht Dilek; Klerx, Joachim; Freire, Sérgio (2013) “Future-oriented activities as a concept for improved disaster risk management. Disaster Advances”, 6(12), 1-10. (IF = 2.272) E-ISSN 2278-4543. REVISTA INDEXADA NO WEB OF SCIENCE
AUTHORS:Aubrecht, Christoph; Özceylan, Aubrecht Dilek; Klerx, Joachim; Freire, Sérgio
YEAR: 2013
ISBN: 2278-4543
TEMPLATE 1: AUTHOR YEAR ISSN (Aubrecht, Christoph || 2013 || 2278-4543 )
TEMPLATE 2: YEAR ISSN AUTHOR (2013 || 2278-4543 || Aubrecht, Christoph )
TEMPLATE 3: ISSN YEAR AUTHOR (2278-4543 || 2013 || Aubrecht, Christoph )
The goal for this is to then import/export this data into Excel and then to a SQL database. I did my research and what I conclude (not sure if right) is that Django seems to be a good way to go since it can create templates (https://docs.djangoproject.com/en/dev/ref/templates/api/) or even with data frames using pandas dataframe (How to rearrange some elements as a data frame, Rearrange data for pandas dataframe?) , but I'm not sure how to implement both, or if its even the best way to do it ( or if it's even possible for a program to recognize those elements and rearrange them).
Similar questions I've searched were:
using a key to rearrange string (didn't help)
Rearranging letters from a list

Design pattern for parsing data that will be grouped to two different ways and flipped

I'm looking for an easily maintainable and extendable design model for a script to parse an excel workbook into two separate workbooks after pulling data from other locations like the command line, and a database. The high level details are as follows.
I need to parse an excel workbook containing a sheet that lists unique question names, the only reliable information that can be parsed from the question name is the book code that identifies the title and edition of the textbook the question is associated with, the rest of the question name is not standardized well enough to be reliably parsed by computer. The general form of the question name is best described by the following regular expression.
'^(\w+)\s(\w{1,2})\.(\w{1,2})\.(\w{1,3})\.(\w{1,3}\.)*$'
The first sub-pattern is the book code, the second sub-pattern is 90% of the time the chapter, and the rest of the sub-patterns could be section, problem type, problem number, or question type information. There is no simple logic, at least not one I can find.
There will be a minimum of three other columns in this spreadsheet; one column will be the chapter the question is associated with, the second will be the section within the chapter the question is associated with, and the third will be some kind of asset indicated by a uniform resource locator.
1 | 1 | qname1 | url | description | url | description ...
1 | 1 | qname2 | url | description
1 | 1 | qname3 | url | description | url | description | url |
The asset can be indicated by a full or partial uniform resource locator, the partial url will need to be completed before it can be fed into the application. There theoretically could be no limit to the number of asset columns, the assets will be grouped in columns by type. Some times additional data will have to be retrieved from a database or combined with the book code before the asset url is complete and can be understood by the application that will be using the asset.
The type is an abstraction, there are eight types right now, each with their own logic in how the uniform resource locator is handled and or completed, and I have to add a new type and its logic every three or four months.
For each asset url there is the possibility of a description column, a character string for display in the application, but not always. (I've already worked out validating the description text, and squashing MSs obscure code page down to something 7-bit ascii can handle.)
Now that all the details are filled-in I can get to the actual problem of parsing the file. I need to split the information in this excel workbook into two separate workbooks. The first workbook will group all the questions by section in rows. With the first cell being the section doublet and the rest of the cells in the row are the question names.
1.1 | qname1 | qname2 | qname3 | qname4 |
1.2 | qname1 | qname2 | qname3 |
1.3 | qname1 | qname2 | qname3 | qname4 | qname5
There is no set number of questions for each section as you can see from the above example.
The second workbook is more complicated, there is one row per asset, and question names that have more than one asset will be duplicated. There will be four or five columns on this sheet. The first is the question name for the asset, the second is a media type used to select the correct icon for the asset in the application, the third is string representing the asset type, the four is the full and complete uniform resource locator for the asset, and the fifth columns is the optional text description for the asset.
q1 | mtype1 | atype1 | url | description
q1 | mtype2 | atype2 | url | description
q1 | mtype2 | atype3 | url | description
q2 | mtype1 | atype1 | url | description
q2 | mtype2 | atype3 | url | description
For the original six types I did have a script that parsed the source excel workbook into the other two excel workbooks, and I was able to add two more types until I ran aground on the implementation of the ninth type and tenth types.
What broke my script was the fact that the ninth type is actually a sub-type of one of the original six, but with entirely different logic, and my mostly procedural script could not accommodate without duplicating a lot of code. I also had a lot of bugs in the script and will be writing the test first on this time around.
I'm stuck with the format for the resulting two workbooks, this script is glue code, development went ahead with the project without bothering to get a complete spec from the sponsor. I work for the same company as the developers but in the editorial department, editorial is co-sponsor of the project, and am expected to fix pesky details like this (I'm foaming at the mouth as I type this).
I've tried factories, I've tried different object models, but each resulting workbook is so different when I find a design that works for generating one workbook the code is not really usable for generating the other. What I would really like are ideas about a maintainable and extensible design for parsing the source workbook into both workbooks with maximum code reuse, and or sympathy.
Not sure if I can help, but at the least, sympathy for you I do have :-)
Have you tried using Strategies? If haven't check out the link, there is even a simple Python example. If your types differ only in the way they handle the URLs, you could encapsulate the different logics into strategy subclasses. In the worst case, there may be duplicated logic between some subclasses, but at least the rest of your app can be happily oblivious about it, and adding new types should be simple. But you might even be able to reuse part of the duplicated logic by e.g. parameterizing strategies differently... I am entirely speculating here, without knowing the concrete details of your problem.
Hope this helps...

Categories

Resources