I extract information from a website and able to get the data. however, I unable to expend the data for 'K" but successfully for 'J' and ''L' .
print "hello from python 2"
from lxml import html
import requests
import csv
import pandas as pd
import MySQLdb as mdb
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
bursa = ['J','K','L']
bursalist = []
print bursalist
for x in range (len(bursa)):
try:
page = requests.get('https://www.malaysiastock.biz/Listed-Companies.aspx?type=A&value='+bursa[x])
tree = html.fromstring(page.content)
tree1 = tree.xpath('//td/h3/a[contains(text(),"(")][not(contains(text(),"(F"))]/text()')
tree2 = tree.xpath('.//td/h3/text()')
len(tree1)
alist = []
for y in range (len(tree1)):
a = tree1[y].split('(')
b = a[1].split(')')
alist.append(a[0])
alist.append(b[0])
alist.append(tree2[2*y])
alist.append(tree2[2*y+1])
a = alist
bursalist.extend(alist)
print bursalist
except Exception:
print "no data"
Notice that print bursalist only show data from 'J' and 'L' while 'K' is missing. But if I get the data from alist, data'K' is shown but unable to extend.
please advise if there is a robust way to do it
print alist will show
['K1 ', '0111', 'K-ONE TECHNOLOGY BERHAD', 'Technology Equipment ', 'KAB ', '0193', 'KEJURUTERAAN ASASTERA BERHAD', 'Industrial Engineering ', 'KAMDAR ', '8672', 'KA
MDAR GROUP (M) BERHAD', 'Retailers ', 'KANGER ', '0170', 'KANGER INTERNATIONAL BERHAD', 'Household Goods ', 'KAREX ', '5247', 'KAREX BERHAD', 'Personal Goods ', 'KAR
YON ', '0054', 'KARYON INDUSTRIES BERHAD', 'Chemicals ', 'KAWAN ', '7216', 'KAWAN FOOD BERHAD', 'Food & Beverages ', 'KEINHIN ', '7199', 'KEIN HING INTERNATIONAL BER
HAD', 'Metals ', 'KEN ', '7323', 'KEN HOLDINGS BERHAD', 'Property ', 'KENANGA ', '6483', 'KENANGA INVESTMENT BANK BERHAD', 'Other Financials ', 'KERJAYA ', '7161', '
KERJAYA PROSPEK GROUP BERHAD', 'Construction ', 'KESM ', '9334', 'KESM INDUSTRIES BERHAD', 'Semiconductors ', 'KEYASIC ', '0143', 'KEY ASIC BERHAD', 'Semiconductors
', 'KFIMA ', '6491', 'KUMPULAN FIMA BERHAD', 'Diversified Industrials ', 'KGB ', '0151', 'KELINGTON GROUP BERHAD', 'Industrial Engineering ', 'KGROUP ', '0036', 'KEY
ALLIANCE GROUP BERHAD', 'Technology Equipment ', 'KHEESAN ', '6203', 'KHEE SAN BERHAD', 'Food & Beverages ', 'KHIND ', '7062', 'KHIND HOLDINGS BERHAD', 'Household G
oods ', 'KHJB ', '0210', 'KIM HIN JOO (MALAYSIA) BERHAD', 'Retailers ', 'KIALIM ', '6211', 'KIA LIM BERHAD', 'Building Materials ', 'KIMHIN ', '5371', 'KIM HIN INDUS
TRY BERHAD', 'Building Materials ', 'KIMLUN ', '5171', 'KIMLUN CORPORATION BERHAD', 'Construction ', 'KINSTEL ', '5060', 'KINSTEEL BHD', 'Metals ', 'KIPREIT ', '5280
', 'KIP REAL ESTATE INVESTMENT TRUST', 'Real Estate Investment Trusts ', 'KKB ', '9466', 'KKB ENGINEERING BERHAD', 'Industrial Engineering ', 'KLCC ', '5235SS', 'KLC
C PROPERTY HOLDINGS BERHAD', 'Real Estate Investment Trusts ', 'KLCI1XI ', '0835EA', 'KENANGA KLCI DAILY (-1X) INVERSE ETF', 'KENANGA KLCI DAILY 2X LEVERAGED ETF', '
NUSFOR ', '5035', 'KOBAY TECHNOLOGY BERHAD', 'Industrial Materials, Components & Equipment ', 'KOBAY ', '6971', 'KOMARKCORP BERHAD', 'Packaging Materials ', 'KOMARK
KLCI2XL ', '0834EA', 'KUALA LUMPUR KEPONG BERHAD', 'Plantation ', 'KLK ', '2445', 'KLUANG RUBBER COMPANY (MALAYA) BERHAD', 'Plantation ', 'KLUANG ', '2453', 'KIM LOONG RESOURCES BERHAD', 'Plantation ', 'KMLOONG ', '5027', 'KNM GROUP BERHAD', 'Other Energy Resources ', 'KNM ', '7164', 'KNUSFORD BERHAD', 'Industrial Services ', 'KNUSFOR ', '5035', 'KOBAY TECHNOLOGY BERHAD', 'Industrial Materials, Components & Equipment ', 'KOBAY ', '6971', 'KOMARKCORP BERHAD', 'Packaging Materials ', 'KOMARK ', '7017', 'KOSSAN RUBBER INDUSTRIES BERHAD', 'Health Care Equipment & Services ', 'KOSSAN ', '7153', 'KOTRA INDUSTRIES BERHAD', 'Pharmaceuticals ', 'KOTRA ', '0002', 'KPJ HEALTHCARE BERHAD', 'Health Care Providers ', 'KPJ ', '5878', 'KUMPULAN POWERNET BERHAD', 'Personal Goods ', 'KPOWER ', '7130', 'KERJAYA PROSPEK PROPERTY BERHAD', 'Property ', 'KPPROP ', '7077', 'KUMPULAN PERANGSANG SELANGOR BERHAD', 'Diversified Industrials ', 'KPS ', '5843', 'KPS CONSORTIUM BERHAD', 'Wood & Wood Products ', 'KPSCB ', '9121', 'KRETAM HOLDINGS BERHAD', 'Plantation ', 'KRETAM ', '1996', 'KRONOLOGI ASIA BERHAD', 'Digital Services ', 'KRONO ', '0176', 'KECK SENG (MALAYSIA) BERHAD', 'Diversified Industrials ', 'KSENG ', '3476', 'KSL HOLDINGS BERHAD', 'Property ', 'KSL ', '5038', 'K.SENG SENG CORPORATION BERHAD', 'Industrial Materials, Components & Equipment ', 'KSSC ', '5192', 'K-STAR SPORTS LIMITED', 'Personal Goods ', 'KSTAR ', '5172', 'KONSORTIUM TRANSNASIONAL BERHAD', 'Travel, Leisure & Hospitality ', 'KTB ', '4847', 'KIM TECK CHEONG CONSOLIDATED BERHAD', 'Consumer Services ', 'KTC ', '0180', 'KUB MALAYSIA BERHAD', 'Industrial Services ', 'KUB ', '6874', 'KUCHAI DEVELOPMENT BERHAD', 'Other Financials ', 'KUCHAI ', '2186', 'KWANTAS CORPORATION BERHAD', 'Plantation ', 'KWANTAS ', '6572', 'KYM HOLDINGS BERHAD', 'Pack
aging Materials ', 'KYM ', '8362']
Related
I have the following list of tuples.
lst =
[
('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])
]
I want to flatten the lists in each tuple to be part of the tuples of lst.
I found this How do I make a flat list out of a list of lists? but have no idea how to make it adequate to my case.
You can use unpacking:
lst = [('LexisNexis', ['IT Services and IT Consulting ', ' New York City, NY']),
('AbacusNext', ['IT Services and IT Consulting ', ' La Jolla, California']),
('Aderant', ['Software Development ', ' Atlanta, GA']),
('Anaqua', ['Software Development ', ' Boston, MA']),
('Thomson Reuters Elite', ['Software Development ', ' Eagan, Minnesota']),
('Litify', ['Software Development ', ' Brooklyn, New York'])]
output = [(x, *l) for (x, l) in lst]
print(output)
# [('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
# ('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
# ('Aderant', 'Software Development ', ' Atlanta, GA'),
# ('Anaqua', 'Software Development ', ' Boston, MA'),
# ('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
# ('Litify', 'Software Development ', ' Brooklyn, New York')]
I've found the answer by Deacon using abc from collections. It is worth to try it too.
from collections import abc
def flatten(obj):
for o in obj:
# Flatten any iterable class except for strings.
if isinstance(o, abc.Iterable) and not isinstance(o, str):
yield from flatten(o)
else:
yield o
[tuple(flatten(i)) for i in lst]
Out[47]:
[('LexisNexis', 'IT Services and IT Consulting ', ' New York City, NY'),
('AbacusNext', 'IT Services and IT Consulting ', ' La Jolla, California'),
('Aderant', 'Software Development ', ' Atlanta, GA'),
('Anaqua', 'Software Development ', ' Boston, MA'),
('Thomson Reuters Elite', 'Software Development ', ' Eagan, Minnesota'),
('Litify', 'Software Development ', ' Brooklyn, New York')]
I am using the lxml xpath of python. I am able to extract text if I give the full path to a HTML tag. However I can't extract all the text from a tag and it's child elements into a list. So for example given this html I would like to get all the texts of the "example" class:
<div class="example">
"Some text"
<div>
"Some text 2"
<p>"Some text 3"</p>
<p>"Some text 4"</p>
<span>"Some text 5"</span>
</div>
<p>"Some text 6"</p>
</div>
I would like to get:
["Some text", "Some text 2", "Some text 3", "Some text 4", "Some text 5", "Some text 6"]
mzjn-s anwer is correct. After some trial and error I've managed to get it working. This is what the end code looks like. You need to put //text() to the end of the xpath. It is without refactoring for the moment, so there will definitely be some mistakes and bad practices but it works.
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
page = session.get("The url you are webscraping")
content = page.content
htmlsite = urllib.request.urlopen("The url you are webscraping")
soup = BeautifulSoup(htmlsite, 'lxml')
htmlsite.close()
tree = html.fromstring(content)
scraped = tree.xpath('//html[contains(#class, "no-js")]/body/div[contains(#class, "container")]/div[contains(#class, "content")]/div[contains(#class, "row")]/div[contains(#class, "col-md-6")]/div[contains(#class, "clearfix")]//text()')
I've tried it out on the team introduction page of keeleyteton.com. It returned the following list which is correct (although needs lots of amending!) because they are in different tags and some are children tags. Thank you for the help!
['\r\n ', '\r\n ', 'Nicholas F. Galluccio', '\r\n ', '\r\n ', 'Managing Director and Portfolio Manager', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Scott R. Butler', '\r\n ', '\r\n ', 'Senior Vice President and Portfolio Manager ', '\r\n ', 'Teton Small Cap Select Value', '\r\n ', 'Keeley Teton Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Thomas E. Browne, Jr., CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Brian P. Leonard, CFA', '\r\n ', '\r\n
', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Robert M. Goldsborough', '\r\n ', '\r\n ', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Mid Cap Dividend Value', '\r\n ', '\r\n ', '\r\n ', 'Brian R. Keeley, CFA', '\r\n ', '\r\n ', 'Portfolio Manager', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Edward S. Borland', '\r\n ', '\r\n
', 'Research Analyst', '\r\n ', 'Keeley Teton Small and Small Mid Cap Value', '\r\n ', '\r\n ', '\r\n ', 'Kevin M. Keeley', '\r\n ', '\r\n ', 'President', '\r\n
', '\r\n ', '\r\n ', 'Deanna B. Marotz', '\r\n ', '\r\n ', 'Chief Compliance Officer', '\r\n ']
I'm trying to write a nested dictionary to a CSV file and running into issues; either the file doesn't write anything, or it errors out.
The dictionary looks something like this:
finalDict = 'How would you rate the quality of the product?': [{'10942625544': 'High '
'quality'},
{'10942625600': 'Neither '
'high nor '
'low '
'quality'},
{'10942625675': 'Neither '
'high nor '
'low '
'quality'},
{'10942625736': 'Very high '
'quality'},
{'10942625788': 'Neither '
'high nor '
'low '
'quality'},
{'10942625827': 'Neither '
'high nor '
'low '
'quality'},
{'10942625878': 'Neither '
'high nor '
'low '
'quality'},
{'10942625932': 'High '
'quality'},
{'10942625977': 'High '
'quality'},
{'10942626027': 'Neither '
'high nor '
'low '
'quality'},
{'10942626071': 'High '
'quality'},
{'10942626128': 'High '
'quality'},
{'10942626180': 'Very high '
'quality'},
{'10942626227': 'Very high '
'quality'},
{'10942626278': 'High '
'quality'},
{'10942626332': 'Low '
'quality'},
{'10942626375': 'Very high '
'quality'},
{'10942626430': 'Low '
'quality'},
{'10942626492': 'Low '
'quality'}],
'How would you rate the value for money of the product?': [{'10942625544': 'Above '
'average'},
{'10942625600': 'Below '
'average'},
{'10942625675': 'Average'},
{'10942625736': 'Excellent'},
{'10942625788': 'Above '
'average'},
{'10942625827': 'Below '
'average'},
{'10942625878': 'Average'},
{'10942625932': 'Average'},
{'10942625977': 'Above '
'average'},
{'10942626027': 'Above '
'average'},
{'10942626071': 'Above '
'average'},
{'10942626128': 'Average'},
{'10942626180': 'Excellent'},
{'10942626227': 'Average'},
{'10942626278': 'Average'},
{'10942626332': 'Below '
'average'},
{'10942626375': 'Excellent'},
{'10942626430': 'Poor'},
{'10942626492': 'Below '
'average'}],
I've tried working off of Write Nested Dictionary to CSV but am struggling to adapt it to my specific case.
My code currently looks like:
def writeToCsv(finalDict):
csv_columns = ['Question', 'UserID', 'Answer']
filename = "output.csv"
with open(filename, "w") as filename:
w = csv.DictWriter(filename, fieldnames=csv_columns)
w.writeheader()
for data in finalDict: #where I'm stuck
Any recommendations would be appreciated!
This is an option:
def writeToCsv(finalDict):
csv_columns = ['Question', 'UserID', 'Answer']
filename = "output.csv"
with open(filename, "w") as fl:
w = csv.DictWriter(fl, fieldnames=csv_columns, lineterminator='\n')
w.writeheader()
for question, data in finalDict.items()
for item in data:
for user, answer in item.items():
w.writerow(dict(zip(csv_columns, (question, user, answer))))
for question, data in finalDict.items():
for resp in data:
row = {'Question': question,
'UserID': list(resp.keys())[0],
'Answer': list(resp.values())[0]}
w.writerow(row)
I tried lot of suggestions but I am unable to remove carriage returns. I am new python and trying it with csv file cleaning.
import csv
filepath_i = 'C:\Source Files\Data Source\Flat File Source\PatientRecords.csv'
filepath_o = 'C:\Source Files\Data Source\Flat File Source\PatientRecords2.csv'
rows = []
with open(filepath_i, 'rU', newline='') as csv_file:
#filtered = (line.replace('\r\n', '') for line in csv_file)
filtered = (line.replace('\r', '') for line in csv_file)
csv_reader = csv.reader(csv_file, delimiter=',')
i = 0
for row in csv_reader:
print(row)
i = i + 1
if(i == 10):
break
#with open(filepath_o, 'w',newline='' ) as writeFile:
# writer = csv.writer(writeFile,lineterminator='\r')
# for row in csv_reader:
# #rows.append(row.strip())
# rows.append(row.strip())
# writer.writerows(rows)
Input
DRG Definition,Provider Id,Provider Name,Provider Street Address,Provider City,Provider State,Provider Zip Code,Hospital Referral Region Description,Hospital Category,Hospital Type, Total Discharges ,Covered Charges , Total Payments ,Medicare Payments
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10001,SOUTHEAST ALABAMA MEDICAL CENTER,1108 ROSS CLARK CIRCLE,DOTHAN,AL,36301,AL - Dothan,Specialty Centers,Government Funded,91,"$32,963.07 ","$5,777.24 ","$4,763.73 "
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10005,MARSHALL MEDICAL CENTER SOUTH,"2505 U S HIGHWAY
431 NORTH",BOAZ,AL,35957,AL - Birmingham,Specialty Centers,Private Institution,14,"$15,131.85 ","$5,787.57 ","$4,976.71 "
039 - EXTRACRANIAL PROCEDURES W/O CC/MCC,10006,ELIZA COFFEE MEMORIAL HOSPITAL,205 MARENGO STREET,FLORENCE,AL,35631,AL - Birmingham,Rehabilitation Centers,Private Institution,24,"$37,560.37 ","$5,434.95 ","$4,453.79 "
Output (4th column 'Provider Street Address')
['DRG Definition', 'Provider Id', 'Provider Name', 'Provider Street Address', 'Provider City', 'Provider State', 'Provider Zip Code', 'Hospital Referral Region Description', 'Hospital Category', 'Hospital Type', ' Total Discharges ', 'Covered Charges ', ' Total Payments ', 'Medicare Payments']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10001', 'SOUTHEAST ALABAMA MEDICAL CENTER', '1108 ROSS CLARK CIRCLE', 'DOTHAN', 'AL', '36301', 'AL - Dothan', 'Specialty Centers', 'Government Funded', '91', '$32,963.07 ', '$5,777.24 ', '$4,763.73 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10005', 'MARSHALL MEDICAL CENTER SOUTH', '2505 U S HIGHWAY \n431 NORTH', 'BOAZ', 'AL', '35957', 'AL - Birmingham', 'Specialty Centers', 'Private Institution', '14', '$15,131.85 ', '$5,787.57 ', '$4,976.71 ']
I ran this on my side and it works:
with open(filepath_i, 'rU', newline='') as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
for row in csv_reader:
row[3] = row[3].replace("\n","").replace("\r","")
print(row)
Output:
['DRG Definition', 'Provider Id', 'Provider Name', 'Provider Street Address', 'Provider City', 'Provider State', 'Provider Zip Code', 'Hospital Referral Region Description', 'Hospital Category', 'Hospital Type', ' Total Discharges ', 'Covered Charges ', ' Total Payments ', 'Medicare Payments']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10001', 'SOUTHEAST ALABAMA MEDICAL CENTER', '1108 ROSS CLARK CIRCLE', 'DOTHAN', 'AL', '36301', 'AL - Dothan', 'Specialty Centers', 'Government Funded', '91', '$32,963.07 ', '$5,777.24 ', '$4,763.73 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10005', 'MARSHALL MEDICAL CENTER SOUTH', '2505 U S HIGHWAY 431 NORTH', 'BOAZ', 'AL', '35957', 'AL - Birmingham', 'Specialty Centers', 'Private Institution', '14', '$15,131.85 ', '$5,787.57 ', '$4,976.71 ']
['039 - EXTRACRANIAL PROCEDURES W/O CC/MCC', '10006', 'ELIZA COFFEE MEMORIAL HOSPITAL', '205 MARENGO STREET', 'FLORENCE', 'AL', '35631', 'AL - Birmingham', 'Rehabilitation Centers', 'Private Institution', '24', '$37,560.37 ', '$5,434.95 ', '$4,453.79 ']
import numpy as np
import pandas as pd
Trying to read a csv file using pandas
This is the data that I scraped.
Please note that there are Brackets start and end [](Maybe its a list). What should I write so entire data to be in table form? I don't know how to separate Brackets from the data.
[]
['Auburn University (Online Master of Business Administration with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' /Campus ', ' Raymond J. Harbert College of Business ']
['Auburn University (Data Science)', ' Bachelors ', ' US', ' AL', ' /Campus ', ' Business ']
['The University of Alabama (Master of Science in Marketing, Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Manderson Graduate School of Business ']
['The University of Alabama (MS in Operations Management - Decision Analytics Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (M.S. degree in Applied Statistics, Data Mining Track)', ' Masters ', ' US', ' AL', ' /Campus ', ' Manderson Graduate School of Business ']
['The University of Alabama (MBA with concentration in Business Analytics)', ' Masters ', ' US', ' AL', ' Online/ ', ' Culverhouse College of Commerce ']
['Arkansas Tech University (Business Data Analytics)', ' Bachelors ', ' US', ' AR', ' /Campus ', ' Business ']
['University of Arkansas (Graduate Certificate in Business Analytics)', ' Certificate ', ' US', ' AR', ' Online/ ', ' Sam M. Walton College of Business ']
['University of Arkansas (Master of Information Systems with Business Analytics Concentration)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of Business ']
['University of Arkansas (Professional Master of Information Systems)', ' Masters ', ' US', ' AR', ' /Campus ', ' Sam M. Walton College of
How should I read CSV file? I want all the data in a table form. Please help
Your problem is exactly what the error message is telling you it is. The error is in parsing this line:
['The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics)', ' Masters ', ' US', ' AL', '
Online/ ', ' Manderson Graduate School of Business ']
The code ignores quote characters and breaks the line up into fields, making a break wherever it finds the delimiter ", ". You're expecting this to be a single field:
The University of Alabama (Master of Science in Marketing,
Specialization in Marketing Analytics
but this "field" has an instance of the delimiter ", " in it, which the CSV parser will honor because it is ignoring the fact that you have this value in quotes. So this piece of data is broken into two fields:
['The University of Alabama (Master of Science in Marketing
and
Specialization in Marketing Analytics)'
This results in the line being broken into 7 fields, and your code is expecting only 6.
Note that in addition, your items are going to include the quotes, which may not be what you're expecting either, and those square braces don't belong there. In short, this isn't a well formed CSV file.
UPDATE: I'm a regex weenie. I do everything with regex expressions, and can't ignore a challenge like this. Here's a regex-based solution that will read exactly what you want out of this data. If you want it to recognize the last line of your data, you should add "']" to the end of that line.
import regex
from pprint import pprint
def parse_file(file):
linepat = regex.compile(r"\[\s*('([^']*)')?(\s*,\s*'([^']*)')*\s*\]")
with open(file) as f:
r = []
while True:
line = f.readline()
if not line:
break
line = line.strip()
if len(line) == 0:
continue
m = linepat.match(line)
if m and m.captures(4):
fields = [m.group(2)] + [s.strip() for s in m.captures(4)]
r.append(fields)
return r
def main():
r = parse_file("/tmp/blah.csv")
pprint(r)
main()
Result:
[['Auburn University (Online Master of Business Administration with '
'concentration in Business Analytics)',
'Masters',
'US',
'AL',
'/Campus',
'Raymond J. Harbert College of Business'],
...
['University of Arkansas (Professional Master of Information Systems)',
'Masters',
'US',
'AR',
'/Campus',
'Sam M. Walton College of']]
Note that this doesn't use the built-in 're' module. That module doesn't deal with repeating groups, which is a must for this kind of problem. Also note that this doesn't involve Pandas. I don't know anything about that module, I assume it is trivial to feed the clean, parsed data from this code into Pandas if that's where you really want it.
the basic method to read file.csv.
def process(string):
print("Processing:",string)
data = []
for line in open("file.csv"):
process(string)
line = line.replace("\n","")
process_code()