Creating a corpus from different JSON files - python

I would like to create a corpus composed by the body of different articles stored in a JSON format. They are in different files named after the year, for example:
with open('Scot_2005.json') as f:
data = [json.loads(line) for line in f]
corresponds to a newspaper, Scotsman for the year 2005. Moreover, the rest of the files for this newspaper are named: APJ_2006.... APJ2015. Also. I have another newspaper, Scottish Daily Mail, that goes only from the years 2014-1015: SDM_2014, SDM_2015. I would like to create a common list with the body of all these articles:
doc_set = [d['body'] for d in data]
My problem is looping the first part of the code that I posted so that data corresponds to all articles rather than just the ones from a given newspaper at a given year. Any ideas of how to accomplish this task? In my attempt I try using Pandas such:
for i in range(2005,2016):
df = pandas.DataFrame([json.loads(l) for l in open('Scot_%d.json' % i)])
doc_set = df.body
The problem with this method seems to me to be: it does not append all years; I am not sure how to include other newspapers with time intervals other than from 2005-15. The outcome of this method looks like:
date
2015-12-31 The Institute of Directors (IoD) has added its...
2015-12-31 It is startling to see how much the Holyrood l...
2015-12-31 A hike in interest rates in the new year will ...
2015-12-31 The First Minister has resolved to make 2016 a...
2015-12-30 The Scottish Government announced yesterday th...
2015-12-30 The Footsie closed lower amid falling oil pric...
2015-12-28 BEFORE we start the guessing game for 2016, a ...
2015-12-27 AS WE ushered in 2015, few would have predicte...
2015-12-23 No matter how hard Derek McInnes and his Aberd...
2015-12-21 THE HEAD of a Scottish Government task force s...
2015-12-17 A Scottish local authority has fought off a le...
2015-12-17 Markets lifted after the Federal Reserve hiked...
2015-12-17 Significant increases in UK quotas for fish in...
2015-12-17 WAR of words with Donald Trump suggests its ti...
2015-12-16 SCOTLAND'S national performance companies have...
2015-12-15 Markets jumped ahead of what investors expect ...
2015-12-14 Political uncertainty in back seat as transpor...
2015-12-11 The International Monetary Fund (IMF) has warn...
2015-12-08 Scotland has a "spring in its step" with the j...
2015-12-07 London's leading share index struggled for dir...
2015-12-03 REDUCING carbon is just the start of it, write...
2015-11-26 One of the country's most prized salmon rivers...
2015-11-23 Tax and legislative changes undermine strong f...
2015-11-23 A second House of Lords committee has called f...
2015-11-14 At first glance, Scotland's economic performan...
2015-11-13 THE United States has long been viewed as the ...
2015-11-12 IT IS vital for a new governance group to rest...
2015-11-12 Former SSE chief Ian Marchant has criticised r...
2015-11-11 Telecoms firm TalkTalk said it will take a hit...
2015-11-09 Improvements to consumer rights legislation ma...
...
2015-02-25 Traders baulked at an assault on the 7,000 lev...
2015-02-24 BRITISH military personnel are to be deployed ...
2015-02-20 DAVID Cameron has announced a £859 million inv...
2015-02-16 Falling oil prices and slowing inflation have ...
2015-02-14 DEFENCE spending cuts and falling oil prices h...
2015-02-14 Brent crude rallied to a 2015 high and helped ...
2015-02-12 THE HOUSING markets in Scotland and Northern I...
2015-02-10 INVESTMENT in Scotland's commercial property m...
2015-02-09 Investors took flight after Greece's new gover...
2015-02-01 Experts say large numbers are delaying decisio...
2015-01-29 MORE than 300 jobs are at risk after Tesco sai...
2015-01-27 THE Three Bears have hit out at the Rangers bo...
2015-01-21 GEORGE Osborne has challenged the right of SNP...
2015-01-19 Employment figures this week should show Briti...
2015-01-19 Why haven't petrol pump prices fallen as fast ...
2015-01-18 Without an agreement on immediate action, the...
2015-01-17 A SECOND independence referendum could be trig...
2015-01-14 THE RETAILER, which like its rivals has come u...
2015-01-14 HOUSE prices in Scotland rose by more than 4 p...
2015-01-13 HOUSE builder Taylor Wimpey is preparing for a...
2015-01-13 Supermarket group Sainsbury's today said it wo...
2015-01-13 INFLATION has tumbled to its lowest level on r...
2015-01-12 BUSINESSES are bullish about their ­prospects ...
2015-01-11 FOR decades, oil has dripped through our natio...
2015-01-09 Shares in the housebuilding sector fell heavil...
2015-01-08 THE Bank of England is expected to leave inter...
2015-01-05 COMPANIES in Scotland are more optimistic abou...
2015-01-04 UK is doing OK, but uncertainty looms on mid-y...
2015-01-02 The London market began the new year in a subd...
2015-01-02 The famous election mantra of Bill Clinton's c...
Name: body, dtype: object

Assuming you have a file list:
file_name_list = ( 'Scot_2005.json', 'APJ_2006.json' )
You can append to a list like this:
data = list()
for file_name in file_name_list:
with open(file_name, 'r') as json_file:
for line in json_file:
data.append(json.loads(line))
If you want to create the file_name_list programmatically, you can use the glob library.

Related

Why isn't replace working in pandas dataframe?

I'm trying to parse some data and I cannot seem to use .replace to remove the junk i.e. [Bluray] and other data that is not year or resolution. My end goal is end up with columns of: Movie Name, Year and Resolution
,Movie Name,others
0,James Bond The Spy Who Loved Me ,1977) [1080p]
1,James Bond Live And Let Die ,1973) [1080p]
2,No Time To Die ,2021) [1080p] [BluRay] [5.1] [YTS.MX]
3,James Bond The Man With The Golden Gun ,1974) [1080p]
4,Casino Royale ,2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5,James Bond Moonraker ,1979) [1080p]
6,James Bond Licence To Kill ,1989) [1080p]
7,James Bond A View To A Kill ,1985) [1080p]
8,James Bond The Living Daylights ,1987) [1080p]
Code i'm using is:
df['others']=df['others'].replace(to_replace=[['','BluRay']],value='')
Can anyone see where i'm going wrong?
Given:
Movie Name others
0 James Bond The Spy Who Loved Me 1977) [1080p]
1 James Bond Live And Let Die 1973) [1080p]
2 No Time To Die 2021) [1080p] [BluRay] [5.1] [YTS.MX]
3 James Bond The Man With The Golden Gun 1974) [1080p]
4 Casino Royale 2006) [2160p] [4K] [BluRay] [5.1] [YTS.MX]
5 James Bond Moonraker 1979) [1080p]
6 James Bond Licence To Kill 1989) [1080p]
7 James Bond A View To A Kill 1985) [1080p]
8 James Bond The Living Daylights 1987) [1080p]
To clean this up I would do:
df['Movie Name'] = df['Movie Name'].str.strip()
df[['Year', 'Resolution']] = df['others'].str.extract('(\d{4})\).*\[(.*p)]')
print(df[['Movie Name', 'Year', 'Resolution']])
Output:
Movie Name Year Resolution
0 James Bond The Spy Who Loved Me 1977 1080p
1 James Bond Live And Let Die 1973 1080p
2 No Time To Die 2021 1080p
3 James Bond The Man With The Golden Gun 1974 1080p
4 Casino Royale 2006 2160p
5 James Bond Moonraker 1979 1080p
6 James Bond Licence To Kill 1989 1080p
7 James Bond A View To A Kill 1985 1080p
8 James Bond The Living Daylights 1987 1080p

Python web scraper will not work on deeply nested tags

This is my second week of coding in Python. I wanted to write a scraper that would return a location and its phone number for all locations. The scraper is incomplete and I have tried a few versions but they all return either an empty list [] or an error.
import requests
from bs4 import BeautifulSoup
import requests
webpage_response = requests.get('https://www.orangetheory.com/en-us/locations/')
webpage = webpage_response.content
soup = BeautifulSoup(webpage, "html.parser")
soup.find_all(attrs={'class': 'aria-label'})
To get info about all fitness clubs in USA from that page you can use next example:
import requests
import pandas as pd
url = "https://api.orangetheory.co/partners/v2/studios?country=United%20States&sort=studioName"
data = requests.get(url).json()
df = pd.DataFrame([d[0] for d in data["data"]])
df = pd.concat([df, df.pop("studioLocation").apply(pd.Series)], axis=1)
print(df)
df.to_csv("data.csv", index=False)
Prints:
studioId studioUUId mboStudioId studioName studioNumber description studioStatus openDate reOpenDate taxRate logoUrl contactEmail timeZone environment studioProfiles physicalAddress physicalCity physicalState physicalPostalCode physicalRegion physicalCountryId physicalCountry phoneNumber latitude longitude
0 2266 f627d35c-9e2b-452a-8017-bfbcccff5a4d 610952.0 14th Street, DC 0943 The science of excess post-exercise oxygen consumption(EPOC) takes your results to new heights in this exciting group fitness concept. You will feel new energy and see amazing results with only 2-4 workouts per week. Each 60-minute class is broken into intervals of high-energy cardiovascular training and strength training. Use a variety of equipment including treadmills, rowing machines, suspension training, and free weights to tone your body and gain energy throughout the day. Exciting and inspiring group classes motivate you to beat plateaus and stick to your goals. Pay-as-you-go or get deep discounts with customized packages.\r\n\r\nThe best part of the Orange Experience is the results. You can burn calories for up to 38 hours after your workout! Active 2017-09-02 00:00:00 2020-06-27 00:00:00 0.000000 https://clients.mindbodyonline.com/studios/OrangetheoryFitnessWashingtonDC0943/logo_mobile.png?imageversion=1513008044 studiomanager0943#orangetheoryfitness.com America/New_York PROD {'isWeb': 1, 'introCapacity': 1, 'isCrm': 1} 1925 14th Street NW Suite C Washington District of Columbia 20009 DC-01 1 United States 2028691700 38.91647339 -77.03197479
1 47914 01ddd24d-58bf-4959-bcb2-34587d6e48fc 660917.0 2021 Virtual Summit 10001 None Coming Soon None None 0.000000 None studiomanager10001#orangetheoryfitness.com America/New_York PROD {'isWeb': 0, 'introCapacity': 0, 'isCrm': 0} * * * * NV-01 1 United States -81.66339500 -15.58054000
2 2964 9fd74853-4bad-4f1d-a9c7-fcbf27eb1651 576312.0 Abilene, TX 0862 None Active 2018-09-22 00:00:00 2020-05-18 00:00:00 0.000000 None studiomanager0862#orangetheoryfitness.com America/Chicago PROD {'isWeb': 1, 'introCapacity': 1, 'isCrm': 1} 3950 Catclaw Drive Abilene Texas 79606 TX-06 1 United States 3254006191 32.40399933 -99.77462006
3 3139 2a5a5bc7-ea4a-4a2a-b166-56ce5e6ee7e2 415638.0 Acworth 1188 None Active 2019-04-17 00:00:00 2020-05-17 00:00:00 0.000000 None studiomanager1188#orangetheoryfitness.com America/New_York PROD {'isWeb': 1, 'introCapacity': 1, 'isCrm': 1} 4391 Acworth Dallas Rd NW Suite 212 Acworth Georgia 30101 GA-01 1 United States 7706748722 34.05842590 -84.72319794
...
and saves data.csv (screenshot from LibreOffice):

How to get a list of tickers in Jupyter Notebook?

Write code to get a list of tickers for all S&P 500 stocks from Wikipedia. As of 2/24/2021, there are 505 tickers in that list. You can use any method you want as long as the code actually queries the following website to get the list:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
One way would be to use the requests module to get the HTML code and then use the re module to extract the tickers. Another option would be the .read_html function in pandas and then export the tickers column to a list.
You should save the tickers in a list with the name sp500_tickers
This will grab the data in the table named 'constituents'.
# find a specific table by table count
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
Result:
[{"Symbol":"MMM","Security":"3M Company","SEC filings":"reports","GICS Sector":"Industrials","GICS Sub-Industry":"Industrial Conglomerates","Headquarters Location":"St. Paul, Minnesota","Date first added":"1976-08-09","CIK":66740,"Founded":"1902"},{"Symbol":"ABT","Security":"Abbott Laboratories","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"North Chicago, Illinois","Date first added":"1964-03-31","CIK":1800,"Founded":"1888"},{"Symbol":"ABBV","Security":"AbbVie Inc.","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Pharmaceuticals","Headquarters Location":"North Chicago, Illinois","Date first added":"2012-12-31","CIK":1551152,"Founded":"2013 (1888)"},{"Symbol":"ABMD","Security":"Abiomed","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"Danvers, Massachusetts","Date first added":"2018-05-31","CIK":815094,"Founded":"1981"},{"Symbol":"ACN","Security":"Accenture","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"IT Consulting & Other Services","Headquarters Location":"Dublin, Ireland","Date first added":"2011-07-06","CIK":1467373,"Founded":"1989"},{"Symbol":"ATVI","Security":"Activision Blizzard","SEC filings":"reports","GICS Sector":"Communication Services","GICS Sub-Industry":"Interactive Home Entertainment","Headquarters Location":"Santa Monica, California","Date first added":"2015-08-31","CIK":718877,"Founded":"2008"},{"Symbol":"ADBE","Security":"Adobe Inc.","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"Application Software","Headquarters Location":"San Jose, California","Date first added":"1997-05-05","CIK":796343,"Founded":"1982"},
Etc., etc., etc.
That's JSON. If you want a table, kind of like what you would use in Excel, simply print the df.
Result:
[ Symbol Security SEC filings GICS Sector \
0 MMM 3M Company reports Industrials
1 ABT Abbott Laboratories reports Health Care
2 ABBV AbbVie Inc. reports Health Care
3 ABMD Abiomed reports Health Care
4 ACN Accenture reports Information Technology
.. ... ... ... ...
500 YUM Yum! Brands Inc reports Consumer Discretionary
501 ZBRA Zebra Technologies reports Information Technology
502 ZBH Zimmer Biomet reports Health Care
503 ZION Zions Bancorp reports Financials
504 ZTS Zoetis reports Health Care
GICS Sub-Industry Headquarters Location \
0 Industrial Conglomerates St. Paul, Minnesota
1 Health Care Equipment North Chicago, Illinois
2 Pharmaceuticals North Chicago, Illinois
3 Health Care Equipment Danvers, Massachusetts
4 IT Consulting & Other Services Dublin, Ireland
.. ... ...
500 Restaurants Louisville, Kentucky
501 Electronic Equipment & Instruments Lincolnshire, Illinois
502 Health Care Equipment Warsaw, Indiana
503 Regional Banks Salt Lake City, Utah
504 Pharmaceuticals Parsippany, New Jersey
Date first added CIK Founded
0 1976-08-09 66740 1902
1 1964-03-31 1800 1888
2 2012-12-31 1551152 2013 (1888)
3 2018-05-31 815094 1981
4 2011-07-06 1467373 1989
.. ... ... ...
500 1997-10-06 1041061 1997
501 2019-12-23 877212 1969
502 2001-08-07 1136869 1927
503 2001-06-22 109380 1873
504 2013-06-21 1555280 1952
[505 rows x 9 columns]]
Alternatively, you can export the df to a CSV file.
df.to_csv('constituents.csv')

Extracting year from a column of string movie names

I have the following data, having two columns, "name" and "gross" in table called train_df:
gross name
760507625.0 Avatar (2009)
658672302.0 Titanic (1997)
652270625.0 Jurassic World (2015)
623357910.0 The Avengers (2012)
534858444.0 The Dark Knight (2008)
532177324.0 Rogue One (2016)
474544677.0 Star Wars: Episode I - The Phantom Menace (1999)
459005868.0 Avengers: Age of Ultron (2015)
448139099.0 The Dark Knight Rises (2012)
436471036.0 Shrek 2 (2004)
424668047.0 The Hunger Games: Catching Fire (2013)
423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006)
415004880.0 Toy Story 3 (2010)
409013994.0 Iron Man 3 (2013)
408084349.0 Captain America: Civil War (2016)
408010692.0 The Hunger Games (2012)
403706375.0 Spider-Man (2002)
402453882.0 Jurassic Park (1993)
402111870.0 Transformers: Revenge of the Fallen (2009)
400738009.0 Frozen (2013)
381011219.0 Harry Potter and the Deathly Hallows: Part 2 (2011)
380843261.0 Finding Nemo (2003)
380262555.0 Star Wars: Episode III - Revenge of the Sith (2005)
373585825.0 Spider-Man 2 (2004)
370782930.0 The Passion of the Christ (2004)
I would like to read and extract the date from "name" to create a new column that will be called "year", which I will then use to filter the data set by specific year.
The new table will look like the following:
year gross name
2009 760507625.0 Avatar (2009)
1997 658672302.0 Titanic (1997)
2015 652270625.0 Jurassic World (2015)
2012 623357910.0 The Avengers (2012)
2008 534858444.0 The Dark Knight (2008)
I tried the apply and lambda approach, but got no results:
train_df[train_df.apply(lambda row: row['name'].startswith('2014'),axis=1)]
Is there a way to use contains (as in C# or "isin" to filter the strings in python?
If you know for sure that your years are going to be at the end of the string, you can do
df['year'] = df['name'].str[-5:-1].astype(int)
This takes the column name, uses the str accessor to access the value of each row as a string, and takes the -5:-1 slice from it. Then, it converts the result to int, and sets it as the year column. This approach will be much faster than iterating over the rows if you have lots of data.
Alternatively, you could use regex for more flexibility using the .extract() method of the str accessor.
df['year'] = df['name'].str.extract(r'\((\d{4})\)').astype(int)
This extracts groups matching the expression \((\d{4})\) (Try it here), which means capture the numbers inside a pair of parentheses containing exactly four digits, and will work anywhere in the string. To anchor it to the end of your string, use a $ at the end of your regex like so: \((\d{4})\)$. The result is the same using regex and using string slicing.
Now we have our new dataframe:
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
5 532177324.0 Rogue One (2016) 2016
6 474544677.0 Star Wars: Episode I - The Phantom Menace (1999) 1999
7 459005868.0 Avengers: Age of Ultron (2015) 2015
8 448139099.0 The Dark Knight Rises (2012) 2012
9 436471036.0 Shrek 2 (2004) 2004
10 424668047.0 The Hunger Games: Catching Fire (2013) 2013
11 423315812.0 Pirates of the Caribbean: Dead Man's Chest (2006) 2006
12 415004880.0 Toy Story 3 (2010) 2010
13 409013994.0 Iron Man 3 (2013) 2013
14 408084349.0 Captain America: Civil War (2016) 2016
15 408010692.0 The Hunger Games (2012) 2012
16 403706375.0 Spider-Man (2002) 2002
17 402453882.0 Jurassic Park (1993) 1993
18 402111870.0 Transformers: Revenge of the Fallen (2009) 2009
19 400738009.0 Frozen (2013) 2013
20 381011219.0 Harry Potter and the Deathly Hallows: Part 2 (... 2011
21 380843261.0 Finding Nemo (2003) 2003
22 380262555.0 Star Wars: Episode III - Revenge of the Sith (... 2005
23 373585825.0 Spider-Man 2 (2004) 2004
24 370782930.0 The Passion of the Christ (2004) 2004
You can a regular expression with pandas.Series.str.extract for this:
df["year"] = df["name"].str.extract(r"\((\d{4})\)$", expand=False)
df["year"] = pd.to_numeric(df["year"])
print(df.head())
gross name year
0 760507625.0 Avatar (2009) 2009
1 658672302.0 Titanic (1997) 1997
2 652270625.0 Jurassic World (2015) 2015
3 623357910.0 The Avengers (2012) 2012
4 534858444.0 The Dark Knight (2008) 2008
The regular expression:
\(: find where there is a literal opening parentheses
(\d{4}) Then, find 4 digits appearing next to each other
The parentheses here means that we're storing our 4 digits as a capture group (in this case its the group of digits we want to extract from the larger string)
\): Then, find a closing parentheses
$: All of the above MUST occur at the end of the string
When all of the above criterion are met, get those 4 digits- or if no match is acquired, return NaN for that row.
Try this.
df = ['Avatar (2009)', 'Titanic (1997)', 'Jurassic World (2015)','The Avengers (2012)', 'The Dark Knight (2008)', 'Rogue One (2016)','Star Wars: Episode I - The Phantom Menace (1999)','Avengers: Age of Ultron (2015)', 'The Dark Knight Rises (2012)','Shrek 2 (2004)', 'Boiling Point (1990)', 'Terror Firmer (1999)', 'Adam's Apples (2005)', 'I Want You (1998)', 'Chalet Girl (2011)','Love, Honor and Obey (2000)', 'Perrier's Bounty (2009)','Into the White (2012)', 'The Decoy Bride (2011)','I Spit on Your Grave 2 (2013)']
for i in df:
mov_title = i[:-7]
year = i[-5:-1]
print(mov_title) //do your actual extraction
print(year) //do your actual extraction
def getYear(val):
startIndex = val.find('(')
endIndex = val.find(')')
return val[(int(startIndex) + 1):endIndex]
Am not much of a python dev, but i believe this will do. You will just need to loop through passing each to the above function. On each function call you will get the date extracted for you.

Inserting a header row for pandas dataframe

I have just started python and am trying to rewrite one of my perl scripts in python. Essentially, I had a long script to convert a csv to json.
I've tried to import my csv into a pandas dataframe, and wanted to insert a header row at the top, since my csv lacks that.
Code:
import pandas
db=pandas.read_csv("netmedsdb.csv",header=None)
db
Output:
0 1 2 3
0 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
I wrote the following code to insert the first element at row 0,coloumn 0:
db.insert(loc=0,column='0',value='Brand')
db
Output:
0 0 1 2 3
0 Brand 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 Brand BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 Brand KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 Brand RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 Brand 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 Brand AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 Brand RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 Brand VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 Brand VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
But unfortunately I got the word "Brand" inserted at coloumn 0 in all rows.
I'm trying to add the header coloumns "Brand", "Generic", "Price", "Company"
Need parameter names in read_csv only:
import pandas as pd
temp=u"""a,b,10,d
e,f,45,r
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'netmedsdb.csv'
df = pd.read_csv(pd.compat.StringIO(temp), names=["Brand", "Generic", "Price", "Company"])
print (df)
Brand Generic Price Company
0 a b 10 d
1 e f 45 r

Categories

Resources