Big Bird Pegasus Summarization output is repeating itself - python

I am trying to use Big Bird Pegasus to summarize various long texts. The output is repeating the same concept in each sentence.
Here is my code using a news article I copied from NPR. The text is longer than the 4096 token limit, so it takes the first few thousand words from my input.
from transformers import BigBirdPegasusForConditionalGeneration, AutoTokenizer, AutoModelForSeq2SeqLM
model = BigBirdPegasusForConditionalGeneration.from_pretrained("google/bigbird-pegasus-large-arxiv")
tokenizer = AutoTokenizer.from_pretrained("google/bigbird-pegasus-large-arxiv")
src_text = '''
On a sleepy cul-de-sac amid the bucolic vineyards and grassy hills of California's Sonoma Valley, a $4 million house has become the epicenter of a summer-long spat between angry neighbors and a new venture capital-backed startup buying up homes around the nation. The company is called Pacaso. It says it's the fastest company in American history to achieve the "unicorn" status of a billion-dollar valuation — but its quarrels in wine country, one of the first regions where it's begun operations, foreshadow business troubles ahead.
Brad Day and his wife, Holly Kulak, were first introduced to Pacaso in May after a romantic sunset dinner in their yard. "And we just saw this drone, coming up and over our backyard," Day says. "And we're like, what is that?"
Pacaso denies directing or paying a drone operator to film the neighborhood. But its website does have drone photos of the house in question, located at 1405 Old Winery Court. It says it bought the photos after the fact.
Nonetheless, after the drone incident, Day and Kulak got suspicious about what was going on in their neighborhood. About a week later, their neighbors told them they were moving and selling their house to a limited liability corporation, or LLC. But they were super vague about it.
Day and Kulak began speaking with other residents on their cul-de-sac. One of them, Nancy Gardner, had learned from a friend in nearby Napa Valley about a new company called Pacaso that was buying houses in the area. The company was co-founded by a Napa resident, and it converts houses into LLCs. Pacaso then sells shares of these corporate houses to multiple investors. Gardner Googled Pacaso, and, sure enough, the house on their cul-de-sac was on its website. The company had named the house "Chardonnay" and was now selling investors the chance to buy a one-eighth share of it for $606,000.
Pacaso was founded in October 2020 by Austin Allison and Spencer Rascoff, two former executives at Zillow. The company is based in San Francisco, and as is typical of tech startups in the Silicon Valley area, its founders tell a lofty story about their business that's about more than just making money. The company says the motivation for the venture began when Allison and his wife, both based in Napa, bought a second home in Lake Tahoe. The night after they closed on the house, Allison says in a promotional video, he and his wife sat around a fire "thinking how appreciative we were to be second homeowners. And, from that moment, I've always been inspired about making the dream of second home ownership possible for more people."
To make second home ownership possible for more people — and, of course, make money — Pacaso uses a "fractional home ownership" model. They buy a house, lightly refurbish it, furnish it and then create an LLC for it. They then divvy up ownership of this corporatized house into eight fractions and sell those shares on their website.
If you buy a share in a house, you're able to stay in it 44 nights per year in increments that can't exceed 14 consecutive days per visit. You can also "gift" these stays to friends or family. Pacaso offers an app to handle the logistics of booking stays. It oversees management, maintenance and cleaning of the property. In exchange for all this, it charges 12% of the home's purchase price upfront and monthly fees going forward. If you buy a share in a house, you have to hold on to it for a year. After that, you can sell it and profit from any appreciation in the home's value (or be on the hook for any depreciation).
When Day, Kulak and their neighbors learned about Pacaso's business model, they were appalled. They saw the venture capital-backed company as invading their community and converting their neighbor's house into a revolving carousel of vacationers. They imagined endless parties, noise and cars overflowing their cul-de-sac. They worried those staying at "Chardonnay" would drive too fast and fail to heed local concerns about wildfires and droughts. But, most of all, they feared the Pacaso house and more like it would destroy their sense of community and turn their neighborhood into an "adult Disneyland."
The county, Day says, had designated their neighborhood an "exclusion zone," which bans Airbnb-style, short-term rentals to preserve the "residential character" of communities. But Pacaso argues that its clients are not short-term renters. They are co-owners of an LLC. This also means they don't have to pay the typical taxes on short-term rentals. Likewise, in the nearby town of St. Helena, Pacaso was trying to circumnavigate a city ban against timeshares with the same argument. Day says he and his neighbors saw Pacaso's newfangled business model as nothing more than a "glorified timeshare" with a legal strategy aimed at "skirting regulations that are designed to keep communities intact."
The cul-de-sac sprang into action. It formed an organization called Sonomans Together Opposing Pacaso, which, not coincidentally, has the acronym STOP. It contacted the county Board of Supervisors. It created an anti-Pacaso website and circulated an online petition. It flooded the local newspaper with op-eds and letters to the editor. It lobbied local real estate agents not to work with Pacaso. "It feels like we're waging a war by land, air and sea," Day says.
Protest signs festoon the neighborhood's lawns, fences and cars. They say things such as "Stop Pacaso" and "Not here, Pacaso!" Day's favorite sign reads, "The Pacaso house is the big one on the right with no soul."
The signs, of course, make the prospect of buying a share in the Pacaso house awkward, to say the least. Alfred Miller, however, bought a share in "Chardonnay" before ever seeing it in person. Miller is a risk management consultant based in Los Angeles. He believes in Pacaso's business model. And he likes wine and Sonoma's climate. As he researched "Chardonnay" online, he liked the modern architecture and pool, and he decided he'd buy a one-eighth share of the house. It wasn't until a couple weeks after he made the purchase that he first drove up to Sonoma and witnessed the spectacle around his new investment.
"So, imagine me as a new owner driving up, and I get to the corner of Old Winery Court," Miller says. "There's a full-on, professionally printed sign that says 'No Pacaso.' '' Miller then turned right onto Old Winery Court "and the more I drive into the neighborhood, the more signs I see. Brad Day has three vehicles in front of his house, and each vehicle has an anti-Pacaso sign on it. I pull into the driveway — there are two signs on each side of the property. I mean, it was not what I would call very welcoming."
As it did on Old Winery Court, controversy erupted in Napa after the company bought a home worth $1.13 million. That's about 35% higher than Napa's median home price. Pacaso insists it only buys luxury and ultra-luxury houses, and it therefore isn't competing with local middle-class families in the housing market. But this home, located two blocks from a high school, didn't quite fit its talking points. Some Napans were pissed. Pacaso says the house was the victim of trespassing and "illegal signage." Pacaso even claims it had to file a police report after a local wrote to the company and said, "I will burn down any home you buy in Napa. This is no joke."
Pacaso's CEO, who lives in Napa, saw firsthand how angry Napans were, and the company responded. In June, Pacaso agreed to sell the Napa home in a traditional manner "to a whole home buyer" rather than convert it into a corporation and sell it to multiple people. The company also pledged to beef up its "Owner Code Of Conduct" to include "decibel limits on all home sound systems," create a "local liaison" dedicated to assisting neighbors, not buy any homes in the area for under $2 million, and, for each house sold in Napa and Sonoma counties, donate $20,000 to a local nonprofit dedicated to affordable housing.
But while it has been trying to placate local communities with business reforms, Pacaso has waged a court battle with the town of St. Helena over whether its homes should be classified as timeshares. Pacaso is dead set against that classification. One reason might be that timeshares have a bad rap: While they're a popular way to go on vacations, their costs and associated fees tend to make them money losers rather than a profitable investment.
Potentially even more damaging to Pacaso's ambitions, however: Timeshares are banned in many vacation communities around the nation. Hence, Pacaso has strong reasons to insist its homes are not timeshares.
"Unlike a timeshare model, the co-owners that Pacaso serves collectively own real estate, not time," says Ellen Haberle, director of community and government relations for Pacaso.
St. Helena disagrees, declaring Pacaso homes are not allowed in the town because of a city ordinance against timesharing. "Simply calling them co-ownership arrangements does not change that fact," City Attorney Ethan Walsh said. In response to the ban, Pacaso sued the town in federal court. The lawsuit is still pending.
Pacaso says it plans to expand across North America and Europe. Given the company's billion-dollar valuation, investors seem to believe that many people will be attracted to its model of fractional second home ownership. But local residents will likely continue to fight the unicorn stampeding into their towns.
'''
device = 'cuda' if torch.cuda.is_available() else 'cpu'
batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(device)
translated = model.generate(**batch)
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
Here is the output. It doesn't mention anything but the topic of fractional ownership. I repeated for another input text and got similar results - each sentence was a slight variation of each other.
tgt_text
['the notion of a fractional ownership in a real property was introduced in the 19th century.<n> fractional ownership in a real property was defined to be the fraction of the value of the property minus the cost of its construction.<n> the fractional ownership of a real property was defined to be the fraction of the value of the property minus the cost of its construction.<n> the fractional ownership of a home is defined to be the fraction of the value of the home minus the cost of its construction. <n> the notion of a fractional ownership in a real property was introduced in the 19th century.<n> the fractional ownership of a home is the fraction of the value of the home minus the cost of its construction.<n> the fractional ownership of a real property was defined to be the fraction of the value of the home minus the cost of its construction.<n> the fractional ownership of a home is the fraction of the value of the home minus the cost of its construction.<n> the fractional ownership of a real property was defined to be the fraction of the value of the home minus the cost of its construction.<n> the fractional ownership of a home was defined to be the fraction of the value of the home minus the cost of its construction.<n> the']

Have you tried to use no_repeat_ngram_size to generate the summarization? E.g.,
translated = model.generate(**batch, no_repeat_ngram_size=3)
This will generate following text:
the notion of a fractional ownership in a real property was introduced in the 19th century.<n> fractional ownership refers to a property that is more than one half of the value of the property. in recent years<n>, fractional ownership has been used to describe a variety of real
estate phenomena, such as the construction of bridges and tunnels, the development of canals and other drainages, as well as the movement of people and goods. here<n> we consider the fractional ownership
of a home, which can be thought of as the difference between the values of the two halves of the home.

Related

capitalization of word from column A in a text from column B in csv file

I Have a csv file with two columns A for brand's names Column B with descriptions
I want to capitalize the brands names in the descriptions by checking if the name of the brand from column A is existing if yes should be capitalized if not move to the next row and so one
any solution for that, I got stack on that here is a sample from the data set
column A
& Other Stories
#HomeOffice
1-800-FLOWERS.COM, INC.
10 Corso Como
100% CAPRI
ITALIA SRL
Column B
They believe in sharing stories. their concept is built around
inviting ctomers to be involved in the creative process behind the
brand. They share everything from the sketch of a shoe to the behind
the scenes of a photo shoot, and views from their ateliers in paris,
stockholm and los angeles. Words such as personal, diverse and
uncomplicated are present in everything they do. They aim to create
collections for all fashion loving women. They want to encourage
personal style with their wide range of products and make everyone
feel welcome. By keeping things uncomplicated and flexible, it’s easy
for them to adapt to new visions and different collections as fashion
always changes. That’s the beauty of it! their Collections are
designed in their three ateliers: paris, stockholm and los angeles.
They are each fantastic yet very different cities. They love the
contrast between the passionate parisian atmosphere, the minimalist,
pragmatic stockholm feel, and the laidback los angeles vibe, and
especially what happens when they all come together. Each person
working with & other stories is an essential part of their group of
creatives and valued individual within the company. They believe in
being spontaneo, personal and flexible, which makes it easy to
collaborate within all parts of their brand and enables growth, for
you and . Currently, & other stories has over 1, 500 employees and
continues to expand, opening more stores around the world.
#Homeoffice Creates a unique, original workplace with a long service life. Stainable sitting sta desks, ergonomic seats and high quality
accessories make your home workplace unique and ergonomic. Pure dutch
design. Go to http: //www. Hashtaghomeoffice. Nl for more information
and inspiration!
#Homeoffice Creates a unique, original workplace with a long service life. Stainable sitting sta desks, ergonomic seats and high quality
accessories make your home workplace unique and ergonomic. Pure dutch
design. Go to http: //www.hashtaghomeoffice. Nl for more information
and inspiration!
Founded in 1991, 10 corso como is recognized as the first concept
destination, blending culture with trends, promoting a close link
between fashion and design. Known as the first «concept store», it
turned the retail concept in a hub for lifestyle and fashion. Today,
30 years later, 10 corso como, with the new presidency and
entrepreneurship of tiziana fati, together with the artistic direction
of carla sozzani, continues the same integrity and aesthetic identity
that has made 10 corso como a symbol of milan, of creativity and made
in italy. For further information about 10 corso como visit www.
10corsocomo. Com connect with 10 corso como on instagram: #10corsocomo
The 100% capri luxury brand was born in 2000 from the idea of ​​toni
aiello who, focing everything on quality, on rich history and on the
charm of made in italy, it is aimed at a target of ctomers who love
lifestyle and the good life, who spend the summer in capri and winter
in saint moritz, also making incursion to the caribbean. People who
like to spend, but who ask first of all. The foc straight to the
linen, why did you want an inseparable combination with 100% capri, a
little? How To do the red luxury car and the ferrari brand. Today
thanks to new technologies the tailoring is able to offer fifteen
different qualities of material, from waterproof linen at lino's garza
and offer collections for men, women, children, but also for the home.
Aiello "wears" even boats and planes. All by invoiced by tens of
millions of euros and an international presence all over the world.
From capri to the fascinating hotel de rsie in rome, from st.
Barthelemy (elegant and refined island of the french antilles) to bal
harbour in florida (the most luxurio mall in the united states), from
sicily (inside the prestigio vegetable golf) to cape town. Openings
are also scheduled for abu dhabi and hong kong.
Thanks

Convert csv to Json to send to watson-personality insight API with Python

I am trying to convert a csv which contains some reviews that I've extracted. I need to convert to a Json as input to IBM watson personality insights.
The csv (WordFinal.csv) with the reviews is like this:
ID,Review,advice,con,date,employeeType,overallStar,position,pro,reviewLink,reviewNo,summary
3,ive been with amazon for 3 years before i finally resigned i started directly out of college as an area manager and promoted once a year to operation manager and was starting to get tapped for senior operations manager the progression at this company is unbelievable and growth potential is nearly limitless you get to work with a highly dedicated team that strives to raise the bar dont get me wrong you may come across individuals who also do the minimum and cut corners to get results but that is the case anywhere i was able to lead teams of up to 700 associates and indirectly oversee 1500 associates during the holiday season all at the age of 25 i was able to influence shift wide department wide building wide and even region wide procedures processes decisions for the company . my only con was the work life balance i was put on nights my entire time with amazon in order to continue to help the lower performing shifts and this would lead to having a daily schedule of wake up go straight to work usually forget to eat leave after 14 hours go to the gym eat maybe sleep wake up after 4 5 hours do it again based on location your leadership team s culture may vary drastically in california i had an extremely supportive senior and general manager team once i came to new jersey the senior team was pretty blind to their own operation super focused on creating reports continuous innovation projects in excel sheets versus just letting us go out there and do the actual improvements there was an extremely slow uptake on just do it recommendations and it was highly evident that this mentality was built into the department as well so again take it with a grain of salt ,"Cali - keep it up, you have culture nailedJersey - This is amazon, you can't raise the bar at amazon without actually being innovating and taking big steps to change things up.",my only con was the work life balance i was put on nights my entire time with amazon in order to continue to help the lower performing shifts and this would lead to having a daily schedule of wake up go straight to work usually forget to eat leave after 14 hours go to the gym eat maybe sleep wake up after 4 5 hours do it again based on location your leadership team s culture may vary drastically in california i had an extremely supportive senior and general manager team once i came to new jersey the senior team was pretty blind to their own operation super focused on creating reports continuous innovation projects in excel sheets versus just letting us go out there and do the actual improvements there was an extremely slow uptake on just do it recommendations and it was highly evident that this mentality was built into the department as well so again take it with a grain of salt ," Nov 9, 2017",Former Employee - Operations Manager,3,I worked at Amazon full-time (More than 3 years),ive been with amazon for 3 years before i finally resigned i started directly out of college as an area manager and promoted once a year to operation manager and was starting to get tapped for senior operations manager the progression at this company is unbelievable and growth potential is nearly limitless you get to work with a highly dedicated team that strives to raise the bar dont get me wrong you may come across individuals who also do the minimum and cut corners to get results but that is the case anywhere i was able to lead teams of up to 700 associates and indirectly oversee 1500 associates during the holiday season all at the age of 25 i was able to influence shift wide department wide building wide and even region wide procedures processes decisions for the company ,https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036.htm/Reviews/Employee-Review-Amazon-RVW17818923.htm,empReview_17818923,Take it with a grain of salt
19, company is on an unstoppable growth trajectory amazons business model is incredible is riding a number of secular trends ecommerce cloud ai and the stock is a winner employees are making more money than expected leadership principles drive a high performance culture that focuses on customers it feels great to work on products that customers love you get to work on very difficult problems with smart people once you establish yourself as a high performer you have a high level of job security and internal mobility teams are constantly hiring and building really innovative things and you are encouraged to move around and explore teams tend to be lean and you will be asked to learn a lot quickly ownership is highly valued office environment is really desirable located in a great downtown seattle neighborhood many people walk to work bring dogs to the office and restaurants and bars are very accessible amazon veterans tend to be incredibly talented individuals and other companies realize it being successful at amazon is well respected in the industry . work life balance can be a challenge work demands are high and teams are often too lean you have to set your own boundaries even with kind managers overachievers will feel under water frugality as a core value goes overboard if amazon doesn t have to give it you it won t no perks no free food or drinks bad coffee unsubsidized cafeterias mediocre hardware for non technical people there doesn t seem to be a morale budget and you will have few official team outings compensation policies are not employee friendly 401k matching is subpar once your signing cash bonus is fully vested your entire compensation will be base salary and stock base salary is capped at 160k across the company stock vests twice a year if you re below a director so your compensation is very lumpy stock price appreciation is taken into consideration in your total compensation targets ie if the value of previously offered shares increases the company will count that as a raise and might not grant you additional stock bonuses despite strong performance ,None, work life balance can be a challenge work demands are high and teams are often too lean you have to set your own boundaries even with kind managers overachievers will feel under water frugality as a core value goes overboard if amazon doesn t have to give it you it won t no perks no free food or drinks bad coffee unsubsidized cafeterias mediocre hardware for non technical people there doesn t seem to be a morale budget and you will have few official team outings compensation policies are not employee friendly 401k matching is subpar once your signing cash bonus is fully vested your entire compensation will be base salary and stock base salary is capped at 160k across the company stock vests twice a year if you re below a director so your compensation is very lumpy stock price appreciation is taken into consideration in your total compensation targets ie if the value of previously offered shares increases the company will count that as a raise and might not grant you additional stock bonuses despite strong performance ," Oct 29, 2016",Current Employee - Product Manager,4,I have been working at Amazon full-time (More than 3 years), company is on an unstoppable growth trajectory amazons business model is incredible is riding a number of secular trends ecommerce cloud ai and the stock is a winner employees are making more money than expected leadership principles drive a high performance culture that focuses on customers it feels great to work on products that customers love you get to work on very difficult problems with smart people once you establish yourself as a high performer you have a high level of job security and internal mobility teams are constantly hiring and building really innovative things and you are encouraged to move around and explore teams tend to be lean and you will be asked to learn a lot quickly ownership is highly valued office environment is really desirable located in a great downtown seattle neighborhood many people walk to work bring dogs to the office and restaurants and bars are very accessible amazon veterans tend to be incredibly talented individuals and other companies realize it being successful at amazon is well respected in the industry ,https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036.htm/Reviews/Employee-Review-Amazon-RVW12494284.htm,empReview_12494284,Incredible growth opportunity with downsides
The format accepted by IBM-Watson is as follows:
{
"contentItems": [
{
"content": "Wow, I liked #TheRock before, now I really SEE how special he is. The daughter story was IT for me. So great! #MasterClass",
"contenttype": "text/plain",
"created": 1447639154000,
"id": "666073008692314113",
"language": "en"
},
{
"content": ".#TheRock how did you Know to listen to your gut and Not go back to football? #Masterclass",
"contenttype": "text/plain",
"created": 1447638226000,
"id": "666069114889179136",
"language": "en"
},
{
"content": ".#TheRock moving back in with your parents so humbling. \" on the other side of your pain is something good if you can hold on\" #masterclass",
"contenttype": "text/plain",
"created": 1447638067000,
"id": "666068446325665792",
"language": "en"
}
]
}
I'm trying to generate the output as IBM-watson accept it, but I can't figure out how can i achieve such task. The code below:
import json, csv
csvfile = 'C:\\WordFinal.csv'
jsonfile = 'C:\\OutputJ.json'
fieldnames=['ID','summary']
data= {}
with open(csvfile) as cs:
#reader = csv.DictReader(cs)
reader = csv.DictReader(cs, fieldnames)
for row in reader:
csvid = row["ID"]
data[csvid] = row
with open(jsonfile, "w",encoding='utf8') as js:
js.write(json.dumps(data,indent=2))
And the output (I only need ID and summary columns from my csv - first and last column):
{
"ID": {
"ID": "ID",
"summary": "Review",
"null": [
"advice",
"con",
"date",
"employeeType",
"overallStar",
"position",
"pro",
"reviewLink",
"reviewNo",
"summary"
]
},
"3": {
"ID": "3",
"summary": "ive been with amazon for 3 years before i finally resigned i started directly out of college as an area manager and promoted once a year to operation manager and was starting to get tapped for senior operations manager the progression at this company is unbelievable and growth potential is nearly limitless you get to work with a highly dedicated team that strives to raise the bar dont get me wrong you may come across individuals who also do the minimum and cut corners to get results but that is the case anywhere i was able to lead teams of up to 700 associates and indirectly oversee 1500 associates during the holiday season all at the age of 25 i was able to influence shift wide department wide building wide and even region wide procedures processes decisions for the company . my only con was the work life balance i was put on nights my entire time with amazon in order to continue to help the lower performing shifts and this would lead to having a daily schedule of wake up go straight to work usually forget to eat leave after 14 hours go to the gym eat maybe sleep wake up after 4 5 hours do it again based on location your leadership team s culture may vary drastically in california i had an extremely supportive senior and general manager team once i came to new jersey the senior team was pretty blind to their own operation super focused on creating reports continuous innovation projects in excel sheets versus just letting us go out there and do the actual improvements there was an extremely slow uptake on just do it recommendations and it was highly evident that this mentality was built into the department as well so again take it with a grain of salt ",
"null": [
"Cali - keep it up, you have culture nailedJersey - This is amazon, you can't raise the bar at amazon without actually being innovating and taking big steps to change things up.",
"my only con was the work life balance i was put on nights my entire time with amazon in order to continue to help the lower performing shifts and this would lead to having a daily schedule of wake up go straight to work usually forget to eat leave after 14 hours go to the gym eat maybe sleep wake up after 4 5 hours do it again based on location your leadership team s culture may vary drastically in california i had an extremely supportive senior and general manager team once i came to new jersey the senior team was pretty blind to their own operation super focused on creating reports continuous innovation projects in excel sheets versus just letting us go out there and do the actual improvements there was an extremely slow uptake on just do it recommendations and it was highly evident that this mentality was built into the department as well so again take it with a grain of salt ",
" Nov 9, 2017",
"Former Employee - Operations Manager",
"3",
"I worked at Amazon full-time\u00c2\u00a0(More than 3 years)",
"ive been with amazon for 3 years before i finally resigned i started directly out of college as an area manager and promoted once a year to operation manager and was starting to get tapped for senior operations manager the progression at this company is unbelievable and growth potential is nearly limitless you get to work with a highly dedicated team that strives to raise the bar dont get me wrong you may come across individuals who also do the minimum and cut corners to get results but that is the case anywhere i was able to lead teams of up to 700 associates and indirectly oversee 1500 associates during the holiday season all at the age of 25 i was able to influence shift wide department wide building wide and even region wide procedures processes decisions for the company ",
"https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036.htm/Reviews/Employee-Review-Amazon-RVW17818923.htm",
"empReview_17818923",
"Take it with a grain of salt"
]
},
"19": {
"ID": "19",
"summary": " company is on an unstoppable growth trajectory amazons business model is incredible is riding a number of secular trends ecommerce cloud ai and the stock is a winner employees are making more money than expected leadership principles drive a high performance culture that focuses on customers it feels great to work on products that customers love you get to work on very difficult problems with smart people once you establish yourself as a high performer you have a high level of job security and internal mobility teams are constantly hiring and building really innovative things and you are encouraged to move around and explore teams tend to be lean and you will be asked to learn a lot quickly ownership is highly valued office environment is really desirable located in a great downtown seattle neighborhood many people walk to work bring dogs to the office and restaurants and bars are very accessible amazon veterans tend to be incredibly talented individuals and other companies realize it being successful at amazon is well respected in the industry . work life balance can be a challenge work demands are high and teams are often too lean you have to set your own boundaries even with kind managers overachievers will feel under water frugality as a core value goes overboard if amazon doesn t have to give it you it won t no perks no free food or drinks bad coffee unsubsidized cafeterias mediocre hardware for non technical people there doesn t seem to be a morale budget and you will have few official team outings compensation policies are not employee friendly 401k matching is subpar once your signing cash bonus is fully vested your entire compensation will be base salary and stock base salary is capped at 160k across the company stock vests twice a year if you re below a director so your compensation is very lumpy stock price appreciation is taken into consideration in your total compensation targets ie if the value of previously offered shares increases the company will count that as a raise and might not grant you additional stock bonuses despite strong performance ",
"null": [
"None",
" work life balance can be a challenge work demands are high and teams are often too lean you have to set your own boundaries even with kind managers overachievers will feel under water frugality as a core value goes overboard if amazon doesn t have to give it you it won t no perks no free food or drinks bad coffee unsubsidized cafeterias mediocre hardware for non technical people there doesn t seem to be a morale budget and you will have few official team outings compensation policies are not employee friendly 401k matching is subpar once your signing cash bonus is fully vested your entire compensation will be base salary and stock base salary is capped at 160k across the company stock vests twice a year if you re below a director so your compensation is very lumpy stock price appreciation is taken into consideration in your total compensation targets ie if the value of previously offered shares increases the company will count that as a raise and might not grant you additional stock bonuses despite strong performance ",
" Oct 29, 2016",
"Current Employee - Product Manager",
"4",
"I have been working at Amazon full-time\u00c2\u00a0(More than 3 years)",
" company is on an unstoppable growth trajectory amazons business model is incredible is riding a number of secular trends ecommerce cloud ai and the stock is a winner employees are making more money than expected leadership principles drive a high performance culture that focuses on customers it feels great to work on products that customers love you get to work on very difficult problems with smart people once you establish yourself as a high performer you have a high level of job security and internal mobility teams are constantly hiring and building really innovative things and you are encouraged to move around and explore teams tend to be lean and you will be asked to learn a lot quickly ownership is highly valued office environment is really desirable located in a great downtown seattle neighborhood many people walk to work bring dogs to the office and restaurants and bars are very accessible amazon veterans tend to be incredibly talented individuals and other companies realize it being successful at amazon is well respected in the industry ",
"https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036.htm/Reviews/Employee-Review-Amazon-RVW12494284.htm",
"empReview_12494284",
"Incredible growth opportunity with downsides"
]
}
}
Any idea how can I get the right format?
Don’t specify the field names. Pull out the ones you want from the full records:
with open(csvfile, "r", encoding="utf-8") as infile:
reader = csv.DictReader(infile)
with open(jsonfile, "w", encoding="utf-8") as outfile:
outfile.write(json.dumps([{"ID": row["ID"], "summary": row["summary"]} for row in reader], indent=2))
Gives:
[
{
"ID": "3",
"summary": "Take it with a grain of salt"
},
{
"ID": "19",
"summary": "Incredible growth opportunity with downsides"
}
]
If you dont need anything other than summary, just store only summary in data
with open(csvfile) as cs:
reader = csv.DictReader(cs, fieldnames)
first_row = next(reader) # skip reading the header line
for row in reader[:
csvid = row["ID"]
data[csvid] = row['summary'] # just store the summary
So I find pandas much easier to work with.
First start with creating a dataframe that has your csv information (assumption your file is in a good format).
import pandas as pd
df = pd.read_csv('WordFinal.csv')
Then you can iterate the dataframe.
records = []
for index, row in df.iterrows():
rec = {
row['ID']: {
'ID': row['ID'],
'summary': row['summary']
# ... etc
}
}
records.append(rec)
Your records object will contain the JSON code you can save.

spaCy sentence segmentation failing on quotes

I am parsing some news data with spaCy and am noticing a consistent failure regarding sentence segmentation where there is a quote. Has anyone else solved this issue?
Here is a reproducible example - note sentence 4 in the output below. spaCy fails to split at the start of the quote, and this is consistent through other news articles I'm working with.
Thanks a lot.
Example:
Raw data:
u'body': u'\n LONDON Nov 4 Britons hurt by lower incomes and rising food prices after the financial crisis have cut back on fruit and vegetables and turned instead to fatty, sugary, processed food, an academic study showed on Monday.Britain has seen food prices rise much more sharply than most other developed economies between 2005 and 2012, while wage growth has been low and unemployment has risen.The net effect has been that Britons are spending 8.5 percent less in real terms on food purchased at home than before the recession - with the trend even greater for pensioners and families with young children.The research is likely to be politically sensitive at a time when Britain\'s Conservative-led government is under pressure from the opposition Labour Party, over declining standards of living and sharply rising demand at food banks which hand out free food to the poorest Britons. People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less nutritious and higher in saturated fat and sugar."Various measures of nutritional quality declined over this period, with bigger decreases for pensioner households and households with young children," said the Institute for Fiscal Studies, an economics research body.OBESITY Families with children were prone to switching to more sugary food, while pensioners favoured food high in saturated fat, the study showed. Both groups often have lower incomes.While the economy is starting to show signs of growth after suffering the biggest hit to economic growth since records began during the 2008-09 recession, households\' disposable incomes are no higher than a decade ago. However, the IFS said a lower-quality diet was not an inevitable consequence of having less money, and that some households had been able to eat as healthily as before while spending less. More research was needed to see why this was not the case for other households, the researchers added.The study looked at data on more than 15,000 households\' shopping habits collected by market research company Kantar Worldpanel between 2005 and 2012.The figures do not include meals purchased or provided away from home, for example in restaurants or at schools, which in England provide free lunches for poorer pupils.The study was released alongside a piece of longer-term research from the IFS, which showed the English now consume 15-30 percent fewer calories than in 1980, despite higher obesity rates probably due to less physical activity.This contrasts with the United States, where calorie consumption has risen as well as obesity. The IFS said it was were researching further into trends in Britons\' physical activity over the period.',
Code to split:
from __future__ import unicode_literals
import spacy
nlp = spacy.load('en')
doc1 = nlp(article_to_json['body'].decode('utf-8'), parse=True)
for number, sent in enumerate(doc1.sents):
print number, sent, "\n"
Output:
0 LONDON Nov 4 Britons hurt by lower incomes and rising food
prices after the financial crisis have cut back on fruit and
vegetables and turned instead to fatty, sugary, processed food, an
academic study showed on Monday.
1 Britain has seen food prices rise much more sharply than most other
developed economies between 2005 and 2012, while wage growth has been
low and unemployment has risen.
2 The net effect has been that Britons are spending 8.5 percent less
in real terms on food purchased at home than before the recession -
with the trend even greater for pensioners and families with young
children.
3 The research is likely to be politically sensitive at a time when
Britain's Conservative-led government is under pressure from the
opposition Labour Party, over declining standards of living and
sharply rising demand at food banks which hand out free food to the
poorest Britons.
4 People have economised by buying less food, measured in number of calories, but also on its quality, picking products that are less
nutritious and higher in saturated fat and sugar."Various measures of
nutritional quality declined over this period, with bigger decreases
for pensioner households and households with young children," said the
Institute for Fiscal Studies, an economics research body.
5 OBESITY Families with children
were prone to switching to more sugary food, while pensioners favoured
food high in saturated fat, the study showed.
6 Both groups often have lower incomes.
7 While the economy is starting to show signs of growth after
suffering the biggest hit to economic growth since records began
during the 2008-09 recession, households' disposable incomes are no
higher than a decade ago.
8 However, the IFS said a lower-quality diet was not an inevitable
consequence of having less money, and that some households had been
able to eat as healthily as before while spending less.
9 More research was needed to see why this was not the case for other
households, the researchers added.
10 The study looked at data on more than 15,000 households' shopping
habits collected by market research company Kantar Worldpanel between
2005 and 2012.The figures do not include meals purchased or provided
away from home, for example in restaurants or at schools, which in
England provide free lunches for poorer pupils.
11 The study was released alongside a piece of longer-term research
from the IFS, which showed the English now consume 15-30 percent fewer
calories than in 1980, despite higher obesity rates probably due to
less physical activity.
12 This contrasts with the United States, where calorie consumption
has risen as well as obesity.
13 The IFS said it was were researching further into trends in
Britons' physical activity over the period.
I googled the original news article to try to figure out why your data looks like it does (missing whitespace between sentences where I wouldn't expect it in a formal news article), and it looks like the original problem is that no whitespace is inserted between HTML paragraphs. If you can fix that problem with how the article is extracted from the original HTML (insert whitespace when you run into <p> or </p>), you won't have this problem with spacy or other tools.
The models available in standard tools will often be trained on news data and it's reasonable to expect them to work well for data like this, but they expect whitespace between sentences. Unless you retrain the models with data including missing whitespace between sentences (or preprocess your data as suggested in a comment), you're going have these kinds of problems.

Retrieve info between paragraph tags with feedparser

I've been reading through the documentation for feedparser and haven't been able to find a solution to this: I would like to retrieve only the string between <p></p>. An example of an excerpt from a feed I'd like to retrieve this from is:
<img alt="Dawsons" height="259" src="http://i.cbc.ca/1.2703554.1405073659!/fileImage/httpImage/image.jpg_gen/derivatives/16x9_460/dawsons.jpg" title="Kathy Dawson and her daughter Emily Dawson, 18, now have a complaint before the Alberta Human Rights Commission over a sexual education course Emily had to take last year. " width="460" /> <p>The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.</p>
Note: this is from the RSS feed at http://www.cbc.ca/cmlink/rss-topstories
which I retrieved with
for item in cbc.entries:
print item.summary
I know I could easily write something to manually parse through and return only what I want but is there a way feedparser can do it for me?
I don't see anything in the docs about parsing using tags but beautifulsoup can get the text;
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.cbc.ca/cmlink/rss-topstories")
soup = BeautifulSoup(r.content)
print [''.join(s.findAll(text=True)) for s in soup.findAll('p')]
[u"Search teams are returning to the home of Kathy and Alvin Liknes today for another sweep of the property, close to two weeks after the couple and their grandson Nathan O'Brien were discovered missing in Calgary.", u"Israel widened its air assault against the Gaza Strip's Hamas militants on Saturday, hitting targets that included a mosque the Israeli military said was being used to conceal rockets. Meanwhile, there are reports Hamas has launched rockets at Tel Aviv.", u'The Sunni militant group ISIS, which wants to create an Islamic state spanning Iraq and Syria, has issued a recruitment video using the image and words of a dead Ontario man who had become a jihadist and joined the fighting in Syria.', u'A Hamilton-area man\u2019s dashcam may have saved him a pricey car insurance payout \u2013 and maybe even from falling victim to an insurance scam, an industry expert says.', u'Tommy Ramone, a co-founder of the seminal punk band the Ramones and the last surviving member of the original group, has died, a business associate said Saturday.', u"During high-stake police interrogations and on seemingly meaningless online dating profiles, some people find themselves lying. So, how can you tell if someone isn't telling you the truth?", u"Israeli strikes in Gaza have led to sleepless nights and anxious Palestinian children, CBC's Derek Stoffel reports from a refugee camp in Gaza City.", u'Saskatchewan Premier Brad Wall has been a vocal proponent of abolishing the Senate. With the Prime Minister now under pressure to fill vacancies in the upper chamber, Wall argues that not appointing new senators might be the way to get rid of the institution.', u"Bassist Charlie Haden, who helped change the shape of jazz more than a half-century ago as a member of Ornette Coleman's groundbreaking quartet and liberated the bass from its traditional rhythm section role, has died. He was 76.", u"Tracy Morgan has sued Wal-Mart over last month's highway crash that seriously injured him and killed a fellow comedian.", u'Buying pot is normally a subtle affair, but not for Mike Boyer, who camped out to become the first person to legally purchase marijuana in Washington state.', u"Monika Platek, CBC's lead producer for social media during the World Cup, looks at some of the standout moments so far from the 2014 World Cup", u'Our weekly round-up of remarkable photos includes scenes from Brazil, Spain, Germany, India and elsewhere around the world.', u'The European Union said on Saturday that it has extended sanctions to cover 11 leaders of the pro-Moscow rebellion in eastern Ukraine.', u'The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.']
You could combine both:
import feedparser
d = feedparser.parse("http://www.cbc.ca/cmlink/rss-topstories")
soup = BeautifulSoup("".join([item.summary for item in d.entries]))
print [''.join(s.findAll(text=True)) for s in soup.findAll('p')]
[u"Search teams are returning to the home of Kathy and Alvin Liknes today for another sweep of the property, close to two weeks after the couple and their grandson Nathan O'Brien were discovered missing in Calgary.", u"Israel widened its air assault against the Gaza Strip's Hamas militants on Saturday, hitting targets that included a mosque the Israeli military said was being used to conceal rockets. Meanwhile, there are reports Hamas has launched rockets at Tel Aviv.", u'The Sunni militant group ISIS, which wants to create an Islamic state spanning Iraq and Syria, has issued a recruitment video using the image and words of a dead Ontario man who had become a jihadist and joined the fighting in Syria.', u'A Hamilton-area man\u2019s dashcam may have saved him a pricey car insurance payout \u2013 and maybe even from falling victim to an insurance scam, an industry expert says.', u'Tommy Ramone, a co-founder of the seminal punk band the Ramones and the last surviving member of the original group, has died, a business associate said Saturday.', u"During high-stake police interrogations and on seemingly meaningless online dating profiles, some people find themselves lying. So, how can you tell if someone isn't telling you the truth?", u"Israeli strikes in Gaza have led to sleepless nights and anxious Palestinian children, CBC's Derek Stoffel reports from a refugee camp in Gaza City.", u'Saskatchewan Premier Brad Wall has been a vocal proponent of abolishing the Senate. With the Prime Minister now under pressure to fill vacancies in the upper chamber, Wall argues that not appointing new senators might be the way to get rid of the institution.', u"Bassist Charlie Haden, who helped change the shape of jazz more than a half-century ago as a member of Ornette Coleman's groundbreaking quartet and liberated the bass from its traditional rhythm section role, has died. He was 76.", u"Tracy Morgan has sued Wal-Mart over last month's highway crash that seriously injured him and killed a fellow comedian.", u'Buying pot is normally a subtle affair, but not for Mike Boyer, who camped out to become the first person to legally purchase marijuana in Washington state.', u"Monika Platek, CBC's lead producer for social media during the World Cup, looks at some of the standout moments so far from the 2014 World Cup", u'Our weekly round-up of remarkable photos includes scenes from Brazil, Spain, Germany, India and elsewhere around the world.', u'The European Union said on Saturday that it has extended sanctions to cover 11 leaders of the pro-Moscow rebellion in eastern Ukraine.', u'The Edmonton Public School Board has said it will tell teachers not to use an anti-abortion centre to teach part of its sexual education curriculum, after a McNally high school student filed a human rights complaint over what she was taught.']
I just import re and do
justtheParagraphs = re.findall("<p>(.*?)</p>", yourfeed.entries.content).group(1)
hope this is a sensible example. you can use search for just the first one but I find myself wanting all "<p>(.*?)</p>" and then displaying the second one [.group(1)].

Reading data from CSV using pandas

I am trying to read data from a csv using pandas, like so :
import pandas as p
loadData = lambda f: np.genfromtxt(open(f,'r'), delimiter=',')
print "loading data.."
traindata = list(np.array(p.read_csv('FinalCSVFin.csv', delimiter=";"))[:,2])
I wish for this to give me a list of the 2nd column of the FinalCSVFin.csv. However, it is returning the error :
---------------------------------------------------------------------------
CParserError Traceback (most recent call last)
<ipython-input-7-de5ad26b44d2> in <module>()
7
8 print "loading data.."
CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 16
An extract of the CSV :
url;urlid;boilerplate;label;alexarank;;;;
http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html;4042;"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest provider of computer services looks to Silicon Valley for input gleaning many ideas from its Almaden research center in San Jose California Holographic conversations projected from mobile phones lead this year s list The predictions also include air breathing batteries computer programs that can tell when and where traffic jams will take place environmental information generated by sensors in cars and phones and cities powered by the heat thrown off by computer servers These are all stretch goals and that s good said Paul Saffo managing director of foresight at the investment advisory firm Discern in San Francisco In an era when pessimism is the new black a little dose of technological optimism is not a bad thing For IBM it s not just idle speculation The company is one of the few big corporations investing in long range research projects and it counts on innovation to fuel growth Saffo said Not all of its predictions pan out though IBM was overly optimistic about the spread of speech technology for instance When the ideas do lead to products they can have broad implications for society as well as IBM s bottom line he said Research Spending They have continued to do research when all the other grand research organizations are gone said Saffo who is also a consulting associate professor at Stanford University IBM invested 5 8 billion in research and development last year 6 1 percent of revenue While that s down from about 10 percent in the early 1990s the company spends a bigger share on research than its computing rivals Hewlett Packard Co the top maker of personal computers spent 2 4 percent last year At Almaden scientists work on projects that don t always fit in with IBM s computer business The lab s research includes efforts to develop an electric car battery that runs 500 miles on one charge a filtration system for desalination and a program that shows changes in geographic data IBM rose 9 cents to 146 04 at 11 02 a m in New York Stock Exchange composite trading The stock had gained 11 percent this year before today Citizen Science The list is meant to give a window into the company s innovation engine said Josephine Cheng a vice president at IBM s Almaden lab All this demonstrates a real culture of innovation at IBM and willingness to devote itself to solving some of the world s biggest problems she said Many of the predictions are based on projects that IBM has in the works One of this year s ideas that sensors in cars wallets and personal devices will give scientists better data about the environment is an expansion of the company s citizen science initiative Earlier this year IBM teamed up with the California State Water Resources Control Board and the City of San Jose Environmental Services to help gather information about waterways Researchers from Almaden created an application that lets smartphone users snap photos of streams and creeks and report back on conditions The hope is that these casual observations will help local and state officials who don t have the resources to do the work themselves Traffic Predictors IBM also sees data helping shorten commutes in the next five years Computer programs will use algorithms and real time traffic information to predict which roads will have backups and how to avoid getting stuck Batteries may last 10 times longer in 2015 than today IBM says Rather than using the current lithium ion technology new models could rely on energy dense metals that only need to interact with the air to recharge Some electronic devices might ditch batteries altogether and use something similar to kinetic wristwatches which only need to be shaken to generate a charge The final prediction involves recycling the heat generated by computers and data centers Almost half of the power used by data centers is currently spent keeping the computers cool IBM scientists say it would be better to harness that heat to warm houses and offices In IBM s first list of predictions compiled at the end of 2006 researchers said instantaneous speech translation would become the norm That hasn t happened yet While some programs can quickly translate electronic documents and instant messages and other apps can perform limited speech translation there s nothing widely available that acts like the universal translator in Star Trek Second Life The company also predicted that online immersive environments such as Second Life would become more widespread While immersive video games are as popular as ever Second Life s growth has slowed Internet users are flocking instead to the more 2 D environments of Facebook Inc and Twitter Inc Meanwhile a 2007 prediction that mobile phones will act as a wallet ticket broker concierge bank and shopping assistant is coming true thanks to the explosion of smartphone applications Consumers can pay bills through their banking apps buy movie tickets and get instant feedback on potential purchases all with a few taps on their phones The nice thing about the list is that it provokes thought Saffo said If everything came true they wouldn t be doing their job To contact the reporter on this story Ryan Flinn in San Francisco at rflinn bloomberg net To contact the editor responsible for this story Tom Giles at tgiles5 bloomberg net by 2015, your mobile phone will project a 3-d image of anyone who calls and your laptop will be powered by kinetic energy. at least that\u2019s what international business machines corp. sees in its crystal ball."",""url"":""bloomberg news 2010 12 23 ibm predicts holographic calls air breathing batteries by 2015 html""}";0;345;;;;
http://www.popsci.com/technology/article/2012-07/electronic-futuristic-starting-gun-eliminates-advantages-races;8471;"{""title"":""The Fully Electronic Futuristic Starting Gun That Eliminates Advantages in Races the fully electronic, futuristic starting gun that eliminates advantages in races the fully electronic, futuristic starting gun that eliminates advantages in races"",""body"":""And that can be carried on a plane without the hassle too The Omega E Gun Starting Pistol Omega It s easy to take for granted just how insanely close some Olympic races are and how much the minutiae of it all can matter The perfect example is the traditional starting gun Seems easy You pull a trigger and the race starts Boom What people don t consider When a conventional gun goes off the sound travels to the ears of the closest runner a fraction of a second sooner than the others That s just enough to matter and why the latest starting pistol has traded in the mechanical boom for orchestrated electronic noise Omega has been the watch company tasked as the official timekeeper of the Olympic Games since 1932 At the 2010 Vancouver games they debuted their new starting gun which is a far cry from the iconic revolvers associated with early games it s clearly electronic but still more than a button that s pressed to get the show rolling About as far away as you can get probably while still clearly being a starting gun Pull the trigger once and off the Olympians go If it s pressed twice consecutively it signals a false start Working through a speaker system is what eliminates any kind of advantage for athletes It s not a big advantage being close to a gun but the sound of the bullet traveling one meter every three milliseconds could contribute to a win Powder pistols have been connected to a speaker system before but even then runners could react to the sound of the real pistol firing rather than wait for the speaker sounds to reach them This year s setup will have speakers placed equidistant from runners forcing the sound to reach each competitor at exactly the same time It wouldn t be an enormous difference Omega Timing board member Peter H\u00fcrzeler said in an email but when you think about reaction times being measured in tiny fractions of a second placing a speaker behind each lane has eliminated any sort of advantage for any athlete They all hear the start commands and signal at exactly the same moment There s also an ulterior reason for its look In a post September 11th world a gun on its way to a major event is going to raise more than a few TSA eyebrows even if it s a realistic looking fake Rather than deal with that the e gun can be transported while still maintaining the general look of a starting gun But there s still nothing like hearing a starting gun go off at the start of a race more than signaling the runners there s probably some Pavlovian response after more than a century of Olympic games that make people want to hear the real thing not a whiny electronic noise Everyone in the stands at home thankfully will still be getting that The sound is programmable and can be synthesized to sound like almost anything H\u00fcrzeler says but we program it to sound like a pistol it s a way to use the best possible starting technology but to keep a rich tradition alive and that can be carried on a plane without the hassle, too technology,gadgets,london 2012,london olympics,olympics,omega,starting guns,summer olympics,timing,popular science,popsci"",""url"":""popsci technology article 2012 07 electronic futuristic starting gun eliminates advantages races""}";1;5304;;;;
http://www.menshealth.com/health/flu-fighting-fruits?cm_mmc=Facebook-_-MensHealth-_-Content-Health-_-FightFluWithFruit;1164;"{""title"":""Fruits that Fight the Flu fruits that fight the flu | cold & flu | men's health"",""body"":""Apples The most popular source of antioxidants in our diet one apple has an antioxidant effect equivalent to 1 500 mg of vitamin C Apples are loaded with protective flavonoids which may prevent heart disease and cancer Next Papayas With 250 percent of the RDA of vitamin C a papaya can help kick a cold right out of your system The beta carotene and vitamins C and E in papayas reduce inflammation throughout the body lessening the effects of asthma Next Cranberries Cranberries have more antioxidants than other common fruits and veggies One serving has five times the amount in broccoli Cranberries are a natural probiotic enhancing good bacteria levels in the gut and protecting it from foodborne illnesses Next Grapefruit Loaded with vitamin C grapefruit also contains natural compounds called limonoids which can lower cholesterol The red varieties are a potent source of the cancer fighting substance lycopene Next Bananas One of the top food sources of vitamin B6 bananas help reduce fatigue depression stress and insomnia Bananas are high in magnesium which keeps bones strong and potassium which helps prevent heart disease and high blood pressure Next everything you need to know about cold and flu so you don\u2019t get sick this season, at men\u2019s health.com cold, flu, infection, sore throat, sneeze, immunity, germs, allergies, stay healthy, sick, contagious, medicines, cold medicine"",""url"":""menshealth health flu fighting fruits cm mmc Facebook Mens Health Content Health Fight Flu With Fruit""}";1;2663;;;;
What have I done incorrectly here ?
1) A better way
import pandas as pd
seperator = ";"
df = pd.read_csv("FinalCSVFin.csv", sep=seperator)
2) Your code
You define a function to Read a file using genfromtxt method in numpy, and then use pandas to read you file. I suggest the latter, just use read_csv method in pandas (as was described in 1).
3) Suggestions
Here are the points you can change to get your code working.
You implement a function to read data using np.genfromtxt. The problems are inconsistency in delimiter and also the lack of dtype in genfromtxt. I edit your function as follows:
loadData = lambda f, s: np.genfromtxt(open(f,'r'), dtype=None, delimiter=s)
This gives you a list of tuples. If your file (i.e. FinalCSVFin.csv) uses ";" as delimiter, call this function as follows:
valus = loadData("test.txt", ";")

Categories

Resources