BB data extraction - limited set of data

BB data extraction - limited set of data - python

I am trying to extract some data from the BB Finance API via rapidapi and have some difficulty extracting the data.
I have some files (txt,csv,xlsx,json) that I have used to see if I can get the data that I need but cannot find a way to get it.
this is an example:
"title":"Bloomberg Markets"
"modules":[26 items
0:{6 items
"id":"hero_1"
"title":NULL
"skipDedup":false
"type":"single_story"
"tracking":{2 items
"id":"36ba25c6-8f14-472a-8068-75b40b7ec2dc"
"title":"Hero"
}
"stories":[1 item
0:{23 items
"id":"2022-02-15/stocks-set-to-rise-on-prospect-of-easing-tension-markets-wrap"
"internalID":"R7D425T0AFB401"
"title":"Rally Falters on Return of Russia, Inflation Risks: Markets Wrap"
"summary":""
"autoGeneratedSummary":"The global stock rally stalled on Wednesday as traders struggled
to evaluate the risk of geopolitical tension in Ukraine and the impact of rising
inflation on central bank policies."
"abstract":[2 items
0:"Treasuries, gold steady; Italian bond yields at two-year high"
1:"Biden cautious on Ukraine as Russia says some troops withdrawn"
I am interested in:
stories:title,
stories:summary,
stories:autogeneratedsummary &
summary:abstract
but cannot find a way to jut extract that info.From there I was thinking of bringing that into a data-frame so that I can work with it.
Is there a way to do so?
thanks

Related

Using regex to capture substring within a pandas df

I’m trying to extract specific substrings from larger phrases contained in my Pandas dataframe. I have rows formatted like so:
Appointment of DAVID MERRIGAN of Hammonds Plains, Nova Scotia, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of CARLA R. CONKIN of Fort Steele, British Columbia, to be Vice-Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of JUDY A. WHITE, Q.C., of Conne River, Newfoundland and Labrador, to be Chairman of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
Appointment of GRETA SITTICHINLI of Inuvik, Northwest Territories, to be a member of the Inuvialuit Arbitration Board, to hold office during pleasure for a term of three years.
and I've been able to capture the capitalized names (e.g. DAVID MERRIGAN) with the regex below but I'm struggling to capture the locations, i.e. the 'of' statement following the capitalized name that ends with the second comma. I've tried just isolating the rest of the string that follows the name with the following code, but it just doesn't seem to work, I keep getting -1 as a response.
df_appointments['Name'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)')
df_appointments['Location'] =
df_appointments['Precis'].str.find(r'\b[A-Z]+(?:\s+[A-Z]+)\b\s([^\n\r]*)')
Any help showing me how to isolate the location substring with regex (after that I can figure out how to get the position, etc) would be tremendously appreciated. Thank you.

The following pattern works for your sample set:
rgx = r'(?:\w\s)+([A-Z\s\.,]+)(?:\sof\s)([A-Za-z\s]+,\s[A-Za-z\s]+)'
It uses capture groups & non-capture groups to isolate only the names & locations from the strings. Rather than requiring two patterns, and having to perform two searches, you can then do the following to extract that information into two new columns:
df[['name', 'location']] = df['precis'].str.extract(rgx)
This then produces:
df
precis name location
0 Appointment of... DAVID MERRIGAN Hammonds Plains, Nova Scotia
1 Appointment of... CARLA R. CONKIN Fort Steele, British Columbia
2 Appointment of... JUDY A. WHITE, Q.C., Conne River, Newfoundland and...
3 Appointment of... GRETA SITTICHINLI Inuvik, Northwest Territories`
Depending on the exact format of all of your precis values, you might have to tweak the pattern to suit perfectly, but hopefully it gets you going...

# Final Answer
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv(r"C:\Users\yueheng.li\Desktop\Up\Question_20220824\Data.csv")
data[['Field_Part1','Field_Part2','Field_Part3']] = data['Precis'].str.split('of',2,expand=True)
data['Address_part1'] = data['Field_Part3'].str.split(',').str[0]
data['Address_part2'] = data['Field_Part3'].str.split(',').str[1]
data['Address'] = data['Address_part1']+','+data['Address_part2']
data.drop(['Field_Part1','Field_Part2','Field_Part3','Address_part1','Address_part2'],axis=1,inplace=True)
# Output Below
data
Easy Way to understand
Thanks
Leon

Context-Based Paragraph Splitting of an Article or long Text

I want to chunk long text into paragraphs that are context-based, because right now i'm just splitting the text in to sentences and chunking them every 250 words and calling that a paragraph, but its obviously not a good way to make a paragraph because its "dumb" and information thats over 250 words gets kept out, and it isn't even really a paragraph, just all the sentences before 250 words is filled up. So i want to make context-based paragraph splitting so it "smart" and is actually a paragraph.
The code below is what i have now:
import re
newtext = '''
Kendrick Lamar Duckworth is an American rapper, songwriter, and record producer. He is often cited as one of the most influential rappers of his generation. Aside from his solo career, he is also a member of the hip hop supergroup Black Hippy alongside his former Top Dawg Entertainment (TDE) labelmates Ab-Soul, Jay Rock, and Schoolboy Q. Raised in Compton, California, Lamar embarked on his musical career as a teenager under the stage name K.Dot, releasing a mixtape titled Y.H.N.I.C. (Hub City Threat Minor of the Year) that garnered local attention and led to his signing with indie record label TDE. He began to gain recognition in 2010 after his first retail release, Overly Dedicated. The following year, he independently released his first studio album, Section.80, which included his debut single "HiiiPoWeR". By that time, he had amassed a large online following and collaborated with several prominent rappers. He subsequently secured a record deal with Dr. Dre's Aftermath Entertainment, under the aegis of Interscope Records. Lamar's major-label debut album, Good Kid, M.A.A.D City, was released in 2012, garnering him widespread critical recognition and mainstream success. His third album To Pimp a Butterfly (2015), which incorporated elements of funk, soul, jazz, and spoken word, predominantly centred around the Black-American experience. It became his first number-one album on the US Billboard 200 and was an enormous critical success. His fourth album, Damn (2017), saw continued acclaim, becoming the first non-classical and non-jazz album to be awarded the Pulitzer Prize for Music. It also yielded his first number-one single, "Humble", on the US Billboard Hot 100. Lamar curated the soundtrack to the superhero film Black Panther (2018) and in 2022, released his fifth and last album with TDE, Mr. Morale & the Big Steppers, which received critical acclaim. Lamar has certified sales of over 70 million records in the United States alone, and all of his albums have been certified platinum or higher by the Recording Industry Association of America (RIAA). He has received several accolades in his career, including 14 Grammy Awards, two American Music Awards, six Billboard Music Awards, 11 MTV Video Music Awards, a Pulitzer Prize, a Brit Award, and an Academy Award nomination. In 2012, MTV named him the Hottest MC in the Game on their annual list. Time named him one of the 100 most influential people in the world in 2016. In 2015, he received the California State Senate's Generational Icon Award. Three of his studio albums were included on Rolling Stone's 2020 list of the 500 Greatest Albums of All Time. Kendrick Lamar Duckworth was born in Compton, California on June 17, 1987, the son of a couple from Chicago. Although not in a gang himself, he grew up around gang members, with his closest friends being Westside Piru Bloods and his father, Kenny Duckworth, being a Gangster Disciple. His first name was given to him by his mother in honor of singer-songwriter Eddie Kendricks of The Temptations. He grew up on welfare and in Section 8 housing. In 1995, at the age of eight, Lamar witnessed his idols Tupac Shakur and Dr. Dre filming the music video for their hit single "California Love", which proved to be a significant moment in his life. As a child, Lamar attended McNair Elementary and Vanguard Learning Center in the Compton Unified School District. He has admitted to being quiet and shy in school, his mother even confirming he was a "loner" until the age of seven. As a teenager, he graduated from Centennial High School in Compton, where he was a straight-A student. Kendrick Lamar has stated that Tupac Shakur, the Notorious B.I.G., Jay-Z, Nas and Eminem are his top five favorite rappers. Tupac Shakur is his biggest influence, and has influenced his music as well as his day-to-day lifestyle. In a 2011 interview with Rolling Stone, Lamar mentioned Mos Def and Snoop Dogg as rappers that he listened to and took influence from during his early years. He also cites now late rapper DMX as an influence: "[DMX] really [got me started] on music," explained Lamar in an interview with Philadelphia's Power 99. "That first album [It's Dark and Hell Is Hot] is classic, [so he had an influence on me]." He has also stated Eazy-E as an influence in a post by Complex saying: "I Wouldn't Be Here Today If It Wasn't for Eazy-E." In a September 2012 interview, Lamar stated rapper Eminem "influenced a lot of my style" and has since credited Eminem for his own aggression, on records such as "Backseat Freestyle". Lamar also gave Lil Wayne's work in Hot Boys credit for influencing his style and praised his longevity. He has said that he also grew up listening to Rakim, Dr. Dre, and Tha Dogg Pound. In January 2013, when asked to name three rappers that have played a role in his style, Lamar said: "It's probably more of a west coast influence. A little bit of Kurupt, [Tupac], with some of the content of Ice Cube." In a November 2013 interview with GQ, when asked "The Four MC's That Made Kendrick Lamar?", he answered Tupac Shakur, Dr. Dre, Snoop Dogg and Mobb Deep, namely Prodigy. Lamar professed to having been influenced by jazz trumpeter Miles Davis and Parliament-Funkadelic during the recording of To Pimp a Butterfly. Lamar has been branded as the "new king of hip hop" numerous times. Forbes said, on Lamar's placement as hip hop's "king", "Kendrick Lamar may or may not be the greatest rapper alive right now. He is certainly in the very short lists of artists in the conversation." Lamar frequently refers to himself as the "greatest rapper alive" and once called himself "The King of New York." On the topic of his music genre, Lamar has said: "You really can't categorize my music, it's human music." Lamar's projects are usually concept albums. Critics found Good Kid, M.A.A.D City heavily influenced by West Coast hip hop and 90s gangsta rap. His third studio album, To Pimp a Butterfly, incorporates elements of funk, jazz, soul and spoken word poetry. Called a "radio-friendly but overtly political rapper" by Pitchfork, Lamar has been a branded "master of storytelling" and his lyrics have been described as "katana-blade sharp" and his flow limber and dexterous. Lamar's writing usually includes references to racism, black empowerment and social injustice, being compared to a State of Union address by The Guardian. His writing has also been called "confessional" and controversial. The New York Times has called Lamar's musical style anti-flamboyant, interior and complex and labelled him as a technical rapper. Billboard described his lyricism as "Shakespearean".
'''
#1100 words
regex = r'([A-z][^.!?]*[.!?]*"?)'
for sens in re.findall(regex, newtext):
newtext = newtext.replace(f'{sens}', f'{sens}<eos>')
sentencess = newtext.split('<eos>')
sentences = sentencess
current_chunk = 0
chunks = []
for sentence in sentences:
if len(chunks) == current_chunk + 1:
if len(chunks[current_chunk]) + len(sentence.split(" ")) <= 250:
chunks[current_chunk].extend(sentence.split(" "))
else:
current_chunk += 1
chunks.append(sentence.split(" "))
else:
chunks.append(sentence.split(" "))
for chunk_id in range(len(chunks)):
chunks[chunk_id] = " ".join(chunks[chunk_id])
print(chunks) # printing all split "paragraphs"
print(chunks[0]) # printing 1st "paragraph"

Print texts that have cosine similarity score less than 0.90

I want to create deduplication process on my database.
I want to measure cosine similarity scores with Pythons Sklearn lib. between new texts and texts that are already in the database.
I want to add only documents that have cosine similarity score less than 0.90.
This is my code:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
list_of_texts_in_database = ["More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.",
"UK ministers say sanctions could target companies and individuals linked to the Russian government.",
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
"He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.",
"Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.",
"The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control",
"A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control",
"That could lead to a major escalation in conflict."]
list_of_new_texts = ["This is a totaly new document that needs to be added into the database one way or another.",
"Boris Johnson also says the UK could limit Russian firm ability to raise capital on London's market.",
"Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.",
"UK minister say sanctions could target companies and individuals linked to the Russian government.",
"That could lead to a major escalation in conflict."]
vectorizer = TfidfVectorizer(lowercase=True, analyzer='word', stop_words = None, ngram_range=(1, 1))
list_of_texts_in_database_tfidf = vectorizer.fit_transform(list_of_texts_in_database)
list_of_new_texts_tfidf = vectorizer.transform(list_of_new_texts)
cosineSimilarities = cosine_similarity(list_of_new_texts_tfidf, list_of_texts_in_database_tfidf)
print(cosineSimilarities)
This code works good, but I do not know how to map the results (how to get texts that have similarity score less than 0.90)

My suggestion would be as follows. You only add those texts with a score less than (or equal) 0.9.
import numpy as np
idx = np.where((cosineSimilarities <= 0.9).all(axis=1))
Then you have the indices of the new texts in list_of_new_texts that do not have a corresponding text with a score of > 0.9 in the already existing list list_of_texts_in_database.
Combining them you can do as follows (although somebody else might have a cleaner method for this...)
print(
list_of_texts_in_database + list(np.array(list_of_new_texts)[idx[0]])
)
Output:
['More now on the UK prime minister’s plan to impose sanctions against Russia, after it sent troops into eastern Ukraine.',
'UK ministers say sanctions could target companies and individuals linked to the Russian government.',
"Boris Johnson also says the UK could limit Russian firms ability to raise capital on London's markets.",
'He has suggested Western allies are looking at stopping Russian companies trading in pounds and dollars.',
'Other measures Western nations could impose include restricting exports to Russia, or excluding it from the Swift financial messaging service.',
'The rebels and Ukrainian military have been locked for years in a bitter stalemate, along a frontline called the line of control',
'A big question in the coming days, is going to be whether Russia also recognises as independent some of the Donetsk and Luhansk regions that are still under Ukrainian government control',
'That could lead to a major escalation in conflict.',
'This is a totaly new document that needs to be added into the database one way or another.',
'Other measure Western nation can impose include restricting export to Russia, or excluding from the Swift financial messaging services.',
'UK minister say sanctions could target companies and individuals linked to the Russian government.']

why dont you work within a dataframe?
import pandas as pd
d = {'old_text':list_of_texts_in_database[:5], 'new_text':list_of_new_texts, 'old_emb': list_of_texts_in_database_tfidf[:5], 'new_emb': list_of_new_texts_tfidf}
df = pd.DataFrame(data=d)
df['score'] = df.apply(lambda row: cosine_similarity(row['old_emb'], row['new_emb'])[0][0], axis=1)
df = df.loc[df.score > 0.9, 'score']
df.head()

How to extract acronyms and abbreviations from pandas dataframe?

I have a pandas dataframe, a column with text data. I want to extract all unique acronyms and abbreviations in that text column.
So far I have a function to extract all acronyms and abbreviations from a given text.
def extract_acronyms_abbreviations(text):
eaa = {}
for match in re.finditer(r"\((.*?)\)", text):
start_index = match.start()
abbr = match.group(1)
size = len(abbr)
words = text[:start_index].split()[-size:]
definition = " ".join(words)
eaa[abbr] = definition
return eaa
extract_acronyms_abbreviations(a)
{'FHH': 'family health history', 'NP': 'nurse practitioner'}
I want to apply/extract all unique acronyms and abbreviations from the text column.
Sample Data:
s = """The MLCommons Association, an open engineering consortium dedicated to improving machine learning for everyone, today announced the general availability of the People's Speech Dataset and the Multilingual Spoken Words Corpus (MSWC). This trail-blazing and permissively licensed datasets advance innovation in machine learning research and commercial applications. Also today, the MLCommons Association is issuing a call for participation in the new DataPerf benchmark suite, which measures and encourages innovation in data-centric AI."""
k = """The MLCommons Association is a firm proponent of Data-Centric AI (DCAI), the discipline of systematically engineering the data for AI systems by developing efficient software tools and engineering practices to make dataset creation and curation easier. Our open datasets and tools like DataPerf concretely support the DCAI movement and drive machine learning innovation."""
j = """The key global provider of sustainable packaging solutions has now taken a significant step towards reaching these ambitions by signing two 10-year virtual Power Purchase Agreements (VPPA) with global renewable energy developer BayWa r.e covering its operations in Europe. The agreements form the largest solar VPPA for the packaging industry in Europe, as well as the first major solar VPPA by a Finnish company."""
a = """Although family health history (FHH) is commonly accepted as an important risk factor for common, chronic diseases, it is rarely considered by a nurse practitioner (NP)."""
import pandas as pd
data = {"text":[s,k,j,a,s,k,j]}
df = pd.DataFrame(data)
Desired output
{'MSWC': 'Multilingual Spoken Words Corpus',
'DCAI': 'proponent of Data-Centric AI',
'VPPA': 'virtual Power Purchase Agreements',
'NP': 'nurse practitioner',
'FHH': 'family health history'}

Lets say df['text'] contains the text data you want to work with.
df["acronyms"] = df.apply(extract_acronyms_abbreviations)
# It will create a new columns containing dictionary return by your function.
Now create a master dict like
master_dict = dict()
for d in df["acronyms"].values:
master_dict.update(d)
print(master_dict)

How to count unique words in python with function?

I would like to count unique words with function. Unique words I want to define are the word only appear once so that's why I used set here. I put the error below. Does anyone how to fix this?
Here's my code:
def unique_words(corpus_text_train):
words = re.findall('\w+', corpus_text_train)
uw = len(set(words))
return uw
unique = unique_words(test_list_of_str)
unique
I got this error
TypeError: expected string or bytes-like object
Here's my bag of words model:
def BOW_model_relative(df):
corpus_text_train = []
for i in range(0, len(df)): #iterate over the rows in dataframe
corpus = df['text'][i]
#corpus = re.findall(r'\w+',corpus)
corpus = re.sub(r'[^\w\s]','',corpus)
corpus = corpus.lower()
corpus = corpus.split()
corpus = ' '.join(corpus)
corpus_text_train.append(corpus)
word2count = {}
for x in corpus_text_train:
words=word_tokenize(x)
for word in words:
if word not in word2count.keys():
word2count[word]=1
else:
word2count[word]+=1
total=0
for key in word2count.keys():
total+=word2count[key]
for key in word2count.keys():
word2count[key]=word2count[key]/total
return word2count,corpus_text_train
test_dict,test_list_of_str = BOW_model_relative(df)
#test_data = pd.DataFrame(test)
print(test_dict)
Here's my csv data
df = pd.read_csv('test.csv')
,text,title,authors,label
0,"On Saturday, September 17 at 8:30 pm EST, an explosion rocked West 23 Street in Manhattan, in the neighborhood commonly referred to as Chelsea, injuring 29 people, smashing windows and initiating street closures. There were no fatalities. Officials maintain that a homemade bomb, which had been placed in a dumpster, created the explosion. The explosive device was removed by the police at 2:25 am and was sent to a lab in Quantico, Virginia for analysis. A second device, which has been described as a “pressure cooker” device similar to the device used for the Boston Marathon bombing in 2013, was found on West 27th Street between the Avenues of the Americas and Seventh Avenue. By Sunday morning, all 29 people had been released from the hospital. The Chelsea incident came on the heels of an incident Saturday morning in Seaside Heights, New Jersey where a bomb exploded in a trash can along a route where thousands of runners were present to run a 5K Marine Corps charity race. There were no casualties. By Sunday afternoon, law enforcement had learned that the NY and NJ explosives were traced to the same person.
Given that we are now living in a world where acts of terrorism are increasingly more prevalent, when a bomb goes off, our first thought usually goes to the possibility of terrorism. After all, in the last year alone, we have had several significant incidents with a massive number of casualties and injuries in Paris, San Bernardino California, Orlando Florida and Nice, to name a few. And of course, last week we remembered the 15th anniversary of the September 11, 2001 attacks where close to 3,000 people were killed at the hands of terrorists. However, we also live in a world where political correctness is the order of the day and the fear of being labeled a racist supersedes our natural instincts towards self-preservation which, of course, includes identifying the evil-doers. Isn’t that how crimes are solved? Law enforcement tries to identify and locate the perpetrators of the crime or the “bad guys.” Unfortunately, our leadership – who ostensibly wants to protect us – finds their hands and their tongues tied. They are not allowed to be specific about their potential hypotheses for fear of offending anyone.
New York City Mayor Bill de Blasio – who famously ended “stop-and-frisk” profiling in his city – was extremely cautious when making his first remarks following the Chelsea neighborhood explosion. “There is no specific and credible threat to New York City from any terror organization,” de Blasio said late Saturday at the news conference. “We believe at this point in this time this was an intentional act. I want to assure all New Yorkers that the NYPD and … agencies are at full alert”, he said. Isn’t “an intentional act” terrorism? We may not know whether it is from an international terrorist group such as ISIS, or a homegrown terrorist organization or a deranged individual or group of individuals. It is still terrorism. It is not an accident. James O’Neill, the New York City Police Commissioner had already ruled out the possibility that the explosion was caused by a natural gas leak at the time the Mayor made his comments. New York’s Governor Andrew Cuomo was a little more direct than de Blasio saying that there was no evidence of international terrorism and that no specific groups had claimed responsibility. However, he did say that it is a question of how the word “terrorism” is defined. “A bomb exploding in New York is obviously an act of terrorism.” Cuomo hit the nail on the head, but why did need to clarify and caveat before making his “obvious” assessment?
The two candidates for president Hillary Clinton and Donald Trump also weighed in on the Chelsea explosion. Clinton was very generic in her response saying that “we need to do everything we can to support our first responders – also to pray for the victims” and that “we need to let this investigation unfold.” Trump was more direct. “I must tell you that just before I got off the plane a bomb went off in New York and nobody knows what’s going on,” he said. “But boy we are living in a time—we better get very tough folks. We better get very, very tough. It’s a terrible thing that’s going on in our world, in our country and we are going to get tough and smart and vigilant.”

The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced

s='aa aa bb cc'
def unique_words(corpus_text_train):
splitted = corpus_text_train.split()
return(len(set(splitted)))
unique_words(s)
Out[14]: 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.