How to access specific text from json data? [python] - python

I need to access the text attribute from this json data so i could end up having:
{'description': {'tags': ['outdoor', 'building', 'street', 'city', 'busy', 'people', 'filled', 'traffic', 'many', 'table', 'car', 'group', 'walking', 'bunch', 'crowded', 'large', 'night', 'light', 'standing', 'man', 'tall', 'umbrella', 'riding', 'sign', 'crowd'], 'captions': [{'text': 'a group of people on a city street filled with traffic at night', 'confidence': 0.8241405091548035}]}, 'requestId': '12fd327f-9b9c-4820-9feb-357a776211d3', 'metadata': {'width': 1826, 'height': 2436, 'format': 'Jpeg'}}
text = "The Text"
I Have tried doing parsed['captions']['text'] but this didnt work. Please let me know if you can help Thanks!

Two problems here. First, captions is under description, and second, text is a key of a dictionary inside a list (first and only item):
>>> import pprint
>>> pprint.pprint(parsed)
{'description': {'captions': [{'confidence': 0.8241405091548035,
'text': 'a group of people on a city street filled with traffic at night'}],
...
So, you could extract the text like this:
>>> parsed['description']['captions'][0]['text']
'a group of people on a city street filled with traffic at night'
Another option could be to use a 3rd-party library that simplifies traversing such JSON structures, for example plucky (full disclosure: I'm the author). With plucky, you can say:
>>> from plucky import pluckable
>>> pluckable(parsed).description.captions.text
['a group of people on a city street filled with traffic at night']
and not worry about dictionaries inside lists.

You can use python json library here, like below -
import json
your_json_string = "{'description': {'tags': ['outdoor', 'building', 'street', 'city', 'busy', 'people', 'filled', 'traffic', 'many', 'table', 'car', 'group', 'walking', 'bunch', 'crowded', 'large', 'night', 'light', 'standing', 'man', 'tall', 'umbrella', 'riding', 'sign', 'crowd'], 'captions': [{'text': 'a group of people on a city street filled with traffic at night', 'confidence': 0.8241405091548035}]}, 'requestId': '12fd327f-9b9c-4820-9feb-357a776211d3', 'metadata': {'width': 1826, 'height': 2436, 'format': 'Jpeg'}}"
data_dict = json.loads(your_json_string)
print(data_dict['description']['captions'][0]['text'])

Related

Nested Python Object to CSV

I looked up "nested dict" and "nested list" but either method work.
I have a python object with the following structure:
[{
'id': 'productID1', 'name': 'productname A',
'option': {
'size': {
'type': 'list',
'name': 'size',
'choices': [
{'value': 'M'},
]}},
'variant': [{
'id': 'variantID1',
'choices':
{'size': 'M'},
'attributes':
{'currency': 'USD', 'price': 1}}]
}]
what i need to output is a csv file in the following, flattened structure:
id, productname, variantid, size, currency, price
productID1, productname A, variantID1, M, USD, 1
productID1, productname A, variantID2, L, USD, 2
productID2, productname A, variantID3, XL, USD, 3
i tried this solution: Python: Writing Nested Dictionary to CSV
or this one: From Nested Dictionary to CSV File
i got rid of the [] around and within the data and e.g. i used this code snippet from 2 and adapted it to my needs. IRL i can't get rid of the [] because that's simple the format i get when calling the API.
with open('productdata.csv', 'w', newline='', encoding='utf-8') as output:
writer = csv.writer(output, delimiter=';', quotechar = '"', quoting=csv.QUOTE_NONNUMERIC)
for key in sorted(data):
value = data[key]
if len(value) > 0:
writer.writerow([key, value])
else:
for i in value:
writer.writerow([key, i, value])
but the output is like this:
"id";"productID1"
"name";"productname A"
"option";"{'size': {'type': 'list', 'name': 'size', 'choices': {'value': 'M'}}}"
"variant";"{'id': 'variantID1', 'choices': {'size': 'M'}, 'attributes': {'currency': 'USD', 'price': 1}}"
anyone can help me out, please?
thanks in advance
list indices must be integers not strings
The following presents a visual example of a python list:
0 carrot.
1 broccoli.
2 asparagus.
3 cauliflower.
4 corn.
5 cucumber.
6 eggplant.
7 bell pepper
0, 1, 2 are all "indices".
"carrot", "broccoli", etc... are all said to be "values"
Essentially, a python list is a machine which has integer inputs and arbitrary outputs.
Think of a python list as a black-box:
A number, such as 5, goes into the box.
you turn a crank handle attached to the box.
Maybe the string "cucumber" comes out of the box
You got an error: TypeError: list indices must be integers or slices, not str
There are various solutions.
Convert Strings into Integers
Convert the string into an integer.
listy_the_list = ["carrot", "broccoli", "asparagus", "cauliflower"]
string_index = "2"
integer_index = int(string_index)
element = listy_the_list[integer_index]
so yeah.... that works as long as your string-indicies look like numbers (e.g. "456" or "7")
The integer class constructor, int(), is not very smart.
For example, x = int("3 ") will produce an error.
You can try x = int(strying.strip()) to get rid of leading and trailing white-space characters.
Use a Container which Allows Keys to be Strings
Long ago, before before electronic computers existed, there were various types of containers in the world:
cookie jars
muffin tins
carboard boxes
glass jars
steel cans.
back-packs
duffel bags
closets/wardrobes
brief-cases
In computer programming there are also various types of "containers"
You do not have to use a list as your container, if you do not want to.
There are containers where the keys (AKA indices) are allowed to be strings, instead of integers.
In python, the standard container which like a list, but where the keys/indices can be strings, is a dictionary
thisdict = {
"make": "Ford",
"model": "Mustang",
"year": 1964
}
thisdict["brand"] == "Ford"
If you want to index into a container using strings, instead of integers, then use a dict, instead of a list
The following is an example of a python dict which has state names as input and state abreviations as output:
us_state_abbrev = {
'Alabama': 'AL',
'Alaska': 'AK',
'American Samoa': 'AS',
'Arizona': 'AZ',
'Arkansas': 'AR',
'California': 'CA',
'Colorado': 'CO',
'Connecticut': 'CT',
'Delaware': 'DE',
'District of Columbia': 'DC',
'Florida': 'FL',
'Georgia': 'GA',
'Guam': 'GU',
'Hawaii': 'HI',
'Idaho': 'ID',
'Illinois': 'IL',
'Indiana': 'IN',
'Iowa': 'IA',
'Kansas': 'KS',
'Kentucky': 'KY',
'Louisiana': 'LA',
'Maine': 'ME',
'Maryland': 'MD',
'Massachusetts': 'MA',
'Michigan': 'MI',
'Minnesota': 'MN',
'Mississippi': 'MS',
'Missouri': 'MO',
'Montana': 'MT',
'Nebraska': 'NE',
'Nevada': 'NV',
'New Hampshire': 'NH',
'New Jersey': 'NJ',
'New Mexico': 'NM',
'New York': 'NY',
'North Carolina': 'NC',
'North Dakota': 'ND',
'Northern Mariana Islands':'MP',
'Ohio': 'OH',
'Oklahoma': 'OK',
'Oregon': 'OR',
'Pennsylvania': 'PA',
'Puerto Rico': 'PR',
'Rhode Island': 'RI',
'South Carolina': 'SC',
'South Dakota': 'SD',
'Tennessee': 'TN',
'Texas': 'TX',
'Utah': 'UT',
'Vermont': 'VT',
'Virgin Islands': 'VI',
'Virginia': 'VA',
'Washington': 'WA',
'West Virginia': 'WV',
'Wisconsin': 'WI',
'Wyoming': 'WY'
}
i could actually iterate this list and create my own sublist, e.g. e list of variants
data = [{
'id': 'productID1', 'name': 'productname A',
'option': {
'size': {
'type': 'list',
'name': 'size',
'choices': [
{'value': 'M'},
]}},
'variant': [{
'id': 'variantID1',
'choices':
{'size': 'M'},
'attributes':
{'currency': 'USD', 'price': 1}}]
},
{'id': 'productID2', 'name': 'productname B',
'option': {
'size': {
'type': 'list',
'name': 'size',
'choices': [
{'value': 'XL', 'salue':'XXL'},
]}},
'variant': [{
'id': 'variantID2',
'choices':
{'size': 'XL', 'size2':'XXL'},
'attributes':
{'currency': 'USD', 'price': 2}}]
}
]
new_list = {}
for item in data:
new_list.update(id=item['id'])
new_list.update (name=item['name'])
for variant in item['variant']:
new_list.update (varid=variant['id'])
for vchoice in variant['choices']:
new_list.update (vsize=variant['choices'][vchoice])
for attribute in variant['attributes']:
new_list.update (vprice=variant['attributes'][attribute])
for option in item['option']['size']['choices']:
new_list.update (osize=option['value'])
print (new_list)
but the output is always the last item of the iteration, because i always overwrite new_list with update().
{'id': 'productID2', 'name': 'productname B', 'varid': 'variantID2', 'vsize': 'XXL', 'vprice': 2, 'osize': 'XL'}
here's the final solution which worked for me:
data = [{
'id': 'productID1', 'name': 'productname A',
'variant': [{
'id': 'variantID1',
'choices':
{'size': 'M'},
'attributes':
{'currency': 'USD', 'price': 1}},
{'id':'variantID2',
'choices':
{'size': 'L'},
'attributes':
{'currency':'USD', 'price':2}}
]
},
{
'id': 'productID2', 'name': 'productname B',
'variant': [{
'id': 'variantID3',
'choices':
{'size': 'XL'},
'attributes':
{'currency': 'USD', 'price': 3}},
{'id':'variantID4',
'choices':
{'size': 'XXL'},
'attributes':
{'currency':'USD', 'price':4}}
]
}
]
for item in data:
for variant in item['variant']:
dic = {}
dic.update (ProductID=item['id'])
dic.update (Name=item['name'].title())
dic.update (ID=variant['id'])
dic.update (size=variant['choices']['size'])
dic.update (Price=variant['attributes']['price'])
products.append(dic)
keys = products[0].keys()
with open('productdata.csv', 'w', newline='', encoding='utf-8') as output_file:
dict_writer = csv.DictWriter(output_file, keys,delimiter=';', quotechar = '"', quoting=csv.QUOTE_NONNUMERIC)
dict_writer.writeheader()
dict_writer.writerows(products)
with the following output:
"ProductID";"Name";"ID";"size";"Price"
"productID1";"Productname A";"variantID1";"M";1
"productID1";"Productname A";"variantID2";"L";2
"productID2";"Productname B";"variantID3";"XL";3
"productID2";"Productname B";"variantID4";"XXL";4
which is exactly what i wanted.

Search for multiple words in a list using python

I'm currently working on my first python project. The goal is to be able to summarise a webpage's information by searching for and printing sentences that contain a specific word from a word list I generate. For example, the following (large) list contains 'business key terms' I generated by using cewl on business websites;
business_list = ['business', 'marketing', 'market', 'price', 'management', 'terms', 'product', 'research', 'organisation', 'external', 'operations', 'organisations', 'tools', 'people', 'sales', 'growth', 'quality', 'resources', 'revenue', 'account', 'value', 'process', 'level', 'stakeholders', 'structure', 'company', 'accounts', 'development', 'personal', 'corporate', 'functions', 'products', 'activity', 'demand', 'share', 'services', 'communication', 'period', 'example', 'total', 'decision', 'companies', 'service', 'working', 'businesses', 'amount', 'number', 'scale', 'means', 'needs', 'customers', 'competition', 'brand', 'image', 'strategies', 'consumer', 'based', 'policy', 'increase', 'could', 'industry', 'manufacture', 'assets', 'social', 'sector', 'strategy', 'markets', 'information', 'benefits', 'selling', 'decisions', 'performance', 'training', 'customer', 'purchase', 'person', 'rates', 'examples', 'strategic', 'determine', 'matrix', 'focus', 'goals', 'individual', 'potential', 'managers', 'important', 'achieve', 'influence', 'impact', 'definition', 'employees', 'knowledge', 'economies', 'skills', 'buying', 'competitive', 'specific', 'ability', 'provide', 'activities', 'improve', 'productivity', 'action', 'power', 'capital', 'related', 'target', 'critical', 'stage', 'opportunities', 'section', 'system', 'review', 'effective', 'stock', 'technology', 'relationship', 'plans', 'opportunity', 'leader', 'niche', 'success', 'stages', 'manager', 'venture', 'trends', 'media', 'state', 'negotiation', 'network', 'successful', 'teams', 'offer', 'generate', 'contract', 'systems', 'manage', 'relevant', 'published', 'criteria', 'sellers', 'offers', 'seller', 'campaigns', 'economy', 'buyers', 'everyone', 'medium', 'valuable', 'model', 'enterprise', 'partnerships', 'buyer', 'compensation', 'partners', 'leaders', 'build', 'commission', 'engage', 'clients', 'partner', 'quota', 'focused', 'modern', 'career', 'executive', 'qualified', 'tactics', 'supplier', 'investors', 'entrepreneurs', 'financing', 'commercial', 'finances', 'entrepreneurial', 'entrepreneur', 'reports', 'interview', 'ansoff']
And the following program allows me to copy all the text from a URL i specify and organises it into a list, in which the elements are separated by sentence;
from bs4 import BeautifulSoup
import urllib.request as ul
url = input("Enter URL: ")
html = ul.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
for script in soup(["script", "style"]):
script.decompose()
strips = list(soup.stripped_strings)
# Joining list to form single text
text = " ".join(strips)
text = text.lower()
# Replacing substitutes of '.'
for i in range(len(text)):
if text[i] in "?!:;":
text = text.replace(text[i], ".")
# Splitting text by sentences
sentences = text.split(".")
My current objective is for the program to print all sentences that contain one (or more) of the key terms above, however i've only been succesful with single words at a time;
# Word to search for in the text
word_search = input("Enter word: ")
word_search = word_search.lower()
sentences_with_word = []
for x in sentences:
if x.count(word_search)>0:
sentences_with_word.append(x)
# Separating sentences into separate lines
sentence_text = "\n\n".join(sentences_with_word)
print(sentence_text)
Could somebody demonstrate how this could be achieved for an entire list at once? Thanks.
Edit
As suggested by MachineLearner, here is an example of the output for a single word. If I use wikipedia's page on marketing for the URL and choose the word 'marketing' as the input for 'word_search', this is a segment of the output generated (although the entire output is almost 600 lines long);
marketing mix the marketing mix is a foundational tool used to guide decision making in marketing
the marketing mix represents the basic tools which marketers can use to bring their products or services to market
they are the foundation of managerial marketing and the marketing plan typically devotes a section to the marketing mix
the 4ps [ edit ] the traditional marketing mix refers to four broad levels of marketing decision
Use a double loop to check multiple words contained in a list:
for sentence in sentences:
for word in words:
if sentence.count(word) > 0:
output.append(sentence)
# Do not forget to break the second loop, else
# you'll end up with multiple times the same sentence
# in the output array if the sentence contains
# multiple words
break

Trying to generate a list from a JSON object (TypeError list indices must be integers or slices, not str)

I have retrieved a JSON object from an API. The JSON object looks like this:
{'copyright': 'Copyright (c) 2020 The New York Times Company. All Rights '
'Reserved.',
'response': {'docs': [{'_id': 'nyt://article/e3e5e5e5-1b32-5e2b-aea7-cf20c558dbd3',
'abstract': 'LEAD: RESEARCHERS at the Brookhaven '
'National Laboratory are employing a novel '
'model to study skin cancer in humans: '
'they are exposing tiny tropical fish to '
'ultraviolet radiation.',
'byline': {'organization': None,
'original': 'By Eric Schmitt',
'person': [{'firstname': 'Eric',
'lastname': 'Schmitt',
'middlename': None,
'organization': '',
'qualifier': None,
'rank': 1,
'role': 'reported',
'title': None}]},
'document_type': 'article',
'headline': {'content_kicker': None,
'kicker': None,
'main': 'Tiny Fish Help Solve Cancer '
'Riddle',
'name': None,
'print_headline': 'Tiny Fish Help Solve '
'Cancer Riddle',
'seo': None,
'sub': None},
'keywords': [{'major': 'N',
'name': 'organizations',
'rank': 1,
'value': 'Brookhaven National '
'Laboratory'},
{'major': 'N',
'name': 'subject',
'rank': 2,
'value': 'Ozone'},
{'major': 'N',
'name': 'subject',
'rank': 3,
'value': 'Radiation'},
{'major': 'N',
'name': 'subject',
'rank': 4,
'value': 'Cancer'},
{'major': 'N',
'name': 'subject',
'rank': 5,
'value': 'Research'},
{'major': 'N',
'name': 'subject',
'rank': 6,
'value': 'Fish and Other Marine Life'}],
'lead_paragraph': 'RESEARCHERS at the Brookhaven '
'National Laboratory are employing a '
'novel model to study skin cancer in '
'humans: they are exposing tiny '
'tropical fish to ultraviolet '
'radiation.',
'multimedia': [],
'news_desk': 'Science Desk',
'print_page': '3',
'print_section': 'C',
'pub_date': '1989-12-26T05:00:00+0000',
'section_name': 'Science',
'snippet': '',
'source': 'The New York Times',
'type_of_material': 'News',
'uri': 'nyt://article/e3e5e5e5-1b32-5e2b-aea7-cf20c558dbd3',
'web_url': 'https://www.nytimes.com/1989/12/26/science/tiny-fish-help-solve-cancer-riddle.html',
'word_count': 870},
{'_id': 'nyt://article/32a2431d-623a-525b-a21d-d401be865818',
'abstract': 'LEAD: Clouds, even the ones formed by '
...and continues like that, too long to show all of it here.
Now, when I want to list just one headline, I use:
pprint(articles['response']['docs'][0]['headline']['print_headline'])
And I get the output
'Tiny Fish Help Solve Cancer Riddle'
The problem is when I want to pick out all of the headlines from this JSON object, and make a list of them. I tried:
index = 0
for headline in articles:
headlineslist = ['response']['docs'][index]['headline']['print_headline'].split("''")
index = index + 1
headlineslist
But I get the error TypeError: list indices must be integers or slices, not str
In other words, it worked when I "listed" just one headline, at index [0], but not when I try to repeat the process over each index. How do I iterate through each index to get a list of outputs like the first one?
To iterate over the document list you can just do the following:
for doc in (articles['response']['docs']):
print(doc['headline']['print_headline'])
This would print all headlines.

Get value from data-set field sublist

I have a dataset (that pull its data from a dict) that I am attempting to clean and republish. Within this data set, there is a field with a sublist that I would like to extract specific data from.
Here's the data:
[{'id': 'oH58h122Jpv47pqXhL9p_Q', 'alias': 'original-pizza-brooklyn-4', 'name': 'Original Pizza', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/HVT0Vr_Vh52R_niODyPzCQ/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/original-pizza-brooklyn-4?adjust_creative=IelPnWlrTpzPtN2YRie19A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=IelPnWlrTpzPtN2YRie19A', 'review_count': 102, 'categories': [{'alias': 'pizza', 'title': 'Pizza'}], 'rating': 4.0, 'coordinates': {'latitude': 40.63781, 'longitude': -73.8963799}, 'transactions': [], 'price': '$', 'location': {'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}, 'phone': '+17185313559', 'display_phone': '(718) 531-3559', 'distance': 319.98144420799355},
Here's how the data is presented within the csv/spreadsheet:
location
{'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}
Is there a way to pull location.city for example?
The below code simply adds a few fields and exports it to a csv.
def data_set(data):
df = pd.DataFrame(data)
df['zip'] = get_zip()
df['region'] = get_region()
newdf = df.filter(['name', 'phone', 'location', 'zip', 'region', 'coordinates', 'rating', 'review_count',
'categories', 'url'], axis=1)
if not os.path.isfile('yelp_data.csv'):
newdf.to_csv('data.csv', header='column_names')
else: # else it exists so append without writing the header
newdf.to_csv('data.csv', mode='a', header=False)
If that doesn't make sense, please let me know. Thanks in advance!

Splitting multiple Dictionaries within a Pandas Column

I'm trying to split a dictionary with a list within a pandas column but it isn't working for me...
The column looks like so when called;
df.topics[3]
Output
"[{'urlkey': 'webdesign', 'name': 'Web Design', 'id': 659}, {'urlkey': 'productdesign', 'name': 'Product Design', 'id': 2993}, {'urlkey': 'internetpro', 'name': 'Internet Professionals', 'id': 10102}, {'urlkey': 'web', 'name': 'Web Technology', 'id': 10209}, {'urlkey': 'software-product-management', 'name': 'Software Product Management', 'id': 42278}, {'urlkey': 'new-product-development-software-tech', 'name': 'New Product Development: Software & Tech', 'id': 62946}, {'urlkey': 'product-management', 'name': 'Product Management', 'id': 93740}, {'urlkey': 'internet-startups', 'name': 'Internet Startups', 'id': 128595}]"
I want to only be left with the 'name' and 'id' to put into separate columns of topic_1, topic_2, and so forth.
Appreciate any help.
You can give this a try.
import json
df.topics.apply(lambda x : {x['id']:x['name'] for x in json.loads(x.replace("'",'"'))} )
Your output for the row you gave is :
{659: 'Web Design',
2993: 'Product Design',
10102: 'Internet Professionals',
10209: 'Web Technology',
42278: 'Software Product Management',
62946: 'New Product Development: Software & Tech',
93740: 'Product Management',
128595: 'Internet Startups'}
You should try a simple method
dt = df.topic[3]
li = []
for x in range(len(dt)):
t = {dt[x]['id']:dt[x]['name']}
li.append(t)
print(li)
Output is-
[{659: 'Web Design'},
{2993: 'Product Design'},
{10102: 'Internet Professionals'},
{10209: 'Web Technology'},
{42278: 'Software Product Management'},
{62946: 'New Product Development: Software & Tech'},
{93740: 'Product Management'},
{128595: 'Internet Startups'}]
First we takes the value of df.topic[3] in dt which is in form of list and dictionary inside the list, then we take an temp list li[] in which we add(append) our values, Now we run the loop for the length of values of de.topic(which we takes as dt), Now in t we are adding id or name by dt[0]['id'] or dt[0]['name'] which is '659:'Web Design' as x increase all values are comes in t, then by { : }
we are converting the values in Dictionary and append it to the temporary list li

Categories

Resources