Bellow you see my code that I use to collect some data via the API of IBM. However I have some problems with saving the output via python to a csv table.
These are the columns that I want (and their values):
emotion__document__emotion__anger emotion__document__emotion__joy
emotion__document__emotion__sadness emotion__document__emotion__fear
emotion__document__emotion__disgust sentiment__document__score
sentiment__document__label language entities__relevance
entities__text entities__type entities__count concepts__relevance
concepts__text concepts__dbpedia_resource usage__text_characters
usage__features usage__text_units retrieved_url
This is my code that I use to collect the data:
response = natural_language_understanding.analyze(
url=url,
features=[
Features.Emotion(),
Features.Sentiment(),
Features.Concepts(limit=1),
Features.Entities(limit=1)
]
)
data = json.load(response)
rows_list = []
cols = []
for ind,row in enumerate(data):
if ind == 0:
cols.append(["usage__{}".format(i) for i in row["usage"].keys()])
cols.append(["emotion__document__emotion__{}".format(i) for i in row["emotion"]["document"]["emotion"].keys()])
cols.append(["sentiment__document__{}".format(i) for i in row["sentiment"]["document"].keys()])
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"].keys()])
cols.append(["retrieved_url"])
d = OrderedDict()
d.update(row["usage"])
d.update(row["emotion"]["document"]["emotion"])
d.update(row["sentiment"]["document"])
d.update(row["concepts"])
d.update(row["entities"])
d.update({"retrieved_url":row["retrieved_url"]})
rows_list.append(d)
df = pd.DataFrame(rows_list)
df.columns = [i for subitem in cols for i in subitem]
df.to_csv("featuresoutput.csv", index=False)
Changing
cols.append(["concepts__{}".format(i) for i in row["concepts"][0].keys()])
cols.append(["entities__{}".format(i) for i in row["entities"][0].keys()])
Did not solved the problem
If you get it from an API, the response would be in json format. You can output it into a csv by:
import csv, json
response = the json response you get from the API
attributes = [emotion__document__emotion__anger, emotion__document__emotion__joy.....attributes you want]
data = json.load(response)
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
for attribute in attributes:
writer.writerow(data[attribute][0])
f.close()
make sure data is in dict but not string, Python 3.6 should return a dict. Print out a few rows to look into how your required data is stored.
This line assigns a string to data:
data=(json.dumps(datas, indent=2))
So here you iterate over the characters of a string:
for ind,row in enumerate(data):
In this case row will be a string, and not a dictionary. So, for example, row["usage"] would give you such an error in this case.
Maybe you wanted to iterate over datas?
Update
The code has a few other issues, such as:
cols.append(["concepts__{}".format(i) for i in row["concepts"].keys()])
In this case, you would want row["concepts"][0].keys() to get the keys of the first element, because row["concepts"] is an array.
I'm not very familiar with pandas, but I would suggest you to look at json_normalize, included in pandas, which can help flatten the JSON structure. An issue you might face, is the concepts and entities, which contain arrays of documents. That means that you would have to include the same document, at least max(len(concepts), len(entities)) times.
Related
My csv file looks like this:
Name,Surname,Fathers_name
Prakash,Patel,sudeep
Rohini,Dalal,raghav
Geeta,vakil,umesh
I want to create a dictionary of lists which should be like this:
dict = {Name: [Pakash,Rohini,Geeta], Surname: [Patel,Dalal,vakil], Fathers_name: [sudeep,raghav,umesh]}
This is my code:
with open(ram_details, 'r') as csv_file:
csv_content = csv.reader(csv_file,delimiter=',')
header = next(csv_content)
if header != None:
for row in csv_content:
dict['Name'].append(row[0])
It is throwing an error that key does not exists? Also, if there is any better way to get the desired output!!! Can someone help me with this?
Your code looks fine. It should work, still if you are getting into any trouble you can always use defaultdict.
from collections import defaultdict
# dict = {'Name':[],'Surname':[],'FatherName':[]}
d = defaultdict(list)
with open('123.csv', 'r') as csv_file:
csv_content = csv.reader(csv_file,delimiter=',')
header = next(csv_content)
if header != None:
for row in csv_content:
# dict['Name'].append(row[0])
# dict['Surname'].append(row[1])
# dict['FatherName'].append(row[2])
d['Name'].append(row[0])
d['Surname'].append(row[1])
d['FatherName'].append(row[2])
Please don't name a variable similar to a build in function or type (such as dict).
The problem is that you haven't initialized a dictionary object yet. So you try to add a key and value to an object which is not known to be dict yet. In any case you need to do the following:
result = dict() # <-- this is missing
result[key] = value
Since you want to create a dictionary and want to append to it directly you can also use python's defaultdict.
A working example would be:
import csv
from collections import defaultdict
from pprint import pprint
with open('details.csv', 'r') as csv_file:
csv_content = csv.reader(csv_file, delimiter=',')
headers = list(map(str.strip, next(csv_content)))
result = defaultdict(list)
if headers != None:
for row in csv_content:
for header, element in zip(headers, row):
result[header].append(element)
pprint(result)
Which leads to the output:
defaultdict(<class 'list'>,
{'Fathers_name': ['sudeep', 'raghav', 'umesh'],
'Name': ['Prakash', 'Rohini ', 'Geeta '],
'Surname': ['Patel ', 'Dalal ', 'vakil ']})
Note 1) my csv file had some extra trailing spaces, which can be removed using strip(), as I did for the headers.
Note 2) I am using the zip function to iterate over the elements and headers at the same time (this saves me to index the row).
Possible alternative is using pandas to_dict method (docs)
You may try to use pandas to achieve that:
import pandas as pd
f = pd.read_csv('todict.csv')
d = f.to_dict(orient='list')
Or if you like a one liner:
f = pd.read_csv('todict.csv').to_dict('orient='list')
First you read your csv file to a pandas data frame (I saved your sample to a file named todict.csv). Then you use the dataframe to dict method to convert to dictionary, specifying that you want lists as your dictinoary values, as explained in the documentation.
I am trying to convert dictionary to CSV so that it is readable (in their respective key).
import csv
import json
from urllib.request import urlopen
x =0
id_num = [848649491, 883560475, 431495539, 883481767, 851341658, 42842466, 173114302, 900616370, 1042383097, 859872672]
for bilangan in id_num:
with urlopen("https://shopee.com.my/api/v2/item/get?itemid="+str(bilangan)+"&shopid=1883827")as response:
source = response.read()
data = json.loads(source)
#print(json.dumps(data, indent=2))
data_list ={ x:{'title':productName(),'price':price(),'description':description(),'preorder':checkPreorder(),
'estimate delivery':estimateDelivery(),'variation': variation(), 'category':categories(),
'brand':brand(),'image':image_link()}}
#print(data_list[x])
x =+ 1
i store the data in x, so it will be looping from 0 to 1, 2 and etc. i have tried many things but still cannot find a way to make it look like this or close to this:
https://i.stack.imgur.com/WoOpe.jpg
Using DictWriter from the csv module
Demo:
import csv
data_list ={'x':{'title':'productName()','price':'price()','description':'description()','preorder':'checkPreorder()',
'estimate delivery':'estimateDelivery()','variation': 'variation()', 'category':'categories()',
'brand':'brand()','image':'image_link()'}}
with open(filename, "w") as infile:
writer = csv.DictWriter(infile, fieldnames=data_list["x"].keys())
writer.writeheader()
writer.writerow(data_list["x"])
I think, maybe you just want to merge some cells like excel do?
If yes, I think this is not possible in csv, because csv format does not contain cell style information like excel.
Some possible solutions:
use openpyxl to generate a excel file instead of csv, then you can merge cells with "worksheet.merge_cells()" function.
do not try to merge cells, just keep title, price and other fields for each line, the data format should be like:
first line: {'title':'test_title', 'price': 22, 'image': 'image_link_1'}
second line: {'title':'test_title', 'price': 22, 'image': 'image_link_2'}
do not try to merge cells, but set the title, price and other fields to a blank string, so it will not show in your csv file.
use line break to control the format, that will merge multi lines with same title into single line.
hope that helps.
If I were you, I would have done this a bit differently. I do not like that you are calling so many functions while this website offers a beautiful JSON response back :) More over, I will use pandas library so that I have total control over my data. I am not a CSV lover. This is a silly prototype:
import requests
import pandas as pd
# Create our dictionary with our items lists
data_list = {'title':[],'price':[],'description':[],'preorder':[],
'estimate delivery':[],'variation': [], 'categories':[],
'brand':[],'image':[]}
# API url
url ='https://shopee.com.my/api/v2/item/get'
id_nums = [848649491, 883560475, 431495539, 883481767, 851341658,
42842466, 173114302, 900616370, 1042383097, 859872672]
shop_id = 1883827
# Loop throw id_nums and return the goodies
for id_num in id_nums:
params = {
'itemid': id_num, # take values from id_nums
'shopid':shop_id}
r = requests.get(url, params=params)
# Check if we got something :)
if r.ok:
data_json = r.json()
# This web site returns a beautiful JSON we can slice :)
product = data_json['item']
# Lets populate our data_list with the items we got. We could simply
# creating one function to do this, but for now this will do
data_list['title'].append(product['name'])
data_list['price'].append(product['price'])
data_list['description'].append(product['description'])
data_list['preorder'].append(product['is_pre_order'])
data_list['estimate delivery'].append(product['estimated_days'])
data_list['variation'].append(product['tier_variations'])
data_list['categories'].append([product['categories'][i]['display_name'] for i, _ in enumerate(product['categories'])])
data_list['brand'].append(product['brand'])
data_list['image'].append(product['image'])
else:
# Do something if we hit connection error or something.
# may be retry or ignore
pass
# Putting dictionary to a list and ordering :)
df = pd.DataFrame(data_list)
df = df[['title','price','description','preorder','estimate delivery',
'variation', 'categories','brand','image']]
# df.to ...? There are dozen of different ways to store your data
# that are far better than CSV, e.g. MongoDB, HD5 or compressed pickle
df.to_csv('my_data.csv', sep = ';', encoding='utf-8', index=False)
I am using python 3.6 and trying to download json file (350 MB) as pandas dataframe using the code below. However, I get the following error:
data_json_str = "[" + ",".join(data) + "]
"TypeError: sequence item 0: expected str instance, bytes found
How can I fix the error?
import pandas as pd
# read the entire file into a python array
with open('C:/Users/Alberto/nutrients.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
From your code, it looks like you're loading a JSON file which has JSON data on each separate line. read_json supports a lines argument for data like this:
data_df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
Note
Remove lines=True if you have a single JSON object instead of individual JSON objects on each line.
Using the json module you can parse the json into a python object, then create a dataframe from that:
import json
import pandas as pd
with open('C:/Users/Alberto/nutrients.json', 'r') as f:
data = json.load(f)
df = pd.DataFrame(data)
If you open the file as binary ('rb'), you will get bytes. How about:
with open('C:/Users/Alberto/nutrients.json', 'rU') as f:
Also as noted in this answer you can also use pandas directly like:
df = pd.read_json('C:/Users/Alberto/nutrients.json', lines=True)
if you want to convert it into an array of JSON objects, I think this one will do what you want
import json
data = []
with open('nutrients.json', errors='ignore') as f:
for line in f:
data.append(json.loads(line))
print(data[0])
The easiest way to read json file using pandas is:
pd.read_json("sample.json",lines=True,orient='columns')
To deal with nested json like this
[[{Value1:1},{value2:2}],[{value3:3},{value4:4}],.....]
Use Python basics
value1 = df['column_name'][0][0].get(Value1)
Please the code below
#call the pandas library
import pandas as pd
#set the file location as URL or filepath of the json file
url = 'https://www.something.com/data.json'
#load the json data from the file to a pandas dataframe
df = pd.read_json(url, orient='columns')
#display the top 10 rows from the dataframe (this is to test only)
df.head(10)
Please review the code and modify based on your need. I have added comments to explain each line of code. Hope this helps!
I want to know the best way to reverse the lines of a big csv file (+50000 lines) in python 2.7 and rewrite it, avoiding the first line.
input:
A;B;C
1;2;3
4;5;6
output
A;B;C
4;5;6
1;2;3
I need to know how to do it in a efficient way in python 2.7.
Thank you guys,
menchopez
read the csv file using csv module and open the output also using csv module. Now you're working with lists as rows.
Use next to write the title line as-is. Now that the first line is consumed, convert the rest of the data into a list to read it fully and apply writerows on the reversed list:
import csv
with open("in.csv") as fr, open("out.csv","wb") as fw:
cr = csv.reader(fr,delimiter=";")
cw = csv.writer(fw,delimiter=";")
cw.writerow(next(cr)) # write title as-is
cw.writerows(reversed(list(cr)))
writerows is the fastest way of doing it, because it involves no python loops.
Python 3 users have to open the output file using open("out.csv","w",newline="") instead.
If you can use external libraries, the pandas library is good for large files:
import pandas as pd
# load the csv and user row 0 as headers
df = pd.read_csv("filepath.csv", header = 0)
# reverse the data
df.iloc[::-1]
If you cannot use external libraries:
import csv
with open("filepath.csv") as csvFile:
reader = csv.reader(csvFile)
# get data
data = [row for row in reader]
# get headers and remove from data
headers = data.pop(0)
# reverse the data
data_reversed = data[::-1]
# append the reversed data to the list of headers
output_data = headers.append(data_reversed)
Read as follows:
rows = []
first = True
for row in reader:
if first:
first = False
first_row = row
continue
rows.append(row)
write as follows:
rows.append(first_row)
writer.writerows(rows[::-1])
I have a csv file with rows of data. The first row is headers for the columns.
I'd like to sort the data by some parameter (specifically, the first column), but of course keep the header where it is.
When I do the following, the header disappears completely and is not included in the output file.
Can anyone please advise how to keep the header but skip it and sort the rest of the rows?
(for fwiw, the first column is a mix of numbers and letters).
Thanks!
Here's my code:
import csv
import operator
sankey = open('rawforsankey.csv', "rb")
raw_reader = csv.reader(sankey)
raw_data = []
for row in raw_reader:
raw_data.append(row)
raw_data_sorted = sorted(raw_data, key=operator.itemgetter(0))
myfiletest = open('newfiletest.csv', 'wb')
wr = csv.writer(myfiletest,quoting = csv.QUOTE_ALL)
wr.writerows(raw_data_sorted)
sankey.close()
myfiletest.close()
EDIT: should mention I tried this variation in the code:
raw_data_sorted = sorted(raw_data[1:], key=operator.itemgetter(0))
but got the same result
You sorted all data, including the header, which means it is still there but perhaps in the middle of your resulting output somewhere.
This is how you'd sort a CSV on one column, preserving the header:
import csv
import operator
with open('rawforsankey.csv', "rb") as sankey:
raw_reader = csv.reader(sankey)
header = next(raw_reader, None)
sorted_data = sorted(raw_reader, operator.itemgetter(0))
with open('newfiletest.csv', 'wb') as myfiletest:
wr = csv.writer(myfiletest, quoting=csv.QUOTE_ALL)
if header:
wr.writerow(header)
wr.writerows(sorted_data)
Just remember that sorting is done lexicographically as all columns are strings. So 10 sorts before 9, for example. Use a more specific sorting key if your data is numeric, for example.