Parsing a python string into a JSON file

Parsing a python string into a JSON file - python

I need to parse a string from a python function into json.dump() but I can't find a solution to remove the quotation marks from the string after it's written into the JSON.
I have this:
"[{'number1':1, 'number2': 2, 'number3': 3, 'word1': 'word'},{'number1':2, 'number2': 2, 'number3': 3, 'word1': 'word'}]"
And I need this:
[{'number1':1, 'number2': 2, 'number3': 3, 'word1': 'word'},{'number1':2, 'number2': 2, 'number3': 3, 'word1': 'word'}]
I tried to strip it using str.strip('""') but the quotation marks stay and a few different things from other threads similar to my problem but none have worked
Is there a way to accomplish what I am trying to do or is that impossible?
Code:
def batchStr(batchIds, count1, count2 ,status):
seg1 = "{'batchId':"
seg2 = ", 'source1count': "
seg3 = ", 'source2count': "
seg4 = ", 'status':"
seg5 = "},"
str1 = "["
for id in range(len(batchIds)):
str2 = str(seg1)+str(batchIds[id])+str(seg2)+str(count1[id])+str(seg3)+str(count2[id]+str(seg4)+str(status[id])+str(seg5)
str1 = str(str1) + str(str2)
str1 = str1[:-1]+"]"
return str1
I want it to output to the JSON like that:
[{'batchId':1, 'source1count': 100, 'source2count': 100, 'status':success},...]
But it outputs:
"[{'batchId':1, 'source1count': 100, 'source2count': 100, 'status':success},...]"

It seems the quotemarks you are seeing are just python informing you that the data you are displaying is a string (this will happen if you are using the repl or a jupyter notebook).
Regardless, here is a cleaned up version of your code that outputs what you are looking for
import json
def batchStr(batchIds, count1, count2 ,status):
json_array = [
dict(
batchId=bid,
source1count=c1,
source2count=c2,
status=s
)
for bid, c1, c2, s in
zip(batchIds, count1, count2, status)
]
return json.dumps(json_array)
print(batchStr([1,2,3], [100]*3, [200]*3, ["success"]*3))
Outputs:
[{"batchId": 1, "source1count": 100, "source2count": 200, "status": "success"}, {"batchId": 2, "source1count": 100, "source2count": 200, "status": "success"}, {"batchId": 3, "source1count": 100, "source2count": 200, "status": "success"}]

You should convert your list in string format to list first, try using eval(list).
Then, you could do json.dumps(list) and write it into the file.

To use the strip string function you need the backslash escape character for the quotation character.
str = "[{'number1':1, 'number2': 2, 'number3': 3, 'word1': 'word'},{'number1':2, 'number2': 2, 'number3': 3, 'word1': 'word'}]"
solution = str.strip('\"')
print(solution)
Or you can use RE regular expression.

Related

Count total number of modal verbs in text

I am trying to create a custom collection of words as shown in the following Categories:
Modal Tentative Certainty Generalizing
Can Anyhow Undoubtedly Generally
May anytime Ofcourse Overall
Might anything Definitely On the Whole
Must hazy No doubt In general
Shall hope Doubtless All in all
ought to hoped Never Basically
will uncertain always Essentially
need undecidable absolute Most
Be to occasional assure Every
Have to somebody certain Some
Would someone clear Often
Should something clearly Rarely
Could sort inevitable None
Used to sorta forever Always
I am reading text from a CSV file row by row:
import nltk
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
from nltk.tokenize import word_tokenize
count = defaultdict(int)
header_list = ["modal","Tentative","Certainity","Generalization"]
categorydf = pd.read_csv('Custom-Dictionary1.csv', names=header_list)
def analyze(file):
df = pd.read_csv(file)
modals = str(categorydf['modal'])
tentative = str(categorydf['Tentative'])
certainity = str(categorydf['Certainity'])
generalization = str(categorydf['Generalization'])
for text in df["Text"]:
tokenize_text = text.split()
for w in tokenize_text:
if w in modals:
count[w] += 1
analyze("test1.csv")
print(sum(count.values()))
print(count)
I want to find number of Modal/Tentative/Certainty verbs which are present in the above table and in each row in test1.csv, but not able to do so. This is generating words frequency with number.
19
defaultdict(<class 'int'>, {'to': 7, 'an': 1, 'will': 2, 'a': 7, 'all': 2})
See 'an','a' are not present in the table. I want to get No of Model verbs = total modal verbs present in 1 row of test.csv text
test1.csv:
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
"They convey the content of a communication."
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
I am stuck and not getting anything. How can I proceed?

I've solved your task for initial CSV format, could be of cause adopted to XML input if needed.
I've did quite fancy solution using NumPy, that's why solution might be a bit complex, but runs very fast and suitable for large data, even Giga-Bytes.
It uses sorted table of words, also sorts text to count words and sorted-search in table, hence works in O(n log n) time complexity.
It outputs original text line on first line, then Found-line where it lists each found in Tabl word in sorted order with (Count, Modality, (TableRow, TableCol)), then Non-Found-line where it lists non-found-in-table words plus Count (number of occurancies of this word in text).
Also a much simpler (but slower) similar solution is located after the first one.
Try it online!
import io, pandas as pd, numpy as np
# Instead of io.StringIO(...) provide filename.
tab = pd.read_csv(io.StringIO("""
Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
tabc = np.array(tab.columns.values.tolist(), dtype = np.str_)
taba = tab.values.astype(np.str_)
tabw = np.char.lower(taba.ravel())
tabi = np.zeros([tabw.size, 2], dtype = np.int64)
tabi[:, 0], tabi[:, 1] = [e.ravel() for e in np.split(np.mgrid[:taba.shape[0], :taba.shape[1]], 2, axis = 0)]
t = np.argsort(tabw)
tabw, tabi = tabw[t], tabi[t, :]
texts = pd.read_csv(io.StringIO("""
Text
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
""")).values[:, 0].astype(np.str_)
for i, (a, text) in enumerate(zip(map(np.array, np.char.split(texts)), texts)):
vs, cs = np.unique(np.char.lower(a), return_counts = True)
ps = np.searchsorted(tabw, vs)
unc = np.zeros_like(a, dtype = np.bool_)
psm = ps < tabi.shape[0]
psm[psm] = tabw[ps[psm]] == vs[psm]
print(
i, ': Text:', text,
'\nFound:',
', '.join([f'"{vs[i]}": ({cs[i]}, {tabc[tabi[ps[i], 1]]}, ({tabi[ps[i], 0]}, {tabi[ps[i], 1]}))'
for i in np.flatnonzero(psm).tolist()]),
'\nNon-Found:',
', '.join([f'"{vs[i]}": {cs[i]}'
for i in np.flatnonzero(~psm).tolist()]),
'\n',
)
Outputs:
0 : Text: When LIWC was first developed, the goal was to devise an efficient will system
Found: "will": (1, Modal, (6, 0))
Non-Found: "an": 1, "developed,": 1, "devise": 1, "efficient": 1, "first": 1, "goal": 1, "liwc": 1, "system": 1, "the": 1, "to": 1, "was": 2, "when":
1
1 : Text: Within a few years, it became clear that there are two very broad categories of words
Found: "clear": (1, Certainty, (10, 2))
Non-Found: "a": 1, "are": 1, "became": 1, "broad": 1, "categories": 1, "few": 1, "it": 1, "of": 1, "that": 1, "there": 1, "two": 1, "very": 1, "withi
n": 1, "words": 1, "years,": 1
2 : Text: Content words are generally nouns, regular verbs, and many adjectives and adverbs.
Found: "generally": (1, Generalizing, (0, 3))
Non-Found: "adjectives": 1, "adverbs.": 1, "and": 2, "are": 1, "content": 1, "many": 1, "nouns,": 1, "regular": 1, "verbs,": 1, "words": 1
3 : Text: They convey the content of a communication.
Found:
Non-Found: "a": 1, "communication.": 1, "content": 1, "convey": 1, "of": 1, "the": 1, "they": 1
4 : Text: To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”
Found:
Non-Found: "a": 1, "and": 2, "are:": 1, "back": 1, "content": 1, "dark": 1, "go": 1, "night”": 1, "phrase": 1, "stormy": 1, "the": 2, "to": 2, "was":
1, "words": 1, "“dark,”": 1, "“it": 1, "“night.”": 1, "“stormy,”": 1
Second solution is implemented in pure Python just for simplicity, only standard python modules io and csv are used.
Try it online!
import io, csv
# Instead of io.StringIO(...) just read from filename.
tab = csv.DictReader(io.StringIO("""Modal,Tentative,Certainty,Generalizing
Can,Anyhow,Undoubtedly,Generally
May,anytime,Ofcourse,Overall
Might,anything,Definitely,On the Whole
Must,hazy,No doubt,In general
Shall,hope,Doubtless,All in all
ought to,hoped,Never,Basically
will,uncertain,always,Essentially
need,undecidable,absolute,Most
Be to,occasional,assure,Every
Have to,somebody,certain,Some
Would,someone,clear,Often
Should,something,clearly,Rarely
Could,sort,inevitable,None
Used to,sorta,forever,Always
"""))
texts = csv.DictReader(io.StringIO("""
"When LIWC was first developed, the goal was to devise an efficient will system"
"Within a few years, it became clear that there are two very broad categories of words"
"Content words are generally nouns, regular verbs, and many adjectives and adverbs."
They convey the content of a communication.
"To go back to the phrase “It was a dark and stormy night” the content words are: “dark,” “stormy,” and “night.”"
"""), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))
It outputs:
'will': (1, Modal)
'clear': (1, Certainty)
'generally': (1, Generalizing)
I'm reading from StringIO content of CSV, that is to convenience so that code contains everything without need of extra files, for sure in your case you'll need direct files reading, for this you may do same as in next code and next link (named Try it online!):
Try it online!
import io, csv
tab = csv.DictReader(open('table.csv', 'r', encoding = 'utf-8-sig'))
texts = csv.DictReader(open('texts.csv', 'r', encoding = 'utf-8-sig'), fieldnames = ['Text'])
tabi = dict(sorted([(v.lower(), k) for e in tab for k, v in e.items()]))
texts = [e['Text'] for e in texts]
for text in texts:
cnt, mod = {}, {}
for word in text.lower().split():
if word in tabi:
cnt[word], mod[word] = cnt.get(word, 0) + 1, tabi[word]
print(', '.join([f"'{word}': ({cnt[word]}, {mod[word]})" for word, _ in sorted(cnt.items(), key = lambda e: e[0])]))

index word in dictionary

I have a text file where I want each word in the text file in a dictionary and then print out the index position each time the word is in the text file.
The code I have is only giving me the number of times the word is in the text file. How can I change this?
I have already converted to lowercase.
dicti = {}
for eachword in wordsintxt:
freq = dicti.get(eachword, None)
if freq == None:
dicti[eachword] = 1
else:
dicti[eachword] = freq + 1
print(dicti)

Change your code to keep the indices themselves, rather than merely count them:
for index, eachword in enumerate(wordsintxt):
freq = dicti.get(eachword, None)
if freq == None:
dicti[eachword] = []
else:
dicti[eachword].append(index)
If you still need the word frequency: that's easy to recover:
freq = len(dicti[word])
Update per OP comment
Without enumerate, simply provide that functionality yourself:
for index in range(len(wordsintxt)):
eachword = wordsintxt[i]
I'm not sure why you'd want to do that; the operation is idiomatic and common enough that Python developers created enumerate for exactly that purpose.

You can use this:
wordsintxt = ["hello", "world", "the", "a", "Hello", "my", "name", "is", "the"]
words_data = {}
for i, word in enumerate(wordsintxt):
word = word.lower()
words_data[word] = words_data.get(word, {'freq': 0, 'indexes': []})
words_data[word]['freq'] += 1
words_data[word]['indexes'].append(i)
for k, v in words_data.items():
print(k, '\t', v)
Which prints:
hello {'freq': 2, 'indexes': [0, 4]}
world {'freq': 1, 'indexes': [1]}
the {'freq': 2, 'indexes': [2, 8]}
a {'freq': 1, 'indexes': [3]}
my {'freq': 1, 'indexes': [5]}
name {'freq': 1, 'indexes': [6]}
is {'freq': 1, 'indexes': [7]}
You can avoid checking if the value exists in your dictionary and then performing a custom action by just using data[key] = data.get(key, STARTING_VALUE)
Greetings!

Use collections.defaultdict with enumerate, just append all the indexes you retrieve from enumerate
from collections import defaultdict
with open('test.txt') as f:
content = f.read()
words = content.split()
dd = defaultdict(list)
for i, v in enumerate(words):
dd[v.lower()].append(i)
print(dd)
# defaultdict(<class 'list'>, {'i': [0, 6, 35, 54, 57], 'have': [1, 36, 58],... 'lowercase.': [62]})

How to Convert Dict to string

number_pad = {"b":2,"a":2,"c":2,"x":4,"y":4,"z":4}
My print statement is
print(get_number,('00bx'))
How to get the output like this: 0024.
I tried this:
get_number = ""
for key,val in num_pad.items():
get_number = get_phone_number + str(val)
Is there anyway can relate letters to numbers?

You can use dict.get(...) for your task:
number_pad = {"b":2,"a":2,"c":2,"x":4,"y":4,"z":4}
text = '00bx'
re_text = ""
for t in text:
re_text += str(number_pad.get(t, t))
print(re_text) # output: 0024
Or you can condense it to this:
re_text = "".join(str(number_pad.get(t, t)) for t in text)
print(re_text) # output: 0024

Is this essentially what you are after?
str(number_pad['b']) + str(number_pad['x'])

Maybe you can try the map function with a lambda expression.
number_pad = {'b': 2, 'a': 2, 'c': 2, 'x': 4, 'y': 4, 'z': 4}
print(''.join(map(lambda c: str(number_pad[c]) if c in number_pad.keys() else c, '00bx')))

Pretty print JSON dumps

I use this code to pretty print a dict into JSON:
import json
d = {'a': 'blah', 'b': 'foo', 'c': [1,2,3]}
print json.dumps(d, indent = 2, separators=(',', ': '))
Output:
{
"a": "blah",
"c": [
1,
2,
3
],
"b": "foo"
}
This is a little bit too much (newline for each list element!).
Which syntax should I use to have this:
{
"a": "blah",
"c": [1, 2, 3],
"b": "foo"
}
instead?

I ended up using jsbeautifier:
import jsbeautifier
opts = jsbeautifier.default_options()
opts.indent_size = 2
jsbeautifier.beautify(json.dumps(d), opts)
Output:
{
"a": "blah",
"c": [1, 2, 3],
"b": "foo"
}

After years, I found a solution with the built-in pprint module:
import pprint
d = {'a': 'blah', 'b': 'foo', 'c': [1,2,3]}
pprint.pprint(d) # default width=80 so this will be printed in a single line
pprint.pprint(d, width=20) # here it will be wrapped exactly as expected
Output:
{'a': 'blah',
'b': 'foo',
'c': [1, 2, 3]}

Another alternative is print(json.dumps(d, indent=None, separators=(',\n', ': ')))
The output will be:
{"a": "blah",
"c": [1,
2,
3],
"b": "foo"}
Note that though the official docs at https://docs.python.org/2.7/library/json.html#basic-usage say the default args are separators=None --that actually means "use default of separators=(', ',': ') ). Note also that the comma separator doesn't distinguish between k/v pairs and list elements.

I couldn't get jsbeautifier to do much, so I used regular expressions. Had json pattern like
'{\n "string": [\n 4,\n 1.0,\n 6,\n 1.0,\n 8,\n 1.0,\n 9,\n 1.0\n ],\n...'
that I wanted as
'{\n "string": [ 4, 1.0, 6, 1.0, 8, 1.0, 9, 1.0],\n'
so
t = json.dumps(apriori, indent=4)
t = re.sub('\[\n {7}', '[', t)
t = re.sub('(?<!\]),\n {7}', ',', t)
t = re.sub('\n {4}\]', ']', t)
outfile.write(t)
So instead of one "dump(apriori, t, indent=4)", I had those 5 lines.

This has been bugging me for a while as well, I found a 1 liner I'm almost happy with:
print json.dumps(eval(str(d).replace('[', '"[').replace(']', ']"').replace('(', '"(').replace(')', ')"')), indent=2).replace('\"\\"[', '[').replace(']\\"\"', ']').replace('\"\\"(', '(').replace(')\\"\"', ')')
That essentially convert all lists or tuples to a string, then uses json.dumps with indent to format the dict. Then you just need to remove the quotes and your done!
Note: I convert the dict to string to easily convert all lists/tuples no matter how nested the dict is.
PS. I hope the Python Police won't come after me for using eval... (use with care)

Perhaps not quite as efficient, but consider a simpler case (somewhat tested in Python 3, but probably would work in Python 2 also):
def dictJSONdumps( obj, levels, indentlevels = 0 ):
import json
if isinstance( obj, dict ):
res = []
for ix in sorted( obj, key=lambda x: str( x )):
temp = ' ' * indentlevels + json.dumps( ix, ensure_ascii=False ) + ': '
if levels:
temp += dictJSONdumps( obj[ ix ], levels-1, indentlevels+1 )
else:
temp += json.dumps( obj[ ix ], ensure_ascii=False )
res.append( temp )
return '{\n' + ',\n'.join( res ) + '\n}'
else:
return json.dumps( obj, ensure_ascii=False )
This might give you some ideas, short of writing your own serializer completely. I used my own favorite indent technique, and hard-coded ensure_ascii, but you could add parameters and pass them along, or hard-code your own, etc.

check for null fields in json python

A very naive question but is there a robust or better way to do following.
Say it has nothing to do with json actually.
let say I have list (reading from file)
string_list = [ "foo",1,None, "null","[]","bar"]
Now, null and [] are essentially equivalent of null but different data structures have different interpretation of "None"?? right?
So rather than me writing a regex for all these rules.. is there a better way to convert "null","[]" etc to None.. ??
Thanks

Define a set of values that should be replaced with None and use list comprehension to "replace" them:
>>> string_list = [ "foo",1,None, "null","[]","bar"]
>>> none_items = {"null", "[]"} # or set(("null", "[]"))
>>> [None if item in none_items else item for item in string_list]
['foo', 1, None, None, None, 'bar']
Or, use map():
>>> map(lambda x: None if x in none_items else x, string_list)
['foo', 1, None, None, None, 'bar']
Using set because of O(1) lookups.

You could try:
string_list = [ "foo",1,None, "null","[]","bar"]
nones = [ "null", "[]" ]
print([None if s in nones else s for s in string_list])

1) You shouldn't be converting anything to None.
2) The first thing you want to do is convert to json. The json module will convert null to None, so you don't have to worry about null. And empty json strings, arrays, and objects, will be converted to empty python strings, lists, and dicts, so you won't be dealing with strings at all.
3) Then if you want to filter out the empty objects, you can do things like this:
import json
my_data = json.loads("""
[
"hello",
"",
[],
{},
[1, 2, 3],
{"a": 1, "b": 2}
]
""")
print(my_data)
print([x for x in my_data if x])
--output:--
['hello', '', [], {}, [1, 2, 3], {'a': 1, 'b': 2}]
['hello', [1, 2, 3], {'a': 1, 'b': 2}]
Empty objects(including 0) evaluate to False.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing a python string into a JSON file - python

You should convert your list in string format to list first, try using eval(list). Then, you could do json.dumps(list) and write it into the file.

Related

Count total number of modal verbs in text

index word in dictionary

How to Convert Dict to string

Pretty print JSON dumps

check for null fields in json python

Categories

Resources