I stream data via Server Send Event and get about 500.000 datasets but instead of getting one json I get this (example of 2 of the 500.000 datasets)(this is how it looks like opening it in gedit, all question marks are \" and all new lines are \n):
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
... -
My goal is to get this into a database. I actually thought I put this into a dictionary and afterwards create a pandas dataframe from here on I should be able to get it into a database. But this ends up to be quite cumbersome. I ended up with something like this:
c1 = data_json[1:-1]
c2 = c1.replace('{data:{', '{\"data\":{')
c3 = c2.replace('}data:{', ', ')
c4 = '{' + c3 + '}'
but even here I have some problems since I have to add /n/n for the new lines. But as soon as I change c3 to c2.replace('}\n\ndata:{', ', ') I get Process finished with exit code 137 (interrupted by signal 9: SIGKILL). Coming from .NET I could handle this quite easy with a deserializer and I am wondering if there is a similar way to deserialize the data.
I get the data via sseclient and would be able to store them as bytes instead of string, if this would help, just fyi.
Any suggestions?
Juggling with replaces is of course a convoluted path -
the language does have the parsers for this kind of escaping built in -
the simpler of which would be passing the string that contains JSON through an eval call. But eval is seldom needed and should be avoided in most cases as "not elegant" - if not outright unsafe (but being unsafe actually just applies when you have no control over the input data - and even them, ast.literal_eval instead of plain eval can mitigate that). Anyway, there are other problems with the format that will prevent eval to work outright - the missing quotes of the outmost data:, for example.
Random rants apart, if your file content is actually:
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
It has two problems: "under-quoting' of the outmost data and an
"over-scaping" of the inner-data.
On an interactive Python session, using the "raw string" marker I can input your example line as it will be read from a file:
In [263]: a = r"""data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n"""
In [264]: print(a)
data:{\"data\":[\"Kendrick\",\"Lamar\"]}\n\ndata:{\"data\":[\"David\",\"Bowie\"]}\n\n
So, on to remove one level of backslashes - Python have an "unicode_escape" text encoding, but it only works from bytes-objects. We then resort to the "latin1" encoding, as it provides a byte-for-byte conversion of the unicode literal in "a" to bytes, and then apply an unicode_escape to remove the "\" :
In [266]: b = a.encode("latin1").decode("unicode_escape")
In [267]: print(b, "\n", repr(b))
data:{"data":["Kendrick","Lamar"]}
data:{"data":["David","Bowie"]}
'data:{"data":["Kendrick","Lamar"]}\n\ndata:{"data":["David","Bowie"]}\n\n'
now it is easy to parse:
We split the resulting string at "\n\n" and have one list with one record
(those you are calling "dataset") per element. Then we resort to string
manipulation to get rid of the starting "data:" and finally, json.load can work on the remaining part.
so:
import json
raw_data = open("mystrangefile.pseudo_json").read()
data = data.encode("latin1").decode("unicode_escape")
records = [json.loads(record.split(":", 1)[-1]) for record in data.split("\n\n")]
And "records" now should contain well behaved Python objects dictionaries, you can put in a database. (Unless Pandas can provide automatic mapping of the columns to a databas, it seems to be an uneeded step - a raw connection.executemany(""" INSERT ...""", records) with a proper open DB connection should suffice.
Also, on a sidenote you mentioned that you could handle this easily with a .NET deserializer: that is only if your files are not as broken as you have shown us - no possible standard serializer could know how to handle such an specific data format out of the box. But, if you actually is that more proeficient in another language/technology to do that, you could resort to write just a converter from the broken input to a properly encoded file, and use that as an intermediate step.
I'm not completely sure if I understood the format in which you get the string correctly, so please correct me if I'm wrong here:
data_json = 'data:{\\"data\\":[\\"Kendrick\\",\\"Lamar\\"]}\\n\\ndata:{\\"data\\":[\\"David\\",\\"Bowie\\"]}\\n\\n'
Your first line seems to strip the first and last character, which I don't see. Are there any additional characters you are stripping away here?
The two following substring replacements seem to have no effect as the substrings are not present in the initial string (if I got it correctly in the first place).
And finally in the last line you are wrapping your result with { and } which is not correct for lists in json. It should be [...]
I can't really tell why you would get a SIGKILL here, though. It does not throw any errors for me, it just does not do what you want it to do. Maybe you're running out of memory with all the 500k examples?
However, this would be a working solution (again, given that I got the initial string correctly):
c1 = data_json.replace('\\n\\n', '') # removing escaped newlines
c2 = c1.replace('data:', ',') # replacing the additional 'data:' with json delimiter ','
c3 = c2.replace('\\', '') # removing artificial escapes
c4 = c3[1:-1] # removing leading ',' (introduced in c2) and trailing newline
c5 = '[' + c4 + ']' # wrapping as list
Now you should be able to json.loads(c5) or whatever you need to do with that string.
I am trying to get some data in a list of dictionaries.
The data comes from a csv file so it's all string.
the the keys in the file all have double qoutes, but since these are all strings, I want to remove them so they look like this in the dictionary:
{'key':value}
instead of this
{'"key"':value}
I tried simply using string = string[1:-1], but this doesn's work...
Here is my code:
csvDelimiter = ","
tsvDelimiter = "\t"
dataOutput = []
dataFile = open("browser-ww-monthly-201305-201405.csv","r")
for line in dataFile:
line = line[:-1] # Removes \n after every line
data = line.split(csvDelimiter)
for i in data:
if type(i) == str: # Doesn't work, I also tried if isinstance(i, str)
# but that didn't work either.
print i
i = i[1:-1]
print i
dataOutput.append({data[0] : data[1]})
dataFile.close()
print "Data output:\n"
print dataOutput
all the prints I get from print i are good, without double quotes, but when I append data to dataOutput, the quotes are back!
Any idea how to make them disappear forever?
Strip it. For example:
data[0].strip('"')
However, when reading cvs files, the best is to use the built-in cvs module. It takes care of this for you.
As noted in the comments, when dealing with CSV files you truly ought to use Python's built-in csv module (linking to Python 2 docs since it seems that's what you're using).
Another thing to note is that when you do:
data = line.split(csvDelimiter)
every item in the returned list, if it is not empty, will be strings. There's no sense in doing a type check in the loop (though if there were a reason to you would use isinstance). I don't know what "didn't work" about it, though it's possible you were using unicode strings. On Python 2 you can usually use isinstance(..., basestring) where basestring is a base class for both str and unicode. On Python 3 just use str unless you know you're dealing with bytes.
You said: "I tried simply using string = string[1:-1], but this doesn't work...". It seems to work fine for me:
In [101]: s="'word'"
In [102]: s[1:-1]
Out[102]: 'word'
I want to do split a string using "},{" as the delimiter. I have tried various things but none of them work.
string="2,1,6,4,5,1},{8,1,4,9,6,6,7,0},{6,1,2,3,9},{2,3,5,4,3 "
Split it into something like this:
2,1,6,4,5,1
8,1,4,9,6,6,7,0
6,1,2,3,9
2,3,5,4,3
string.split("},{") works at the Python console but if I write a Python script in which do this operation it does not work.
You need to assign the result of string.split("},{") to a new string. For example:
string2 = string.split("},{")
I think that is the reason you think it works at the console but not in scripts. In the console it just prints out the return value, but in the script you want to make sure you use the returned value.
You need to return the string back to the caller. Assigning to the string parameter doesn't change the caller's variable, so those changes are lost.
def convert2list(string):
string = string.strip()
string = string[2:len(string)-2].split("},{")
# Return to caller.
return string
# Grab return value.
converted = convert2list("{1,2},{3,4}")
You could do it in steps:
Split at commas to get "{...}" strings.
Remove leading and trailing curly braces.
It might not be the most Pythonic or efficient, but it's general and doable.
I was taking the input from the console in the form of arguments to the script....
So when I was taking the input as {{2,4,5},{1,9,4,8,6,6,7},{1,2,3},{2,3}} it was not coming properly in the arg[1] .. so the split was basically splitting on an empty string ...
If I run the below code from a script file (in Python 2.7):
string="2,1,6,4,5,1},{8,1,4,9,6,6,7,0},{6,1,2,3,9},{2,3,5,4,3 "
print string.split("},{")
Then the output I got is:
['2,1,6,4,5,1', '8,1,4,9,6,6,7,0', '6,1,2,3,9', '2,3,5,4,3 ']
And the below code also works fine:
string="2,1,6,4,5,1},{8,1,4,9,6,6,7,0},{6,1,2,3,9},{2,3,5,4,3 "
def convert2list(string):
string=string.strip()
string=string[:len(string)].split("},{")
print string
convert2list(string)
Use This:
This will split the string considering },{ as a delimiter and print the list with line breaks.
string = "2,1,6,4,5,1},{8,1,4,9,6,6,7,0},{6,1,2,3,9},{2,3,5,4,3"
for each in string.split('},{'):
print each
Output:
2,1,6,4,5,1
8,1,4,9,6,6,7,0
6,1,2,3,9
2,3,5,4,3
If you want to print the split items in the list only you can use this simple print option.
string = "2,1,6,4,5,1},{8,1,4,9,6,6,7,0},{6,1,2,3,9},{2,3,5,4,3"
print string.split('},{')
Output:
['2,1,6,4,5,1', '8,1,4,9,6,6,7,0', '6,1,2,3,9', '2,3,5,4,3']
Quite simply ,you have to use split() method ,and "},{" as a delimeter, then print according to arguments (because string will be a list ) ,
like the following :
string.split("},{")
for i in range(0,len(string)):
print(string[i])