Copy and convert text in postgresql column - python

Let's say I have some JSON stored in postgresql like so:
{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}
This is an inverted index showing the position of each word, which spells out
the time is here the time is now
I want to put the text from the second example in a separate column. I can convert the inverted text with python like so:
def convert_index(inverted_index):
unraveled = {}
for key, values in inverted_index.items():
for value in values:
unraveled[value] = key
sorted_unraveled = dict(sorted(unraveled.items()))
result = " ".join(sorted_unraveled.values())
result = result.replace("\n", "")
return result
But I would love to do this within postgresql so I am not reading text from one column, running a script somewhere else, then adding text in a separate column. Anybody know of a way to go about that? Can I use some kind of script?

You need to get keys with jsonb_each() and unpack arrays with jsonb_array_elements() then aggregate the keys with proper order:
with my_table(json_col) as (
values
('{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}'::jsonb)
)
select string_agg(key, ' ' order by ord::int)
from my_table
cross join jsonb_each(json_col)
cross join jsonb_array_elements(value) as e(ord)
Test it in Db<>fiddle.

Related

What's the efficient way to filter a python dictionary based on whether an element in a value list exists?

I have a dictionary (table) defined like this:
table = {"id": [1, 2, 3]}, {"file": ['good1.txt', 'bad2.txt', 'good3.txt']}
and I have a list of bad candidates that should be removed:
to_exclude = ['bad0.txt', 'bad1.txt', 'bad2.txt']
I hope to filter the table based on if the file in a row of my table can be found inside to_exclude.
filtered = {"id": [1, 2]}, {"file": ['good1.txt', 'good3.txt']}
I guess I could use a for loop to check the entries one by one, but I was wondering what's the most python-efficient manner to solve this problem.
Could someone provide some guidance on this? Thanks.
I'm assuming you miswrote your data structure. You have a set of two dictionaries, which is impossible. (Dictionaries are not hashable). I'm hoping your actual data is:
data = {"id": [1, 2, 3], "file": [.......]}
a dictionary with two keys.
So for me, the simplest would be:
# Create a set for faster testing
to_exclude_set = set(to_exclude)
# Create (id, file) pairs for the pairs we want to keep
pairs = [(id, file) for id, file in zip(data["id"], data["file"])
if file not in to_exclude_set]
# Recreate the data structure
result = { 'id': [_ for id, _ in pairs],
'file': [_ for _, file in pairs] }

How to convert a dataframe from a GridDB container to a list of lists?

I am using GridDB Python Client and I have a container that stores my database and I need to get the dataframe object converted to a list of lists. The read_sql_query function offered by pandas returns a dataframe but I need to get the dataframe returned as a list of lists instead of a dataframe. The first element in the list of lists is for the header of the dataframe (the column names) and the other elements are for the rows in the dataframe. Is there a way I could do that? Please help
Here is the code for the container and the part where the program reads SQL queries:
#...
import griddb_python as griddb
import pandas as pd
from pprint import pprint
factory = griddb.StoreFactory.get_instance()
# Initialize container
try:
gridstore = factory.get_store(host="127.0.0.1", port="8080",
cluster_name="Cls36", username="root",
password="")
conInfo = griddb.ContainerInfo("Fresher_Students",
[["id", griddb.Type.INTEGER],
["First Name",griddb.Type.STRING],
["Last Name", griddb.Type.STRING],
["Gender", griddb.Type.STRING],
["Course", griddb.Type.STRING]
],
griddb.ContainerType.COLLECTION, True)
cont = gridstore.put_container(conInfo)
cont.create_index("id", griddb.IndexType.DEFAULT)
data = pd.read_csv("fresher_students.csv")
#Add data
for i in range(len(data)):
ret = cont.put(data.iloc[i, :])
print("Data added successfully")
except griddb.GSException as e:
print(e)
sql_statement = ('SELECT * FROM Fresher_Students')
sql_query = pd.read_sql_query(sql_statement, cont)
def convert_to_lol(query):
# Code goes here
# ...
LOL = convert_to_lol(sql_query.head()) # Not Laughing Out Load, it's List of Lists
pprint(LOL)
#...
I want to get something that looks like this:
[["id", "First Name", "Last Name", "Gender", "Course"],
[0, "Catherine", "Ghua", "F", "EEE"],
[1, "Jake", "Jonathan", "M", "BMS"],
[2, "Paul", "Smith", "M", "BFA"],
[3, "Laura", "Williams", "F", "MBBS"],
[4, "Felix", "Johnson", "M", "BSW"],
[5, "Vivian", "Davis", "F", "BFD"]]
[UPDATED]
The easiest way I know about(for any DF):
df = pd.DataFrame({'id':[2, 3 ,4], 'age':[24, 42, 13]})
[df.columns.tolist()] + df.reset_index().values.tolist()
output:
[['id', 'age'], [0, 2, 24], [1, 3, 42], [2, 4, 13]]

Python converting nested dictionary with list containing float elements as individual elements

I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]

large nested lists versus dictionaries

Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.

how to convert a text to dict in order to iterate it in python?

What i am trying to do is to iterate. I have this line of code in one column in a table in my database:
[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
i assign this to a variable order so i have:
order = [{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
then
i want to iterate it. I am trying the follow:
for o in order.items():
product = o['item']
...
it doesn't work. How can i convert it?
for order in orders:
ord = order.shopping_cart_details # [{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
temp = {'order_id': order.id, 'movies': ord['item'], 'created': order.created}
full_results.append(temp)
i get string indices must be integers
Assume the order is a string
order = '[{"item": 5, "quantity": 2},{"item": 6, "quantity": 1}]'
def doSomething():
import json
ord = json.loads(order)
values =[ v['item'] for v in ord] # if u want a single item u put values = v.pop()['item']
Could you check the return data type after you retrieve the data from database for the following code:
[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
Use type(order) to determine the type of the return value. It might be that the return data for the above data is in string format. In this case, when you store the list into database, you may consider to use json.dumps() to convert the data first then later when you retrieve the data, you may use json.loads() to get back the original data type. In this case, you may consider to change the database column type to blob.
Or to make it more simple you can use eval function.
order = '[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]'
order = eval(order)
for o in order:
product = o['item']
print product

Categories

Resources