Related
I have a dictionary (table) defined like this:
table = {"id": [1, 2, 3]}, {"file": ['good1.txt', 'bad2.txt', 'good3.txt']}
and I have a list of bad candidates that should be removed:
to_exclude = ['bad0.txt', 'bad1.txt', 'bad2.txt']
I hope to filter the table based on if the file in a row of my table can be found inside to_exclude.
filtered = {"id": [1, 2]}, {"file": ['good1.txt', 'good3.txt']}
I guess I could use a for loop to check the entries one by one, but I was wondering what's the most python-efficient manner to solve this problem.
Could someone provide some guidance on this? Thanks.
I'm assuming you miswrote your data structure. You have a set of two dictionaries, which is impossible. (Dictionaries are not hashable). I'm hoping your actual data is:
data = {"id": [1, 2, 3], "file": [.......]}
a dictionary with two keys.
So for me, the simplest would be:
# Create a set for faster testing
to_exclude_set = set(to_exclude)
# Create (id, file) pairs for the pairs we want to keep
pairs = [(id, file) for id, file in zip(data["id"], data["file"])
if file not in to_exclude_set]
# Recreate the data structure
result = { 'id': [_ for id, _ in pairs],
'file': [_ for _, file in pairs] }
I am using GridDB Python Client and I have a container that stores my database and I need to get the dataframe object converted to a list of lists. The read_sql_query function offered by pandas returns a dataframe but I need to get the dataframe returned as a list of lists instead of a dataframe. The first element in the list of lists is for the header of the dataframe (the column names) and the other elements are for the rows in the dataframe. Is there a way I could do that? Please help
Here is the code for the container and the part where the program reads SQL queries:
#...
import griddb_python as griddb
import pandas as pd
from pprint import pprint
factory = griddb.StoreFactory.get_instance()
# Initialize container
try:
gridstore = factory.get_store(host="127.0.0.1", port="8080",
cluster_name="Cls36", username="root",
password="")
conInfo = griddb.ContainerInfo("Fresher_Students",
[["id", griddb.Type.INTEGER],
["First Name",griddb.Type.STRING],
["Last Name", griddb.Type.STRING],
["Gender", griddb.Type.STRING],
["Course", griddb.Type.STRING]
],
griddb.ContainerType.COLLECTION, True)
cont = gridstore.put_container(conInfo)
cont.create_index("id", griddb.IndexType.DEFAULT)
data = pd.read_csv("fresher_students.csv")
#Add data
for i in range(len(data)):
ret = cont.put(data.iloc[i, :])
print("Data added successfully")
except griddb.GSException as e:
print(e)
sql_statement = ('SELECT * FROM Fresher_Students')
sql_query = pd.read_sql_query(sql_statement, cont)
def convert_to_lol(query):
# Code goes here
# ...
LOL = convert_to_lol(sql_query.head()) # Not Laughing Out Load, it's List of Lists
pprint(LOL)
#...
I want to get something that looks like this:
[["id", "First Name", "Last Name", "Gender", "Course"],
[0, "Catherine", "Ghua", "F", "EEE"],
[1, "Jake", "Jonathan", "M", "BMS"],
[2, "Paul", "Smith", "M", "BFA"],
[3, "Laura", "Williams", "F", "MBBS"],
[4, "Felix", "Johnson", "M", "BSW"],
[5, "Vivian", "Davis", "F", "BFD"]]
[UPDATED]
The easiest way I know about(for any DF):
df = pd.DataFrame({'id':[2, 3 ,4], 'age':[24, 42, 13]})
[df.columns.tolist()] + df.reset_index().values.tolist()
output:
[['id', 'age'], [0, 2, 24], [1, 3, 42], [2, 4, 13]]
I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]
Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.
What i am trying to do is to iterate. I have this line of code in one column in a table in my database:
[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
i assign this to a variable order so i have:
order = [{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
then
i want to iterate it. I am trying the follow:
for o in order.items():
product = o['item']
...
it doesn't work. How can i convert it?
for order in orders:
ord = order.shopping_cart_details # [{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
temp = {'order_id': order.id, 'movies': ord['item'], 'created': order.created}
full_results.append(temp)
i get string indices must be integers
Assume the order is a string
order = '[{"item": 5, "quantity": 2},{"item": 6, "quantity": 1}]'
def doSomething():
import json
ord = json.loads(order)
values =[ v['item'] for v in ord] # if u want a single item u put values = v.pop()['item']
Could you check the return data type after you retrieve the data from database for the following code:
[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]
Use type(order) to determine the type of the return value. It might be that the return data for the above data is in string format. In this case, when you store the list into database, you may consider to use json.dumps() to convert the data first then later when you retrieve the data, you may use json.loads() to get back the original data type. In this case, you may consider to change the database column type to blob.
Or to make it more simple you can use eval function.
order = '[{u"item": 5, u"quantity": 2},{u"item": 6, u"quantity": 1}]'
order = eval(order)
for o in order:
product = o['item']
print product