Please could I solicit some general advice regarding Python lists. I know I shouldn't answer 'open' questions on here but I am worried about setting off on completely the wrong path.
My problem is that I have .csv files that are approximately 600,000 lines long each. Each row of the .csv has 6 fields, of which the first field is a date-time stamp in the format DD/MM/YYYY HH:MM:SS. The next two fields are blank and the last three fields contain float and integer values, so for example:
23/05/2017 16:42:17, , , 1.25545, 1.74733, 12
23/05/2017 16:42:20, , , 1.93741, 1.52387, 14
23/05/2017 16:42:23, , , 1.54875, 1.46258, 11
etc
No two values in column 1 (date-time stamp) will ever be the same.
I need to write a program that will do a few basic operations with the data, such as:
read all of the data into a dictionary, list, set (?) etc as appropriate.
search through the date time stamp column for a particular value.
read through the list and do basic calculations on the floats in columns 4 and 5.
write a new list based on the searches/calculations.
My question is - how should I 'handle' the data and am I likely to run into problems due to the length of the dataset?
For example, should I import all of the data into a list, and each element of the list is a sublist of each rows data? E.g:
[[23/05/2017 16:42:17,'','', 1.25545, 1.74733, 12],[23/05/2017 16:42:20,'','', 1.93741, 1.52387, 14], ...]
Or would it be better to make each date-time stamp the 'key' in a dictionary and make the dictionary 'value' a list with all the other values, e.g:
{'23/05/2017 16:42:17': [ , , 1.25545, 1.74733, 12], ...}
etc
If I use the list approach, is there a way to get Python to 'search' in only the first column for a particular time stamp rather than making it search through 600,000 rows times 6 columns when we know that only the first column contains timestamps?
I apologize if my query is a little vague, but would appreciate any guidance that anyone can offer.
600000 lines aren't that many, your script should run fine with either a list or a dict.
As a test, let's use:
data = [["2017-05-02 17:28:24", 0.85260, 1.16218, 7],
["2017-05-04 05:40:07", 0.72118, 0.47710, 15],
["2017-05-07 19:27:53", 1.79476, 0.47496, 14],
["2017-05-09 01:57:10", 0.44123, 0.13711, 16],
["2017-05-11 07:22:57", 0.17481, 0.69468, 0],
["2017-05-12 10:11:01", 0.27553, 0.47834, 4],
["2017-05-15 05:20:36", 0.01719, 0.51249, 7],
["2017-05-17 14:01:13", 0.35977, 0.50052, 7],
["2017-05-17 22:05:33", 1.68628, 1.90881, 13],
["2017-05-18 14:44:14", 0.32217, 0.96715, 14],
["2017-05-18 20:24:23", 0.90819, 0.36773, 5],
["2017-05-21 12:15:20", 0.49456, 1.12508, 5],
["2017-05-22 07:46:18", 0.59015, 1.04352, 6],
["2017-05-26 01:49:38", 0.44455, 0.26669, 13],
["2017-05-26 18:55:24", 1.33678, 1.24181, 7]]
dict
If you're looking for exact timestamps, a lookup will be much faster with a dict than with a list. You have to know exactly what you're looking for though: "23/05/2017 16:42:17" has a completely different hash than "23/05/2017 16:42:18".
data_as_dict = {l[0]: l[1:] for l in data}
print(data_as_dict)
# {'2017-05-21 12:15:20': [0.49456, 1.12508, 5], '2017-05-18 14:44:14': [0.32217, 0.96715, 14], '2017-05-04 05:40:07': [0.72118, 0.4771, 15], '2017-05-26 01:49:38': [0.44455, 0.26669, 13], '2017-05-17 14:01:13': [0.35977, 0.50052, 7], '2017-05-15 05:20:36': [0.01719, 0.51249, 7], '2017-05-26 18:55:24': [1.33678, 1.24181, 7], '2017-05-07 19:27:53': [1.79476, 0.47496, 14], '2017-05-17 22:05:33': [1.68628, 1.90881, 13], '2017-05-02 17:28:24': [0.8526, 1.16218, 7], '2017-05-22 07:46:18': [0.59015, 1.04352, 6], '2017-05-11 07:22:57': [0.17481, 0.69468, 0], '2017-05-18 20:24:23': [0.90819, 0.36773, 5], '2017-05-12 10:11:01': [0.27553, 0.47834, 4], '2017-05-09 01:57:10': [0.44123, 0.13711, 16]}
print(data_as_dict.get('2017-05-17 14:01:13'))
# [0.35977, 0.50052, 7]
print(data_as_dict.get('2017-05-17 14:01:10'))
# None
Note that your DD/MM/YYYY HH:MM:SS format isn't very convenient : sorting the cells lexicographically won't sort them by datetime. You'd need to use datetime.strptime() first:
from datetime import datetime
data_as_dict = {datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S'): l[1:] for l in data}
print(data_as_dict.get(datetime(2017,5,17,14,1,13)))
# [0.35977, 0.50052, 7]
print(data_as_dict.get(datetime(2017,5,17,14,1,10)))
# None
list with binary search
If you're looking for timestamps ranges, a dict won't help you much. A binary search (e.g. with bisect) on a list of timestamps should be very fast.
import bisect
timestamps = [datetime.strptime(l[0], '%Y-%m-%d %H:%M:%S') for l in data]
i = bisect.bisect(timestamps, datetime(2017,5,17,14,1,10))
print(data[i-1])
# ['2017-05-15 05:20:36', 0.01719, 0.51249, 7]
print(data[i])
# ['2017-05-17 14:01:13', 0.35977, 0.50052, 7]
Database
Before reinventing the wheel, you might want to dump all your CSVs into a small database (sqlite, Postgresql, ...) and use the corresponding queries.
Pandas
If you don't want the added complexity of a database but are ready to invest some time learning a new syntax, you should use pandas.DataFrame. It does exactly what you want, and then some.
Related
Let's say I have some JSON stored in postgresql like so:
{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}
This is an inverted index showing the position of each word, which spells out
the time is here the time is now
I want to put the text from the second example in a separate column. I can convert the inverted text with python like so:
def convert_index(inverted_index):
unraveled = {}
for key, values in inverted_index.items():
for value in values:
unraveled[value] = key
sorted_unraveled = dict(sorted(unraveled.items()))
result = " ".join(sorted_unraveled.values())
result = result.replace("\n", "")
return result
But I would love to do this within postgresql so I am not reading text from one column, running a script somewhere else, then adding text in a separate column. Anybody know of a way to go about that? Can I use some kind of script?
You need to get keys with jsonb_each() and unpack arrays with jsonb_array_elements() then aggregate the keys with proper order:
with my_table(json_col) as (
values
('{"the": [0, 4], "time": [1, 5], "is": [2, 6], "here": [3], "now": [7]}'::jsonb)
)
select string_agg(key, ' ' order by ord::int)
from my_table
cross join jsonb_each(json_col)
cross join jsonb_array_elements(value) as e(ord)
Test it in Db<>fiddle.
I'm collecting values from different arrays and nested dictionary containing list values, like below. The lists contains millions of rows, I tried pandas dataframe concatenation But getting out of memory, so I resorted to a for loop.
array1_str = ['user_1', 'user_2', 'user_3','user_4' , 'user_5']
array2_int = [3,3,1,2,4]
nested_dict_w_list = {'outer_dict' : { 'inner_dict' : [[1.0001],[2.0033],[1.3434],[2.3434], [0.44224]}}
final_out = [array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]] for i in range(len(array2_int))]
I'm getting the output as
user_1, 3, [2.3434]
user_2, 3, [2.3434]
user_3, 1, [1.0001]
user_4, 2, [1.3434]
user_5, 4, [0.44224]
But I want the output as
user_1, 3, 2.3434
user_2, 3, 2.3434
user_3, 1, 1.0001
user_4, 2, 1.3434
user_5, 4, 0.44224
I need to eventually convert this to parquet file, I'm using spark dataframe to convert this to parquet, but the schema is appearing as array(double)). But I need it as just double. Any input is appreciated.
The below for loop is working, but any other efficient and elegant solution.
final_output = []
for i in range(len(array2_int)-1)):
index = nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]]
final_output.append(array1_str[i], array2_int[i], index[0])
You can modify your original list comprehension, by indexing to item zero:
final_out = [
(array1_str[i], array2_int[i], nested_dict_w_list['outer_dict']['inner_dict'][array2_int[i]][0])
for i in range(len(array2_int))
]
For the following array;
[[[11, 22, 33]]],[[[32, 12, 3]]], I wanted to extract the 1st row and it should output 11,22,33. However, using the following code, I got [[11, 22, 33]]. How can I remove the double bracket?
df = pd.DataFrame([
[[[11, 22, 33]]],
[[[32, 12, 3]]]
], index=[1, 2], columns=['ColA'])
df[df.index == 1].ColA.item()
Expected output should be in the form of 11,22,33; without the bracket
Use .astype(str) and str.replace with the regex or operator (|). Then we use iat to get the first value:
df['ColA'].astype(str).str.replace('\[|\]', '').iat[0]
Output
'11, 22, 33'
Notice: that the type of your value changed from list to string
Or using native python functions str and replace:
str(df['ColA'].iat[0]).replace('[', '').replace(']', '')
Let's assume a very simple data structure. In the below example, IDs are unique. "date" and "id" are strings, and "amount" is an integer.
data = [[date1, id1, amount1], [date2, id2, amount2], etc.]
If date1 == date2 and id1 == id2, I'd like to merge the two entries into one and basically add up amount1 and amount2 so that data becomes:
data = [[date1, id1, amount1 + amount2], etc.]
There are many duplicates.
As data is very big (over 100,000 entries), I'd like to do this as efficiently as possible. What I did is a created a new "common" field that is basically date + id combined into one string with metadata allowing me to split it later (date + id + "_" + str(len(date)).
In terms of complexity, I have four loops:
Parse and load data from external source (it doesn't come in lists) | O(n)
Loop over data and create and store "common" string (date + id + metadata) - I call this "prepared data" where "common" is my encoded field | O(n)
Use the Counter() object to dedupe "prepared data" | O(n)
Decode "common" | O(n)
I don't care about memory here, I only care about speed. I could make a nested loop and avoid steps 2, 3 and 4 but that would be a time-complexity disaster (O(n²)).
What is the fastest way to do this?
Consider a defaultdict for aggregating data by a unique key:
Given
Some random data
import random
import collections as ct
random.seed(123)
# Random data
dates = ["2018-04-24", "2018-05-04", "2018-07-06"]
ids = "A B C D".split()
amounts = lambda: random.randrange(1, 100)
ch = random.choice
data = [[ch(dates), ch(ids), amounts()] for _ in range(10)]
data
Output
[['2018-04-24', 'C', 12],
['2018-05-04', 'C', 14],
['2018-04-24', 'D', 69],
['2018-07-06', 'C', 44],
['2018-04-24', 'B', 18],
['2018-05-04', 'C', 90],
['2018-04-24', 'B', 1],
['2018-05-04', 'A', 77],
['2018-05-04', 'A', 1],
['2018-05-04', 'D', 14]]
Code
dd = ct.defaultdict(int)
for date, id_, amt in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key] += amt
dd
Output
defaultdict(int,
{'2018-04-24B_10': 19,
'2018-04-24C_10': 12,
'2018-04-24D_10': 69,
'2018-05-04A_10': 78,
'2018-05-04C_10': 104,
'2018-05-04D_10': 14,
'2018-07-06C_10': 44})
Details
A defaultdict is a dictionary that calls a default factory (a specified function) for any missing keys. It this case, every date + id combination is uniquely added to the dict. The amounts are added to values if existing keys are found. Otherwise an integer (0) initializes a new entry to the dict.
For illustration, you can visualize the aggregated values using a list as the default factory.
dd = ct.defaultdict(list)
for date, id_, val in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key].append(val)
dd
Output
defaultdict(list,
{'2018-04-24B_10': [18, 1],
'2018-04-24C_10': [12],
'2018-04-24D_10': [69],
'2018-05-04A_10': [77, 1],
'2018-05-04C_10': [14, 90],
'2018-05-04D_10': [14],
'2018-07-06C_10': [44]})
We see three occurrences of duplicate keys where the values were appropriately summed. Regarding efficiency, notice:
keys are made with format(), which should be a bit better the string concatenation and calling str()
every key and value is computed in the same iteration
Using pandas makes this really easy:
import pandas as pd
df = pd.DataFrame(data, columns=['date', 'id', 'amount'])
df.groupby(['date','id']).sum().reset_index()
For more control you can use agg instead of sum():
df.groupby(['date','id']).agg({'amount':'sum'})
Depending on what you are doing with the data, it may be easier/faster to go this way just because so much of pandas is built on compiled C extensions and optimized routines that make it super easy to transform and manipulate.
You could import the data into a structure that prevents duplicates and than convert it to a list.
data = {
date1: {
id1: amount1,
id2: amount2,
},
date2: {
id3: amount3,
id4: amount4,
....
}
The program's skeleton:
ddata = collections.defaultdict(dict)
for date, id, amount in DATASOURCE:
ddata[date][id] = amount
data = [[d, i, a] for d, subd in ddata.items() for i, a in subd.items()]
I have imported a csv as a multi-indexed Dataframe. Here's a mockup of the data:
df = pd.read_csv("coursedata2.csv", index_col=[0,2])
print (df)
COURSE
ID Course List
12345 Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
22345 Drawing Techniques DESN10016
Colour Theory DESN14049
Finishes & Sustainable Issues DESN12758
Lighting DESN20025
32345 Window Treatments&Soft Furnish DESN27370
42345 Introduction to CADD INFO16859
Principles of Drafting DESN10065
Drawing Techniques DESN10016
The Fundamentals of Design DESN15436
Colour Theory DESN14049
Interior Environments DESN10000
Drafting DESN10123
Textiles and Applications DESN10199
Finishes & Sustainable Issues DESN12758
[17 rows x 1 columns]
I can easily slice it by label using .xs -- eg:
selected = df.xs (12345, level='ID')
print selected
COURSE
Course List
Interior Environments DESN10000
Rendering & Present Skills DESN20065
Lighting DESN20025
[3 rows x 1 columns]
>
But what I want to do is step through the dataframe and perform an operation on each block of courses, by ID. The ID values in the real data are fairly random integers, sorted in ascending order.
df.index shows:
df.index
MultiIndex(levels=[[12345, 22345, 32345, 42345], [u'Colour Theory', u'Colour Theory ', u'Drafting', u'Drawing Techniques', u'Finishes & Sustainable Issues', u'Interior Environments', u'Introduction to CADD', u'Lighting', u'Principles of Drafting', u'Rendering & Present Skills', u'Textiles and Applications', u'The Fundamentals of Design', u'Window Treatments&Soft Furnish']],
labels=[[0, 0, 0, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3], [5, 9, 7, 3, 1, 4, 7, 12, 6, 8, 3, 11, 0, 5, 2, 10, 4]],
names=[u'ID', u'Course List'])
It seems to me that I should be able to use the first index labels to increment through the Dataframe. Ie. Get all the courses for label 0 then 1 then 2 then 3,... but it looks like .xs will not slice by label.
Am I missing something?
So there may be more efficient ways to do this, depending on what you're trying to do to the data. However, there are two approaches which immediately come to mind:
for id_label in df.index.levels[0]:
some_func(df.xs(id_label, level='ID'))
and
for id_label in df.index.levels[0]:
df.xs(id_label, level='ID').apply(some_func, axis=1)
depending on whether you want to operate on the group as a whole or on each row with in it.