Ways to categorise data?

Ways to categorise data? - python

I wan't to allocate downloaded data (csv) into for simplicity say 3 categories. Has anyone got any tips or similar projects i could look at or python tools i should look at.
3 categories are...
Shares: Include the following a,b,c
Bonds: Include the following d,e,f
Cash: g
My downloaded data may have any combination of the above investments with any value.
https://docs.google.com/spreadsheets/d/1GU7jVLA-YzqRTxyLMdbymdJ6b1RtB09bpOjIDX6eJok/edit?usp=sharing
Thats 2 basic example of what the data will be downloaded as and what I want it to be converted to.
The real data will have 10-15 investments and approx 4 catergories I just want to know is possible to sort like this? It gets tricky as we have longer investment names and some are similar but sorted into different catergories.
If some one could point me in the right direction, i.e do i need a dictionary or some basic framework or code to look at that would be awesome.
Keen to learn but don't know where to start cheers - this is my first proper coding project.
Im not to fussed about the formatting of the output, as long as it clearly categorises the info and sums each category i'm happy :)

You don't need a framework, just the builtins will do (as usual in Python).
from collections import defaultdict
# Input data "rows". These would probably be loaded from a file.
raw_data = [
('a', 1000.00),
('b', 2000.00),
('d', 3000.00),
('e', 4000.00),
('g', 5000.00),
('g', 10000.00),
('c', 5000.00),
('d', 2000.00),
('a', 4000.00),
('e', 5000.00),
]
# Category definitions, mapping a category name to the row "types" (first column).
categories = {
'Shares': {'a', 'b', 'c'},
'Bonds': {'d', 'e', 'f'},
'Cash': {'g'},
}
# Build an inverse map that makes lookups faster later.
# This will look like e.g. {"a": "Shares", "b": "Shares", ...}
category_map = {}
for category, members in categories.items():
for member in members:
category_map[member] = category
# Initialize an empty defaultdict to group the rows with.
rows_per_category = defaultdict(list)
# Iterate through the raw data...
for row in raw_data:
type = row[0] # Grab the first column per row,
category = category_map[type] # map it through the category map (this will crash if the category is undefined),
rows_per_category[category].append(row) # and put it in the defaultdict.
# Iterate through the now collated rows in sorted-by-category order:
for category, rows in sorted(rows_per_category.items()):
# Sum the second column (value) for the total.
total = sum(row[1] for row in rows)
# Print header.
print("###", category)
# Print each row.
for row in rows:
print(row)
# Print the total and an empty line.
print("=== Total", total)
print()
This will output something like
### Bonds
('d', 3000.0)
('e', 4000.0)
('d', 2000.0)
('e', 5000.0)
=== Total 14000.0
### Cash
('g', 5000.0)
('g', 10000.0)
=== Total 15000.0
### Shares
('a', 1000.0)
('b', 2000.0)
('c', 5000.0)
('a', 4000.0)
=== Total 12000.0

Related

Python Pandas identify changes over time

I am working with a large data set containing portfolio holdings of clients per date (i.e. in each time period, I have a number of stock investments for each person). My goal is to try and identify 'buys' and 'sells'. A buy happens when a new stock appears in a person's portfolio (compared to the previous period). A sell happens when a stock disappears in a person's portfolio (compared to the previous period). Is there an easy/efficient way to do this in Python? I can only think of a cumbersome way via for-loops.
Suppose we have the following dataframe:
which can be computed with the following code:
df = pd.DataFrame({'Date_ID':[1,1,1,1,2,2,2,2,2,2,3,3,3,3], 'Person':['a', 'a', 'b', 'b', 'a', 'a', 'a', 'a', 'b', 'b', 'a', 'a', 'a', 'b'], 'Stock':['x1', 'x2', 'x2', 'x3', 'x1', 'x2', 'x3', 'x4', 'x2', 'x3', 'x1', 'x2', 'x3', 'x3']})
I would like to create the 'bought' and 'sell' columns which identify stocks that have been added or are going to be removed from the portfolio. The buy column equals 'True' if the stock newly appears in the persons portfolio (compared to the previous date). The Sell column equals True in case the stock disappears from the person's portfolio the next date.
How to accomplish this (or something similar to identify trades efficiently) in Python?

You can group your dataframe by 'Person' first because
people are completely independent from each other.
After that, for each person - group by 'Date_ID', and for each stock in a group determine if it is present in the next group:
def get_person_indicators(df):
"""`df` here contains info for 1 person only."""
g = df.groupby('Date_ID')['Stock']
prev_stocks = g.agg(set).shift()
was_bought = g.transform(lambda s: ~s.isin(prev_stocks[s.name])
if not pd.isnull(prev_stocks[s.name])
else False)
next_stocks = g.agg(set).shift(-1)
will_sell = g.transform(lambda s: ~s.isin(next_stocks[s.name])
if not pd.isnull(next_stocks[s.name])
else False)
return pd.DataFrame({'was_bought': was_bought, 'will_sell': will_sell})
result = pd.concat([df, df.groupby('Person').apply(get_person_indicators)],
axis=1)
Note:
For better memory usage you can change the dtype of the 'Stock' column from str to Categorical:
df['Stock'] = df['Stock'].astype('category')

Creating & using categorical data type with pandas

I'm having trouble changing the type of my variable to a categorical data type.
My variable is called "Energy class" and contains the following values:
A++, A+, A, B, C, D, E, F, G.
I want to change the type to a category and order the categories in that same order.
Hence: A++ = 1, A+ = 2, A = 3, B = 4 , etc.
I will also have to perform the same manipulation with another variable, "Condition of the building", which conains the following values: "Very good, "Good", "To be restored".
I tried using the pandas set_categories() method. But it didn't work. There is very little information on how to use it in the documentation.
Anyone knows how to deal with this?
Thank you

You can use map:
energy_class = {'A++':1, 'A+':2,...}
df['Energy class'] = df['Energy class'].map(energy_class)
A bit fancier when you have ordered list of the classes
energy_classes = ['A++', 'A+',...]
df['Energy_class'] = df['Energy class'].map(dict(**enumerate(energy_classes,1))

You can use ordered pd.Categorical:
df['energy_class'] = pd.Categorical(
df['energy_class'],
categories=['A++', 'A+', 'A', 'B', 'C', 'D', 'E', 'F', 'G'],
ordered=True)

Reading items from a csv and updating the same items in another csv

I'm working on a method to read data from input.csv, and update stock column in output.csv based on the product's id
These are the steps I'm working on right now:
1. Read product info from input.csv into input_data = [], which will return a list of OrderedDict.
input_data currently looks like this:
[OrderedDict([('id', '1'), ('name', 'a'), ('stock', '33')]),
OrderedDict([('id', '2'), ('name', 'b'), ('stock', '66')]), OrderedDict([('id', '3'), ('name', 'c'), ('stock', '99')])]
2. Read current product info from output.csv into output_data = [], which has the same schema as input_data
3. Iterate through input_data and update the stock column in output_data based on stock info in input_data. What's the best way to do this?
-> An important mention is that in input_data there might be some IDs which exist in input_data but do not exist in output_data. I would like to update the stocks for ids common to input_data and output_data, and the "new" ids would most likely be written to a new csv.
I was thinking of something like (this isn't real code):
for p in input_data:
# check if p['id'] exists in the list of output_data IDs (I might have to create a list of IDs in output_data for this as well, in order to check it against input_data IDs
# if p['id'] exists in output_data, write the Stock to the corresponding product in output_data
# else, append p to another_csv
I know this looks pretty messy, what I'm asking is for a logical way to approach this mission without wasting too much compute time. The files in question are going to be 100,000 rows long, probably, so performance and speed will be an issue.
If my data from input_data and output_data are a list of OrderedDict , what is the best way to check the id in input_data and write the stock to the product with the exact same id in output_data?

While Python might not be your best option, I wouldn't use lists of OrderDict for this task. This is simply because trying to change something within output_data would require O(n) complexity which will simply transform your script in O(n**2).
I would save the two files in dicts (or OrderedDicts if you care about order), like this (and reduce the complexity of the whole thing to O(n)):
input_data = {
'1': ['a', '33'],
'2': ['b', '66'],
'3': ['c', '99']
}
output_data = {
'1': ['a', '31'],
'3': ['c', '95']
}
# iterate through all keys in input_data and update output_data
# if a key does not exist in output_data, create it in a different dict
new_data = {}
for key in input_data:
if key not in output_data:
new_data[key] = input_data[key]
# for optimisation's sake you could append data into the new file here
# and not save into a new dict
else:
output_data[key][1] = input_data[key][1]
# for optimisation's sake you could append data into a new output file here
# and rename/move the new output file into the old output file after the script finishes

What is the fastest way to dedupe multivariate data?

Let's assume a very simple data structure. In the below example, IDs are unique. "date" and "id" are strings, and "amount" is an integer.
data = [[date1, id1, amount1], [date2, id2, amount2], etc.]
If date1 == date2 and id1 == id2, I'd like to merge the two entries into one and basically add up amount1 and amount2 so that data becomes:
data = [[date1, id1, amount1 + amount2], etc.]
There are many duplicates.
As data is very big (over 100,000 entries), I'd like to do this as efficiently as possible. What I did is a created a new "common" field that is basically date + id combined into one string with metadata allowing me to split it later (date + id + "_" + str(len(date)).
In terms of complexity, I have four loops:
Parse and load data from external source (it doesn't come in lists) | O(n)
Loop over data and create and store "common" string (date + id + metadata) - I call this "prepared data" where "common" is my encoded field | O(n)
Use the Counter() object to dedupe "prepared data" | O(n)
Decode "common" | O(n)
I don't care about memory here, I only care about speed. I could make a nested loop and avoid steps 2, 3 and 4 but that would be a time-complexity disaster (O(n²)).
What is the fastest way to do this?

Consider a defaultdict for aggregating data by a unique key:
Given
Some random data
import random
import collections as ct
random.seed(123)
# Random data
dates = ["2018-04-24", "2018-05-04", "2018-07-06"]
ids = "A B C D".split()
amounts = lambda: random.randrange(1, 100)
ch = random.choice
data = [[ch(dates), ch(ids), amounts()] for _ in range(10)]
data
Output
[['2018-04-24', 'C', 12],
['2018-05-04', 'C', 14],
['2018-04-24', 'D', 69],
['2018-07-06', 'C', 44],
['2018-04-24', 'B', 18],
['2018-05-04', 'C', 90],
['2018-04-24', 'B', 1],
['2018-05-04', 'A', 77],
['2018-05-04', 'A', 1],
['2018-05-04', 'D', 14]]
Code
dd = ct.defaultdict(int)
for date, id_, amt in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key] += amt
dd
Output
defaultdict(int,
{'2018-04-24B_10': 19,
'2018-04-24C_10': 12,
'2018-04-24D_10': 69,
'2018-05-04A_10': 78,
'2018-05-04C_10': 104,
'2018-05-04D_10': 14,
'2018-07-06C_10': 44})
Details
A defaultdict is a dictionary that calls a default factory (a specified function) for any missing keys. It this case, every date + id combination is uniquely added to the dict. The amounts are added to values if existing keys are found. Otherwise an integer (0) initializes a new entry to the dict.
For illustration, you can visualize the aggregated values using a list as the default factory.
dd = ct.defaultdict(list)
for date, id_, val in data:
key = "{}{}_{}".format(date, id_, len(date))
dd[key].append(val)
dd
Output
defaultdict(list,
{'2018-04-24B_10': [18, 1],
'2018-04-24C_10': [12],
'2018-04-24D_10': [69],
'2018-05-04A_10': [77, 1],
'2018-05-04C_10': [14, 90],
'2018-05-04D_10': [14],
'2018-07-06C_10': [44]})
We see three occurrences of duplicate keys where the values were appropriately summed. Regarding efficiency, notice:
keys are made with format(), which should be a bit better the string concatenation and calling str()
every key and value is computed in the same iteration

Using pandas makes this really easy:
import pandas as pd
df = pd.DataFrame(data, columns=['date', 'id', 'amount'])
df.groupby(['date','id']).sum().reset_index()
For more control you can use agg instead of sum():
df.groupby(['date','id']).agg({'amount':'sum'})
Depending on what you are doing with the data, it may be easier/faster to go this way just because so much of pandas is built on compiled C extensions and optimized routines that make it super easy to transform and manipulate.

You could import the data into a structure that prevents duplicates and than convert it to a list.
data = {
date1: {
id1: amount1,
id2: amount2,
},
date2: {
id3: amount3,
id4: amount4,
....
}
The program's skeleton:
ddata = collections.defaultdict(dict)
for date, id, amount in DATASOURCE:
ddata[date][id] = amount
data = [[d, i, a] for d, subd in ddata.items() for i, a in subd.items()]

What data structure container can be sorted by date

is there any data structure that can be sorted by date in Python 3?
('2015-08-01', 10,10)
('2015-08-03', 11,11)
.. and so on ..
I know i can use pandas dataframe, but like to know if there are other more lightweight alternatives.

Since the date is a string in YYYY-MM-DD format, it's already sortable in the way you'd expect. And since the dates are the first item, you don't even need to provide a key function.
data = [('2015-08-03', 11,11), ('2015-08-01', 10,10)]
data.sort()
print(data)
Result:
[('2015-08-01', 10, 10), ('2015-08-03', 11, 11)]
If the date wasn't the first item, you could do this:
import operator
data = [('a', '2015-08-03', 11,11), ('b', '2015-08-01', 10,10)]
data.sort(key=operator.itemgetter(1))
print(data)
Result:
[('b', '2015-08-01', 10, 10), ('a', '2015-08-03', 11, 11)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Ways to categorise data? - python

Related

Python Pandas identify changes over time

Creating & using categorical data type with pandas

Reading items from a csv and updating the same items in another csv

What is the fastest way to dedupe multivariate data?

What data structure container can be sorted by date

Categories

Resources