Fast distinct list of elements in an array python

Fast distinct list of elements in an array python - python

I need to speed up the time to count the distinct elements in this code and I'm not really sure how to do a faster count.
def process_columns(columns):
with open(columns, 'r') as src:
data = csv.reader(src, delimiter ='\t', skipinitialspace = False)
category = []
group = columns.split("/")
group = group[-1].split(".")
if group[0] in ["data_1", "data_2"]:
for row in data:
if row[0] not in category:
category.append(row[0])
message = "\t%d distinct elements from %ss" % (len(category), group[0])
print message

A master method to count distinct elements in a python array is :
array = [1,1,2,3,3,4,5,6,6]
n_elts = len(set(array))
print(n_elts)
Output:
6

Without much knowledge on your data, here's a quick way to maintain a set of unique words for your groups, using collections.defaultdict.
from collections import defaultdict
def process_columns(columns):
categories = defaultdict(set) # initialises a default dict with values as sets
with open(columns, 'r') as src:
data = csv.reader(src, delimiter ='\t', skipinitialspace = False)
group = columns.split("/")[-1].split('.')
for row in data:
categories[group[0]].update(row[0])
for k in categories:
message = "\t%d distinct elements from %ss" % (len(categories[k]), k)
print message

Initialise category as a set; and remove the if block to add data into category,replace it with category.add
category = {}
group = columns.split("/")
group = group[-1].split(".")
if group[0] in ["data_1", "data_2"]:
for row in data:
category.add(row[0])
Hope this is clear

Related

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa

How to grab all items from all DataFrames in Pandas

I have this issue where all the rows in mt Dataframe contain more than one item. I would like to iterate throughout the whole Dataframe and append each row item into a new list but I'm unsure on how to do this as of now.
IPs
0 [172.16.254.1, 192.168.1.15, 255.255.255.0]
1 [192.0.2.1, 255.255.255.0, 192.0.2.1]
2 [172.16.254.1]
3 [0.0.0.0]
This is my current output - and I would like to take each item per row in the Dataframe and append to a list
curled_ips_list = []
ip_addresses_found = []
ip_address_format = (r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
with open(website_file_path, 'r', encoding='utf-8-sig') as curled_ips_file:
found_ips_reader = pd.read_csv(curled_ips_file, names=['IPs'], delimiter='\n', quoting=csv.QUOTE_NONE, engine='c')
found_ips_reader = pd.Series(found_ips_reader['IPs'])
curled_ips_list = found_ips_reader[found_ips_reader.str.contains(ip_address_format)]
curled_ips_list = curled_ips_list.str.findall(ip_address_format)
curled_ips_list = pd.DataFrame(curled_ips_list)
curled_ips_file.close()
Not receiving any error messages as of yet, but unsure on how to go about it

Since you have not mentioned your output that you need, I am assuming you need the following.
#Load your IPs in a dataframe as the one that you have mentioned above.
iplist = df['IPs']
[ip for sublist in iplist for ip in sublist]
['172.16.254.1',
'192.168.1.15',
'255.255.255.0',
'192.0.2.1',
'255.255.255.0',
'192.0.2.1',
'172.16.254.1',
'0.0.0.0']

Comparing data in two CSV files

I have two CSV files which contain all the products in the database and currently the files are being compared using Excel formulas which is a long process. (approx. 130,000 rows in each file)
I have written a script in Python which works well with small sample data, however it isn't practical in the real world
CSV layout is:
ID, Product Title, Cost, Price1, Price2, Price3, Status
import csv
data_old = []
data_new = []
with open(file_path_old) as f1:
data = csv.reader(f1, delimiter=",")
next(data)
for row in data:
data_old.append(row)
f1.close()
with open(file_path_new) as f2:
data = csv.reader(f2, delimiter=",")
for row in data:
data_new.append(row)
f2.close()
for d1 in data_new:
for d2 in data_old:
if d2[0] == d1[0]:
# If match check rest of data in the same row
if d2[1] != d1[1]:
...
if d2[2] != d1[2]:
...
The issue with the above is as it is a nested for loop its going through each row of the second data 130,000 times (Slow is an understatement)
What I'm trying to achieve is to get a list of all the products which have had a change in either the title, cost, any of the 3 prices and status as well as a boolean flag to show which data has changed from the previous weeks data.
Desired Output CSV Format:
ID, Old Title, New Title, Changed, Old Cost, New Cost, Changed....
123, ABC, ABC, False, £12, £13, True....
SOLUTION:
import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")
# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])
# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]
# Output to CSV
out = df_swap.to_csv(output_file)

Just use pandas
import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')
Then you can do whatever (just read the doc). For example, to compare the titles:
old['Title'] == new['Title'] gives you an array of booleans for every row in your file.

Do you care about new and removed products? If not, then you can get O(n) performance by using a dictionary.
Pick one CSV file and shove it into a dictionary keyed by id. Use lookups into the dictionary to find products that changed.
Note that I simplified your data down to one column for brevity.
data_old = [
(1, 'alpha'),
(2, 'bravo'),
(3, 'delta'),
(5, 'echo')
]
data_new = [
(1, 'alpha'),
(2, 'zulu'),
(4, 'foxtrot'),
(6, 'mike'),
(7, 'lima'),
]
changed_products = []
new_product_map = {id: product for (id, product) in data_new}
for id, old_product in data_old:
if id in new_product_map and new_product_map[id] != old_product:
changed_products.append(id)
print('Changed products: ', changed_products)
You can shorten this even more using a list comprehension
new_product_map = {id: product for (id, product) in data_new}
changed_products = [id for (id, old_product) in data_old if id in new_product_map and new_product_map[id] != old_product]
print('Changed products: ', changed_products)
The diff algorithm below can also track insertions and deletions. You can use it if your CSV files are sorted by id.
You can sort the data in O(n*Lg(n)) time after loading it if the CSV files have no sensible order. Proceed with the diff after sorting.
Either way, this will be faster than the O(n^2) loops in your original post:
data_old = # same setup as before
data_new = # ditto
old_index = 0
new_index = 0
new_products = []
deleted_products = []
changed_products = []
while old_index < len(data_old) and new_index < len(data_new):
(old_id, old_product) = data_old[old_index]
(new_id, new_product) = data_new[new_index]
if old_id < new_id:
print('Product removed : %d' % old_id)
deleted_products.append(old_id)
old_index += 1
elif new_id < old_id:
print('Product added : %d' % new_id)
new_products.append(new_id)
new_index += 1
else:
if old_product != new_product:
print ('Product %d changed from %s to %s' %(old_id, old_product, new_product))
changed_products.append(old_id)
else:
print ('Product %d did not change' % old_id)
old_index += 1
new_index += 1
if old_index != len(data_old):
num_deleted = len(data_old) - old_index
print('The last %d old items were deleted' % num_deleted)
deleted_products += [id for (id, _) in data_old[old_index:]]
elif new_index != len(data_new):
num_added = len(data_new) - new_index
print('The last %d ne items were completely new' % num_added)
new_products += [id for (id, _) in data_new[new_index:]]
print('New products: ', new_products)
print('Changed products: ', changed_products)
print('Deleted products: ', deleted_products)
PS: The suggestion to use pandas is a great one. Use it if possible.

Simplify the use of multiple hashings in Python

I have a CSV file with about 700 rows and 3 columns, containing label, rgb and string information, e.g.:
str; rgb; label; color
bones; "['255','255','255']"; 2; (241,214,145)
Aorta; "['255','0','0']"; 17; (216,101,79)
VenaCava; "['0','0','255']"; 16; (0,151,206)
I'd like to create a simple method to convert one unique input to one unique output.
One solution would be to hash all ROIDisplayColor entries with corresponding label entries as dictionary e.g. rgb2label:
with open("c:\my_file.csv") as csv_file:
rgb2label, label2rgb = {}, {} # rgb2str, label2str, str2label...
for row in csv.reader(csv_file):
rgb2label[row[1]] = row[2]
label2rgb[row[2]] = row[1]
This could simply be used as follows:
>>> rgb2label[ "['255','255','255']"]
'2'
>>> label2rgb['2']
"['255','255','255']"
The application is sumple but requires an unique unique dictionary for every relation (rgb2label,rgb2str,str2rgb,str2label, etc...).
Does a more compact solution with the same ease of use exist?

Here you're limiting yourself to one-to-one dictionaries, so you end up with loads of them (4^2=16 here).
You could instead use one-to-many dictionaries, so you'll have only 4:
for row in csv.reader(csv_file):
rgb[row[1]] = row
label[row[2]] = row
That you would use like this:
>>> rgb[ "['255','255','255']"][2]
'2'
>>> label['2'][1]
"['255','255','255']"
You could make this clearer by turning your row into a dict as well:
for row in csv.reader(csv_file):
name, rgb, label, color = row
d = {"rgb": rgb, "label": label}
rgb[row[1]] = d
label[row[2]] = d
That you would use like this:
>>> rgb[ "['255','255','255']"]["label"]
'2'
>>> label['2']["rgb"]
"['255','255','255']"

Storing data into namedtuples with empty fields to add other stuff

['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
I need to make a namedtuple for each of the lines in this list of lines, basically the fields would be the word in the first line 'Date,Open,High,Low,Close,Volume,Adj Close', I will then be making some calculations and will need to add 2 more fields at the end of each namedtuple. Any help on how I can do this?

from collections import namedtuple
data = ['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
def convert_to_named_tuples(data):
# get the names for the named tuple
field_names = data[0].split(",")
# these are you two extra custom fields
field_names.append("extra1")
field_names.append("extra2")
# field names can't have spaces in them (they have to be valid python identifiers
# and "Adj Close" isn't)
field_names = [field_name.replace(" ", "_") for field_name in field_names]
# you can do this as many times as you like..
# personally I'd do it manually once at the start and just check you're getting
# the field names you expect here...
ShareData = namedtuple("ShareData", field_names)
# unpack the data into the named tuples
share_data_list = []
for row in data[1:]:
fields = row.split(",")
fields += [None, None]
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# check it works..
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
Actually this is better I think since it converts the fields into the right types. On the downside it won't take arbitraty data...
from collections import namedtuple
from datetime import datetime
data = [...same as before...]
field_names = ["Date","Open","High","Low","Close","Volume", "AdjClose", "Extra1", "Extra2"]
ShareData = namedtuple("ShareData", field_names)
def convert_to_named_tuples(data):
share_data_list = []
for row in data[1:]:
row = row.split(",")
fields = (datetime.strptime(row[0], "%Y-%m-%d"), # date
float(row[1]), float(row[2]),
float(row[3]), float(row[4]),
int(row[5]), # volume
float(row[6]), # adj close
None, None) # extras
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# test
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
But I agree with other posts.. why use namedtuple when you can use a class definition..

Any special reason why you want to used namedtuples? If you want to add fields later maybe you should use a dictionary. If you really wan't to go the namedtuple way though, you could use a placeholder like:
from collections import namedtuple
field_names = data[0].replace(" ", "_").lower().split(",")
field_names += ['placeholder_1', 'placeholder_2']
Entry = namedtuple('Entry', field_names)
list_of_named_tuples = []
mock_data = [None, None]
for row in data[1:]:
row_data = row.split(",") + mock_data
list_of_named_tuples.append(Entry(*row_data))
If, instead, you want to parse your data into a list of dictionaries (more pythonic IMO) you should do:
field_names = data[0].split(",")
list_of_dicts = [dict(zip(field_names, row.split(','))) for row in data[1:]]
EDIT: Note that even though you may use dictionaries instead of namedtuples for the small dataset from your example, doing so with large amounts of data will translate into a higher memory footprint for your program.

why don't you use a dictionary for the data, adding additional keys is then easy
dataList = []
keys = myData[0].split(',')
for row in myData:
tempdict = dict()
for index, value in enumerate(row.split(',')):
tempdict[keys[index]] = value
# if your additional values are going to be determined here then
# you can do whatever calculations you need and add them
# otherwise you do work with this list elsewhere
dataList.append(tempdict)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fast distinct list of elements in an array python - python

A master method to count distinct elements in a python array is : array = [1,1,2,3,3,4,5,6,6] n_elts = len(set(array)) print(n_elts) Output: 6

Initialise category as a set; and remove the if block to add data into category,replace it with category.add category = {} group = columns.split("/") group = group[-1].split(".") if group[0] in ["data_1", "data_2"]: for row in data: category.add(row[0]) Hope this is clear

Related

Looking for a more elegant and sophisticated solution when multiple if and for-loop are used

How to grab all items from all DataFrames in Pandas

Comparing data in two CSV files

Simplify the use of multiple hashings in Python

Storing data into namedtuples with empty fields to add other stuff

Categories

Resources