I am beginner/intermediate user working with python and when I write elaborate code (at least for me), I always try to rewrite it looking for reducing the number of lines when possible.
Here the code I have written.
It is basically read all values of one data frame looking for a specific string, if string found save index and value in a dictionary and drop rows where these string was found. And the same with next string...
##### Reading CSV file values and looking for variants IDs ######
# Find Variant ID (rs000000) in CSV
# \d+ is neccesary in case the line find a rs+something. rs\d+ looks for rs+ numbers
rs = df_draft[df_draft.apply(lambda x:x.str.contains("rs\d+"))].dropna(how='all').dropna(axis=1, how='all')
# Now, we save the results found in a dict key=index and value=variand ID
if rs.empty == False:
ind = rs.index.to_list()
vals = list(rs.stack().values)
row2rs = dict(zip(ind, vals))
print(row2rs)
# We need to remove the row where rs has been found.
# Because if in the same row more than one ID variant found (i.e rs# and NM_#)
# this code is going to get same variant more than one.
for index, rs in row2rs.items():
# Rows where substring 'rs' has been found need to be delete to avoid repetition
# This will be done in df_draft
df_draft = df_draft.drop(index)
## Same thing with other ID variants
# Here with Variant ID (NM_0000000) in CSV
NM = df_draft[df_draft.apply(lambda x:x.str.contains("NM_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NM.empty == False:
ind = NM.index.to_list()
vals = list(NM.stack().values)
row2NM = dict(zip(ind, vals))
print(row2NM)
for index, NM in row2NM.items():
df_draft = df_draft.drop(index)
# Here with Variant ID (NP_0000000) in CSV
NP = df_draft[df_draft.apply(lambda x:x.str.contains("NP_\d+"))].dropna(how='all').dropna(axis=1, how='all')
if NP.empty == False:
ind = NP.index.to_list()
vals = list(NP.stack().values)
row2NP = dict(zip(ind, vals))
print(row2NP)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
# Here with ClinVar field (RCV#) in CSV
RCV = df_draft[df_draft.apply(lambda x:x.str.contains("RCV\d+"))].dropna(how='all').dropna(axis=1, how='all')
if RCV.empty == False:
ind = RCV.index.to_list()
vals = list(RCV.stack().values)
row2RCV = dict(zip(ind, vals))
print(row2RCV)
for index, NP in row2NP.items():
df_draft = df_draft.drop(index)
I was wondering for a more elegant solution of writing this simple but long code.
I have been thinking of sa
I have this issue where all the rows in mt Dataframe contain more than one item. I would like to iterate throughout the whole Dataframe and append each row item into a new list but I'm unsure on how to do this as of now.
IPs
0 [172.16.254.1, 192.168.1.15, 255.255.255.0]
1 [192.0.2.1, 255.255.255.0, 192.0.2.1]
2 [172.16.254.1]
3 [0.0.0.0]
This is my current output - and I would like to take each item per row in the Dataframe and append to a list
curled_ips_list = []
ip_addresses_found = []
ip_address_format = (r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b')
with open(website_file_path, 'r', encoding='utf-8-sig') as curled_ips_file:
found_ips_reader = pd.read_csv(curled_ips_file, names=['IPs'], delimiter='\n', quoting=csv.QUOTE_NONE, engine='c')
found_ips_reader = pd.Series(found_ips_reader['IPs'])
curled_ips_list = found_ips_reader[found_ips_reader.str.contains(ip_address_format)]
curled_ips_list = curled_ips_list.str.findall(ip_address_format)
curled_ips_list = pd.DataFrame(curled_ips_list)
curled_ips_file.close()
Not receiving any error messages as of yet, but unsure on how to go about it
Since you have not mentioned your output that you need, I am assuming you need the following.
#Load your IPs in a dataframe as the one that you have mentioned above.
iplist = df['IPs']
[ip for sublist in iplist for ip in sublist]
['172.16.254.1',
'192.168.1.15',
'255.255.255.0',
'192.0.2.1',
'255.255.255.0',
'192.0.2.1',
'172.16.254.1',
'0.0.0.0']
I have two CSV files which contain all the products in the database and currently the files are being compared using Excel formulas which is a long process. (approx. 130,000 rows in each file)
I have written a script in Python which works well with small sample data, however it isn't practical in the real world
CSV layout is:
ID, Product Title, Cost, Price1, Price2, Price3, Status
import csv
data_old = []
data_new = []
with open(file_path_old) as f1:
data = csv.reader(f1, delimiter=",")
next(data)
for row in data:
data_old.append(row)
f1.close()
with open(file_path_new) as f2:
data = csv.reader(f2, delimiter=",")
for row in data:
data_new.append(row)
f2.close()
for d1 in data_new:
for d2 in data_old:
if d2[0] == d1[0]:
# If match check rest of data in the same row
if d2[1] != d1[1]:
...
if d2[2] != d1[2]:
...
The issue with the above is as it is a nested for loop its going through each row of the second data 130,000 times (Slow is an understatement)
What I'm trying to achieve is to get a list of all the products which have had a change in either the title, cost, any of the 3 prices and status as well as a boolean flag to show which data has changed from the previous weeks data.
Desired Output CSV Format:
ID, Old Title, New Title, Changed, Old Cost, New Cost, Changed....
123, ABC, ABC, False, £12, £13, True....
SOLUTION:
import pandas as pd
# Read CSVs
old = pd.read_csv(old_file, sep=",")
new = pd.read_csv(new_file, sep=",")
# Join data together in single data table
df_join = pd.concat([old.set_index('PARTNO'), new.set_index('PARTNO'], axis='columns', key=['Old', 'New'])
# Displays data side by side
df_swap = pd.swaplevel(axis='columns')[old.columns[1:]]
# Output to CSV
out = df_swap.to_csv(output_file)
Just use pandas
import pandas as pd
old = pd.read_csv(file_path_old, sep=',')
new = pd.read_csv(file_path_new, sep=',')
Then you can do whatever (just read the doc). For example, to compare the titles:
old['Title'] == new['Title'] gives you an array of booleans for every row in your file.
Do you care about new and removed products? If not, then you can get O(n) performance by using a dictionary.
Pick one CSV file and shove it into a dictionary keyed by id. Use lookups into the dictionary to find products that changed.
Note that I simplified your data down to one column for brevity.
data_old = [
(1, 'alpha'),
(2, 'bravo'),
(3, 'delta'),
(5, 'echo')
]
data_new = [
(1, 'alpha'),
(2, 'zulu'),
(4, 'foxtrot'),
(6, 'mike'),
(7, 'lima'),
]
changed_products = []
new_product_map = {id: product for (id, product) in data_new}
for id, old_product in data_old:
if id in new_product_map and new_product_map[id] != old_product:
changed_products.append(id)
print('Changed products: ', changed_products)
You can shorten this even more using a list comprehension
new_product_map = {id: product for (id, product) in data_new}
changed_products = [id for (id, old_product) in data_old if id in new_product_map and new_product_map[id] != old_product]
print('Changed products: ', changed_products)
The diff algorithm below can also track insertions and deletions. You can use it if your CSV files are sorted by id.
You can sort the data in O(n*Lg(n)) time after loading it if the CSV files have no sensible order. Proceed with the diff after sorting.
Either way, this will be faster than the O(n^2) loops in your original post:
data_old = # same setup as before
data_new = # ditto
old_index = 0
new_index = 0
new_products = []
deleted_products = []
changed_products = []
while old_index < len(data_old) and new_index < len(data_new):
(old_id, old_product) = data_old[old_index]
(new_id, new_product) = data_new[new_index]
if old_id < new_id:
print('Product removed : %d' % old_id)
deleted_products.append(old_id)
old_index += 1
elif new_id < old_id:
print('Product added : %d' % new_id)
new_products.append(new_id)
new_index += 1
else:
if old_product != new_product:
print ('Product %d changed from %s to %s' %(old_id, old_product, new_product))
changed_products.append(old_id)
else:
print ('Product %d did not change' % old_id)
old_index += 1
new_index += 1
if old_index != len(data_old):
num_deleted = len(data_old) - old_index
print('The last %d old items were deleted' % num_deleted)
deleted_products += [id for (id, _) in data_old[old_index:]]
elif new_index != len(data_new):
num_added = len(data_new) - new_index
print('The last %d ne items were completely new' % num_added)
new_products += [id for (id, _) in data_new[new_index:]]
print('New products: ', new_products)
print('Changed products: ', changed_products)
print('Deleted products: ', deleted_products)
PS: The suggestion to use pandas is a great one. Use it if possible.
I have a CSV file with about 700 rows and 3 columns, containing label, rgb and string information, e.g.:
str; rgb; label; color
bones; "['255','255','255']"; 2; (241,214,145)
Aorta; "['255','0','0']"; 17; (216,101,79)
VenaCava; "['0','0','255']"; 16; (0,151,206)
I'd like to create a simple method to convert one unique input to one unique output.
One solution would be to hash all ROIDisplayColor entries with corresponding label entries as dictionary e.g. rgb2label:
with open("c:\my_file.csv") as csv_file:
rgb2label, label2rgb = {}, {} # rgb2str, label2str, str2label...
for row in csv.reader(csv_file):
rgb2label[row[1]] = row[2]
label2rgb[row[2]] = row[1]
This could simply be used as follows:
>>> rgb2label[ "['255','255','255']"]
'2'
>>> label2rgb['2']
"['255','255','255']"
The application is sumple but requires an unique unique dictionary for every relation (rgb2label,rgb2str,str2rgb,str2label, etc...).
Does a more compact solution with the same ease of use exist?
Here you're limiting yourself to one-to-one dictionaries, so you end up with loads of them (4^2=16 here).
You could instead use one-to-many dictionaries, so you'll have only 4:
for row in csv.reader(csv_file):
rgb[row[1]] = row
label[row[2]] = row
That you would use like this:
>>> rgb[ "['255','255','255']"][2]
'2'
>>> label['2'][1]
"['255','255','255']"
You could make this clearer by turning your row into a dict as well:
for row in csv.reader(csv_file):
name, rgb, label, color = row
d = {"rgb": rgb, "label": label}
rgb[row[1]] = d
label[row[2]] = d
That you would use like this:
>>> rgb[ "['255','255','255']"]["label"]
'2'
>>> label['2']["rgb"]
"['255','255','255']"
['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
I need to make a namedtuple for each of the lines in this list of lines, basically the fields would be the word in the first line 'Date,Open,High,Low,Close,Volume,Adj Close', I will then be making some calculations and will need to add 2 more fields at the end of each namedtuple. Any help on how I can do this?
from collections import namedtuple
data = ['Date,Open,High,Low,Close,Volume,Adj Close',
'2014-02-12,1189.00,1190.00,1181.38,1186.69,1724500,1186.69',
'2014-02-11,1180.17,1191.87,1172.21,1190.18,2050800,1190.18',
'2014-02-10,1171.80,1182.40,1169.02,1172.93,1945200,1172.93',
'2014-02-07,1167.63,1177.90,1160.56,1177.44,2636200,1177.44',
'2014-02-06,1151.13,1160.16,1147.55,1159.96,1946600,1159.96',
'2014-02-05,1143.38,1150.77,1128.02,1143.20,2394500,1143.20',
'2014-02-04,1137.99,1155.00,1137.01,1138.16,2811900,1138.16',
'2014-02-03,1179.20,1181.72,1132.01,1133.43,4569100,1133.43']
def convert_to_named_tuples(data):
# get the names for the named tuple
field_names = data[0].split(",")
# these are you two extra custom fields
field_names.append("extra1")
field_names.append("extra2")
# field names can't have spaces in them (they have to be valid python identifiers
# and "Adj Close" isn't)
field_names = [field_name.replace(" ", "_") for field_name in field_names]
# you can do this as many times as you like..
# personally I'd do it manually once at the start and just check you're getting
# the field names you expect here...
ShareData = namedtuple("ShareData", field_names)
# unpack the data into the named tuples
share_data_list = []
for row in data[1:]:
fields = row.split(",")
fields += [None, None]
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# check it works..
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
Actually this is better I think since it converts the fields into the right types. On the downside it won't take arbitraty data...
from collections import namedtuple
from datetime import datetime
data = [...same as before...]
field_names = ["Date","Open","High","Low","Close","Volume", "AdjClose", "Extra1", "Extra2"]
ShareData = namedtuple("ShareData", field_names)
def convert_to_named_tuples(data):
share_data_list = []
for row in data[1:]:
row = row.split(",")
fields = (datetime.strptime(row[0], "%Y-%m-%d"), # date
float(row[1]), float(row[2]),
float(row[3]), float(row[4]),
int(row[5]), # volume
float(row[6]), # adj close
None, None) # extras
share_data = ShareData(*fields)
share_data_list.append(share_data)
return share_data_list
# test
share_data_list = convert_to_named_tuples(data)
for share_data in share_data_list:
print share_data
But I agree with other posts.. why use namedtuple when you can use a class definition..
Any special reason why you want to used namedtuples? If you want to add fields later maybe you should use a dictionary. If you really wan't to go the namedtuple way though, you could use a placeholder like:
from collections import namedtuple
field_names = data[0].replace(" ", "_").lower().split(",")
field_names += ['placeholder_1', 'placeholder_2']
Entry = namedtuple('Entry', field_names)
list_of_named_tuples = []
mock_data = [None, None]
for row in data[1:]:
row_data = row.split(",") + mock_data
list_of_named_tuples.append(Entry(*row_data))
If, instead, you want to parse your data into a list of dictionaries (more pythonic IMO) you should do:
field_names = data[0].split(",")
list_of_dicts = [dict(zip(field_names, row.split(','))) for row in data[1:]]
EDIT: Note that even though you may use dictionaries instead of namedtuples for the small dataset from your example, doing so with large amounts of data will translate into a higher memory footprint for your program.
why don't you use a dictionary for the data, adding additional keys is then easy
dataList = []
keys = myData[0].split(',')
for row in myData:
tempdict = dict()
for index, value in enumerate(row.split(',')):
tempdict[keys[index]] = value
# if your additional values are going to be determined here then
# you can do whatever calculations you need and add them
# otherwise you do work with this list elsewhere
dataList.append(tempdict)