I have a set of CSV data, and I need to calculate the total quantity and profit using Visual studio code. A group of codes have already been provided for me, hence I need to do the calculation. Only the profit in Column N (row 13) and quantity in Column O (row 14) should be part of the calculation.
the data in CSV
this is the code provided for me:
fp = Path.cwd()/"superstore_transaction.csv"
with fp.open(mode="r", encoding="UTF-8", newline="") as file:
reader = csv.reader(file)
next(reader)
cluster1 = []
cluster2 = []
cluster3 = []
for row in reader:
if row[4] == "Cluster 1":
cluster1.append([row[13], row[14]])
elif row[4] == "Cluster 2":
cluster2.append([row[13], row[14]])
else:
cluster3.append([row[13], row[14]])
I tried using For loop, but it doesn't work. I think in general I am just confused with the overall coding that was already provided for me, and I am only limited to a number of codes that I can use to help calculate the total profit
Looks like the profit column have percentage values, so you need to remove the % symbol so it can be converted to a int or float values for the calculation.
Here is an example of cluster1, so you do the same for others manually or looping through them as you want :
cluster1_profit = sum(float(sub_c1[0].replace('%', '')) for sub_c1 in cluster1)
cluster1_quantity = sum(int(sub_c1[1]) for sub_c1 in cluster1)
For this I just did the sum of profits and quantities, so it depends on how you want to calculate your total.
There are better ways to do this, using pandas or numpy, it will make it easier.
Related
This question already has answers here:
Count occurrences in DataFrame
(2 answers)
Closed 5 months ago.
I have a data set of customers and products and I would like to know which combination of products are more popular combinations chosen by customers and display that in a table (like a traditional Mileage chart or other neat way).
Example dataset:
Example output:
I am able to tell that the most popular combination of products for customers are P1 with P2 and the least popular is P1 with P3. My actual dataset is of course much larger in terms of customers and products.
I'd also be keen to hear any ideas on better outputs visualisations too, especially as I can't figure out how to best display 3 way or 4 way popular combinations.
Thank you
I have a full code example that may work for what you are doing... or at least give you some ideas on how to move forward.
This script uses OpenPyXl to scrape the info from the first sheet. It is turned into a dictionary where the key's are strings of the combinations. The combinations are then counted and it is then placed into a 2nd sheet (see image).
Results:
The Code:
from openpyxl import load_workbook
from collections import Counter
#Load master workbook/Worksheet and the file to be processed
data_wb = load_workbook(filename=r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_ws = data_wb['Sheet1']
results_ws = data_wb['Sheet2']
#Finding Max rows in sheets
data_max_rows = data_ws.max_row
results_max_rows = results_ws.max_row
#Collecting Values and placing in array
customer_dict = {}
for row in data_ws.iter_rows(min_row = 2, max_col = 2, max_row = data_max_rows): #service_max_rows
#Gathering row values and creatin var's for relevant ones
row_list = [cell.value for cell in row]
customer_cell = row_list[0]
product_cell = row_list[3]
#Creating Str value for dict
if customer_cell not in customer_dict:
words = ""
words += product_cell
customer_dict.update({customer_cell:words})
else:
words += ("_" + product_cell)
customer_dict.update({customer_cell:words})
#Counting Occurances in dict for keys
count_dict = Counter(customer_dict.values())
#Column Titles
results_ws.cell(1, 1).value = "Combonation"
results_ws.cell(1, 2).value = "Occurances"
#Placing values into spreadsheet
count = 2
for key, value in count_dict.items():
results_ws.cell(count, 1).value = key
results_ws.cell(count, 2).value = value
count += 1
data_wb.save(filename = r"C:\\Users\---Place your loc here---\SO_excel.xlsx")
data_wb.close()
First of all, I'm sorry if this question has already been asked but I believe my challenge is specific enough. I'm not looking for complete answers but simply guidelines on how I can proceed.
First of all, I have a raw dataset of monitoring participants. This data include things like income, savings, etc... and these participants have been tracked for 6 months (Jan to Jun). But the data is stored in a whole single Excel file with a column to specify the month, which means that one participant's name comes back 6 times in the file, one for each month. Each participant has a unique ID.
I want to transfrom this data in a more workable way and I wanted to learn to do it with Python. But then I feel stuck and rusty because it's been ages since I've coded and I'm only used to the codes I use on a regular basis (printing grouped averages, etc...); Here's the steps I want to follow:
a. Start by creating a column which contains a unique list of participants that have been tracked using the ID. Each participant has to be cited once only;
b. Each participants is recorded with an activity and sub-activity type in the original file, which will need to be added in the new dataset as well;
c. For the month of January for example, I want to create a 'january_income' column in which the income from january has been dragged from the raw dataset, and so on for each variable and each month.
Can anyone provide guidelines on how I may proceed? As I said, it doesn't have to be specific codes, it can be methods or steps along with the function I can use.
Thanks alot already.
N.B: I use Spyder as a working environment.
Your question is not specific. But you can try and adjust the code below:
import csv
"""
Convert your excel file to csv format
This sample assumes that you have a csv file with the first row as header or fieldnames
"""
with open('test.csv','w') as fp:
fp.write("""ID,Name,Income,Savings,Month
1,"Sample Name",1000,100,1
""")
def format(infile = 'infile.csv', outfile='outfile.csv'):
months = ['January', 'February', 'March'] #Add specific months
target_fields = ['Income', 'Savings'] # Add your desired fields
timestamp_field = 'Month' #The field which indicate the month of the row
ID_field = 'ID' # The field which indicates the unique identifier of the participant
part_specific_fields = [ID_field, 'Name'] # The fields which are specific for each participant, these fields won't be touched at all.
target_combined_fields = [f'{month}_{field}' for field in target_fields for month in months]
total_fields = part_specific_fields + target_combined_fields
temp = {}
with open(infile,'r') as fpi, open(outfile,'w') as fpo:
reader = csv.DictReader(fpi)
for row in reader:
ID = int(row[ID_field])
if ID not in temp:
temp[ID] = {}
for other_field in part_specific_fields:
# Insert the constant columns that should not be touched
temp[ID][other_field] = row[other_field]
month_pos = int(row[timestamp_field]) - 1 # subtract 1 for 0 indexing
month = months[month_pos] # Month name in plain English
for field in target_fields:
temp[ID][f'{month}_{field}'] = row[field]
# All the processing completed
#now write the data
writer = csv.DictWriter(fpo, fieldnames=total_fields)
writer.writeheader()
for row in temp.values():
writer.writerow(row)
# File has been wriiten successfully
#now return the mapped dictionary
return temp
print(format('test.csv'))
First, You have to convert your .xls file to .csv format
Process the each row and map that to specific <month>_<field> keys.
Write the processed data to outfile.csv file
Thanks for the notes. First of all, I'm sorry if my post is not specific and thanks for initiating me on the community. Since my initial post, I've made some effort to work on my data and with my actual knowledge of the langage, all I could come up with was a filtering code as my code below shows. This lets me have a column for each data of each month okay but I'm stuck on two things: first, I had to repeat this code for each month and change the months in the labels. I wouldn't have minded that approach if I didnt have to face another problem: This doesn't take in account the fact that some participants have not been tracked on certain months, which means that even if the data was sorted according to ID number, there is a mismatch between the columns because their length vary according to the number of participants tracked for that month. Now I'm looking to optimize this code by adding a line which would let me resolve my second issue (at this point I don't mind if the code is long but if there could be optimization to be made at all, I'm open to it as well):
os.chdir("XXXXXXX")
economique = pd.read_csv('data_economique.csv')
#JANVIER
ID_jan = economique.query("mois_de_suivi == 'Janvier'")["ID"]
nom_jan = economique.query("mois_de_suivi == 'Janvier'")["nom"]
sexe_jan = economique.query("mois_de_suivi == 'Janvier'")["sexe"]
district_jan = economique.query("mois_de_suivi == 'Janvier'")["district"]
activite_jan = economique.query("mois_de_suivi == 'Janvier'")["activite"]
CA_jan = economique.query("mois_de_suivi == 'Janvier'")["chiffre_affaire"]
charges_jan = economique.query("mois_de_suivi == 'Janvier'")["charges"]
resultat_jan = economique.query("mois_de_suivi == 'Janvier'")["benefice"]
remb_attendu_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_attendu"]
remb_effectue_jan = economique.query("mois_de_suivi == 'Janvier'")["remb_effectue"]
remb_differe_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_remb_differe"]
epargne_jan = economique.query("mois_de_suivi == 'Janvier'")["calcul_epargne"]
I have list of items (each row has the following: item number, lot number, description, total quantity). If a certain lot-number in my list exists twice, I add the quantities of both those rows together. "data" is my original list. "max_item" is the max times an item occurs in "data". I created a new list (one_lot_per_row_list) and have appended my updated rows to it, but I also need to add the rows from "data" that did not have duplicate lots. Or I need to remove the row that was not updated from "data" (data[i+1+j]) in my code below. Not sure if the best way to approach this is to create a new list or to remove rows from my original. Hopefully this makes sense! All help very appreciated!
Example list below -- The final 2 rows have the same Internal Lot number. I would like to add their Total Available quantities together, and then remove the row that was not updated.
Part Internal Lot Number Description Total available Expiration Date Location
0001 QLN03867 P 2 3/31/2025 FRZ06 Half 1
0002 QLN03923 A 15 4/30/2023 F01-S01-05
0002 QLN03469 A 3 9/30/2022 F01-S03-02
0003 QLN03924 G 15 9/30/2022 F01-S01-05
0003 QLN03470 G 2 9/30/2022 F01-S01-02
0004 QLN03466 U 4 10/31/2022 F01-S03-02
0005 QLN03925 C 10 4/30/2023 F01-S01-02
0005 QLN03471 C 2 9/30/2022 F01-S01-02
0006 QLN03468 R 5 7/31/2021 F01-S03-02
0007 QLN03994 I 2 4/13/2025 F01-S03-03
0007 QLN03994 I 1 4/13/2025 F01-S03-02
data = []
for row in csv_reader:
azpn = row[0]
azln = row[1]
description = row[2]
location = row[5]
date = datetime.strptime(row[4], '%m/%d/%Y')
total_available = int(row[3])
data.append([azpn, azln, description, total_available, date, location])
one_lot_per_row_list = []
i = 0
j = 1
for i in range(len(data)- max_item):
# if the lot number of row i is equal to the lot number of row i + j
for j in range(max_item):
if data[i][1] == data[i+1+j][1]:
#add total available of data[i] to row data[i+1+j]
data[i][3] += data[i+1+j][3]
#append the new row to one_lot_per_row_list
one_lot_per_row_list.append(data[i])
j+=1
i+=1
You could pursue your approach or go for a more elegant method as follows:
Sort by lot number
Group by lot number
Use the reduce function to merge the items in each group.
IIUC, you can do this very easily via pandas. The alternative is itertools groupby:
Here's one way via pandas:
groupby lot number and transform the column Total.
drop the duplicates based on subset = ['Internal Lot Number', 'Total']
Finally, save the CSV file via to_csv.
import pandas as pd
df = pd.read_csv('your csv file path here')
df.assign(Total=df.groupby('Internal Lot Number')['Total'].transform('sum')).drop_duplicates(
['Internal Lot Number', 'Total']).to_csv('output csv file path here')
I am working on a assignment, but I am stuck and I do not know how to proceed.
I need to make different categories out of the different categories from the first line (from the txt file) and calculate averages over every numerical value. The program has to work flawless when I add new lines to the txt file.
Category;currency;sellerRating;Duration;endDay;ClosePrice;OpenPrice;Competitive?
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Movie/Game;US;3249;5;Mon;0,01;0,01;No
Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No
Music/Automotive/Game;US;3249;5;Mon;0,01;0,01;No
This is the text file. I tried to make different categories out of them, but I do not know if I did it correctly and how to let Python know that he has to calculate all the numbers from 1 group.
with open('bijlage2.txt') as bestand:
maak_er_lists_van = [(line.strip()).split(';') for line in bestand]
keys = maak_er_lists_van[0]
lijst = list(zip([keys]*len(maak_er_lists_van[1:]),
maak_er_lists_van[1:]))
x = [zip(i[0], i[1]) for i in lijst]
maak_dict = [dict(i) for i in x]
for i in maak_dict:
categorieen =[i['Category'], i['currency'], i['sellerRating'],
i['Duration'], i['endDay'], i['ClosePrice'], i['OpenPrice'],
i['Competitive?']]
categorieen = list(map(int, categorieen))
This is what I have so far. I am a Python beginner so the whole text file thing is new to me. Can somebody help me or explain what I have to do so that I can work further on this project? Many thanks in advance!
Here's how I would do it. I had to add using locale.atof() because where I am . is used as the decimal point, not commas. You may have to change this as indicated.
The csv module is used to read the file, and the averages are computed in a two-step process. First the values for each category are summed, and then afterwards, the average value of each one is calculated based on the number of values read.
import csv
import locale
from pprint import pprint, pformat
import locale
#locale.setlocale(locale.LC_ALL, '') # empty string for platform's default settings
# Following used for testing to force ',' to be considered as a decimal point.
locale.setlocale(locale.LC_ALL, 'French_France.1252')
avg_names = 'sellerRating', 'Duration', 'ClosePrice', 'OpenPrice'
averages = {avg_name: 0 for avg_name in avg_names} # Initialze.
# Find total of each category of interest.
num_values = 0
with open('bijlage2.txt', newline='') as bestand:
csvreader = csv.DictReader(bestand, delimiter=';')
for row in csvreader:
num_values += 1
for avg_name in avg_names:
averages[avg_name] += locale.atof(row[avg_name])
# Calculate average of each summed value.
for avg_name, total in averages.items():
averages[avg_name] = total / num_values
print('raw results:')
pprint(averages)
print() # Formatted output
print('Averages:')
for avg_name in avg_names:
rounded = locale.format_string('%.2f', round(averages[avg_name], 2),
grouping=True)
print(' {:<13} {:>10}'.format(avg_name, rounded))
Output:
raw results:
{'ClosePrice': 0.01, 'Duration': 5.0, 'OpenPrice': 0.01, 'sellerRating': 3249.0}
Averages:
sellerRating 3 249,00
Duration 5,00
ClosePrice 0,01
OpenPrice 0,01
Everything is fine with your way to read the file and creating a dictionary with the categories and values, imo. Your list maak_dict contains one dictionary for every line. To calculate an average for one category, you could do something like this:
def calc_average(categ):
values = [i[categ] for i in maak_dict]
average = sum(values)/len(values)
return average
assuming that you want to calculate the mean average. categ has to be a string.
After that, you can create a new dictionary that contains all the averages:
new_dict = {}
for category in maak_dict[0].keys():
avg = calc_average(category)
new_dict[category] = avg
I have two csv files with two columns each (have 10 years of daily data):
time,value
19800101,0.15
.
.
.
I used following to read data in lists a and b
import csv
a = []
with open('data.csv','rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
a.append([row[0],row[1]])
same way to get list b. I want to get mean of each month in list a and if it falls below 0.01 then remove all daily values belonging to that month and output a new list. Also, I want corresponding daily values to be removed from list b as well to produce a new list for it. Both lists a and b would be of equal length with same time steps.
Any suggestions would be appreciative.
For example:
a = [0.14,1.12........] # daily values (say have 2-years = 730 values)
b = [0.11,0.005,......] # daily values (say have 2-years = 730 values)
if March and April have monthly mean less than 0.01 in list a then will get following lists with daily values for these months removed:
a_new = [0.14,1.12,.....] # daily values (669 values)
b_new = [0.11,0.005,....] # daily values (669 values)
Well this may not be a quite good-looking and most effcient solution... but let me know how it works.
import numpy,csv
time=[]
data_a=[]
data_b=[]
#--------------------Read in a--------------------
with open('data_a.csv','rb') as csvfile:
reader=csv.reader(csvfile)
for row in reader:
time.append(row[0])
data_a.append(float(row[1]))
#--------------------Read in b--------------------
with open('data_b.csv','rb') as csvfile:
reader=csv.reader(csvfile)
for row in reader:
data_b.append(float(row[1]))
data_a=numpy.array(data_a)
data_b=numpy.array(data_b)
monthly=numpy.zeros(data_a.shape)
#-----------------Get month means-----------------
for ii in xrange(len(time)):
tt=time[ii]
if ii==0:
month_old=tt[4:6]
index_start=ii
else:
#----------------new month----------------
month=tt[4:6]
if month != month_old:
month_mean=numpy.mean(data_a[index_start:ii])
print 'mean for month',month_old,month_mean
monthly[index_start:ii]=month_mean
month_old=month
index_start=ii
#----------------Get the last month----------------
if ii==len(time)-1:
month_mean=numpy.mean(data_a[index_start:])
print 'mean for month',month_old,month_mean
monthly[index_start:]=month_mean
#-------------------Filter data-------------------
index=numpy.where(monthly>=0.01)
data_a_filtered=numpy.take(data_a,index)
data_b_filtered=numpy.take(data_b,index)
time_filtered=numpy.take(time,index)