Python, Lists, Removing Rows

Python, Lists, Removing Rows - python

I have list of items (each row has the following: item number, lot number, description, total quantity). If a certain lot-number in my list exists twice, I add the quantities of both those rows together. "data" is my original list. "max_item" is the max times an item occurs in "data". I created a new list (one_lot_per_row_list) and have appended my updated rows to it, but I also need to add the rows from "data" that did not have duplicate lots. Or I need to remove the row that was not updated from "data" (data[i+1+j]) in my code below. Not sure if the best way to approach this is to create a new list or to remove rows from my original. Hopefully this makes sense! All help very appreciated!
Example list below -- The final 2 rows have the same Internal Lot number. I would like to add their Total Available quantities together, and then remove the row that was not updated.
Part Internal Lot Number Description Total available Expiration Date Location
0001 QLN03867 P 2 3/31/2025 FRZ06 Half 1
0002 QLN03923 A 15 4/30/2023 F01-S01-05
0002 QLN03469 A 3 9/30/2022 F01-S03-02
0003 QLN03924 G 15 9/30/2022 F01-S01-05
0003 QLN03470 G 2 9/30/2022 F01-S01-02
0004 QLN03466 U 4 10/31/2022 F01-S03-02
0005 QLN03925 C 10 4/30/2023 F01-S01-02
0005 QLN03471 C 2 9/30/2022 F01-S01-02
0006 QLN03468 R 5 7/31/2021 F01-S03-02
0007 QLN03994 I 2 4/13/2025 F01-S03-03
0007 QLN03994 I 1 4/13/2025 F01-S03-02
data = []
for row in csv_reader:
azpn = row[0]
azln = row[1]
description = row[2]
location = row[5]
date = datetime.strptime(row[4], '%m/%d/%Y')
total_available = int(row[3])
data.append([azpn, azln, description, total_available, date, location])
one_lot_per_row_list = []
i = 0
j = 1
for i in range(len(data)- max_item):
# if the lot number of row i is equal to the lot number of row i + j
for j in range(max_item):
if data[i][1] == data[i+1+j][1]:
#add total available of data[i] to row data[i+1+j]
data[i][3] += data[i+1+j][3]
#append the new row to one_lot_per_row_list
one_lot_per_row_list.append(data[i])
j+=1
i+=1

You could pursue your approach or go for a more elegant method as follows:
Sort by lot number
Group by lot number
Use the reduce function to merge the items in each group.

IIUC, you can do this very easily via pandas. The alternative is itertools groupby:
Here's one way via pandas:
groupby lot number and transform the column Total.
drop the duplicates based on subset = ['Internal Lot Number', 'Total']
Finally, save the CSV file via to_csv.
import pandas as pd
df = pd.read_csv('your csv file path here')
df.assign(Total=df.groupby('Internal Lot Number')['Total'].transform('sum')).drop_duplicates(
['Internal Lot Number', 'Total']).to_csv('output csv file path here')

Related

How to calculate the number of occurrences between data in excel?

I have a huge CSV table of thousands of data, I want to make a table of number of occurrence of two elements together divided by how many that element presented
[
Like Bitcoin appeared 8 times in this rows with 2 times with API so the relation between bitcoin to API: is that API always exists with bitcoin so the value of API appearing with bitcoin is 1 and bitcoin appearing with API is 1/4.
I want something looks like this in the end
How I can do it with python or any other tool?
This is sample of file
sample of the file

This, I think, does do the job. I typed your spreadsheet into a csv by hand (would have been nice to be able to cut and paste), and the results seem reasonable.
import itertools
import csv
import numpy as np
words = {}
for row in open('input.csv'):
parts = row.rstrip().split(',')
for a,b in itertools.combinations(parts,2):
if a not in words:
words[a] = [b]
else:
words[a].append( b )
if b not in words:
words[b] = [a]
else:
words[b].append( a )
print(words)
size = len(words)
keys = list(words.keys())
track = np.zeros((size,size))
for i,k in enumerate(keys):
track[i,i] = len(words[k])
for j in words[k]:
track[i,keys.index(j)] += 1
track[keys.index(j),i] += 1
print(keys)
# Scale to [0,1].
for row in range(track.shape[0]):
track[row,:] /= track[row,row]
# Create a csv with the results.
fout = open('corresp.csv','w')
print( ','.join([' ']+keys), file=fout )
for row in range(track.shape[0]):
print( keys[row], file=fout, end=',')
print( ','.join(f"{track[row,i]}" for i in range(track.shape[1])), file=fout )
Here's the first few lines of the result:
,API,Backend Development,Bitcoin,Docker,Article Rewriting,Article writing,Blockchain,Content Writing,Ghostwriting,Android,Ethereum,PHP,React.js,C Programming,C++ Programming,ASIC,Digital ASIC Coding,Embedded Software,Article Writing,Blog,Copy Typing,Affiliate Marketing,Brand Marketing,Bulk Marketing,Sales,BlockChain,Business Strategy,Non-fungible Tokens,Technical Writing,.NET,Arduino,Software Architecture,Bluetooth Low Energy (BLE),C# Programming,Ada programming,Programming,Haskell,Rust,Algorithm,Java,Mathematics,Machine Learning (ML),Matlab and Mathematica,Data Entry,HTML,Circuit Designs,Embedded Systems,Electronics,Microcontroller, C++ Programming,Python
API,1.0,0.14285714285714285,0.5714285714285714,0.14285714285714285,0.0,0.0,0.2857142857142857,0.0,0.0,0.0,0.14285714285714285,0.0,0.14285714285714285,0.2857142857142857,0.2857142857142857,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Backend Development,0.6666666666666666,1.0,0.6666666666666666,0.6666666666666666,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Bitcoin,0.21052631578947367,0.05263157894736842,1.0,0.05263157894736842,0.0,0.0,0.2631578947368421,0.0,0.0,0.05263157894736842,0.10526315789473684,0.10526315789473684,0.05263157894736842,0.15789473684210525,0.21052631578947367,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.05263157894736842,0.0,0.0,0.05263157894736842,0.0,0.0,0.0,0.0,0.05263157894736842,0.05263157894736842,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Docker,0.6666666666666666,0.6666666666666666,0.6666666666666666,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

I had a look at this by creating a pivot table in Excel for every combination of columns there are: AB AC, AD, BC, BD, CD and putting the unique entries from the first column, eg A, in the rows and the unique entries from the second, eg B, in the column and then putting column A in the values area, I find all matches and the count of all matches
This is a clunky method but I note from the Python based method that has been submitted, my answer is essentially no more or less clunky than that!

How can I manipulate all items in a specific index in a CSV using Python?

I would like to change index[3] for all items by 10% of it's current value.
Anything I have found online has not produced any results. The closest result is below but I couldn't see how to implement.
for line in finalList:
price = line[3]
line[3] = float(line[3]) * 1.1
line[3] = round(line[3], 2)
print(line[3])
My products.csv looks like so:
Hardware,Hammer,10,10.99
Hardware,Wrench,12,5.75
Food,Beans,32,1.99
Paper,Plates,100,2.59
My code:
def getProductsData():
productsfile = open('products.csv', 'r')
# read products file
productsreader = csv.reader(productsfile)
products = []
# break each line apart (unsure of meaning here)
for row in productsreader:
for column in row:
print(column, end = '|')
# create and append list of the above data within a larger list
products.append(row)
# loop through the list and display it's contents
print()
# add 10% to price
# add another product to your list of lists
products.append(['Toddler', 'Millie', '2017', '8.25'])
print(products)
# write the contents to a new file
updatedfile = open('updated_products.csv', 'w')
for line in products:
updatedfile.write(','.join(line))
updatedfile.write(','.join('\n'))
updatedfile.close()
getProductsData()

To add 10% to price, you would first need to make sure the column type is numeric and then you can carry out the calculation.
df:
Hardware,Hammer,10,10.99
Hardware,Wrench,12,5.75
Food,Beans,32,1.99
Paper,Plates,100,2.59
Make sure column is numeric:
df[3]=pd.to_numeric(df[3])
Overwrite column with 10% added to values, and round to 2 decimal places:
df[3]=round(df[3]+(df[3]/10),2)
df:
0 1 2 3
0 Hardware Hammer 10 12.09
1 Hardware Wrench 12 6.32
2 Food Beans 32 2.19
3 Paper Plates 100 2.85

Most efficient method to modify values within large dataframes - Python

Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.

You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here

Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2

From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.

Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Extract monthly data from a list with mean greater than a threshold using Python

I have two csv files with two columns each (have 10 years of daily data):
time,value
19800101,0.15
.
.
.
I used following to read data in lists a and b
import csv
a = []
with open('data.csv','rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
a.append([row[0],row[1]])
same way to get list b. I want to get mean of each month in list a and if it falls below 0.01 then remove all daily values belonging to that month and output a new list. Also, I want corresponding daily values to be removed from list b as well to produce a new list for it. Both lists a and b would be of equal length with same time steps.
Any suggestions would be appreciative.
For example:
a = [0.14,1.12........] # daily values (say have 2-years = 730 values)
b = [0.11,0.005,......] # daily values (say have 2-years = 730 values)
if March and April have monthly mean less than 0.01 in list a then will get following lists with daily values for these months removed:
a_new = [0.14,1.12,.....] # daily values (669 values)
b_new = [0.11,0.005,....] # daily values (669 values)

Well this may not be a quite good-looking and most effcient solution... but let me know how it works.
import numpy,csv
time=[]
data_a=[]
data_b=[]
#--------------------Read in a--------------------
with open('data_a.csv','rb') as csvfile:
reader=csv.reader(csvfile)
for row in reader:
time.append(row[0])
data_a.append(float(row[1]))
#--------------------Read in b--------------------
with open('data_b.csv','rb') as csvfile:
reader=csv.reader(csvfile)
for row in reader:
data_b.append(float(row[1]))
data_a=numpy.array(data_a)
data_b=numpy.array(data_b)
monthly=numpy.zeros(data_a.shape)
#-----------------Get month means-----------------
for ii in xrange(len(time)):
tt=time[ii]
if ii==0:
month_old=tt[4:6]
index_start=ii
else:
#----------------new month----------------
month=tt[4:6]
if month != month_old:
month_mean=numpy.mean(data_a[index_start:ii])
print 'mean for month',month_old,month_mean
monthly[index_start:ii]=month_mean
month_old=month
index_start=ii
#----------------Get the last month----------------
if ii==len(time)-1:
month_mean=numpy.mean(data_a[index_start:])
print 'mean for month',month_old,month_mean
monthly[index_start:]=month_mean
#-------------------Filter data-------------------
index=numpy.where(monthly>=0.01)
data_a_filtered=numpy.take(data_a,index)
data_b_filtered=numpy.take(data_b,index)
time_filtered=numpy.take(time,index)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python, Lists, Removing Rows - python

You could pursue your approach or go for a more elegant method as follows: Sort by lot number Group by lot number Use the reduce function to merge the items in each group.

Related

How to calculate the number of occurrences between data in excel?

How can I manipulate all items in a specific index in a CSV using Python?

Most efficient method to modify values within large dataframes - Python

Pandas very slow query

Extract monthly data from a list with mean greater than a threshold using Python

Categories

Resources