Mapping data frame descriptions based on values of multiple columns - python

I need to generate a mapping dataframe with each unique code and a description I want prioritised, but need to do it based off a set of prioritisation options. So for example the starting dataframe might look like this:
Filename TB Period Company Code Desc. Amount
0 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 98 100
1 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1000 7 200
2 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1000 ZX -100
3 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1000 29 -200
4 3 - Foxtrot... Prior TB FOXTROT FOXTROT__1001 BA 100
5 3 - Foxtrot... Opening TB FOXTROT FOXTROT__1001 9 200
6 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 ARC -100
7 3 - Foxtrot... Closing TB FOXTROT FOXTROT__1001 86 -200
The options I have for prioritisation of descriptions are:
Firstly to search for viable options in each Period, so for example Closing first, then if not found Opening, then if not found Prior.
If multiple descriptions are in the prioritised period, prioritise either longest or first instance.
So for example, if I wanted prioritisation of Closing, then Opening, then Prior, with longest string, I should get a mapping dataframe that looks like this:
Code New Desc.
FOXTROT__1000 29
FOXTROT__1001 ARC
Just for context, I have a fairly simple way to do all this in tkinter, but its dependent on generating a GUI of inconsistent codes and comboboxes of their descriptions, which is then used to generate a mapping dataframe.
The issue is that for large volumes (>1000 up to 30,000 inconsistent codes), it becomes impractical to generate a GUI, so for large volumes I need this as a way to auto-generate the mapping dataframe directly from the initial data whilst circumventing tkinter entirely.

import numpy as np
import pandas as df
#Create a new column which shows the hierarchy given the value of Period
df['NewFilterColumn'] = np.where( df['Period'] == 'Closing', 1,
np.where(df['Period'] == 'Opening', 2,
np.where(df['Period'] == 'Prior', 3, None
)
)
)
df = df.sort_values(by = ['NewFilterColumn', 'Code','New Desc.'], ascending = True, axis = 0)

Related

plotting categorical data by row from a dataframe

I have some data showing a machine's performance. one of the columns is for when the pipe it makes fails a particular quality check causing the machine to automatically cut the pipe. Depending on the machine and the way it's set up this happens around 1% of the time and I am trying to make a plot that shows the failure rate against time - my theory is that the longer some of the tools have been in use, the more failures they produce.
Here is an example of the excel file the machine makes every 24 hours.
The column "Cut Event" is the one I am interested in. In the snip the "/" symbol indicates no cut was made, when a cut is made it the cell in that column will say "speed", "ovality" or "thickness" as a reason (in German). What I want to do I go through a dataframe and only capture rows that have a failure, i.e. not a forward slash.
Here is what I have from reading through SO and other tutorials. The machine "speaks" German btw, hence the longer words,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#fig = plt.gcf()
df = pd.read_excel("W03 tool with cuts and dates.xlsx",
dtype=object)
df = df[['Time','Cut_Event']]
df['Cut_Event'].loc[df['Cut_Event'] == 'Geschwindigkeitsschwankung'] = 'Speed Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == 'Kugelfehler'] = 'Kugel Cut Event'
df['Cut_Event'].loc[df['Cut_Event'] == '/'] = 'No Cut Event'
print (df)
What I am stuck on is passing these events over to be plotted. My python learned so far has been about plotting everything in a particular column of a numerical dataframe, rather than just specific events of categorical data and I am getting errors as a result. I tried seaborn but got nowhere.
All help genuinely appreciated.
edit: Adding the dataset
Datum WKZ_code Time Rad_t1 Not Important Cut_Event
10 Sep W03 00:00:00 100 250 /
10 Sep W03 00:00:01 100 250 /
10 Sep W03 00:00:02 100 250 /
10 Sep W03 00:00:03 100 250 /
10 Sep W03 00:00:04 100 250 /
10 Sep W03 00:00:00 100 250 Speed Cut

Pandas Dataframe as Paramaters within Pyomo optimisation model

I'm new to Pyomo and trying to utilise data in my pandas dataframe as parameters within the optimisation model, the dataframe looks like this;
Ticker Margin Avg. Volume M_ratio V_ratio
Index
0 ES1 6600.00 1250970 0.126036 0.212996
1 TY1 1150.00 1232311 0.021961 0.209819
2 FV1 700.00 488906 0.013367 0.083244
3 TU1 570.00 293885 0.010885 0.050038
4 ED3 500.00 137802 0.009548 0.023463
5 NQ1 7500.00 427061 0.143223 0.072713
6 FDAX1 24074.12 98838 0.459728 0.016829
7 FESX1 2641.28 832836 0.050439 0.141803
8 FGBL1 2502.75 546878 0.047793 0.093114
9 FGBM1 1042.10 330517 0.019900 0.056275
10 FGBS1 262.97 232801 0.005022 0.039638
11 F2MX1 4822.81 398 0.092098 0.000068
The model I'm constructing aims to find the maximum contracts one may have in all assets based on balance and a number of constraints.
I need to iterate through the rows in order to add all the relevant data to model.utilisation
model.Vw = Param() #<- V_ratio from df
model.M = Param() #<- Margin from df
model.L = Var(domain=NonNegativeReals)
model.utilisation = Objective(expr = model.M * model.L, sense=maximize)
Effectively it needs to take in Margin for each ticker and determine how many of that you can get relevant to balance - i.e.
*(ES1 Margin * model.L) + (TY1 Margin * model.L)* etc etc throughout the dataframe.
I've tested the logic by plugging in dummy data and seems to work but it's not efficient to be writing in each piece of data and then adding it to the utilisation model as I have hundreds of lines in my dataframe.
Apologies if there are some blinding errors, very new to Pyomo

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Tabulate according to terminal width in python?

I have some tabular data with some long fields. Pandas will cut off some of the long fields like this:
shortname title \
0 shc Shakespeare His Contemporaries
1 folger-shakespeare Folger Shakespeare Library Digital Texts
2 perseus-c-greek Perseus Canonical Greek
3 stanford-1880s Adult British Fiction of the 1880s, Assembled ...
4 reuters-21578 Reuters-21578
5 ecco-tcp Eighteenth Century Collections Online / Text C...
centuries
0 16th, 17th
1 16th, 17th
2 NaN
3 NaN
4 NaN
5 18th
and if I use tabulate.tabulate(), it looks like this:
- ------------------ -------------------------------------------------------------------------- ----------
0 shc Shakespeare His Contemporaries 16th, 17th
1 folger-shakespeare Folger Shakespeare Library Digital Texts 16th, 17th
2 perseus-c-greek Perseus Canonical Greek nan
3 stanford-1880s Adult British Fiction of the 1880s, Assembled by the Stanford Literary Lab nan
4 reuters-21578 Reuters-21578 nan
5 ecco-tcp Eighteenth Century Collections Online / Text Creation Partnership ECCO-TCP 18th
- ------------------ -------------------------------------------------------------------------- ----------
In the first case, the width is set to around 80, I'm guessing, and doesn't expand to fill the terminal window. I would like the columns "shortname," "title," and "centuries" to be on the same line, so this doesn't work.
In the second case, the width is set to the width of the data, but that won't work if there's a very long title, and if the user has a smaller terminal window, it will wrap really strangely.
So what I'm looking for is a (preferably easy) way in Python to pretty-print tabular data according to the user's terminal width, or at least allow me to specify the user's terminal width, which I will get elsewhere, like tabulate(data, 120) for 120 columns. Is there a way to do that?
I figured it out with a little poking around the pandas docs. This is what I'm doing now:
table = df[fields]
width = pandas.util.terminal.get_terminal_size() # find the width of the user's terminal window
pandas.set_option('display.width', width[0]) # set that as the max width in Pandas
print(table)

Python csv multiprocessing load to dictionary or list

I'm moving from MATLAB to python my algorithms and I have stuck in parallel processing
I need to process a very large amount of csv's (1 to 1M) with a large number of rows (10k to 10M) with 5 independent data columns.
I already have a code that does this, but with only one processor, loading csv's to a dictionary in RAM takes about 30 min(~1k csv's of ~100k rows).
The file names are in a list loaded from a csv(this is already done):
Amp Freq Offset PW FileName
3 10000.0 1.5 1e-08 FlexOut_20140814_221948.csv
3 10000.0 1.5 1.1e-08 FlexOut_20140814_222000.csv
3 10000.0 1.5 1.2e-08 FlexOut_20140814_222012.csv
...
And the CSV in the form: (Example: FlexOut_20140815_013804.csv)
# TDC characterization output file , compress
# TDC time : Fri Aug 15 01:38:04 2014
#- Event index number
#- Channel from 0 to 15
#- Pulse width [ps] (1 ns precision)
#- Time stamp rising edge [ps] (500 ps precision)
#- Time stamp falling edge [ps] (500 ps precision)
##Event Channel Pwidth TSrise TSfall
0 6 1003500 42955273671237500 42955273672241000
1 6 1003500 42955273771239000 42955273772242500
2 6 1003500 42955273871241000 42955273872244500
...
I'm looking for something like MATLAB 'parfor' that takes the name from the list opens the files and put the data in a list of dictionary's.
It's a list because there is an order in the files (PW), but in the examples I've found it seems to be more complicated to do this, so first I will try to put it in a dictonary and after I will arrange the data in a list.
Now I'm starting with the multiprocessing examples on the web:
Writing to dictionary of objects in parallel
I will post updates when I have a piece of "working" code.

Categories

Resources