How to extract specific Tables from multiple PDFs in Python

How to extract specific Tables from multiple PDFs in Python - python

I have a data bank of PDF files that I've downloaded through webscraping. I can extract the tables from these PDF files and visualise them in jupyter notebook like this:
import os
import camelot.io as camelot
n = 1
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item, pages='all', split_text=True)
print(f'''DATENBLATT {n}: {item}
''')
n += 1
for tabs in tables:
print(tabs.df, "\n==============================================================================\n")
in this way I get the results for two PDF files in the data bank as follows.
(PDf1, PDF2)
Now I would like to ask how I can get only the specific data from tables that contain for example "Voltage" and "Current" info. More specifically I would like to extract user-defined or targeted info and make charts with this values instead of printing them as whole.
Thanks in advance.
DATENBLATT 1: HY-Energy-Plus-Peak-Pack-HYP-00-2972-R2.pdf
0 1
0 Part Number HYP-00-2972
1 Voltage Nominal 51.8V
2 Voltage Range Min/Max 43.4V/58.1V
3 Charge Current 160A maximum \nDe-rated by BMS message over CA...
4 Discharge Current 300A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 5.76kWh/111.4Ah
6 Maximum Energy Density 164Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 300.5mm
9 Weight 37kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported Information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack Behaviour
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages, Pack Current, ...
2 Interlock to control external protection devic...
3 Actively controlled dissipative balancing
4 BMS implements a single master and multi-slave...
5 Zivan, Victron, Delta-Q, TC-Charger, SPE. For ...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 10
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================
DATENBLATT 2: HY-Energy-Standard-Pack-HYP-00-2889-R2.pdf
0 1
0 Part Number HYP-00-2889
1 Voltage Nominal 44.4V
2 Voltage Range Min/Max 37.2V/49.8V
3 Charge Current 132A maximum \nDe-rated by BMS message over CA...
4 Discharge Current 132A maximum \nDe-rated by BMS message over CA...
5 Maximum Capacity 4.94kWh/111Ah
6 Maximum Energy Density 152Wh/kg
7 Useable capacity Limited to 90% by BMS to improve cell life
8 Dimensions W: 243 x L: 352 x H: 265mm
9 Weight 32kg
10 Mounting Fixtures 4x M8 mounting points for easy secure mounting
11
==============================================================================
0 \
0 Communication Protocol
1 Reported Information
2 Pack Protection Mechanism
3 Balancing Method
4 Multi-Pack Behaviour
5 Compatible Chargers as standard
6 Charger Control
7 Auxiliary Connectors
8 Power connectors
9
1
0 CAN bus at user selectable baud rate (propriet...
1 Cell Temperatures and Voltages, Pack Current, ...
2 Interlock to control external protection devic...
3 Actively controlled dissipative balancing
4 BMS implements a single master and multi-slave...
5 Zivan, Delta-Q, TC-Charger, SPE, Victron, Bass...
6 Direct current control based on cell voltage/t...
7 Binder 720-Series 8-way male & female
8 4x Amphenol SurLok Plus 8mm \nWhen using batte...
9
==============================================================================
0 \
0 Max no of packs in series
1 Max Number of Parallel Packs
2 External System Requirements
3
1
0 12
1 127
2 External Protection Device (e.g. Contactor) co...
3
==============================================================================

You can define a list of the strings of interest;
then select only the tables which contain at least one of these strings.
import os
import camelot.io as camelot
n = 1
# define your strings of interest
interesting_strings=["voltage", "current"]
arr = os.listdir('D:\Test') # arr ist die Liste der PDF-Titel
for item in arr:
tables = camelot.read_pdf(item, pages='all', split_text=True)
print(f'''DATENBLATT {n}: {item}
''')
n += 1
for tabs in tables:
# select only tables which contain at least one of the interesting strings
if any(s in tabs.df.to_string().lower() for s in interesting_strings) :
print(tabs.df, "\n==============================================================================\n")
If you want to search for interesting strings only in specific places (for example, in the first column), you can use Pandas dataframes properties, such as iloc:
any(s in tabs.df.iloc[0].to_string().lower() for s in interesting_strings)

Related

Saving different presets with python

for my espresso machine I am programming a GUI in Python with PyQt5.
I want to save different "profiles" which I can recall depending on the bean I am using.
There is a firmware (from the machine called "decent espresso machine") written in TCL which shows what I want (I will attach 2 images and a snippet of one of those description files, line 1 in this file contains the different steps).
I am not sure which parser I should use. I obviously always want to save different values for the same parameters (in c I would say the same struct), so somehow it feels wrong to use someting like configparser where you always have independent sections.
Can somebody give me a hint what lib I should use. There are always the same parameters, but the different recipes may contain different amount of steps.
I would like to have something like this pseudocode:
recipe=open('recipe1.txt')
currentstep=0
for steps in recipe:
pressure = step[currentstep].pressure
flow = step[currentstep].flow
...
brew()
currentstep = currentstep + 1
Steps
Overview
This is in the TCL file:
advanced_shot {{exit_if 1 flow 6.0 volume 100 transition fast exit_flow_under 0 temperature 90.0 name infuse pressure 1 sensor coffee pump flow exit_type pressure_over exit_flow_over 6 exit_pressure_over 3.0 exit_pressure_under 0 seconds 20.0} {exit_if 0 volume 100 transition fast exit_flow_under 0 temperature 90.0 name {rise and hold} pressure 9.0 sensor coffee pump pressure exit_flow_over 6 exit_pressure_over 11 seconds 10.0 exit_pressure_under 0} {exit_if 1 volume 100 transition smooth exit_flow_under 0 temperature 90.0 name decline pressure 4.0 sensor coffee pump pressure exit_type pressure_under exit_flow_over 1.2 exit_pressure_over 11 seconds 20.0 exit_pressure_under 4.0} {exit_if 1 flow 1.2 volume 100 transition smooth exit_flow_under 0 temperature 90.0 name {pressure limit} pressure 4.0 sensor coffee pump pressure exit_type flow_over exit_flow_over 1.0 exit_pressure_over 11 exit_pressure_under 0 seconds 10.0} {exit_if 0 flow 1.0 volume 100 transition smooth exit_flow_under 0 temperature 90.0 name {flow limit} pressure 3.0 sensor coffee pump flow exit_flow_over 6 exit_pressure_over 11 seconds 30.0 exit_pressure_under 0}}
author Decent
beverage_type espresso
espresso_decline_time 30
espresso_hold_time 15
espresso_pressure 6.0
espresso_temperature 90.0
final_desired_shot_volume 32
final_desired_shot_volume_advanced 0
final_desired_shot_weight 32
final_desired_shot_weight_advanced 36
flow_profile_decline 1.2
flow_profile_decline_time 17
flow_profile_hold 2
flow_profile_hold_time 8
flow_profile_minimum_pressure 4
flow_profile_preinfusion 4
flow_profile_preinfusion_time 5
preinfusion_flow_rate 4
preinfusion_guarantee 1
preinfusion_stop_pressure 4.0
preinfusion_time 20
pressure_end 4.0
profile_hide 0
profile_language en
profile_notes {An advanced spring lever profile by John Weiss that addresses a problem with simple spring lever profiles, by using both pressure and flow control. The last two steps keep pressure/flow under control as the puck erodes, if the shot has not finished by the end of step 3. Please consider this as a starting point for tweaking.}
profile_title {Advanced spring lever}
settings_profile_type settings_2c
tank_desired_water_temperature 0
water_temperature 80

Summarising features with multiple values in Python for Machine Learning model

I have a data file containing different foetal ultrasound measurements. The measurements are collected at different points during pregnancy, like so:
PregnancyID MotherID gestationalAgeInWeeks abdomCirc
0 0 14 150
0 0 21 200
1 1 20 294
1 1 25 315
1 1 30 350
2 2 8 170
2 2 9 180
2 2 18 NaN
As you can see from the table above, I have multiple measurements per pregnancy (between 1 and 26 observations each).
I want to summarise the ultrasound measurements somehow such that I can replace the multiple measurements with a fixed amount of features per pregnancy. So I thought of creating 3 new features, one for each trimester of pregnancy that would hold the maximum measurement recorded during that trimester:
abdomCirc1st: this feature would hold the maximum value of all abdominal circumference measurements measured between 0 to 13 Weeks
abdomCirc2nd: this feature would hold the maximum value of all abdominal circumference measurements measured between 14 to 26 Weeks
abdomCirc3rd: this feature would hold the maximum value of all abdominal circumference measurements measured between 27 to 40 Weeks
So my final dataset would look like this:
PregnancyID MotherID abdomCirc1st abdomCirc2nd abdomCirc3rd
0 0 NaN 200 NaN
1 1 NaN 315 350
2 2 180 NaN NaN
The reason for using the maximum here is that a larger abdominal circumference is associated with the adverse outcome I am trying to predict.
But I am quite confused about how to go about this. I have used the groupby function previously to derive certain statistical features from the multiple measurements, however this is a more complex task.
What I want to do is the following:
Group all abdominal circumference measurements that belong to the same pregnancy into 3 trimesters based on gestationalAgeInWeeks value
Compute the maximum value of all abdominal circumference measurements within each trimester, and assign this value to the relevant feature; abdomCirc1st, abdomCir2nd or abdomCirc3rd.
I think I have to do something along the lines of:
df["abdomCirc1st"] = df.groupby(['MotherID', 'PregnancyID', 'gestationalAgeInWeeks'])["abdomCirc"].transform('max')
But this code does not check what trimester the measurement was taken in (gestationalAgeInWeeks). I would appreciate some help with this task.

You can try this. a bit of a complicated query but it seems to work:
(df.groupby(['MotherID', 'PregnancyID'])
.apply(lambda d: d.assign(tm = (d['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.groupby('tm')['abdomCirc']
.apply(max))
.unstack()
)
produces
tm 1 2 3
MotherID PregnancyID
0 0 NaN 200.0 NaN
1 1 NaN 294.0 350.0
2 2 180.0 NaN NaN
Let's unpick this a bit. First we groupby on MontherId, PregnancyID. Then we apply a function to each grouped dataframe (d)
For each d, we create a 'trimester' column 'tm' via assign (I assume I got the math right here, but correct it if it is wrong!), then we groupby by 'tm' and apply max. For each sub-dataframe d then we obtain a Series which is tm:max(abdomCirc).
Then we unstack() that moves tm to the column names
You may want to rename this columns later, but I did not bother
Solution 2
Come to think of it you can simplify the above a bit:
(df.assign(tm = (df['gestationalAgeInWeeks']+ 13 - 1 )// 13))
.drop(columns = 'gestationalAgeInWeeks')
.groupby(['MotherID', 'PregnancyID','tm'])
.agg('max')
.unstack()
)
similar idea, same output.

There is a magic command called query. This should do your work for now:
abdomCirc1st = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks <= 13')['abdomCirc'].max()
abdomCirc2nd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 14 and gestationalAgeInWeeks <= 26')['abdomCirc'].max()
abdomCirc3rd = df.query('MotherID == 0 and PregnancyID == 0 and gestationalAgeInWeeks >= 27 and gestationalAgeInWeeks <= 40')['abdomCirc'].max()
If you want something more automatic (and not manually changing the values of your ID's: MotherID and PregnancyID, every time for each different group of rows), you have to combine it with groupby (as you did on your own)
Check this as well: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html

Minimize number of shops while reaching all customers

In this particular issue, I have an imaginary city divided into squares - basically a MxN grid of squares covering the city. M and N can be relatively big, so I have cases with more than 40,000 square cells overall.
I have a number of customers Z distributed in this grid, some cells will contain many customers while others will be empty. I would like to find a way to place the minimum number of shops (only one per cell) to be able to serve all customers, with the restriction that all customers must be “in reach” of one shop and all customers need to be included.
As an additional couple of twist, I have these constraints/issues:
There is a maximum distance that a customer can travel - if the shop is in a cell too far away then the customer cannot be associated with that shop. Edit: it’s not really a distance, it’s a measure of how easy it is for a customer to reach a shop, so I can’t use circles...
While respecting the condition (1) above, there may well be multiple shops in reaching distance of the same customer. In this case, the closest shop should win.
At the moment I’m trying to ignore the issue of costs - many customers means bigger shops and larger costs - but maybe at some point I’ll think about that too. The problem is, I have no idea of the name of the problem I’m looking at nor about possible algorithmic solutions for it: can this be solved as a Linear Programming problem?
I normally code in Python, so any suggestions on a possible algorithmic approach and/or some code/libraries to solve it would be very much appreciated.
Thank you in advance.
Edit: as a follow up, I kind of found out I could solve this problem as a MINLP “uncapacitated facility problem”, but all the information I have found are way too complex: I don’t care to know which customer is served by which shop, I only care to know if and where a shop is built. I have a secondary way - as post processing - to associate a customer to the most appropriate shop.
All the codes I found set up this monstrous linear system associating a constraint per customer per shop (as “explained” here: https://en.m.wikipedia.org/wiki/Facility_location_problem#Uncapacitated_facility_location), so in a situation like mine I could easily end up with a linear system with millions of rows and columns, which with integer/binary variables will take about the age of the universe to solve.
There must be an easier way to handle this...

I think this can be formulated as a set covering problem link.
You say:
in a situation like mine I could easily end up with a linear system
with millions of rows and columns, which with integer/binary variables
will take about the age of the universe to solve
So let's see if that is even remotely true.
Step 1: generate some data
I generated a grid of 200 x 200, yielding 40,000 cells. I place at random M=500 customers. This looks like:
---- 22 PARAMETER cloc customer locations
x y
cust1 35 75
cust2 169 84
cust3 111 18
cust4 61 163
cust5 59 102
...
cust497 196 148
cust498 115 136
cust499 63 101
cust500 92 87
Step 2: calculate reach of each customer
The next step is to determine for each customer c the allowed locations (i,j) within reach. I created a large sparse boolean matrix reach(c,i,j) for this. I used the rule: if the manhattan distance is
|i-cloc(c,'x')|+|j-cloc(c,'y')| <= 10
then the store at (i,j) can service customer c. My data looks like:
(zeros are not stored). This data structure has 106k elements.
Step 3: Form MIP model
We form a simple MIP model:
The inequality constraint says: we need at least one store that is within reach of each customer. This is a very simple model to formulate and to implement.
Step 4: Solve
This is a large but easy MIP. It has 40,000 binary variables. It solves very fast. On my laptop it took less than 1 second with a commercial solver (3 seconds with open-source solver CBC).
The solution looks like:
---- 47 VARIABLE numStores.L = 113 number of stores
---- 47 VARIABLE placeStore.L store locations
j1 j6 j7 j8 j9 j15 j16 j17 j18
i4 1
i18 1
i40 1
i70 1
i79 1
i80 1
i107 1
i118 1
i136 1
i157 1
i167 1
i193 1
+ j21 j23 j26 j28 j29 j31 j32 j36 j38
i10 1
i28 1
i54 1
i72 1
i96 1
i113 1
i147 1
i158 1
i179 1
i184 1
i198 1
+ j39 j44 j45 j46 j49 j50 j56 j58 j59
i5 1
i18 1
i39 1
i62 1
i85 1
i102 1
i104 1
i133 1
i166 1
i195 1
+ j62 j66 j67 j68 j69 j73 j74 j76 j80
i11 1
i16 1
i36 1
i61 1
i76 1
i105 1
i112 1
i117 1
i128 1
i146 1
i190 1
+ j82 j84 j85 j88 j90 j92 j95 j96 j97
i17 1
i26 1
i35 1
i48 1
i68 1
i79 1
i97 1
i136 1
i156 1
i170 1
i183 1
i191 1
+ j98 j102 j107 j111 j112 j114 j115 j116 j118
i4 1
i22 1
i36 1
i56 1
i63 1
i68 1
i88 1
i100 1
i101 1
i111 1
i129 1
i140 1
+ j119 j121 j126 j127 j132 j133 j134 j136 j139
i11 1
i30 1
i53 1
i72 1
i111 1
i129 1
i144 1
i159 1
i183 1
i191 1
+ j140 j147 j149 j150 j152 j153 j154 j156 j158
i14 1
i35 1
i48 1
i83 1
i98 1
i117 1
i158 1
i174 1
i194 1
+ j161 j162 j163 j164 j166 j170 j172 j174 j175
i5 1
i32 1
i42 1
i61 1
i69 1
i103 1
i143 1
i145 1
i158 1
i192 1
i198 1
+ j176 j178 j179 j180 j182 j183 j184 j188 j191
i6 1
i13 1
i23 1
i47 1
i61 1
i81 1
i93 1
i103 1
i125 1
i182 1
i193 1
+ j192 j193 j196
i73 1
i120 1
i138 1
i167 1
I think we have debunked your statement that a MIP model is not a feasible approach to this problem.
Note that the age of the universe is 13.7 billion years or 4.3e17 seconds. So we have achieved a speed-up of about 1e17. This is a record for me.
Note that this model does not find the optimal locations for the stores, but only a configuration that minimizes the number of stores needed to service all customers. It is optimal in that sense. But the solution will not minimize the distances between customers and stores.

How to find a maximimum with given constraints in Python/Pandas

For a given set of players, player positions, player cost, a budget and a set of constraints, how can I find the "optimal" solution? For example:
ID - Pos - cost - pts
1 1 13 10
2 1 5 13
3 2 10 15
4 2 10 8
5 3 12 12
6 3 7 14
and a budget of 30 (total cost cannot exceed 30), limitation of 1 player per position.
The real problem I'm trying to solve is: I have estimated points per player in fantasy football. Now given the constraints in fantasy football, i.e.
a budget of 100
1 goalkeeper
a max of 5 defenders, min of 3 defenders
a max of 5 midfielders, min of 3 midfielders
a max of 3 strikers, min of 1 striker
Given these constraints, how do I find the maximum pts?
What libraries and tools are available for something like this? I could imagine myself doing this in Excel solver, but given my dataset with over 1000 players it wouldn't work.
I started writing some custom code, but quickly realized there must be some readymade solutions for this.

Scikit-optimize is a good starting point
https://scikit-optimize.github.io/

This is pretty much the knapsack problem (https://en.wikipedia.org/wiki/Knapsack_problem) with the additional constraint, that same positions can't go together, which has been discussed here before:
Knapsack with items to consider constraint
As stated there the problem is NP-hard.
You might have a look at the itertools module to reduce the runtime of your computation.

Faster lookup between two dataframe columns - Python/Pandas Way

I have two data frames with each having about 250K lines. I am trying to do a fuzzy lookup between the two data frame's columns. After look up I need the indexes for those good matches with the threshold.
Some details are following,
My df1:
Name State Zip combined_1
0 Auto MN 10 Auto,MN,10
1 Rtla VI 253 Rtla,VI,253
2 Huka CO 56218 Huka,CO,56218
3 kann PR 214 Kann,PR,214
4 Himm NJ 65216 Himm,NJ,65216
5 Elko NY 65418 Elko,NY,65418
6 Tasm MA 13 Tasm,MA,13
7 Hspt OH 43218 Hspt,OH,43218
My other data frame that I am trying to look upto
Name State Zip combined_2
0 Kilo NC 69521 Kilo,NC,69521
1 Kjhl FL 3369 Kjhl,FL,3369
2 Rtla VI 25301 Rtla,VI,25301
3 Illt GA 30024 Illt,GA,30024
4 Huka CO 56218 Huka,CO,56218
5 Haja OH 96766 Haja,OH,96766
6 Auto MN 010 Auto,MN,010
7 Clps TX 44155 Clps,TX,44155
If you look close, when I do fuzzy lookup I should get a good match for indexes 0 and 2 in my df1 from df2 indexes, 6, 4.
So, I did this,
from fuzzywuzzy import fuzz
# Save df1 index
df_1index = []
# save df2 index
df2_indexes = []
# save fuzzy ratio
fazz_rat = []
for index, details in enumerate(df1['combined_1']):
for ind, information in enumerate(df2['combined_2']):
fuzmatch = fuzz.ratio(str(details), str(information))
if fuzmatch >= 94:
df_1index.append(index)
df2_indexes.append(ind)
fazz_rat.append(fuzmatch)
else:
pass
As I expected, I am getting the results for this example case,
df_1index
>> [0,2]
df2_indexes
>> [6,4]
To run against 250K * 250K lines in both data frames it takes, so much time.
How can I speed up this lookup process? Is there pandas or python way to improve performance for what want?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.