I want to extract data present inside a rectangle box in a PDF file to a CSV file with corresponding columns and rows.
I tried using Camelot, PyPdf2, Tabula libraries etc, but I couldn't get the desired outcome in a CSV file. Could anyone help me here ?
I want this data to be published into a CSV file with respective rows and columns.
Below is the data present inside a rectangle box inside a PDF file and link to input PDF file is attached as well:
[enter link description here][2]
[2]: [enter link description here][2]
Below is the code, which I have tried :
import PyPDF2
pdf_file_obj = open('Rectangle_Box_PDF_2021_v2.pdf', 'rb')
pdf_read = PyPDF2.PdfFileReader(pdf_file_obj)
print("The total number of pages : " +str(pdf_read.numPages))
page_obj = pdf_read.getPage(0)
cont = []
pdf_list = [page_obj.extractText()]
print(pdf_list)
list1 = []
pdf_list = [page_obj.extractText()]
for i in range(0, len(pdf_list)):
list1.append(pdf_list[i].split('\n'))
flatList = sum(list1, [])
print(flatList)
[2]: The pdf file link : https://drive.google.com/file/d/1m1mwO6V9UMuXTddXdkAf0Bx88l9zudcB/view?usp=share_link
This really is such a poor quality file the data is too badly bult with errors, such
that any means to handle it needs a human key board input.
the qr code was refused by several readers
upi://pay?cu=INR
&pa=flipkartinternet#hsbc
&pn=MAHENDRA MAURYA
&gstIn=07AHTPM2207K1Z2
&am=0
&tr=OD124716518958108000
&tn=payOD124716518958108000
&invoiceNo=PZT2204190054Y22YD01
&InvoiceDate=2022-04-19T00:55:16+05:30
&invoiceValue=899.000
&transactionMethod=FLIPKART_FINANCE
&gstBrkUp={GST:96.320|CGST:0|SGST:0|IGST:96.320}
Here it is exported to xlsx, no slight on Aspose (Rubbish In) but not a good result for exporting to CSV
the best you may expect as plain text (with one comma per line) will be
Product
Description
Qty
Gross
Amount
Discount
Taxable
Value
IGST
Total
Sadow 40 Meters CAT 6 Ethernet Cable Lan
Network CAT6 Internet Modem RJ45 Patch Cord
40 m LAN Cable Grey | sadow Grey cat6 40mtr |
IMEI/SrNo: [[]]
HSN: 85177090 | IGST: 12%
1
899.00
-0.00
802.68
96.32
899.00
Shipping and Handling
Charges
1
0.00
0
0.00
0.00
0.00
TOTAL QTY: 1
TOTAL PRICE: 899.00
All values are in INR
u
or ze
gna ure
Better bet is possibly try as HTML then extract that or import to a spreadsheet for csv export.
Here using Adobe export source PDF to XLSX
However best of all was export via xpdf pdftotext and import to excel to save as csv.
Product,Description,Qty,Gross,Discount,Taxable,IGST,Total
,,,Amount,,Value,,
Sadow 40 Meters CAT6 Ethernet Cable Lan,,,,,,,
Network CAT6 Internet Modem RJ45 Patch Cord,HSN: 85177090 | IGST: 12%,1,899.00,-0.00,802.68,96.32,899.00
40 m LAN Cable Grey | sadow Grey cat6 40mtr |,,,,,,,
IMEI/SrNo: [[]],,,,,,,
,Shipping and Handling,1,0.00,0,0.00,0.00,0.00
,Charges,,,,,,
TOTAL QTY: 1,,,,,,TOTAL,PRICE: 899.00
,,,,,,,All values are in INR
,,,,,,u orze,gna ure
I have a huge txt file from that I want to exclude every Page Number, Tabular Data or Headings. The only differentiator i can think of is that the Text I need to keep is at least two lines Long
The data does look (exemplary) like this:
1 C o mp a n y
2 C o mb in ed ma na g emen t
r ep o r t
Total equity and liabilities
6,130.3
100.0%
5,930.0
100.0%
200.3
Additionally, there is bodytext, which I want to keep:
The total assets of ZALANDO SE rose by 3.4% primarily due to a further increase in financial
assets. The assets of ZALANDO SE mainly consist of financial and current assets, specifically
securities and cash, shares in affiliated companies as well as inventories and receivables.
Equity and liabilities comprise equity and current and non-current liabilities and provisions.
I did try to write:
myvariable = textstring.replace(\n.*\n," ") but it does not do anything.
Probably the subject I chose is not a good subject. but Im going to explain very clearly. my purpose is to come up with the most efficient way as the number of files is very big and it may take long time.
I have a folder which contains a lot of files(300K). these files have names. the pattern in their name is like this:
09060083_1542296310_2_CON_ENT-Floor-Practice_2015-09-25-false_MRB3738.txt
in the name of this file one things matter for me:
09060083 which I extract simply
I also have a data frame. my data frame looks like this:
Clinic Number 6month
1 09060083 1
2 494383 4
13 494383 4
14 494383 1
17 494382 9
21 494382 4
25 494383 4
28 494383 4
29 994381 5
30 994383 10
Clinic number is the same as from character 1 to 8 of the file name. Now I want to transfer some of the files to another folders based on some criteria.
my folder name is based on 6month column in data frame. so I have 10 folder name 1 2 3... 10.
My simple method for doing this is that to extract character 1 to 8 of the file name, then compare with Clinic Number column in the data frame, then if they were the same transfer to the folder with corresponding name of 6month column of that row.
But I guess it will take long time. I m looking for the most efficient way to do it. with my approach its almost awefull as it needs to loop throughth whole data frame for every single file.
Thanks in advance
You can find duplicate clinic entries and then move the corresponding files to the respective folder.
e.g. if your df is like
Clinic_Num 6month Filename
09060083 1 09060083_blah
494383 4 494383_blah1
494383 4 494383_blah2
494383 1 494383_blah3
Select all duplicate rows by:
df_to_be_moved = df[df.duplicated(subset='Clinic_Num')]
Now, your df_to_moved will be like:
Clinic_Num 6month Filename
494383 4 494383_blah2
494383 1 494383_blah3
Now you can select rows based on your destination folder and get a list of filepath for that folder and move them.
import os, shutil
BASE_PATH = "C:\Users\M193053\Documents\"
for idx in range(1,11): # folder name
folder_name = os.path.join(BASE_PATH, "folder_"+str(idx))
os.makedirs(folder_name, exist_ok=True)
matches = df_to_be_moved[df_to_be_moved['6month']==idx].Filename.tolist()
matches = [os.path.join(BASE_PATH, filename) for filename in matches]
for file in matches:
shutil.move(file, folder_name)
guys.
I've got a bit of a unique issue trying to merge two big data files together. Both files have a column of the same data (patent number) with all other columns different.
The idea is to join them such that these patent number columns align so the other data is readable and connected.
Just the first few lines of the .dat file looks like:
IL 1 Chicago 10030271 0 3930271
PA 1 Bedford 10156902 0 3930272
MO 1 St. Louis 10112031 0 3930273
IL 1 Chicago 10030276 0 3930276
And the .asc:
02 US corporation No change 11151713 TRANSCO PROD INC 58419
02 US corporation No change 11151720 SECURE TELECOM INC 502530
02 US corporation No change 11151725 SOA SYSTEMS INC 520365
02 US corporation No change 11151738 REVTEK INC 473150
The .dat file is too large to open fully in Excel so I don't think reorganizing it there is an option (rather I don't know if it is or not through any macros I've found online yet).
Quite a newbie question I feel but does anyone know how I could link these data sets together (preferably using Python) with this patent number unique identifier?
You will want to write a program that reads in the data from the two files you would like to merge. You will open the file and parse the data for each line. From there you are able to write the data to a new file in any order that you would like. This is accomplish-able through python file IO.
pseudo code:
def filehandler(self, filename1, filename2):
Fd =open(filename1, "r")
Fd2 = open(filename2, "r")
while True:
line1 = Fd.readline()
if not line1: break # this will exit the loop if there is no more to read
Line1_array = line1.split()
# first line of first file is split and saved in an array deliniated by spaces.
I'm moving from MATLAB to python my algorithms and I have stuck in parallel processing
I need to process a very large amount of csv's (1 to 1M) with a large number of rows (10k to 10M) with 5 independent data columns.
I already have a code that does this, but with only one processor, loading csv's to a dictionary in RAM takes about 30 min(~1k csv's of ~100k rows).
The file names are in a list loaded from a csv(this is already done):
Amp Freq Offset PW FileName
3 10000.0 1.5 1e-08 FlexOut_20140814_221948.csv
3 10000.0 1.5 1.1e-08 FlexOut_20140814_222000.csv
3 10000.0 1.5 1.2e-08 FlexOut_20140814_222012.csv
...
And the CSV in the form: (Example: FlexOut_20140815_013804.csv)
# TDC characterization output file , compress
# TDC time : Fri Aug 15 01:38:04 2014
#- Event index number
#- Channel from 0 to 15
#- Pulse width [ps] (1 ns precision)
#- Time stamp rising edge [ps] (500 ps precision)
#- Time stamp falling edge [ps] (500 ps precision)
##Event Channel Pwidth TSrise TSfall
0 6 1003500 42955273671237500 42955273672241000
1 6 1003500 42955273771239000 42955273772242500
2 6 1003500 42955273871241000 42955273872244500
...
I'm looking for something like MATLAB 'parfor' that takes the name from the list opens the files and put the data in a list of dictionary's.
It's a list because there is an order in the files (PW), but in the examples I've found it seems to be more complicated to do this, so first I will try to put it in a dictonary and after I will arrange the data in a list.
Now I'm starting with the multiprocessing examples on the web:
Writing to dictionary of objects in parallel
I will post updates when I have a piece of "working" code.