I have a PDF file, I need to convert it into a CSV file this is my pdf file example as link https://online.flippingbook.com/view/352975479/ the code used is
import re
import parse
import pdfplumber
import pandas as pd
from collections import namedtuple
file = "Battery Voltage.pdf"
lines = []
total_check = 0
with pdfplumber.open(file) as pdf:
pages = pdf.pages
for page in pdf.pages:
text = page.extract_text()
for line in text.split('\n'):
print(line)
with the above script I am not getting proper output, For Time column "AM" is getting in the next line. The output I am getting is like this
It may help you to see how the surface of a pdf is displayed to the screen. so that one string of plain text is placed part by part on the display. (Here I highlight where the first AM is to be placed.
As a side issue that first AM in the file is I think at first glance encoded as this block
BT
/F1 12 Tf
1 0 0 1 224.20265 754.6322 Tm
[<001D001E>] TJ
ET
Where in that area 1D = A and 1E = M
So If you wish to extract each LINE as it is displayed, by far the simplest way is to use a library such as pdftotext that especially outputs each row of text as seen on page.
Thus using an attack such as tabular comma separated you can expect each AM will be given its own row. Which should by logic be " ",AM," "," " but some extractors should say nan,AM,nan,nan
As text it looks like this from just one programmable line
pdftotext -layout "Battery Voltage.pdf"
That will output "Battery Voltage.txt" in the same work folder
Then placing that in a spreadsheet becomes
Now we can export in a couple of clicks (no longer) as "proper output" csv along with all its oddities that csv entails.
,,Battery Vo,ltage,
Sr No,DateT,Ime,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off
,AM,,,
2,01/11/2022,00:23:10,47.15,Off
,AM,,,
3,01/11/2022,00:38:10,47.15,Off
,AM,,,
4,01/11/2022,00:58:10,47.15,Off
,AM,,,
5,01/11/2022,01:18:10,47.15,Off
,AM,,,
6,01/11/2022,01:33:10,47.15,Off
,AM,,,
7,01/11/2022,01:48:10,47.15,Off
,AM,,,
8,01/11/2022,02:03:10,47.15,Off
,AM,,,
9,01/11/2022,02:18:10,47.15,Off
,AM,,,
10,01/11/2022,02:37:12,47.15,Off
,AM,,,
So, if the edits were not done before csv generation it is simpler to post process in an editor, like this html page (no need for more apps)
,,Battery,Voltage,
Sr No,Date,Time,Voltage (v),Ignition
1,01/11/2022,00:08:10,47.15,Off,AM,,,
2,01/11/2022,00:23:10,47.15,Off,AM,,,
3,01/11/2022,00:38:10,47.15,Off,AM,,,
4,01/11/2022,00:58:10,47.15,Off,AM,,,
5,01/11/2022,01:18:10,47.15,Off,AM,,,
6,01/11/2022,01:33:10,47.15,Off,AM,,,
7,01/11/2022,01:48:10,47.15,Off,AM,,,
8,01/11/2022,02:03:10,47.15,Off,AM,,,
9,01/11/2022,02:18:10,47.15,Off,AM,,,
10,01/11/2022,02:37:12,47.15,Off,AM,,,
Then on re-import it looks more human generated
In discussions it was confirmed all that's desired is a means to a structured list and first parse using
pdftotext -layout -nopgbrk -x 0 -y 60 -W 800 -H 800 -fixed 6 "Battery Voltage.pdf" &type "battery voltage.txt"|findstr "O">battery.txt
will output regulated data columns for framing, with a fixed headline or splitting or otherwise using cleaned data.
1 01-11-2022 00:08:10 47.15 Off
2 01-11-2022 00:23:10 47.15 Off
3 01-11-2022 00:38:10 47.15 Off
4 01-11-2022 00:58:10 47.15 Off
5 01-11-2022 01:18:10 47.15 Off
...
32357 24-11-2022 17:48:43 45.40 On
32358 24-11-2022 17:48:52 44.51 On
32359 24-11-2022 17:48:55 44.51 On
32360 24-11-2022 17:48:58 44.51 On
32361 24-11-2022 17:48:58 44.51 On
At this stage we can use text handling such as csv or add json brackets
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do echo csv is "%%a,%%b,%%c,%%d,%%e">output.txt
...
csv is "32357,24-11-2022,17:48:43,45.40,On"
csv is "32358,24-11-2022,17:48:52,44.51,On"
csv is "32359,24-11-2022,17:48:55,44.51,On"
csv is "32360,24-11-2022,17:48:58,44.51,On"
csv is "32361,24-11-2022,17:48:58,44.51,On"
So the request is for JSON (not my forte so you may need to improve on my code as I dont know what mongo expects)
here I drop a pdf onto a battery.bat
{"line_id":1,"created":{"date":"01-11-2022"},{"time":"00:08:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":2,"created":{"date":"01-11-2022"},{"time":"00:23:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":3,"created":{"date":"01-11-2022"},{"time":"00:38:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":4,"created":{"date":"01-11-2022"},{"time":"00:58:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":5,"created":{"date":"01-11-2022"},{"time":"01:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":6,"created":{"date":"01-11-2022"},{"time":"01:33:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":7,"created":{"date":"01-11-2022"},{"time":"01:48:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":8,"created":{"date":"01-11-2022"},{"time":"02:03:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":9,"created":{"date":"01-11-2022"},{"time":"02:18:10"},{"Voltage":"47.15"},{"State","Off"}}
{"line_id":10,"created":{"date":"01-11-2022"},{"time":"02:37:12"},{"Voltage":"47.15"},{"State","Off"}}
it is a bit slow as running in pure console so lets run it blinder by add #, it will still take time as we are working in plain text, so do expect a significant delay for 32,000+ lines = 2+1/2 minutes on my kit
pdftotext -layout -nopgbrk -x 0 -y 60 -W 700 -H 800 -fixed 8 "%~1" battery.txt
echo Heading however you wish it for json perhaps just opener [ but note only one redirect chevron >"%~dpn1.txt"
for /f "tokens=1,2,3,4,5 delims= " %%a In ('Findstr /C:"O" battery.txt') do #echo "%%a": { "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },>>"%~dpn1.txt"
REM another json style could be { "Line_Id": %%a, "Date": "%%b", "Time": "%%c", "Voltage": %%d, "Ignition": "%%e" },
REM another for an array can simply be [%%a,"%%b","%%c",%%d,"%%e" ],
echo Tailing however you wish it for json perhaps just final closer ] but note double chevron >>"%~dpn1.txt"
To see progress change #echo { to #echo %%a&echo {
Thus, after a minute or so
however, it tends to add an extra minute for all that display activity! before the window closes as a sign of completion.
For cases like these, build a parser that converts the unusable data into something you can use.
Logic below converts that exact file to a CSV, but will only work with that specific file contents.
Note that for this specific file you can ignore the AM/PM as the time is in 24h format.
import pdfplumber
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
with open("output.csv", "w") as outfile:
header = "serialnumber;date;time;voltage;ignition\n"
outfile.write(header)
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split('\n'):
if line.strip() in skiplines:
continue
outfile.write(";".join(line.split())+"\n")
EDIT
So, JSON files in python are basically just a list of dict items (yes, that's oversimplification).
The only thing you need to change is the way you actually process the lines. The actual meat of the logic doesn't change...
import pdfplumber
import json
file = "Battery Voltage.pdf"
skiplines = [
"Battery Voltage",
"AM",
"PM",
"Sr No DateTIme Voltage (v) Ignition",
""
]
result = []
with pdfplumber.open(file) as pdf:
for page in pdf.pages:
for line in page.extract_text().split("\n"):
if line.strip() in skiplines:
continue
serialnumber, date, time, voltage, ignition = line.split()
result.append(
{
"serialnumber": serialnumber,
"date": date,
"time": time,
"voltage": voltage,
"ignition": ignition,
}
)
with open("output.json", "w") as outfile:
json.dump(result, outfile)
I am aware that a lot of questions are already asked on this topic, but none of them worked for my specific case.
I want to import a text file in python, and want to be able to access each value seperatly in python. My text file looks like (it's seperated by tabs):
example dataset
For example, the data '1086: CampNou' is written in one cell. I am mainly interested in getting access to the values presented here. Does anybody have a clue how to do this?
1086: CampNou 2084: Hospi 2090: Sants 2094: BCN-S 2096: BCN-N 2101: UNI 2105: B23 Total
1086: CampNou 0 15,6508 12,5812 30,3729 50,2963 0 56,0408 164,942
2084: Hospi 15,7804 0 19,3732 37,1791 54,1852 27,4028 59,9297 213,85
2090: Sants 12,8067 22,1304 0 30,6268 56,7759 29,9935 62,5204 214,854
2096: BCN-N 51,135 54,8545 57,3742 46,0102 0 45,6746 56,8001 311,849
2101: UNI 0 28,9589 31,4786 37,5029 31,6773 0 50,2681 179,886
2105: B23 51,1242 38,5838 57,3634 75,1552 56,7478 40,2728 0 319,247
Total 130,846 160,178 178,171 256,847 249,683 143,344 285,559 1404,63'
You can use pandas to open and manipulate your data.
import pandas as pd
df = pd.read_csv("mytext.txt")
This should read properly your file
def read_file(filename):
"""Returns content of file"""
file = open(filename, 'r')
content = file.read()
file.close()
return content
content = read_file("the_file.txt") # or whatever your text file is called
items = content.split(' ')
Then your values will be in the list items: ['', '1086: CampNou', '2084: Hospi', '2090: Sants', ...]
I have a file of nested json data. I am trying to "get.some_object" and write a csv file with the objects (I think they are called objects: "some_object": "some_value"); I would like one row for each group of nested items. This is my code:
import csv
import json
path = 'E:/Uni Arbeit/Prof Hayo/Sascha/Bill data/97/bills/hr/hr4242'
outputfile = open('TaxLaw1981.csv', 'w', newline='')
outputwriter = csv.writer(outputfile)
with open(path + "/" + "/data.json", "r") as f:
data = json.load(f)
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
outputwriter.writerow([a, b, c])
outputfile.close()
The problem I have is that it only writes the last group of data to csv; however when I run
with open(path + "/" + "/data.json", "r") as f:
data = json.load(f)
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
print (a)
all of my "a" values print out.
Suggestions?
You need to flush your outputwriter to write the row to the file, else it will keep on replacing the one in the variable and eventually only write the last value. Writerow only works when you close the file unless you flush the data.
for act in data['actions']:
a = act.get('acted_at')
b = act.get('text')
c = act.get('type')
outputwriter.writerow([a, b, c])
outputfile.flush()
The code you posted above works 100% with the file you have.
The file (for anyone interested) is available with rsync -avz --delete --delete-excluded --exclude **/text-versions/ govtrack.us::govtrackdata/congress/97/bills/hr/hr4242 .
.
And the output to the csv file is (omitting some lines in the middle)
1981-07-23,Referred to House Committee on Ways and Means.,referral
1981-07-23,"Consideration and Mark-up Session Held by Committee on Ways and Means Prior to Introduction (Jun 10, 81 through Jul 23, 81).",action
1981-07-23,"Hearings Held by House Committee on Ways and Means Prior to Introduction (Feb 24, 25, Mar 3, 4, 5, 24, 25, 26, 27, 30, 31, Apr 1, 2, 3, 7, 81).",action
...
...
...
1981-08-12,Measure Signed in Senate.,action
1981-08-12,Presented to President.,topresident
1981-08-13,Signed by President.,signed
1981-08-13,Became Public Law No: 97-34.,enacted
You should post the full error code you get when you execute (probably due to an encoding error) to let someone understand why your code is failing.