Writing pdf data into csv - python

I am a beginner to python and I am stuck in a critical part of my script. The goal is to get fillable pdf data into one single csv file. I have built a script to do so through pdfminer but I am now stuck in the part where I should get my script to write all of the output data into my csv. Instead, I am only getting one data line written in my csv. Any clue please?
Here is my code:
import sys, os
import six
import csv
import numpy as np
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
fp = open("C:\\Users\\Sufyan\\Desktop\\test\\ccc.pdf", "rb")
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog["AcroForm"])["Fields"]
for i in fields:
field = resolve1(i)
name, value = field.get("T"), field.get("V")
filehehe = "{0}:{1}".format(name,value)
values = resolve1(value)
names = resolve1(name)
print(values)
with open('test.csv','wb') as f:
for i in names:
f.write(values)

Clues:
There is no i inside this loop:
with open('test.csv','wb') as f:
for i in names:
f.write(values)
For the same for loop, which values does names contain? Are these a lot? 1? 2? Why?

Related

How to read XML directory and Import data to mysql table in Python

Currently I am using following code to read xml files and extract data.
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et
import datetime
tree=et.parse(r'/data/dump_xml/myfile1.xml')
root=tree.getroot()
NAME = []
for name in root.iter('name'):
NAME.append(name.text)
UPDATE = []
for update in root.iter('lastupdate'):
UPDATE.append(update.text)
updated = datetime.datetime.fromtimestamp(int(UPDATE[0]))
lastupdate=updated.strftime('%Y-%m-%d %H:%M:%S')
ParaValue = []
for parameterevalue in root.iter('value'):
ParaValue.append(parameterevalue.text)
print(ParaValue[0])
print(ParaValue[1])
print(lastupdate,NAME[0],ParaValue[0])
print(lastupdate,NAME[1],ParaValue[1])
From one file I could get following output as a result
2022-05-23 11:25:01 traffic_in 1.5012356187e+05
2022-05-23 11:25:01 traffic_out 1.7723777592e+05
But I have set of xml files in /data/dump_xml/ and I need to get all the results as below with the file name as well. I need to export all those as a dataframe.Can someone help me to do this for whole directory?

How do I fix my code so that it is automated?

I have the below code that takes my standardized .txt file and converts it into a JSON file perfectly. The only problem is that sometimes I have over 300 files and doing this manually (i.e. changing the number at the end of the file and running the script is too much and takes too long. I want to automate this. The files as you can see reside in one folder/directory and I am placing the JSON file in a differentfolder/directory, but essentially keeping the naming convention standardized except instead of ending with .txt it ends with .json but the prefix or file names are the same and standardized. An example would be: CRAZY_CAT_FINAL1.TXT, CRAZY_CAT_FINAL2.TXT and so on and so forth all the way to file 300. How can I automate and keep the file naming convention in place, and read and output the files to different folders/directories? I have tried, but can't seem to get this to iterate. Any help would be greatly appreciated.
import glob
import time
from glob import glob
import pandas as pd
import numpy as np
import csv
import json
csvfile = open(r'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL1.txt', 'r')
jsonfile = open(r'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL1.json', 'w')
reader = csv.DictReader(csvfile)
out = json.dumps([row for row in reader])
jsonfile.write(out)
****************************************************************************
I also have this code using the python library "requests". How do I make this code so that it uploads multiple json files with a standard naming convention? The files end with a number...
import requests
#function to post to api
def postData(xactData):
url = 'http link'
headers = {
'Content-Type': 'application/json',
'Content-Length': str(len(xactData)),
'Request-Timeout': '60000'
}
return requests.post(url, headers=headers, data=xactData)
#read data
f = (r'filepath/file/file.json', 'r')
data = f.read()
print(data)
# post data
result = postData(data)
print(result)
Use f-strings?
for i in range(1,301):
csvfile = open(f'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL{i}.txt', 'r')
jsonfile = open(f'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL{i}.json', 'w')
import time
from glob import glob
import csv
import json
import os
INPATH r'C:\Users\...\...\...\Dog'
OUTPATH = r'C:\Users\...\...\...\Rat'
for csvname in glob(INPATH+'\*.txt'):
jsonname = OUTPATH + '/' + os.basename(csvname[:-3] + 'json')
reader = csv.DictReader(open(csvname,'r'))
json.dump( list(reader), open(jsonname,'w') )

I want to convert .docx to .dotx

I have populated some mail merge fields in a .docx file and now I want my script to convert the saved .docx file to a .dotx file. I am using Python 3.6.
from __future__ import print_function
from mailmerge import MailMerge
from datetime import date
from docx import Document
from docx.opc.constants import CONTENT_TYPE as CT
import csv
import sys
import os
import numpy as np
import pandas as pd
# . . .
for i in range(0, numTemplates):
theTemplateName = templateNameCol[i]
theTemplateFileLocation = templateFileLocationCol[i]
template = theTemplateFileLocation
document = MailMerge(template)
print(document.get_merge_fields())
theOffice = officeCol[i]
theAddress = addressCol[i]
theSuite = suiteCol[i]
theCity = cityCol[i]
theState = stateCol[i]
theZip = zipCol[i]
thePhoneNum = phoneNumCol[i]
theFaxNum = faxNumCol[i]
document.merge(
Address = theAddress
)
document.write(r'\Users\me\mailmergeproject\test-output' + str(i) + r'.docx')
#do conversion here
Here at the bottom is where I want to do the conversion. As you can see, I've written a file and it's just sitting in a folder right now
Here is the code snippet for converting the .docx file to dotx.
You have to change the content-type while saving the document
pip install python-docx
import docx
document = docx.Document('foo.dotx')
document_part = document.part
document_part._content_type = 'application/vnd.openxmlformats-
officedocument.wordprocessingml.template.main+xml'
document.save('bar.docx')

How to download a CSV file from the World Bank's dataset

I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')

Python: Import Data from Open Office calc with lxml

How can I import data for example for the field A1?
When I use etree.parse() I get an error, because I dont have a xml file.
It's a zip file:
import zipfile
from lxml import etree
z = zipfile.ZipFile('mydocument.ods')
data = z.read('content.xml')
data = etree.XML(data)
etree.dump(data)

Categories

Resources