I have the below code that takes my standardized .txt file and converts it into a JSON file perfectly. The only problem is that sometimes I have over 300 files and doing this manually (i.e. changing the number at the end of the file and running the script is too much and takes too long. I want to automate this. The files as you can see reside in one folder/directory and I am placing the JSON file in a differentfolder/directory, but essentially keeping the naming convention standardized except instead of ending with .txt it ends with .json but the prefix or file names are the same and standardized. An example would be: CRAZY_CAT_FINAL1.TXT, CRAZY_CAT_FINAL2.TXT and so on and so forth all the way to file 300. How can I automate and keep the file naming convention in place, and read and output the files to different folders/directories? I have tried, but can't seem to get this to iterate. Any help would be greatly appreciated.
import glob
import time
from glob import glob
import pandas as pd
import numpy as np
import csv
import json
csvfile = open(r'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL1.txt', 'r')
jsonfile = open(r'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL1.json', 'w')
reader = csv.DictReader(csvfile)
out = json.dumps([row for row in reader])
jsonfile.write(out)
****************************************************************************
I also have this code using the python library "requests". How do I make this code so that it uploads multiple json files with a standard naming convention? The files end with a number...
import requests
#function to post to api
def postData(xactData):
url = 'http link'
headers = {
'Content-Type': 'application/json',
'Content-Length': str(len(xactData)),
'Request-Timeout': '60000'
}
return requests.post(url, headers=headers, data=xactData)
#read data
f = (r'filepath/file/file.json', 'r')
data = f.read()
print(data)
# post data
result = postData(data)
print(result)
Use f-strings?
for i in range(1,301):
csvfile = open(f'C:\Users\...\...\...\Dog\CRAZY_CAT_FINAL{i}.txt', 'r')
jsonfile = open(f'C:\Users\...\...\...\Rat\CRAZY_CAT_FINAL{i}.json', 'w')
import time
from glob import glob
import csv
import json
import os
INPATH r'C:\Users\...\...\...\Dog'
OUTPATH = r'C:\Users\...\...\...\Rat'
for csvname in glob(INPATH+'\*.txt'):
jsonname = OUTPATH + '/' + os.basename(csvname[:-3] + 'json')
reader = csv.DictReader(open(csvname,'r'))
json.dump( list(reader), open(jsonname,'w') )
Related
I'm new to programming and am trying to resolve this script to upload data, but I can't get it to work and I don't know why. I am modifying a script I already had to work with a single file, but I need to upload over 3,000 files with a specific extension (.json) in the specific file directory. I am getting an error on the 'f.read(open(files, 'r')) line. The error I get is there is no read extension for f. Not sure what I am doing incorrectly. I've researched it and still can't fix it. Any help would be appreciated.
import requests
import time
import csv
import json
import glob
from glob import glob
# function to post data
def postData(xactData):
url = 'api address'
headers = {
'Content-Type': 'application/json',
'Content-Length': str(len(xactData)),
'Request-Timeout': '2000000000'
}
return requests.post(url, headers=headers, data=xactData)
# read data
f = r'path to file'
# iterate over all files with extension path .json
for files in glob(f + '/*.json'):
my_data = f.read(open(files, 'r'))
print(files) # print the files that have been uploaded to the api
print(my_data) # print the data uploaded to the api
# post the data to the database
result = postData(my_data)
print(result)
I have over 200 scraped files in json format and I want to analyse them. I can open them individually, but would like to loop through to save time as I will be doing this a lot.
Can open each file but want to be able to do a loop in some format
e.g.
with codecs.open('c:\\project\\input*.json','r','utf-8') as f:
where '*' is a number.....
import codecs, json, csv, re
#read a json file downloaded with twitterscraper
with codecs.open('c:\\project\\input1.json','r','utf-8') as f:
tweets = json.load(f,encoding='utf-b')
Just put your files into a folder and then loop through the files in the folder like so.
import codecs
import json
import csv
import re
import os
files = []
for file in os.listdir("/mydir"):
if file.endswith(".json"):
files.append(os.path.join("/mydir", file))
for file in files:
with codecs.open(file,'r','utf-8') as f:
tweets = json.load(f,encoding='utf-b')
Add, and use, glob to iterate over files with certain file pattern.
import glob
import codecs
import json
# ... more packages here
for file in glob.glob('c:\\project\\input*.json'):
with codecs.open(file, 'r','utf-8') as f:
tweets = json.load(f, encoding='utf-b')
#... whatever you do next with `tweets`
BTW: utf-b instead of utf-8?
I have a script that gets all of the .zip files from a folder, then one by one, opens the zip file, loads the content of the JSON file inside and imports this to MongoDB.
The error I am getting is the JSON object must be str, bytes or bytearray, not 'TextIOWrapper'
The code is:
import json
import logging
import logging.handlers
import os
from logging.config import fileConfig
from pymongo import MongoClient
def import_json():
try:
client = MongoClient('5.57.62.97', 27017)
db = client['vuln_sets']
coll = db['vulnerabilities']
basepath = os.path.dirname(__file__)
filepath = os.path.abspath(os.path.join(basepath, ".."))
archive_filepath = filepath + '/vuln_files/'
filedir = os.chdir(archive_filepath)
for item in os.listdir(filedir):
if item.endswith('.json'):
file_name = os.path.abspath(item)
fp = open(file_name, 'r')
json_data = json.loads(fp)
for vuln in json_data:
print(vuln)
coll.insert(vuln)
os.remove(file_name)
except Exception as e:
logging.exception(e)
I can get this working to use a single file but not multiple, i.e. to do one file I wrote:
from zipfile import ZipFile
import json
import pymongo
archive = ZipFile("vulners_collections/cve.zip")
archived_file = archive.open(archive.namelist()[0])
archive_content = archived_file.read()
archived_file.close()
connection = pymongo.MongoClient("mongodb://localhost")
db=connection.vulnerability
vuln1 = db.vulnerability_collection
vulners_objects = json.loads(archive_content)
for item in vulners_objects:
vuln1.insert(item)
From my comment above:
I have no experience with glob, but from skimming the doc I get the impression your archive_files is a simple list of file-paths as strings, correct? You can not perform actions like .open on string (thus your error), so try changing your code to this:
...
archive_filepath = filepath + '/vuln_files/'
archive_files = glob.glob(archive_filepath + "/*.zip")
for file in archive_files:
with open(file, "r") as currentFile:
file_content = currentFile.read()
vuln_content = json.loads(file_content)
for item in vuln_content:
coll.insert(item)
...
file is NOT a file object or anything but just a simple string. So you cant perform methods on it that are not supported by string.
You are redefining your iterator by setting it to the result of the namelist method. You need a for loop within the for to go through the contents of the zip file and of course a new iterator variable.
Isn't file.close wrong and the correct call is file.close().
U can use json.load() to load file directly, instead of json.loads()
fp = open(file_name, 'r')
json_data = json.load(fp)
fp.close()
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')
As described here, it is possible to send multiple files with one request:
Uploading multiple files in a single request using python requests module
However, I have a problem generating these multiple filehandlers from a list.
So let's say I want to make a request like this:
sendfiles = {'file1': open('file1.txt', 'rb'), 'file2': open('file2.txt', 'rb')}
r = requests.post('http://httpbin.org/post', files=sendfiles)
How can I generate sendfiles from the list myfiles?
myfiles = ["file1.txt", "file20.txt", "file50.txt", "file100.txt", ...]
Use a dictionary comprehension, using os.path.splitext() to remove those extensions from the filenames:
import os.path
sendfiles = {os.path.splitext(fname)[0]: open(fname, 'rb') for fname in myfiles}
Note that a list of 2-item tuples will do too:
sendfiles = [(os.path.splitext(fname)[0], open(fname, 'rb')) for fname in myfiles]
Beware; using the files parameter to send a multipart-encoded POST will read all those files into memory first. Use the requests-toolbelt project to build a streaming POST body instead:
from requests_toolbelt import MultipartEncoder
import requests
import os.path
m = MultipartEncoder(fields={
os.path.splitext(fname)[0]: open(fname, 'rb') for fname in myfiles})
r = requests.post('http://httpbin.org/post', data=m,
headers={'Content-Type': m.content_type})