I would like to know how to read several json files from a single folder (without specifying the files names, just that they are json files).
Also, it is possible to turn them into a pandas DataFrame?
Can you give me a basic example?
One option is listing all files in a directory with os.listdir and then finding only those that end in '.json':
import os, json
import pandas as pd
path_to_json = 'somedir/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
print(json_files) # for me this prints ['foo.json']
Now you can use pandas DataFrame.from_dict to read in the json (a python dictionary at this point) to a pandas dataframe:
montreal_json = pd.DataFrame.from_dict(many_jsons[0])
print montreal_json['features'][0]['geometry']
Prints:
{u'type': u'Point', u'coordinates': [-73.6051013, 45.5115944]}
In this case I had appended some jsons to a list many_jsons. The first json in my list is actually a geojson with some geo data on Montreal. I'm familiar with the content already so I print out the 'geometry' which gives me the lon/lat of Montreal.
The following code sums up everything above:
import os, json
import pandas as pd
# this finds our json files
path_to_json = 'json/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
# here I define my pandas Dataframe with the columns I want to get from the json
jsons_data = pd.DataFrame(columns=['country', 'city', 'long/lat'])
# we need both the json and an index number so use enumerate()
for index, js in enumerate(json_files):
with open(os.path.join(path_to_json, js)) as json_file:
json_text = json.load(json_file)
# here you need to know the layout of your json and each json has to have
# the same structure (obviously not the structure I have here)
country = json_text['features'][0]['properties']['country']
city = json_text['features'][0]['properties']['name']
lonlat = json_text['features'][0]['geometry']['coordinates']
# here I push a list of data into a pandas DataFrame at row given by 'index'
jsons_data.loc[index] = [country, city, lonlat]
# now that we have the pertinent json data in our DataFrame let's look at it
print(jsons_data)
for me this prints:
country city long/lat
0 Canada Montreal city [-73.6051013, 45.5115944]
1 Canada Toronto [-79.3849008, 43.6529206]
It may be helpful to know that for this code I had two geojsons in a directory name 'json'. Each json had the following structure:
{"features":
[{"properties":
{"osm_key":"boundary","extent":
[-73.9729016,45.7047897,-73.4734865,45.4100756],
"name":"Montreal city","state":"Quebec","osm_id":1634158,
"osm_type":"R","osm_value":"administrative","country":"Canada"},
"type":"Feature","geometry":
{"type":"Point","coordinates":
[-73.6051013,45.5115944]}}],
"type":"FeatureCollection"}
Iterating a (flat) directory is easy with the glob module
from glob import glob
for f_name in glob('foo/*.json'):
...
As for reading JSON directly into pandas, see here.
Loads all files that end with * .json from a specific directory into a dict:
import os,json
path_to_json = '/lala/'
for file_name in [file for file in os.listdir(path_to_json) if file.endswith('.json')]:
with open(path_to_json + file_name) as json_file:
data = json.load(json_file)
print(data)
Try it yourself:
https://repl.it/#SmaMa/loadjsonfilesfromfolderintodict
To read the json files,
import os
import glob
contents = []
json_dir_name = '/path/to/json/dir'
json_pattern = os.path.join(json_dir_name, '*.json')
file_list = glob.glob(json_pattern)
for file in file_list:
contents.append(read(file))
If turning into a pandas dataframe, use the pandas API.
More generally, you can use a generator..
def data_generator(my_path_regex):
for filename in glob.glob(my_path_regex):
for json_line in open(filename, 'r'):
yield json.loads(json_line)
my_arr = [_json for _json in data_generator(my_path_regex)]
I am using glob with pandas. Checkout the below code
import pandas as pd
from glob import glob
df = pd.concat([pd.read_json(f_name, lines=True) for f_name in glob('foo/*.json')])
A simple and very easy-to-understand answer.
import os
import glob
import pandas as pd
path_to_json = r'\path\here'
# import all files from folder which ends with .json
json_files = glob.glob(os.path.join(path_to_json, '*.json'))
# convert all files to datafr`enter code here`ame
df = pd.concat((pd.read_json(f) for f in json_files))
print(df.head())
I feel a solution using pathlib is missing :)
from pathlib import Path
file_list = list(Path("/path/to/json/dir").glob("*.json"))
One more option is to read it as a PySpark Dataframe and then convert it to Pandas Dataframe (if really necessary, depending on the operation I'd suggest keeping as a PySpark DF). Spark natively handles using a directory with JSON files as the main path without the need of libraries for reading or iterating over each file:
# pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.json('/some_dir_with_json/*.json')
Next, in order to convert into a Pandas Dataframe, you can do:
df = spark_df.toPandas()
Related
I have few csv files in my Azure File share which I am accessing as text by following the code:
from azure.storage.file import FileService
storageAccount='...'
accountKey='...'
file_service = FileService(account_name=storageAccount, account_key=accountKey)
share_name = '...'
directory_name = '...'
file_name = 'Name.csv'
file = file_service.get_file_to_text(share_name, directory_name, file_name)
print(file.content)
The contents of the csv files are being displayed but I need to pass them as dataframe which I am not able to do. Can anyone please tell me how to read the file.content as pandas dataframe?
After reproducing from my end, I could able to read a csv file into dataframe from the contents of the file following the below code.
generator = file_service.list_directories_and_files('fileshare/')
for file_or_dir in generator:
print(file_or_dir.name)
file=file_service.get_file_to_text('fileshare','',file_or_dir.name)
df = pd.read_csv(StringIO(file.content), sep=',')
print(df)
RESULTS:
Currently I am using following code to read xml files and extract data.
import pandas as pd
import numpy as np
import xml.etree.cElementTree as et
import datetime
tree=et.parse(r'/data/dump_xml/myfile1.xml')
root=tree.getroot()
NAME = []
for name in root.iter('name'):
NAME.append(name.text)
UPDATE = []
for update in root.iter('lastupdate'):
UPDATE.append(update.text)
updated = datetime.datetime.fromtimestamp(int(UPDATE[0]))
lastupdate=updated.strftime('%Y-%m-%d %H:%M:%S')
ParaValue = []
for parameterevalue in root.iter('value'):
ParaValue.append(parameterevalue.text)
print(ParaValue[0])
print(ParaValue[1])
print(lastupdate,NAME[0],ParaValue[0])
print(lastupdate,NAME[1],ParaValue[1])
From one file I could get following output as a result
2022-05-23 11:25:01 traffic_in 1.5012356187e+05
2022-05-23 11:25:01 traffic_out 1.7723777592e+05
But I have set of xml files in /data/dump_xml/ and I need to get all the results as below with the file name as well. I need to export all those as a dataframe.Can someone help me to do this for whole directory?
I have 2 questions
How to convert and extract JSON file into EXCEL file in python
How to combine all json file into one file?
Now, I have 30 json files. I would like to extract them all into EXCEL file (In readable format).
Lastly, I need to combine all of the result into one excel file. So, curious on how to do that too.
Converting JSON into EXCEL;
import pandas as pd
df = pd.read_json('./file1.json')
df.to_excel('./file1.xlsx')
Combining multiple EXCELs (two files are combined in the example);
import glob
import pandas as pd
excl_list_path = ["./file1.xlsx", "./file2.xlsx"]
excl_list = []
for file in excl_list_path:
excl_list.append(pd.read_excel(file))
excl_merged = pd.DataFrame()
for excl_file in excl_list:
excl_merged = excl_merged.append(
excl_file, ignore_index=True)
excl_merged.to_excel('file1-file2-merged.xlsx', index=False)
Note; Your specific JSON file structure is important for these examples...
And I have the perfect function just for that
import pandas as pd
def save_to_excel(json_file, filename):
df = pd.read_json(json_file).T
df.to_excel(filename)
json_data = {"a": "data A", "b": "data B"}
save_to_excel(json_data, "json_data.xlsx")
More info here
You can try to use this library,
https://pypi.org/project/tablib/0.9.3/
It provides a lot of features that can help you on this.
How to combine all json file into one file?
ans:
import json
import glob
import pprint as pp #Pretty printer
'
combined = []
for json_file in glob.glob("*.json"): #Assuming that your json files and .py file in the same directory
with open(json_file, "rb") as infile:
combined.append(json.load(infile))
pp.pprint(combined)
I'm trying to write to an existing Parquet file stored on the local filesystem. But when writing multiple times, the previous one gets overridden instead of added.
from datetime import datetime
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def append_to_parquet_table(dataframe, filename):
full_path = os.path.join('.', filename)
table = pa.Table.from_pandas(dataframe)
writer = pq.ParquetWriter(full_path, table.schema)
writer.write_table(table=table)
def save(passed):
data = {'number': [1234],
'verified': [passed],
'date': datetime.now().strftime("%Y-%m-%d %H:%M:%S")}
data_df = pd.DataFrame(data)
append_to_parquet_table(data_df, 'results.parquet')
save(True)
save(False)
Why is the first data set being "updated" instead of a new one written?
I'm trying to write to an existing Parquet file stored on the local filesystem.
This isn't supported by the file format. Parquet files are immutable after being written.
I want to make a pdf somposed by ranges in all Excel-workbooks located in a given folder (folderwithallfiles). All workbooks will have the same structure so the range reference will be the same for all workbooks.
I thought I got it with the script below, but it does not work.
import win32com.client as win32
import glob
import os
xlfiles = sorted(glob.glob("*.xlsx"))
#print "Reading %d files..."%len(xlfiles)
cwd = "C:\\Users\\user\folderwithallfiles"
#cwd = os.getcwd()
path_to_pdf = r'C:\\Users\\user\folderwithallfiles\multitest.pdf'
excel = win32.gencache.EnsureDispatch('Excel.Application')
for xlfile in xlfiles:
wb = excel.Workbooks.Open(cwd+"\\"+xlfile)
ws = wb.Sheets('sheet 1')
ws.Range("A1:Q59").Select()
wb.ActiveSheet.ExportAsFixedFormat(0, path_to_pdf)
Please check the below code if it works. I have written on the fly. Let me know if you find issues in it.
import pandas as pd
import numpy as np
import glob
import pdfkit as pdf
all_data = pd.DataFrame()
for f in glob.glob("filepath\file*.xlsx"):
df = pd.read_excel(f)
all_data = all_data.append(df, ignore_index=True)
all_data.to_html("filepath\all_data.html)
pdf.from_file("filepath\all_data.html", "filepath\all_data.pdf")