Merge multiple JSON files into one file by using Python (stream twitter) - python

I've pulled data from Twitter. Currently, the data is in multiple files and I could not merge it into one single file.
Note: all files are in JSON format.
The code I have used is here and here.
It has been suggested to work with glop to compile JSON files
I write this code as I have seen in some tutorials about merge JSON by using Python
from glob import glob
import json
import pandas as pd
with open('Desktop/json/finalmerge.json', 'w') as f:
for fname in glob('Desktop/json/*.json'): # Reads all json from the current directory
with open(fname) as j:
f.write(str(j.read()))
f.write('\n')
I successfully merge all files and now the file is finalmerge.json.
Now I used this as suggested in several threads:
df_lines = pd.read_json('finalmerge.json', lines=True)
df_lines
1000000*23 columns
Then, what I should do to make each feature in separate columns?
I'm not sure why what's wrong with JSON files, I checked the file that I merge and I found it's not valid as JSON file? what I should do to make this as a data frame?
The reason I am asking this is that I have very basic python knowledge and all the answers to similar questions that I have found are way more complicated than I can understand. Please help this new python user to convert multiple JSON files to one JSON file.

I think that the problem is that your files are not really json (or better, they are structured as jsonl ). You have two ways of proceding:
you could read every file as a text file and merge them line by line
you could convert them to json (add a square bracket at the beginning of the file and a comma at the end of every json element).
Try following this question and let me know if it solves your problem: Loading JSONL file as JSON objects
You can also try to edit your code this way:
with open('finalmerge.json', 'w') as f:
for fname in glob('Desktop/json/*.json'):
with open(fname) as j:
f.write(str(j.read()))
f.write('\n')
Every line will be a different json element.

Related

How to automate the process of converting the list of .dat files, with their dictionaries (in seperate .dct format), to pandas data frames?

The following code coverts .dat files into data frames with the use of its dictionary file in .dct format. It works well. But my problem is that I was unable to automate this process, creating a loop that takes the pairs of these files from lists is a little bit tricky, atleast for me. I could really use some help with that.
try:
from statadict import parse_stata_dict
except ImportError:
!pip install statadict
import pandas as pd
from statadict import parse_stata_dict
dict_file = '2015_2017_FemPregSetup.dct'
data_file = '2015_2017_FemPregData.dat'
stata_dict = parse_stata_dict(dict_file)
stata_dict
nsfg = pd.read_fwf(data_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs)
# nsfg is now a pandas DataFrame
These are the lists of files that I would like to convert into data frames. Every .dat file has its own dictionary file:
dat_name = ['2002FemResp.dat',
'2002Male.dat'...
dct_name = ['2002FemResp.dct',
'2002Male.dct'...
Assuming both lists have the same length and you will want to save the csv dataframe you could try:
c=0
for dat,dct in zip(dat_name, dct_name):
c+=1
stata_dict = parse_stata_dict(dct)
pd.read_fwf(dat, names=stata_dict.names, colspecs=stata_dict.colspecs).to_csv(r'path_name\file_name_{}.csv'.format(c))
# don't forget the '.csv'!
Also consider that if you are not using windows you need to use '/' rather than '\' in your path (or you can use os.path.join() to avoid this issue.

How to filter out useable data from csv files using python?

Please help me in extracting important data from a .csv file using python. I got .csv file from 'citrine'.
I want to extract the element name and atomic percentage in the form of "Al2.5B0.02C0.025Co14.7Cr16.0Mo3.0Ni57.48Ti5.0W1.25Zr0.03"
ORIGINAL
[{""element"":""Al"",""idealAtomicPercent"":{""value"":""5.4""}},{""element"":""B"",""idealAtomicPercent"":{""value"":""0.02""}},{""element"":""C"",""idealAtomicPercent"":{""value"":""0.13""}},{""element"":""Co"",""idealAtomicPercent"":{""value"":""7.5""}},{""element"":""Cr"",""idealAtomicPercent"":{""value"":""6.1""}},{""element"":""Mo"",""idealAtomicPercent"":{""value"":""2.0""}},{""element"":""Nb"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ni"",""idealAtomicPercent"":{""value"":""61.0""}},{""element"":""Re"",""idealAtomicPercent"":{""value"":""0.5""}},{""element"":""Ta"",""idealAtomicPercent"":{""value"":""9.0""}},{""element"":""Ti"",""idealAtomicPercent"":{""value"":""1.0""}},{""element"":""W"",""idealAtomicPercent"":{""value"":""5.8""}},{""element"":""Zr"",""idealAtomicPercent"":{""value"":""0.13""}}]
Original CSV
Expected output
Without having the file structure it is hard to tell.
Try to load the file using:
import csv
with open(file_path) as file:
reader = csv.DictReader(...)
You will have to figure out the arguments for the function which depend on the file.

How to create an array of several .txt files

I am trying to create some plots from some data in my research lab. Each data file is saved into tab-delimited text files.
My goal is to write a script that can read the text file, add the columns within each file to an array, and then eventually slice through the array at different points to create my plots.
My problem is that I'm struggling with just starting the script. Rather than hardcode each txt file to be added to the same array, is there a way to loop over each file in my directory to add the necessary files to the array, then slice through them?
I apologize if my question is not clear; I am new to Python and it is quite a steep learning curve for me. I can try to clear up any confusion if what I am asking doesn't make sense.
I am also using Canopy to write my script if this matters.
You could do something like:
from csv import DictReader # CSV reader can be used as TSV reader
from glob import iglob
readers = []
for path in iglob("*.txt"):
reader = DictReader(open(path), delimiter='\t')
readers.append(reader)
glob.iglob("*.txt") returns an iterator over all files with extension .txt in the current working directory.
csv.DictReader reads a CSV file as a iterator of dicts. A tab-delimited text file the same thing with a different delimiter.

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:
In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.
Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")
I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Categories

Resources