How to read bz2 files into dataframes using pyspark? - python

I can read a json file into a dataframe in Pyspark using
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.json("path to json file")
However, when i try to read a bz2(compressed csv) into a dataframe it gives me an error. I am using:
spark = SparkSession.builder.appName('GetDetails').getOrCreate()
df = spark.read.load("path to bz2 file")
Could you please help correct me?

The method spark.read.load() has an optional parameter format which by default is 'parquet'.
So, for your code to work it should look like this:
df = spark.read.load("data.json.bz2", format="json")
Also, spark.read.json will perfectly work for compressed JSON files, e.g.:
df = spark.read.json("data.json.bz2")

Related

How to convert CSV to parquet file without RLE_DICTIONARY encoding?

I've already test three ways of converting a csv file to a parquet file. You can find them below. All the three created the parquet file. I've tried to view the contents of the parquet file using "APACHE PARQUET VIEWER" on Windows and I always got the following error message:
"encoding RLE_DICTIONARY is not supported"
Is there any way to avoid this?
Maybe a way to use another type of encoding?...
Below the code:
1º Using pandas:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet")
2º Using pyarrow:
from pyarrow import csv, parquet
table = csv.read_csv("filename.csv")
parquet.write_table(table, "filename.parquet")
3º Using dask:
from dask.dataframe import read_csv
dask_df = read_csv("filename.csv", dtype={'column_xpto': 'float64'})
dask_df.to_parquet("filename.parquet")
You should set use_dictionary to False:
import pandas as pd
df = pd.read_csv("filename.csv")
df.to_parquet("filename.parquet", use_dictionary=False)

Converting .CIF files to a dataset (csv, xls, etc)

how are you all? Hope you're doing good!
So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.
So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.
So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.
Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!
To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose to make rows as columns (since we'll define index as an orientation for the dict to deal with "missing" values).
You need to install biopython by executing this line in your (Windows) terminal :
pip install biopython
Then, you can use the code below to read a specific .CIF file :
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
>>> display(df)
Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :
from pathlib import Path
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from time import time
mof_collection = r"path_to_the_MOF_collection"
start = time()
list_of_cif = []
for file in Path(mof_collection).glob('*.cif'):
dico = MMCIF2Dict(file)
temp = pd.DataFrame.from_dict(dico, orient='index')
temp = temp.transpose()
temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
list_of_cif.append(temp)
df = pd.concat(list_of_cif)
end = time()
print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
df
>>> output
I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv or pandas.DataFrame.to_excel:
df.to_csv('your_output_filename.csv', index=False)
df.to_excel('your_output_filename.xlsx', index=False)
EDIT :
To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe() method by using pymatgen :
from pymatgen.io.cif import CifParser
parser = CifParser("abavij_P1.cif")
structure = parser.get_structures()[0]
structure.as_dataframe()
>>> output
In case you need to check if a .CIF file has a valid structure, you can use :
if len(structure)==0:
print('The .CIF file has no structure')
Or:
try:
structure = parser.get_structures()[0]
except:
print('The .CIF file has no structure')

Converte json file to csv file with proper formatted rows and columns in excel

Currently I'm working a script that can convert json file to csv format my script is working but I need to modify it to have proper data format like having rows and columns when the json file is converted to csv file, May I know what I need to add or modify on my script?
import pandas as pd
df = pd.read_json (r'/home/admin/myfile.json')
df.to_csv (r'/home/admin/xml/myfileSample.csv', index = None, sep=":")
Taking reference from your code,you can try
df.to_csv(r'/home/admin/xml/myfileSample.csv', encoding='utf-8', header=header,index = None, sep=":")
This could be useful.
import pandas as pd
df_json=pd.read_json("input_file.json")
df_json.head()
df_json.to_csv("output_file.csv",index=False)
Your code is all fine, Just change the to_csv to to_excel function and it should work all fine!
import pandas as pd
df = pd.read_json (r'/home/admin/myfile.json')
df.to_excel (r'/home/admin/xml/myfileSample.csv', index = None, sep=":")
Learn more about the to_excel function of pandas here:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html

How to read the data with special characters in xlsx file using pandas dataframe?

I want to read the xlsx file in the pandas data frame and perform some operations on the data. I am able to read the file with the command:
df = pd.read_excel('file.xlsx')
but when I am trying to perform some operation on the data, I am getting the following error:
ValueError: could not convert string to float:''disc abc r14jt mt cxp902 5 r2eu fail''
How I can resolve this problem. I already tried encoding='utf-8' but then also I am getting the error.
Actually I have one xlsx file 'original.xlsx', I am filtering some data from that file and saving that data as 'file.xlsx' with below command:
original.to_excel("file.xlsx",index=False,header=['a','b','c'],engine='xlsxwriter')
Now when I am trying to read the 'file.xlsx' file and perform some operation on it, I am getting that error. Is there any issue in the way I am saving the file or while reading it.
xl_file = pd.ExcelFile(file_name)
dfs = {sheet_name: xl_file.parse(sheet_name)
for sheet_name in xl_file.sheet_names}
You can try:
import pandas as pd
df = pd.read_excel('file.xlsx', encoding='latin1')
if a column of float is writted as a="3.300,144" you should do the following:
a = a.replace(".", "")
a = a.replace(",", ".")
float(a)
Output a
33300.144

How to write pandas dataframe into xlsb file in python pandas

I need to write pandas dataframe (df_new, in my case) into an xlsb file which has some formulas. I am stuck on the code below and do not know what to do next:
with open_workbook('NiSource SLA_xlsb.xlsb') as wb:
with wb.get_sheet("SL Dump") as sheet:
can anyone suggest me how to write dataframe into xlsb file
You could try reading the xlsb file as a dataframe and then concating the two.
import pandas as pd
existingdf = pd.DataFrame()
originaldf = pd.read_excel('./NiSource SLA_xlsb.xlsb'
twodflist = [originaldf, df_new]
existingdf = pd.concat(twodflist)
existingdf.reset_index(drop = True)
existingdf.to_excel(r'PATH\filename.xlsb')
Change the path to wherever you want the output to go to and change filename to what you want the output to be named. Let me know if this works.

Categories

Resources