Converting .CIF files to a dataset (csv, xls, etc) - python

how are you all? Hope you're doing good!
So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.
So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.
So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.
Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!

To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose to make rows as columns (since we'll define index as an orientation for the dict to deal with "missing" values).
You need to install biopython by executing this line in your (Windows) terminal :
pip install biopython
Then, you can use the code below to read a specific .CIF file :
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
>>> display(df)
Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :
from pathlib import Path
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from time import time
mof_collection = r"path_to_the_MOF_collection"
start = time()
list_of_cif = []
for file in Path(mof_collection).glob('*.cif'):
dico = MMCIF2Dict(file)
temp = pd.DataFrame.from_dict(dico, orient='index')
temp = temp.transpose()
temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
list_of_cif.append(temp)
df = pd.concat(list_of_cif)
end = time()
print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
df
>>> output
I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv or pandas.DataFrame.to_excel:
df.to_csv('your_output_filename.csv', index=False)
df.to_excel('your_output_filename.xlsx', index=False)
EDIT :
To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe() method by using pymatgen :
from pymatgen.io.cif import CifParser
parser = CifParser("abavij_P1.cif")
structure = parser.get_structures()[0]
structure.as_dataframe()
>>> output
In case you need to check if a .CIF file has a valid structure, you can use :
if len(structure)==0:
print('The .CIF file has no structure')
Or:
try:
structure = parser.get_structures()[0]
except:
print('The .CIF file has no structure')

Related

How to automate the process of converting the list of .dat files, with their dictionaries (in seperate .dct format), to pandas data frames?

The following code coverts .dat files into data frames with the use of its dictionary file in .dct format. It works well. But my problem is that I was unable to automate this process, creating a loop that takes the pairs of these files from lists is a little bit tricky, atleast for me. I could really use some help with that.
try:
from statadict import parse_stata_dict
except ImportError:
!pip install statadict
import pandas as pd
from statadict import parse_stata_dict
dict_file = '2015_2017_FemPregSetup.dct'
data_file = '2015_2017_FemPregData.dat'
stata_dict = parse_stata_dict(dict_file)
stata_dict
nsfg = pd.read_fwf(data_file,
names=stata_dict.names,
colspecs=stata_dict.colspecs)
# nsfg is now a pandas DataFrame
These are the lists of files that I would like to convert into data frames. Every .dat file has its own dictionary file:
dat_name = ['2002FemResp.dat',
'2002Male.dat'...
dct_name = ['2002FemResp.dct',
'2002Male.dct'...
Assuming both lists have the same length and you will want to save the csv dataframe you could try:
c=0
for dat,dct in zip(dat_name, dct_name):
c+=1
stata_dict = parse_stata_dict(dct)
pd.read_fwf(dat, names=stata_dict.names, colspecs=stata_dict.colspecs).to_csv(r'path_name\file_name_{}.csv'.format(c))
# don't forget the '.csv'!
Also consider that if you are not using windows you need to use '/' rather than '\' in your path (or you can use os.path.join() to avoid this issue.

I need to remove an entire json blob/object if it contains the file name/path from and external file with contains known duplicates

How would I strip out strings that start with: {"filename": "\\network\test\etc\file0001.tif and end with }]}]}
The length of the objects differ depending on size, content of the files.
I'm starting to figure out dataframes/pandas in python and I don't understand general json structure yet.
import pandas as pd
df = pd.read_json('Filelist.json')
--ColA in the index = "filename" (Need help here)--
dups = pd.read_csv('Deleted_Duplicates.csv')
df_final = df.loc[~df.ColA.isin(dups.Duplicates),:]
df_final.to_json('Filelist_NoDupes.csv',index=False)
I would expect I could ignore which column the filename is in, using the external list to strip out entire rows/objects and output the new file.
You will need to figure out the proper escaping because you don't have a working example to test, but it will be something like this:
df_final = df.loc[~df.ColA.str.match(pat = '\{"filename"\: "\\\\network\\test\\etc\\file0001.tif.*\}\]\}\]\}') ,:]

Applying the same operations on multiple .csv file in pandas

I have six .csv files. They overall size is approximately 4gigs. I need to clean each and do some data analysis task on each. These operations are the same for all the frames.
This is my code for reading them.
#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")
Each time I run the kernel, I activate one of the files to be read.
I am looking for a more elegant way to do this. I thought about doing a for-loop. Making a list of file names and then reading them one after the other but I don't want to merge them together so I think another approach must exist. I have been searching for it but it seems all the questions lead to concatenating the files read at the end.
Use for and format like this. I use this every single day:
number_of_files = 6
for i in range(1, number_of_files+1):
df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))
#your code here, do analysis and then the loop will return and read the next dataframe
You could use a list to hold all of the dataframes:
number_of_files = 6
dfs = []
for file_num in range(len(number_of_files)):
dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()
Then to get a certain dataframe use:
df1 = dfs[0]
Edit:
As you are trying to keep from loading all of these in memory, I'd resort to streaming them. Try changing the for loop to something like this:
for file_num in range(len(number_of_files)):
with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
dfs.append(csv.reader(iter(f.readline, '')))
Then just use a for loop over dfs[n] or next(dfs[n]) to read each line into memory.
P.S.
You may need multi-threading to iterate through each one at the same time.
Loading/Editing/Saving: - using csv module
Ok, so I've done a lot of research, python's csv module does load one line at a time, it's most likely in the mode we are opening it in. (explained here)
If you don't want to use Pandas (which chunking may honestly be the answer, just implement that into #seralouk's answer if so), otherwise, then yes! This below is in my mind would be the best approach, we just need to change a couple of things.
number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
#notice I'm opening the original file as f in mode 'r' for read only
#and the new file as nf in mode 'a' for append
with open(filename.format(str(file_num).zfill(2)), 'r') as f,
open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
#initialize the writer before looping every line
w = csv.writer(nf)
for row in csv.reader(f):
#do your "data cleaning" (THIS IS PER-LINE REMEMBER)
#save to file
w.writerow(row)
Note:
You may want to consider using a DictReader and/or DictWriter, I'd prefer them over regular reader/writers as I find them easier to understand.
Pandas Approach - using chunks
PLEASE READ this answer - if you'd like to steer away from my csv approach and stick with Pandas :) It literally seems like it's the same issue as yours and the answer is what you're asking for.
Basically Panda's allows for you to partially load a file as chunks, execute any alterations, then you can write those chunks to a new file. Below is majorly from that answer but I did do some more reading up myself in the docs
number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"
for file_num in range(number_of_files):
for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
# Do your data cleaning
chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks
For more info on chunking the data see here as well it's good reading for those such as yourself getting headaches over these memory issues.
Use glob.glob to get all files with similar names:
import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
df = pd.read_csv(f)
# manipulate df
df.to_csv(f)
This will match yellow_tripdata_2018-0<any one character>.csv. You can also use yellow_tripdata_2018-0*.csv to match yellow_tripdata_2018-0<anything>.csv or even yellow_tripdata_*.csv to match all csv files that start with yellow_tripdata.
Note that this also only loads one file at a time.
Use os.listdir() to make a list of files you can loop through?
samplefiles = os.listdir(filepath)
for filename in samplefiles:
df = pd.read_csv(filename)
where filepath is the directory containing multiple csv's?
Or a loop that changes the filename:
for i in range(1, 7):
df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))
# import libraries
import pandas as pd
import glob
# store file paths in a variable
project_folder = r"C:\file_path\"
# Save all file path in a variable
all_files_paths = glob.glob(project_folder + "/*.csv")
# Create a list to save whole data
li = []
# Use list comprehension to iterate over all files; and append data in each file to list
list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]
# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)

How to tag records with filename, imported to pandas dataframe from multiple csv files?

I have a set of csv files I need to import into a pandas dataframe.
I have imported the filepaths as a list, FP, and I am using the following code to read the data:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
This is working great, but unfortunately there are no datetimestamps or file identifying attributes in the files. I need to know which file each record came from.
I tried adding this line, but this just returned the filename of the final file read:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
df['filename'] = i
I can imagine some messy multi-step solutions, but wondered if there was something more elegant I could do within my existing loop.
I'd do it this way:
df = pd.concat([pd.read_csv(f, header=None).assign(filename=f) for f in FP],
ignore_index=True)

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

Categories

Resources