I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.
import pandas as pd
file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]
df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M
If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.
import pandas as pd
file_name = r"C:\....\file.txt"
df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M
Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).
Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.
Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwf into it. This is the code for the generation and reading it:
from random import random
import pandas as pd
file_name = r"/tmp/file.txt"
lines_no = int(30e6)
with open(file_name, 'w') as f:
for i in range(lines_no):
if i%int(1e5) == 0:
print("Writing progress: {:0.1f}%"
.format(float(i) / float(lines_no)*100), end='\r')
f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")
print("File created. Now read it using pd.read_fwf ...")
fwidths = [11,11,11,11,11,11]
df = pd.read_fwf(file_name, widths = fwidths,
names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])
#print(df)
print(df.shape) #<30M
So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.
The result is : (30000000, 6)
Example file created:
7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905
Related
how are you all? Hope you're doing good!
So, get this. I need to convert some .CIF files (found here: https://www.ccdc.cam.ac.uk/support-and-resources/downloads/ - MOF Collection) to a format that i can use with pandas, such as CSV or XLS. I'm researching about using MOF's for hydrogen storage, and this collection from Cambrigde's Structural Database would do wonders for me.
So far, i was able to convert them using ToposPro, but not to a format that i can use with Pandas readTo.
So, do any of you know of a way to do this? I've also read about pymatgen and matminer, but i've never used them before.
Also, sorry for any mishap with my writing, english isn't my main language. And thanks for your help!
To read a .CIF file as a pandas DataFrame, you can use Bio.PDB.MMCIF2Dict module from biopython to firstly parse the .CIF file and return a dictionnary. Then, you will need pandas.DataFrame.from_dict to create a dataframe from the bio-dictionnary. Finally, you have to pandas.DataFrame.transpose to make rows as columns (since we'll define index as an orientation for the dict to deal with "missing" values).
You need to install biopython by executing this line in your (Windows) terminal :
pip install biopython
Then, you can use the code below to read a specific .CIF file :
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
dico = MMCIF2Dict(r"path_to_the_MOF_collection\abavij_P1.cif")
df = pd.DataFrame.from_dict(dico, orient='index')
df = df.transpose()
>>> display(df)
Now, if you need the read the whole MOF collection (~10k files) as a dataframe, you can use this :
from pathlib import Path
import pandas as pd
from Bio.PDB.MMCIF2Dict import MMCIF2Dict
from time import time
mof_collection = r"path_to_the_MOF_collection"
start = time()
list_of_cif = []
for file in Path(mof_collection).glob('*.cif'):
dico = MMCIF2Dict(file)
temp = pd.DataFrame.from_dict(dico, orient='index')
temp = temp.transpose()
temp.insert(0, 'Filename', Path(file).stem) #to get the .CIF filename
list_of_cif.append(temp)
df = pd.concat(list_of_cif)
end = time()
print(f'The DataFrame of the MOF Collection was created in {end-start} seconds.')
df
>>> output
I'm sure you're aware that the .CIF files may have different number of columns. So, feel free to concat (or not) the MOF collection. And last but not least, if you want to get a .csv and/or an .xlsx file of your dataframe, you can use either pandas.DataFrame.to_csv or pandas.DataFrame.to_excel:
df.to_csv('your_output_filename.csv', index=False)
df.to_excel('your_output_filename.xlsx', index=False)
EDIT :
To read the structure of a .CIF file as a DataFrame, you can use the as_dataframe() method by using pymatgen :
from pymatgen.io.cif import CifParser
parser = CifParser("abavij_P1.cif")
structure = parser.get_structures()[0]
structure.as_dataframe()
>>> output
In case you need to check if a .CIF file has a valid structure, you can use :
if len(structure)==0:
print('The .CIF file has no structure')
Or:
try:
structure = parser.get_structures()[0]
except:
print('The .CIF file has no structure')
I am trying to load a flat file into a python pandas data frame.
Using Python 3.8.3 and pandas version 1.0.5
The read_csv code is like this:
import pandas as pd
df = pd.read_csv(myfile, sep='|', usecols=[0], names=["ID"],
dtype=str,
encoding='UTF-8',
memory_map=True,
low_memory=True, engine='c')
print('nb entries:', df["ID"].size)
This gives me a number of entries.
However, this does not match the number of entries I get with the following code:
num_lines = sum(1 for line in open(myfile, encoding='UTF-8')
print('nb lines:', num_lines)
I don't get an error message.
I tried several options (with/without encoding, with/without low memory, with or without memory map, with or without warn_bad_lines, with the c engine or the default one), but I always got the same erroneous results.
By changing the nrows parameters I identified where in the file the problem seems to be. And I copied the lines of interest in a test file and re-run the code on the test file. This time I get the correct result.
Now I realize that my machine is a little short on memory, so maybe some allocation is failing silently. Would there be a way to test for that? I tried running the script without any other applications open, but I got the same erroneous results.
How should I troubleshoot this type of problem?
Something like this could be used to read the file in chunks
import pandas as pd
import numpy as np
n_rows = sum(1 for _ in open("./test.csv", encoding='UTF-8')) - 1
chunk_size = 300
n_chunks = int(np.ceil(n_rows / chunk_size))
read_lines = 0
for chunk_idx in range(n_chunks):
df = pd.read_csv("./test.csv", header=0, skiprows=chunk_idx*chunk_size, nrows=chunk_size)
read_lines += len(df)
print(read_lines)
Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.
I have some data in an Excel file. I would like to analyze them using Python. I started by creating a CSV file using this guide.
Thus I have created a CSV (Comma delimited) file filled with the following data:
I wrote a few lines of code in Python using Spyder:
import pandas
colnames = ['GDP', 'Unemployment', 'CPI', 'HousePricing']
data = pandas.read_csv('Dane_2.csv', names = colnames)
GDP = data.GDP.tolist()
print(GDP)
The output is nothing I've expected:
It can be easily seen that the output differs a lot from the figures in GDP column. I will appreciate any tips or hints which will help to deal with my problem.
Seems like in the GDP column there are decimal values from the first column in the .csv file and first digits of the second column. There's either something wrong with the .csv you created, but more probably you need to specify separator in the pandas.read_csv line. Also, add header=None, to make sure you don't lose the first line of the file (i.e. it will get replaced by colnames).
Try this:
import pandas
colnames = ['GDP', 'Unemployment', 'CPI', 'HousePricing']
data = pandas.read_csv('Dane_2.csv', names = colnames, header=None, sep=';')
GDP = data.GDP.tolist()
print(GDP)
I have a set of csv files I need to import into a pandas dataframe.
I have imported the filepaths as a list, FP, and I am using the following code to read the data:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
This is working great, but unfortunately there are no datetimestamps or file identifying attributes in the files. I need to know which file each record came from.
I tried adding this line, but this just returned the filename of the final file read:
for i in FP:
df = pd.read_csv(i,index_col=None, header=0).append(df)
df['filename'] = i
I can imagine some messy multi-step solutions, but wondered if there was something more elegant I could do within my existing loop.
I'd do it this way:
df = pd.concat([pd.read_csv(f, header=None).assign(filename=f) for f in FP],
ignore_index=True)