handling a huge file with python and pytables - python

simple problem, but maybe tricky answer:
The problem is how to handle a huge .txt file with pytables.
I have a big .txt file, with MILLIONS of lines, short lines, for example:
line 1 23458739
line 2 47395736
...........
...........
The content of this .txt must be saved into a pytable, ok, it's easy. Nothing else to do with the info in the txt file, just copy into pytables, now we have a pytable with, for example, 10 columns and millions of rows.
The problem comes up when, with the content in the txt file, 10 columns x millions lines are directly generated in the paytable BUT, depending on the data on each line of the .txt file, new colums must be created on the pytable. So how to handle this efficiently??
Solution 1: first copy all the text file, line by line into pytable (millions), and then iterate over each row on pytable (millions again) and, depending on the values, generate the new columns needed for the pytable.
Solution 2: read line by line the .txt file, do whatever needed, calculate the new needed values, and then send all the info to a pyrtable.
Solution 3:.....any other efficient and faster solution???

I think that basic problem here is one of the conceptual model. PyTables' Tables only handle regular (or structured) data. However, the data that you have is irregular or unstructured in that the structure is determined as you read the data. Said another way, PyTables needs the column description to be known completely by the time that create_table() is called. There is no way around this.
Since in your problem statement any line may add a new column you have no choice but to do this in two full passes through the data: (1) read through the data and determine the columns and (2) write the data to the table. In pseudocode:
import tables as tb
cols = {}
# discover columns
d = open('data.txt')
for line in d:
for col in line:
if col not in cols:
cols['colname'] = col
# write table
d.seek(0)
f = tb.open_file(...)
t = f.create_table(..., description=cols)
for line in d:
row = line_to_row(line)
t.append(row)
d.close()
f.close()
Obviously, if you knew the table structure ahead of time you could skip the first loop and this would be much faster.

Related

Pandas: How to append a single row to a pickled and zipped dataframe efficiently?

I'm in a situation where I have to add a single row to the end of dataframe very frequently. Initially I used plain text .csv files and therefore appending a line at the end of the file was trivial and didn't require loading the dataframe to RAM.
line_to_add = '1,2,3\n'
with open('path/to/file.csv', 'a') as file_handle:
file_handle.write(line_to_add)
For memory disk reasons I would like to save my dataframe as a pickled+zipped file, but if I do that I loose the ability to easily append to the end of the file. Is this doable without having to load the dataframe into RAM every time?

Ignore carriage returns (u1000D) with read_csv in python pandas

I regularly get sent on a regular basis a csv containing 100+ columns and millions or rows. These csv files always contain certain set of columns, Core_cols = [col_1, col_2, col_3], and a variable number of other columns, Var_col = [a, b, c, d, e]. The core columns are always there and there could be 0-200 of the variable columns. Sometimes one of the columns in the variable columns will contain a carriage return. I know which columns this can happen in, bad_cols = [a, b, c].
When import the csv with pd.read_csv these carriage returns make corrupt rows in the resultant dataframe. I can't re-make the csv without these columns.
How do I either:
Ignore these columns and the carriage return contained within? or
Replace the carriage returns with blanks in the csv?
My current code looks something like this:
df = pd.read_csv(data.csv, dtype=str)
I've tried things like removing the columns after the import, but the damage seems to already have been done by this point. I can't find the code now, but when testing one fix the error said something like "invalid character u000D in data". I don't control the source of the data so can't make the edits to that.
Pandas supports multiline CSV files if the file is properly escaped and quoted. If you cannot read a CSV file in Python using pandas or csv modules nor open it in MS Excel then it's probably a non-compliant "CSV" file.
Recommend to manually edit a sample of the CSV file and get it working so can open with Excel. Then recreate the steps to normalize it programmatically in Python to process the large file.
Use this code to create a sample CSV file copying first ~100 lines into a new file.
with open('bigfile.csv', "r") as csvin, open('test.csv', "w") as csvout:
line = csvin.readline()
count = 0
while line and count < 100:
csvout.write(line)
count += 1
line = csvin.readline()
Now you have a small test file to work with. If the original CSV file has millions of rows and "bad" rows are found much later in the file then you need to add some logic to find the "bad" lines.

How to edit and save one element of a CSV file in python

I would like to to edit one item of my CSV file and then save the file. Let's say I have a CSV file that reads:
A,B,C,D
1,Today,3,4
5,Tomorrow,7,8
and I when I do something to the program, I want it to alter the CSV file such that it reads:
A,B,C,D
1,Yesterday,3,4
5,Tomorrow,7,8
I've tried two approaches:
First, I tried using pandas.csv_read to open the file in memory, alter it by using iloc, and then save the CSV file using csv.writer. Predictably, this resulted in the format of the CSV file being drastically altered (a seemingly arbitrary number of spaces between adjacent entries in a row). This meant that subsequent attempts to edit the CSV file would fail.
Example:
A B C D
1 Yesterday 3 4
5 Tomorrow 7 8
The second attempt was similar to the first in that I opened the CSV file and edited it, but I then converted it into a string before using csv.writer. The issue with this method is that it saves all of the entries in one row in the CSV file, and I cannot figure out how to tell the program to start a new row.
Example:
A,B,C,D,1,Yesterday,3,4,5,Tomorrow,7,8
Is there a proper way to edit one entry in a particular row of a CSV file in python?

Looping through .xlsx files using pandas, only does first file

My ultimate goal is to merge the contents of a folder full of .xlsx files into one big file.
I thought the below code would suffice, but it only does the first file, and I can't figure out why it stops there. The files are small (~6 KB), so it shouldn't be a matter of waiting. If I print f_list, it shows the complete list of files. So, where am I going wrong? To be clear, there is no error returned, it just does not do the entire for loop. I feel like there should be a simple fix, but being new to Python and coding, I'm having trouble seeing it.
I'm doing this with Anaconda on Windows 8.
import pandas as pd
import glob
f_list = glob.glob("C:\\Users\\me\\dt\\xx\\*.xlsx") # creates my file list
all_data = pd.DataFrame() # creates my DataFrame
for f in f_list: # basic for loop to go through file list but doesn't
df = pd.read_excel(f) # reads .xlsx file
all_data = all_data.append(df) # appends file contents to DataFrame
all_data.to_excel("output.xlsx") # creates new .xlsx
Edit with new information:
After trying some of the suggested changes, I noticed the output claiming the files are empty, except for 1 of them which is slightly larger than the others. If I put them into the DataFrame, it claims the DataFrame is empty. If I put it into the dict, it claims there are no values associated. Could this have something to do with the file size? Many, if not most, of these files have 3-5 rows with 5 columns. The one it does see has 12 rows.
I strongly recommend reading the DataFrames into a dict:
sheets = {f: pd.read_excel(f) for f in f_list}
For one thing this is very easy to debug: just inspect the dict in the REPL.
Another is that you can then concat these into one DataFrame efficiently in one pass:
pd.concat(sheets.values())
Note: This is significantly faster than append, which has to allocate a temporary DataFrame at each append-call.
An alternative issue is that your glob may not be picking up all the files, you should check that it is by printing f_list.

How do I enumerate a csv file with Python?

I have some data in my csv file but what I'm trying to do now is add in another column that has the row number in it. I've tried writing it normally but I need to be able to append to my file, so it needs to be able to pick up where it left off. I thought maybe the best thing to do was to overwrite the whole column again, but for that I think you need to read it again, which proved harder than I thought.
This depends - is the row number meaningful to the data in the row itself, or is this solely for your reporting so you can count how many lines appear as rows?
import csv
my_file = r'some.csv'
with open(my_file) as csv_file:
reader = csv.DictReader(csv_file)
row_index = 0
for row in reader:
print(str(row_index), row['col_1'], row['col_2'])
row_index += 1
This will read the CSV file and print out an incremented index per-row.
Otherwise this sounds like a script that checks for a column, 'row_count' for example, removes it, re-creates it, and populates all fields of 'row_count' with an incremented integer.
Picking up where it last left off: There will be no meaningful relationship between the index count and the rows of the file itself by manually adding these row numbers: if you want meaningful persistence with this new column (i.e. that a relational database would then use this column to link to another table with) it should be implemented sooner, such as in the schema of where these CSV files are being exported from.

Categories

Resources