I’m trying to read an unknown large csv file with pandas.
I came across some errors so I added the following arguments:
df = pd.read_csv(csv_file, engine="python", error_bad_lines=False, warn_bad_lines=True)
It is working good and skipping offending lines, and errors are prompted to the terminal correctly, such as:
Skipping line 31175: field larger than field limit (131072)
However, I’d like to save all errors to a variable instead of printing them.
How can I do it?
Note that I have a big program here and can't change the output of all logs from file=sys.stdout to something else. I need a case specific solution.
Thanks!
use on_bad_lines capability instead (available in pandas 1.4+):
badlines_list = []
def badlines_collect (bad_line: list[str]) -> None:
badlines_list.append(bad_line)
return None
df = pd.read_csv(csv_file, engine="python",on_bad_lines=badlines_collect)
Related
I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?
I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats
I have a file named "sample name_TIC.txt". The first three columns in this file are useful - Scan, Time, and TIC. It also has 456 not useful columns after the first 3. To do other data processing, I need these not-useful columns to go away. So I wrote a bit of code to start:
os.chdir(main_folder)
mydir = (os.getcwd())
nameslist=['Scan','Time', 'TIC']
for path, subdirs, files in os.walk(mydir):
for file in files:
if (file.endswith('TIC.txt')):
myfile=os.path.join(path, file)
TIC_df = pd.read_csv(myfile,sep="\t",skiprows=1, usecols=[0,1,2],names=nameslist)
Normally, the for loop is set into a function that is iterated over a very large set of folders with a lot of samples, hence the os.walk stuff, but we can ignore that right now. This code will be completed to save a new .txt file with only the 3 relevant columns.
The problem comes in the last line, the pd.read_csv line. This results in a dataframe with an index column that comprises the data from the first 456 columns and the last 3 columns of the .txt are given the names in nameslist and callable as columns in pandas, (i.e. using .iloc). This is not a multi-index. It is a single index with all the data and whitespace of those first columns.
In this example code sep="\t" because that's how excel can successfully import it. But I've also tried:
sep="\s"
delimiter=r"\s+" rather than a sep argument
including header=None
not including the usecols argument I made an error, and did not call the proper result from this code edit. This is the correct solution. See edit below or the answer.
setting index_col=False
How can I get pd.read_csv to take the first 3 columns and ignore the rest?
Thanks.
EDIT: In my end-of-day foolishness, I made an error, changing the target df to the example TIC_df. In the original code set I took this from, this was named mz207_df. My call function was still referncing the old df name.
Changing the last line of code to:
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols[0,1,2],names=nameslist)
successfully resolved my problem. Using sep="\t" also worked. Sorry for wasting people's time. I will post this with an answer as well in case someone needs to learn about usecols like I did.
Answering here to make sure the problem gets flagged as answered, in case someone else searches for it.
I made an error when calling the result from the code which included the usecols=[0,1,2] argument, and I was calling an older dataframe. The following line of code successfully generated the desired code.
TIC_df = pd.read_csv(myfile,sep="\s+",skiprows=1, usecols=[0,1,2],names=nameslist)
Using sep="\t" also generated the correct dataframe, but I default to \s+ to accomdate different and varible formatting from analytical machine outputs.
I am trying to read this huge text file: https://www.dropbox.com/s/3ikikw8bxde6y1i/TCAD_SPECIAL%20EXPORT_2019_20200409.zip?dl=0 (if you download the zip, the file is Special_ARB.txt (not necessary for my question imo).
I am running this code (adding error_bad_lines=False) to ignore lines with more-than-expected fields, which works well:
pd.read_csv(r'~/Special_ARB.txt', sep="|",
header=None,encoding='cp1252',error_bad_lines=False)
The problem is that read.csv() crashed when a line had only 1 field. With the following error:
Too many columns specified: expected 77 and found 1
Is there a way to tell python/pandas to ignore this error? It is not letting me know which line it is. There are more than a million rows so I can't just find it on my own.
I tried a for loop to read line by line and figure it out from there, but data is so large that python crashed.
The number of columns is 77 which is correctly identify by pandas when running the code, I don't think that's an issue.
Thanks,
Errors and Exceptions
Python Try Except
try:
pd.read_csv(r'~/Special_ARB.txt', sep="|", header=None,encoding='cp1252',error_bad_lines=False)
except <your error description>:
<do this>
This should work for in-memory datasets, you can use chunking for a solution on large datasets: https://stackoverflow.com/a/59331754/9379924
I've been working on some dataframes with Python. I load them in using readCSV(filename, index=0) and it's all fine. The files also open fine in Excel. I also opened them in notepad, and the seem alright; below is an example line:
851,1.218108787,0.636454978,0.269719611,-0.849476404,-0.143909689,0.050626813,-0.094248374,-0.3096134,-0.131347142,0.671271112,0.167593329,0.439417259,-0.198164647,-0.031552824,-0.215189948,-0.1791156,0.092648696,-0.107840318,-0.162596466,0.019324121,0.040572892,-0.008307331,-0.077819297,-0.023809355,-0.148229913,-0.041082835,0.138234498,-0.070986117,0.024788437,-0.050982962,0.24689969,0
The first column is as I understand it an index column. Then there's a bunch of Principal Components, and at the end is a 1/0.
When I try and load the file into WEKA, however, it gives me a nasty error and urges me to use the converter, saying:
Reason:
32 Problem encountered on line: 2
When I attempt to use the converter with the default settings, it states a new error:
Couldn't read object file_name.csv invalid stream header: 2C636F6D
Could anyone help with any of this? I can't provide the entire data file but if requested I can try and maybe cut out a few rows and only paste those if the error still occurs. Are there any flags I need to specify when saving a file to CSV in python? At the moment I just use a .toCSV('x.csv').
I think the index column not having an issue would prevent weka from reading it, when you write using pandas.to_csv() set the index = False
df.to_csv(index = False)
I'm trying to parse firewall logs with Python & Pandas but i'm having issues with getting the correct separator to work;
In my current log data :
num�date�time�orig�type�action�alert�i/f_name�i/f_dir�product�log_sys_message�origin_id�ProductFamily�src�dst�proto�message_info�service�s_port�rule�rule_uid�rule_name�service_id�xlatesrc�xlatedst�NAT_rulenum�NAT_addtnl_rulenum�xlatedport�xlatesport�ICMP�ICMP Type�ICMP Code�rule_guid�hit�policy�first_hit_time�last_hit_time�log_id�description�status�version�comment�update_service�TCP packet out of state�tcp_flags�sys_message:�inzone�outzone�Protection Name�Severity�Confidence Level�protection_id�SmartDefense Profile�Performance Impact�Industry Reference�Protection Type�Update Version�Attack Info�attack�capture_uuid�FollowUp�Total logs�Suppressed logs
0�24Oct2017�23:59:00�10.100.255.190�control� ��daemon�inbound�VPN-1 & FireWall-1�Log file has been switched to: 2017-10-24_235900.log�cteafmfw1�Network��������������������������������������������������
and the code:
import pandas as pd
file = pd.read_csv('2017-10-25_235900.log-export.csv', sep='\xff',
header=0, index_col=False)
print(file)
when i run this i can see that the separator is not processed; I've tried also assignin it to variable with value chr(255) as was proposed for similiar issue but cannot seem to get this separator prosessed at all.
I know that i can process the file and replace separator but as there is tons of data with this separator already would be nice to know is it even possible to get this working?
For other wondering the same:
Adding "encoding='latin-1'" to read_csv params solved this
Thanks # COLDSPEED