how to retrieve all lines with errors in pandas - python

For example, I can use
pd.read_csv('file.csv')
to load a csv file.
By default, it fails when there are any parsing errors. I understand that one can use error_bad_lines=False to skip the rows with errors.
But my question is:
How to get all the lines where errors occur? This way, I can potentially solve the problem for not only this particular file.csv but also other related files in a batch file1.csv, file2.csv, file3.csv ...

One easy way would be to prepend a row index number to each row. This can easily be done with Awk or Python before loading the data. You could even do it in-memory using StringIO or your own custom file-like object in Python which would "magically" prepend the row numbers.

Related

Merging datasets using dask proves unsuccessful

I am trying to merge a number of large data sets using Dask in Python to avoid loading issues. I want to save as .csv the merged file. The task proves harder than imagined:
I put together a toy example with just two data sets
The code I then use is the following:
import dask.dataframe as dd
import glob
import os
os.chdir('C:/Users/Me/Working directory')
file_list = glob.glob("*.txt")
dfs = []
for file in file_list:
ddf = dd.read_table(file, sep=';')
dfs.append(ddf)
dd_all = dd.concat(dfs)
If I use dd_all.to_csv('*.csv') I simply print out the two original data sets.
If I use dd_all.to_csv('name.csv') I get an error saying the file does not exist.
(FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Me\\Working directory\\name.csv\\1.part')
I can check that using dd_all.compute() the merged data set had successfully been created.
You are misunderstanding how Dask works - the behaviour you see is as expected. In order to be able to write from multiple workers in parallel, it is necessary for each worker to be able to write to a separate file; there is no way to know the length of the first chunk before writing it has finished, for example. To write to a single file is therefore necessarily a sequential operation.
The default operation, therefore, is to write one output file for each input partition, and this is what you see. Since Dask can read from these in parallel, it does raise the question of why you would want to creation one output file at all.
For the second method without the "*" character, Dask is assuming that you are supplying a directory, not a file, and is trying to write two files within this directory, which doesn't exist.
If you really wanted to write a single file, you could do one of the following:
use the repartition method to make a single output piece and then to_csv
write the separate file and concatenate them after the fact (taking care of the header line)
iterate over the partitions of your dataframe in sequence to write to the same file.

Pandas vs JSON library to read a JSON file in Python

It seems that I can use both pandas and/or json to read a json file, i.e.
import pandas as pd
pd_example = pd.read_json('some_json_file.json')
or, equivalently,
import json
json_example = json.load(open('some_json_file.json'))
So my question is, what's the difference and which one should I use? Is one way recommended over another, are there certain situations where one is better than the other, etc. ? Thanks.
It Depends.
When you have a single JSON structure inside a json file, use read_json because it loads the JSON directly into a DataFrame. With json.loads, you've to load it into a python dictionary/list, and then into a DataFrame - an unnecessary two step process.
Of course, this is under the assumption that the structure is directly parsable into a DataFrame. For non-trivial structures (usually of the form of complex nested lists-of-dicts), you may want to use json_normalize instead.
On the other hand, with a JSON lines file, the story becomes different. From my experience, I've found loading a JSON lines file with pd.read_json(..., lines=True) is actually slightly slower on large data (tested on ~50k+ records once), and to make matters worse, cannot handle rows with errors - the entire read operation fails. In contrast, you can use json.loads on each line of your file inside a try-except brace for some robust code which actually ends up being a few clicks faster. Go figure.
Use whatever fits the situation.

split excel sheet for every nrows using python

I have an excel file with more than 1 million rows. Now i need to split that for every n rows and save it in a new file. am very new to python. Any help, is much appreciated and needed
As suggested by OhAuth you can save the Excel document to a csv file. That would be a good start to begin the processing of you data.
Processing your data you can use the Python csv library. That would not require any installation since it comes with Python automatically.
If you want something more "powerful" you might want to look into Pandas. However, that requires an installation of the module.
If you do not want to use the csv module of Python nor the pandas module because you do not want to read into the docs, you could also do something like.
f = open("myCSVfile", "r")
for row in f:
singleRow = row.split(",") #replace the "," with the delimiter you chose to seperate your columns
print singleRow
> [value1, value2, value3, ...] #it returns a list and list comprehension is well documented and easy to understand, thus, further processing wont be difficult
However, I strongly recommend looking into the moduls since they handle csv data better, more efficient and on 'the long shot' save you time and trouble.

Python Pandas to_pickle cannot pickle large dataframes

I have a dataframe "DF" with with 500,000 rows. Here are the data types per column:
ID int64
time datetime64[ns]
data object
each entry in the "data" column is an array with size = [5,500]
When I try to save this dataframe using
DF.to_pickle("my_filename.pkl")
it returned me the following error:
12 """
13 with open(path, 'wb') as f:
---> 14 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
OSError: [Errno 22] Invalid argument
I also try this method but I get the same error:
import pickle
with open('my_filename.pkl', 'wb') as f:
pickle.dump(DF, f)
I try to save 10 rows of this dataframe:
DF.head(10).to_pickle('test_save.pkl')
and I have no error at all. Therefore, it can save small DF but not large DF.
I am using python 3, ipython notebook 3 in Mac.
Please help me to solve this problem. I really need to save this DF to a pickle file. I can not find the solution in the internet.
Until there is a fix somewhere on pickle/pandas side of things,
I'd say a better option is to use alternative IO backend. HDF is suitable for large datasets (GBs). So you don't need to add additional split/combine logic.
df.to_hdf('my_filename.hdf','mydata',mode='w')
df = pd.read_hdf('my_filename.hdf','mydata')
Probably not the answer you were hoping for but this is what I did......
Split the dataframe into smaller chunks using np.array_split (although numpy functions are not guaranteed to work, it does now, although there used to be a bug for it).
Then pickle the smaller dataframes.
When you unpickle them use pandas.append or pandas.concat to glue everything back together.
I agree it is a fudge and suboptimal. If anyone can suggest a "proper" answer I'd be interested in seeing it, but I think it as simple as dataframes are not supposed to get above a certain size.
Split a large pandas dataframe
Try to use compression. It worked for me.
data_df.to_pickle('data_df.pickle.gzde', compression='gzip')
I ran into this same issue and traced the cause to a memory issue. According to this recourse it's usually not actually caused by the memory itself, but the movement of too many resources into the swap space. I was able to save the large pandas file by disableing swap all together withe the command (provided in that link):
swapoff -a

Problems converting a large file to a panel using read_csv into python pandas

I have been trying to load in a large-ish file (~480MB, 5,250,000 records, stock price daily data -dt, o, h, l, c, v, val , adj, fv, sym, code - for about 4,500 instruments) into pandas using read_csv. It runs fine, and creates the DataFrame. However, on conversion to a Panel, the values for several stocks are way off, and nowhere close to the values in the original csv file.
I then attempted to use the chunksize parameter in read_csv, and used a for loop to:
reader = read_csv("bigfile.csv",index_col=[0,9],parse_dates=True,names=['n1','n2',...,'nn'], chunksize=100000)
new_df = DataFrame(reader.get_chunk(1))
for chunk in reader:
new_df = concat(new_df, chunk)
This reads in the data, but:
I get the same erroneous values (edit:) when converting to a Panel
It takes ages longer than the plain read_csv (no iterator)
Any ideas how to get around this?
Edit:
Changed the question to reflect the problem - the dataframe is fine, conversion to a Panel is the problem. Found the error appearing even after splitting the input csv file, merging and then converting to a panel. If i maintain a multi-index DataFrame, there is no problem and the values are represented correctly.
Some bugs have been fixed in the DataFrame to Panel code. Please try with the latest pandas version (preferably upcoming 0.10) and let us know if you're still having issues.
If you know a few specific values that are off, you might just examine those lines specifically in your csv file. You should also check out the docs on csv, particularly in terms of dialects and the Sniffer class. You might be able to find some setting that will correctly detect how the the file is delimited.
If you find that the errors go away when you look only at specific lines, that probably means that there is an erroneous/missing line break somewhere that is throwing things off.
Finally, if you can't seem to find patterns of correct/incorrect lines, you might try (randomly or otherwise) selecting a subset of the lines in your csv file and see whether the error is occurring because of the size of the file (I'd guess this would be unlikely, but I'm not sure).

Categories

Resources