Filtering Chunks by A string - python

I have csv file with 60M plus rows. I am only interested in a subset of these and would like to put them in a dataframe.
Here is the code I am using:
iter_csv = pd.read_csv('/Users/xxxx/Documents/globqa-pgurlbymrkt-Report.csv', iterator=True, chunksize=1000)
df = pd.concat([chunk[chunk['Site Market (evar13)'].str.contains("Canada", na=False)] for chunk in iter_csv])
off the answer here : pandas: filter lines on load in read_csv
I get the following error:
AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas
Cant seem to figure out whats wrong and will appreciate guidance here.

Try verifying the data representing a string first.
What does the last chunk return that you are expecting to use .contains() on?
It seems that the data may be missing and if so then it wouldn't be a string.

Related

How can I read a document with pandas (python) that don't look like the average one?

I am trying to get the values from the colums from a file. The doc looks like this
the data I want to read
So all the examples I have found uses pd.read_csv or pd.DataFrame but all data usually have a clear header an nothing on top of the data (I have like 10 lines of code I don't really need for what I am doing).
Also, I think maybe there is something wrong because I tried to run:
data = pd.read_csv('tdump_BIL_100_-96_17010100.txt',header=15)
and I get
pd.read_csv output
which is just the row in one column, so there is no separation apparently, and therefore no way of getting the columns I need.
So my question is if there is a way to get the data from this file with pandas and how to get it.
If a defined number, skip the initial rows, indicate that no header is present, and that values are separated by spaces.
df = pd.read_csv(filename, skiprows=15, header=None, sep='\s+')
See read_csv for documentation.

Dropping a problematic column from a dask dataframe

I have a dask dataframe with one problematic column that (I believe) is the source of a particular error that is thrown every time I try to do anything with the dataframe (be it head, or to_csv, or even when I try to subset using a (different) column. The error is probably owing to a data type mismatch and shows up like this:
ValueError: invalid literal for int() with base 10: 'FIPS'
So I decided to drop that column ('FIPS') using
df = df.drop('FIPS', axis=1)
Now when I do df.columns, I don't see 'FIPS' any longer which I take to mean that it has indeed been dropped. But when I try to write a different column to a file
df.column_a.to_csv('example.csv')
I keep getting the same error
ValueError: invalid literal for int() with base 10: 'FIPS'
I assume it has something to do with dask's lazy approaches as a result of which it delays the drop, but any work-around would be very helpful.
Basically, I just need to extract a single column (column_a) from df.
try to convert to a pandas dataframe after the drop
df.compute()
and only then write to csv

Python read_csv - ParserError: Error tokenizing data

I understand why I get this error when trying to df = pd.read_csv(file) :
ParserError: Error tokenizing data. C error: Expected 14 fields in line 7, saw 30
When it reads in the csv, it sees 14 strings/columns in the first row, based on the first row of the csv calls it the headers (which is what I want).
However, those columns are extended further, down the rows (specifially when it gets to row 7).
I can find solutions that will read it in by skipping those rows 1-6, but I don't want that. I still want the whole csv to be read, but instead of the header being 14 columns, how can I tell it make the header 30 columns, and if there is no text/string, just leave the column as a "", or null, or some random numbering. In other words, I don't care what it's named, I just need the space holder so it can parse after row 6.
I'm wondering is there a way to read in the csv, and explicitly say there are 30 columns but have not found a solution.
I can throw some random solutions that I think should work.
1) Set Header=None and give columns names in 'Name' attribute of read_csv.
df=pd.read_csv(file, header=None, namees = [field1, field2, ...., field 30])
PS. This will work if your CSV doesn't have a header already.
2) Secondly you can try using below command (if your csv already has header row)
df=pd.read_csv(file, usecols=[0,1,2,...,30])
Let me know if this works out for you.
Thanks,
Rohan Hodarkar
what about trying, to be noted error_bad_lines=False will cause the offending lines to be skipped
data = pd.read_csv('File_path', error_bad_lines=False)
Just few more collectives answers..
It might be an issue with the delimiters in your data the first row,
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
df = pandas.read_csv('File_path', sep='delimiter', header=None)
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Here Documenet says: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
This might be an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')
OR
pandas.read_csv('File_path',header=None,sep=', ')

First row of data has become a column in Pandas table

The first row in pandas data table has turned into a column. I've tried various renaming methods and restructuring and it hasn't been working. Something really trivial, but unfortunately I need some help.
The line "0" is supposed to come down as the first data row "Bachelor". Could someone please point me to the proper way of getting this done?
I think there is problem your csv have no header, so is possible create default range columns names:
df_degree = pd.read_csv(file, header=None)
Or is possible define custom columns names:
df_degree = pd.read_csv(file, names=['col1','col2'])

Saving Pandas dataframe indices to file

I am trying to save just the indices of a dataframe to a file. Here is what I've tried:
A)
np.savetxt("file_name", df.index.values)
returns:
TypeError: Mismatch between array dtype ('object') and format specifier ('%.18e')
B)
df.index.values.tofile("file_name")
returns:
IOError: cannot write object arrays to a file in binary mode
C)
with open("file_name","w") as f:
f.write("\n".join(df_1_indexes.values.tolist()))
Can someone please explain why A) and B) failed? I'm at a loss.
Cheers
The error in A) is likely because you have strings or someother type of object in your index. The default format specifier in np.savetxt appears to assume the data is float-like. I got around this by setting fmt='%s', though it likely isn't a reliable solution.
B) doesn't yield any errors for me, using some basic examples of Index and MultiIndex. Your error is probably due to the specific type of elements in your index.
Note that there is an easier and more reliable way to save just the index. You can set the columns parameter of to_csv as an empty list, which will suppress all columns from the output:
df.to_csv('file_name', columns=[], header=False)
If your index has a name and you want the name to appear in the output (similar to how column names appear), remove header=False from the code above.
In addition to #root's answer, if you've already deleted df after saving its index, you can do this:
saved_index=df.index
pd.Series(saved_index,index=saved_index).to_csv('file_name', header=False, index=False)

Categories

Resources