Everytime I import this one csv ('leads.csv') I get the following error:
/usr/local/lib/python2.7/site-packages/pandas/io/parsers.py:1130: DtypeWarning: Columns (11,12,13,14,17,19,20,21) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I import many .csv's for this one analysis of which 'leads.csv' is only one. It's the only file with the problem. When I look at those columns in a spreadsheet application, the values are all consistent.
For example, Column 11 (which is Column K when using Excel), is a simple Boolean field and indeed, every row is populated and it's consistently populated with exactly 'FALSE' or exactly 'TRUE'. The other fields that this error message references have consistently-formatted string values with only letters and numbers. In most of these columns, there are at least some blanks.
Anyway, given all of this, I don't understand why this message keeps happening. It doesn't seem to matter much as I can use the data anyway. But here are my questions:
1) How would you go about identifying the any rows/records that are causing this error?
2) Using the low_memory=False option seems to be pretty unpopular in many of the posts I read. Do I need to declare the datatype of each field in this case? Or should I just ignore the error?
Related
I am pulling data from an api, converting it into a Python Object, and attempting to convert it to a dataframe. However, when I try to unpack Python Object, I am getting extra rows than are in the Python Object.
You can see in my dataframe how on 2023-02-03, there are multiple rows. One row seems to be giving me the correct data while the other row is giving me random data. I am not sure where the extra row is coming from. I'm wondering if it has something to do with the null values or whether I am not unpacking the Python Object correctly.
My code
I double checked the raw data from the JSON response and don't see the extra values there. On Oura's UI, I checked the raw data and didn't notice anything there either.
Here's an example of what my desired output would look like:
enter image description here
Can anyone identify what I might be doing wrong?
More of a conceptual question.
When I import files into Python (without specifying the data types) -- just straight up df = pd.read_csv("blah.csv/") or df = pd.read_excel("blah.xls"), Python naturally guesses the data types of the columns.
No issues here.
However, sometimes when I am working with one of the columns, say, an object column, and I know for certain that Python guessed correctly, my .str functions sometimes don't work as intended, or I get an error. Yet, if I were to specify the column data type after importation as a .str everything works as intended.
I also noticed that if I specify one of the object columns as a str datatype after importation, the size of the object increases. So I am guessing Python's object type is different from a "string object" datatype? What causes this discrepancy?
I am trying to calculate one column mean from an excel.
I delete all the null value and '-' in the column called 'TFD' and form a new dataframe by selecting three columns. I want to calculated the mean from the new dataframe with groupby. But there is an error called "No numeric types to aggregate", I don't know why I have this error and how to fix it.
sheet=pd.read_excel(file)
sheet_copy=sheet
sheet_copy=sheet_copy[(~sheet_copy['TFD'].isin(['-']))&(~sheet_copy['TFD'].isnull())]
sheet_copy=sheet_copy[['Participant ID','Paragraph','TFD']]
means=sheet_copy['TFD'].groupby([sheet_copy['Participant ID'],sheet_copy['Paragraph']]).mean()
From your spreadsheet snippet above it looks as though your Participant ID and Paragraph columns have data types which are Text formats which leads me to believe that they will be strings inside of your dataframe? Which leads me to believe this is precisely where your issue lies from the exception "No numeric types to aggregate"
Following this, here are some good examples of group by with the mean clause from the pandas documentation:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.mean.html
If you had the dataset to hand I would have tried it out myself and provided a snippet of the code used.
I have a fairly large csv file filled with data obtained from a machine for material testing (compression test). The headers of the data are Time, Force, Stroke and they are repeated 10 times because of the sample size, so the last set of headers is Time.10, Force.10, Stroke.10.
Because of the nature of the experiment not all columns are equally long (some are approx.. 2000 rows longer than others). When I import the file into my IDE (spyder or jupyter) using pandas, all the cells within the the rows that are empty in the csv file are labeled as NaN.
The problem is... I can't do any mathematical operations within or between columns that have NaN values as they are treated as str. I have tried the most recommended solutions on pretty much all forums; .fillna(), dropna(), replace() and interpolate(). The mentioned methods work but only visually, e.g. df.fillna(0) replaces all NaN values with 0, but when I try to e.g. find the max value in the column, I still get the error that says that there are strings in my column (TypeError: '>' not supported between instances of 'float' and 'str'). The problem is caused 100% by the NaN values that are the result of the empty cells in the csv file as I have imported a csv file in which all columns where the same length (with no empty cells) and there where no problems. If anyone has any solution to this problem (doesn't need to be within pandas, just within Python) that I'm stuck on for over 2 weeks, I would be grateful.
Try read_csv() with na_filter=False.
This should at least prevent from setting "empty" source cells to NaN.
But note that:
such "empty" cells can have an empty string as the content,
the type of each column containing at least one such cell is object
(not a number),
so that (for the time being) they can't take part in any numeric
operations.
So probably, after read_csv() you should:
replace such empty strings with e.g. 0 (or whatever numeric value),
call to_numeric(...) to change type of each column from object
to whatever numeric type is appropriate in each case.
What exactly happens when Pandas issues this warning? Should I worry about it?
In [1]: read_csv(path_to_my_file)
/Users/josh/anaconda/envs/py3k/lib/python3.3/site-packages/pandas/io/parsers.py:1139:
DtypeWarning: Columns (4,13,29,51,56,57,58,63,87,96) have mixed types. Specify dtype option on import or set low_memory=False.
data = self._reader.read(nrows)
I assume that this means that Pandas is unable to infer the type from values on those columns. But if that is the case, what type does Pandas end up using for those columns?
Also, can the type always be recovered after the fact? (after getting the warning), or are there cases where I may not be able to recover the original info correctly, and I should pre-specify the type?
Finally, how exactly does low_memory=False fix the problem?
Revisiting mbatchkarov's link, low_memory is not deprecated.
It is now documented:
low_memory : boolean, default True
Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no
mixed types either set False, or specify the type with the dtype
parameter. Note that the entire file is read into a single DataFrame
regardless, use the chunksize or iterator parameter to return the data
in chunks. (Only valid with C parser)
I have asked what resulting in mixed type inference means, and chris-b1 answered:
It is deterministic - types are consistently inferred based on what's
in the data. That said, the internal chunksize is not a fixed number
of rows, but instead bytes, so whether you can a mixed dtype warning
or not can feel a bit random.
So, what type does Pandas end up using for those columns?
This is answered by the following self-contained example:
df=pd.read_csv(StringIO('\n'.join([str(x) for x in range(1000000)] + ['a string'])))
DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
type(df.loc[524287,'0'])
Out[50]: int
type(df.loc[524288,'0'])
Out[51]: str
The first part of the csv data was seen as only int, so converted to int,
the second part also had a string, so all entries were kept as string.
Can the type always be recovered after the fact? (after getting the warning)?
I guess re-exporting to csv and re-reading with low_memory=False should do the job.
How exactly does low_memory=False fix the problem?
It reads all of the file before deciding the type, therefore needing more memory.
low_memory is apparently kind of deprecated, so I wouldn't bother with it.
The warning means that some of the values in a column have one dtype (e.g. str), and some have a different dtype (e.g. float). I believe pandas uses the lowest common super type, which in the example I used would be object.
You should check your data, or post some of it here. In particular, look for missing values or inconsistently formatted int/float values. If you are certain your data is correct, then use the dtypes parameter to help pandas out.