I import pandas as pd and run the code below and get the following result
Code:
traindataset = pd.read_csv('/Users/train.csv')
print traindataset.dtypes
print traindataset.shape
print traindataset.iloc[25,3]
traindataset.dropna(how='any')
print traindataset.iloc[25,3]
print traindataset.shape
Output
TripType int64
VisitNumber int64
Weekday object
Upc float64
ScanCount int64
DepartmentDescription object
FinelineNumber float64
dtype: object
(647054, 7)
nan
nan
(647054, 7)
[Finished in 2.2s]
From the result, the dropna line doesn't work because the row number doesn't change and there is still NAN in the dataframe. How that comes? I am craaaazy right now.
You need to read the documentation (emphasis added):
Return object with labels on given axis omitted
dropna returns a new DataFrame. If you want it to modify the existing DataFrame, all you have to do is read further in the documentation:
inplace : boolean, default False
If True, do operation inplace and return None.
So to modify it in place, do traindataset.dropna(how='any', inplace=True).
pd.DataFrame.dropna uses inplace=False by default. This is the norm with most Pandas operations; exceptions do exist, e.g. update.
Therefore, you must either assign back to your variable, or state explicitly inplace=True:
df = df.dropna(how='any') # assign back
df.dropna(how='any', inplace=True) # set inplace parameter
Stylistically, the former is often preferred as it supports operator chaining, and the latter often does not yield any or significant performance benefits.
Alternatively, you can also use notnull() method to select the rows which are not null.
For example if you want to select Non null values from columns country and variety of the dataframe reviews:
answer=reviews.loc[(reviews.country.notnull()) & (reviews.variety.notnull())]
But here we are just selecting relevant data;to remove null values you should use dropna() method.
This is my first post. I just spent a few hours debugging this exact issue and I would like to share how I fixed this issue.
I was converting my entire dataframe to a string and then placing that value back into the dataframe using similar code to what is displayed below: (please note, the code below will only convert the value to a string)
row_counter = 0
for ind, row in dataf.iterrows():
cell_value = str(row['column_header'])
dataf.loc[row_counter, 'column_header'] = cell_value
row_counter += 1
After converting the entire dataframe to a string, I then used the dropna() function. The values that were previously NaN (considered a null value by pandas) were converted to the string 'nan'.
In conclusion, drop blank values FIRST, before you start manipulating data in the CSV and converting its data type.
Related
I have data frame like
df_merge_filtered=dfmerge[['category','title','body','start','end','relatedEntities.id','location.position.lat','location.position.lon']]
riskc=df_merge_filtered['category'].str.split(',', expand=True).melt()['value']\
.str.split(':', expand=True).rename(columns={0:'Level1', 1:'Level2'})
This code breaks when I have 0 records in df_merge_filtered dataframe.
Error
Can only use .str accessor with string values!
I want to create dummy dataframe structure even if there no records to be processed.
How to fix this
With no rows of data the melt call and ['value'] reference have a default float64 empty column. You can force this back to a string column so the code runs, but without the : to split on we still don't get the right result.
In[] df_merge_filtered = pd.DataFrame(columns=['category','title','body','start','end','relatedEntities.id','location.position.lat','location.position.lon'])
In[] df_merge_filtered['category'].str.split(',', expand=True).melt()['value']
Out[]: Series([], Name: value, dtype: float64) # float64 dtype on empty result!
If we force the string datatype then expanding on : will give a 0 row DataFrame. The rename will optionally rename the 2 columns you want but there's nothing to rename:
df_merge_filtered['category'].str.split(',', expand=True).melt() \
['value'].astype('string') \
.str.split(':', expand=True)
#.rename(columns={0:'Level1', 1:'Level2'})
Out[]:
Empty DataFrame
Columns: []
Index: []
Maybe you could detect the 0-row case and handle it separately?
if df_merge_filtered.shape[0] == 0:
riskc = pd.DataFrame(columns=['Level1', 'Level2'])
else:
# original code
I'll ask as a sanity check - do you know that you only ever have the same number of commas (else you'll have missing values after the first split) and only one colon per item (else you'll have multiple columns with missing items after the second split)?
I'm using Pandas 0.20.3 in my python 3.X. I want to add one column in a pandas data frame from another pandas data frame. Both the data frame contains 51 rows. So I used following code:
class_df['phone']=group['phone'].values
I got following error message:
ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series
class_df.dtypes gives me:
Group_ID object
YEAR object
Terget object
phone object
age object
and type(group['phone']) returns pandas.core.series.Series
Can you suggest me what changes I need to do to remove this error?
The first 5 rows of group['phone'] are given below:
0 [735015372, 72151508105, 7217511580, 721150431...
1 []
2 [735152771, 7351515043, 7115380870, 7115427...
3 [7111332015, 73140214, 737443075, 7110815115...
4 [718218718, 718221342, 73551401, 71811507...
Name: phoen, dtype: object
In most cases, this error comes when you return an empty dataframe. The best approach that worked for me was to check if the dataframe is empty first before using apply()
if len(df) != 0:
df['indicator'] = df.apply(assign_indicator, axis=1)
You have a column of ragged lists. Your only option is to assign a list of lists, and not an array of lists (which is what .value gives).
class_df['phone'] = group['phone'].tolist()
The error of the Question-Headline
"ValueError: Cannot set a frame with no defined index and a value that cannot be converted to a Series"
might as well occur if for what ever reason the table does not have any rows.
Instead of using an if-statement, you can use set result_type argument of apply() function to "reduce".
df['new_column'] = df.apply(func, axis=1, result_type='reduce')
The data assigned to a column in the DataFrame must be a single dimension array. For example, consider a num_arr to be added to a DataFrame
num_arr.shape
(1, 126)
For this num_arr to be added to a DataFrame column, It should be reshaped....
num_arr = num_arr.reshape(-1, )
num_arr.shape
(126,)
Now I could set this arr as a DataFrame column
df = pd.DataFrame()
df['numbers'] = num_arr
I have a very large csv which I need to read in. To make this fast and save RAM usage I am using read_csv and set the dtype of some columns to np.uint32. The problem is that some rows have missing values and pandas uses a float to represent those.
Is it possible to simply skip rows with missing values? I know I could do this after reading in the whole file but this means I couldn't set the dtype until then and so would use too much RAM.
Is it possible to convert missing values to some other I choose during the reading of the data?
It would be dainty if you could fill NaN with say 0 during read itself. Perhaps a feature request in Pandas's git-hub is in order...
Using a converter function
However, for the time being, you can define your own function to do that and pass it to the converters argument in read_csv:
def conv(val):
if val == np.nan:
return 0 # or whatever else you want to represent your NaN with
return val
df = pd.read_csv(file, converters={colWithNaN : conv}, dtypes=...)
Note that converters takes a dict, so you need to specify it for each column that has NaN to be dealt with. It can get a little tiresome if a lot of columns are affected. You can specify either column names or numbers as keys.
Also note that this might slow down your read_csv performance, depending on how the converters function is handled. Further, if you just have one column that needs NaNs handled during read, you can skip a proper function definition and use a lambda function instead:
df = pd.read_csv(file, converters={colWithNaN : lambda x: 0 if x == np.nan else x}, dtypes=...)
Reading in chunks
You could also read the file in small chunks that you stitch together to get your final output. You can do a bunch of things this way. Here is an illustrative example:
result = pd.DataFrame()
df = pd.read_csv(file, chunksize=1000)
for chunk in df:
chunk.dropna(axis=0, inplace=True) # Dropping all rows with any NaN value
chunk[colToConvert] = chunk[colToConvert].astype(np.uint32)
result = result.append(chunk)
del df, chunk
Note that this method does not strictly duplicate data. There is a time when the data in chunk exists twice, right after the result.append statement, but only chunksize rows are repeated, which is a fair bargain. This method may also work out to be faster than by using a converter function.
There is no feature in Pandas that does that. You can implement it in regular Python like this:
import csv
import pandas as pd
def filter_records(records):
"""Given an iterable of dicts, converts values to int.
Discards any record which has an empty field."""
for record in records:
for k, v in record.iteritems():
if v == '':
break
record[k] = int(v)
else: # this executes whenever break did not
yield record
with open('t.csv') as infile:
records = csv.DictReader(infile)
df = pd.DataFrame.from_records(filter_records(records))
Pandas uses the csv module internally anyway. If the performance of the above turns out to be a problem, you could probably speed it up with Cython (which Pandas also uses).
If you show some data, SO ppl could help.
pd.read_csv('FILE', keep_default_na=False)
For starters try these:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
na_values : str or list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘nan’.
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.
na_filter : boolean, default True
Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file
city state neighborhoods categories
Dravosburg PA [asas,dfd] ['Nightlife']
Dravosburg PA [adad] ['Auto_Repair','Automotive']
I have above dataframe I want to convert each element of a list into column for eg:
city state asas dfd adad Nightlife Auto_Repair Automotive
Dravosburg PA 1 1 0 1 1 0
I am using following code to do this :
def list2columns(df):
"""
to convert list in the columns
of a dataframe
"""
columns=['categories','neighborhoods']
for col in columns:
for i in range(len(df)):
for element in eval(df.loc[i,"categories"]):
if len(element)!=0:
if element not in df.columns:
df.loc[:,element]=0
else:
df.loc[i,element]=1
How to do this in more efficient way?
Why still there is below warning when I am using df.loc already
SettingWithCopyWarning: A value is trying to be set on a copy of a slice
from a DataFrame.Try using .loc[row_indexer,col_indexer] = value instead
Since you're using eval(), I assume each column has a string representation of a list, rather than a list itself. Also, unlike your example above, I'm assuming there are quotes around the items in the lists in your neighborhoods column (df.iloc[0, 'neighborhoods'] == "['asas','dfd']"), because otherwise your eval() would fail.
If this is all correct, you could try something like this:
def list2columns(df):
"""
to convert list in the columns of a dataframe
"""
columns = ['categories','neighborhoods']
new_cols = set() # list of all new columns added
for col in columns:
for i in range(len(df[col])):
# get the list of columns to set
set_cols = eval(df.iloc[i, col])
# set the values of these columns to 1 in the current row
# (if this causes new columns to be added, other rows will get nans)
df.iloc[i, set_cols] = 1
# remember which new columns have been added
new_cols.update(set_cols)
# convert any un-set values in the new columns to 0
df[list(new_cols)].fillna(value=0, inplace=True)
# if that doesn't work, this may:
# df.update(df[list(new_cols)].fillna(value=0))
I can only speculate on an answer to your second question, about the SettingWithCopy warning.
It's possible (but unlikely) that using df.iloc instead of df.loc will help, since that is intended to select by row number (in your case, df.loc[i, col] only works because you haven't set an index, so pandas uses the default index, which matches the row number).
Another possibility is that the df that is passed in to your function is already a slice from a larger dataframe, and that is causing the SettingWithCopy warning.
I've also found that using df.loc with mixed indexing modes (logical selectors for rows and column names for columns) produces the SettingWithCopy warning; it's possible that your slice selectors are causing similar problems.
Hopefully the simpler and more direct indexing in the code above will solve any of these problems. But please report back (and provide code to generate df) if you are still seeing that warning.
Use this instead
def list2columns(df):
"""
to convert list in the columns
of a dataframe
"""
df = df.copy()
columns=['categories','neighborhoods']
for col in columns:
for i in range(len(df)):
for element in eval(df.loc[i,"categories"]):
if len(element)!=0:
if element not in df.columns:
df.loc[:,element]=0
else:
df.loc[i,element]=1
return df
I have a Pandas Dataframe with different dtypes for the different columns. E.g. df.dtypes returns the following.
Date datetime64[ns]
FundID int64
FundName object
CumPos int64
MTMPrice float64
PricingMechanism object
Various of cheese columns have missing values in them. Doing a group operations on it with NaN values in place cause problems. To get rid of them with the .fillna() method is the obvious choice. Problem is the obvious clouse for strings are .fillna("") while .fillna(0) is the correct choice for ints and floats. Using either method on DataFrame throws exception. Any elegant solutions besides doing them individually (have about 30 columns)? I have a lot of code depending on the DataFrame and would prefer not to retype the columns as it is likely to break some other logic.
Can do:
df.FundID.fillna(0)
df.FundName.fillna("")
etc
You can iterate through them and use an if statement!
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(0)
else:
df[col].fillna("")
When you iterate through a pandas DataFrame, you will get the names of each of the columns, so to access those columns, you use df[col]. This way you don't need to do it manually and the script can just go through each column and check its dtype!
You can grab the float64 and object columns using:
In [11]: float_cols = df.blocks['float64'].columns
In [12]: object_cols = df.blocks['object'].columns
and int columns won't have NaNs else they would be upcast to float.
Now you can apply the respective fillnas, one cheeky way:
In [13]: d1 = dict((col, '') for col in object_cols)
In [14]: d2 = dict((col, 0) for col in float_cols)
In [15]: df.fillna(value=dict(d1, **d2))
A compact version example:
#replace Nan with '' for columns of type 'object'
df=df.select_dtypes(include='object').fillna('')
However, after the above operation, the dataframe will only contain the 'object' type columns. For keeping all columns, use the solution proposed by #Ryan Saxe.
#Ryan Saxe's answer is accurate. To get it to work on my data I had to set inplace=True and also data= 0 and data= "". See code below:
for col in df:
#get dtype for column
dt = df[col].dtype
#check if it is a number
if dt == int or dt == float:
df[col].fillna(data=0, inplace=True)
else:
df[col].fillna(data="", inplace=True)
similar to #Guddi: A bit verbose, but still more concise then #Ryan's answer and keeping all columns:
df[df.select_dtypes("object").columns] = df.select_dtypes("object").fillna("")
Rather than running the conversion one column at a time, which is inefficient, here is a way to grab all of the int or float columns and change in one shot.
int_float_cols = df.select_dtypes(include=['int', 'float']).columns
df[int_float_cols] = df[int_float_cols].fillna(value=0)
Obvious how to adapt this to handle object.
I'm aware that in Pandas older versions, there were no NAs allowed in integers, so grabbing the "ints" is not strictly necessary and it may accidentially promote ints to floats. However, in our use case, it is better to be safe than sorry.
I ran into this because ordinary approach, df.fillna(0) corrupted all of the datetime variables.