I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?
Your immediate issue is not re-assigning inside for-loop:
train = train.append(trainvals)
test = test.append(testvals)
However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.
def split_dfs(df):
df = df.sort_values(by='Date')
trainvals = df[:int(len(df)*0.8)]
testvals = df[int(len(df)*0.8):]
return {'train': trainvals, 'test': testvals}
dfs = [split_dfs(df) for g,df in df.groupby['itemid']]
train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])
You can avoid the loop with df.groupby.quantile.
train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train
Note this could have unexpected behavior if df.index is not unique.
Related
I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!
I want to convert a particular categorical variable into dummy variables using pd.get_dummies() for both test and train data so instead of doing it for both separately, I used a for loop. However, the following code does not work and .head() returns the same dataset.
combine = [train_data, test_data]
for dataset in combine:
dummy_col = pd.get_dummies(dataset['targeted_sex'])
dataset = pd.concat([dataset, dummy_col], axis = 1)
dataset.drop('targeted_sex', axis = 1, inplace = True)
train_data.head() # does not change
Even if I use an iterator which traverses the index like this, it still doesn't work.
for i in range(len(combine)):
Can I get some help? Also, Pandas get_dummies() doesn't provide an inplace option.
For referencing purposes , I would use a dict:
Create a dictionary of train and test:
combine={'train_data':train_data,'test_data':test_data}
Use this code which uses a dict comprehension:
new_combine={k:pd.concat([dataset, pd.get_dummies(dataset['targeted_sex'])], axis = 1)
.drop('targeted_sex',1) for k,dataset in combine.items()}
Print test and train now by referencing the keys:
print(new_combine['train_data']) #same for test
You need to print dataset.head() instead of train_data.head().
You can use this function:
df: dataframe
todummy_list: list of column names which will be dummies
def dummy_df(df, todummy_list):
for x in todummy_list:
dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
df = df.drop(x, 1)
df = pd.concat([df, dummies], axis=1)
return df
I have two DataFrames that are nearly identical in structure, and I want to perform data transformation/cleaning on them simultaneously. To do this, I created a list that contains both of these DFs and loop through the list.
ex:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
combined = [train, test]
for dataset in combined:
dataset = dataset.drop(['Age'], axis =1)
print(dataset.head())
The final print statement in the for loop works fine -- the 'Age' column is dropped. However, if I immediately call train.head(), then the dropped column is still present in the DataFrame. It's almost as though two copies of "train" and "test" are being created --- the ones inside the "combined" list and the ones outside. Is there something I need to do to make these changes persist?
This seems like it should be so simple, and it's driving me nuts!
You are creating a new dataset variable at each loop, and the operation is performed on those. So you are indeed, as you say, creating copies of train and test. What you want is to drop that column inplace, rather than re-assigning:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
combined = [train, test]
for dataset in combined:
dataset.drop(['Age'], axis = 1, inplace=True)
# print(dataset.head())
Note that another solution would just be to ignore those columns when you load them:
train = pd.read_csv('train.csv', usecols=lambda x: x!='Age')
test = pd.read_csv('test.csv', usecols=lambda x: x!='Age')
In addition to #sacul's answer, there are more common ways to modify values in list like this:
lst = [1,2,3,4] # any list
for i, elem in enumerate(lst):
lst[i] = elem + 1 # can be any method here
lst
Out[24]: [2, 3, 4, 5]
How to compare column names of 2 different Pandas data frame. I want to compare train and test data frames where there are some columns missing in test Data frames??
pandas.Index objects, including dataframe columns, have useful set-like methods, such as intersection and difference.
For example, given dataframes train and test:
train_cols = train.columns
test_cols = test.columns
common_cols = train_cols.intersection(test_cols)
train_not_test = train_cols.difference(test_cols)
train_column = train.columns
test_column = test.columns
common_column = train_column.intersection(test_column)
train_not_in_test = train_column.difference(test_column)
data = pd.read_csv("file.csv")
As = data.groupby('A')
for name, group in As:
current_column = group.iloc[:, i]
current_column.iloc[0] = np.NAN
The problem: 'data' stays the same after this loop, even though I'm trying to set values to np.NAN .
As #ohduran suggested:
data = pd.read_csv("file.csv")
As = data.groupby('A')
new_data = pd.DataFrame()
for name, group in As:
# edit grouped data
# eg group.loc[:,'column'] = np.nan
new_data = new_data.append(group)
.groupby() does not change the initial DataFrame. You might want to store what you do with groupby() on a different variable, and the accumulate it in a different DataFrame using that for loop?