I'm sorry if my title is confusing but I wasn't sure how to describe the situation that I'm currently trying to understand. But basically I stumbled upon this question when I was working with train_test_split procedure from sklearn module.
So, let's go ahead and I show you an example of what has been confusing me for couple of hours already.
Let's create a simple dataframe with 3 columns:
'Letter' - a letter from alphabet;
'Number' - serial number of the letter;
'Type' - type of the number.
import pandas as pd
data = [['A', 1, 'Odd'], ['B', 2, 'Even'], ['C', 3, 'Odd'],
['D', 4, 'Even'], ['E', 5, 'Odd'], ['F', 6, 'Even'], ['G', 7, 'Odd']]
df = pd.DataFrame(data, columns=['Letter', 'Number', 'Type'])
We can create 4 samples to work with using train_test_split:
from sklearn.model_selection import train_test_split
target = df['Type']
features = df.drop('Type', axis=1)
features_train, features_valid, target_train, target_valid = train_test_split(features,
target, test_size=0.4, random_state=12)
And now if we want to see the rows of features_train with the odd numbers we can write the following code:
features_odds = features_train[target_train == 'Odd']
features_odds
And we get this:
Output
And there we have it right as new dataframe contains the rows exactly with the odd numbers.
How does that work when features_train can get the info from target_train even though those are two separated dataframes?
I think there should be an easy answer but for some reason I'm not able to understand the mechanics of this right now.
I have also tried a different approach (not using train_test_split) but it works just as fine:
target_dummy = df['Type']
features_dummy = df.drop('Type', axis=1)
features_dumb_odds = features_dummy[target_dummy == 'Odd']
features_dumb_odds
Would appreciate and help in understanding it a lot!
target_train == 'Odd' is a Series of boolean values. As a Series, it also has an index. That index is used to align with features_train that you index into, and it's compatible.
As a first step of exploration, start with print(target_train == 'Odd')
It's good to think about how the pieces fit together. In this case, the boolean series and where you index into need to have exactly the same index for it to not raise an exception.
Related
I am realizing an XG Boost model. I did my train-test split on a dataframe having 91 columns. I want to use my model on a new dataframe which have different columns than my training set. I have removed the extra columns and added the ones which were present in the train dataset and not the new one.
However, I cannot use the models because the new set does not have the same number of columns but when I am computing the list of the differences in columns the list is empty.
Do you have an idea of how I could correct this problem ?
Thanks in advance for your time !
You can try like this :
import pandas as pd
X_PAU = pd.DataFrame({'test1': ['A', 'A'], 'test2': [0, 0]})
print(len( X_PAU.columns ))
X = pd.DataFrame({'test1': ['A', 'A']})
print(len( X.columns ))
# Your implimentation
print(set(X.columns) - set(X_PAU.columns)) #This should be empty set
#
print(X_PAU.columns.difference(X.columns).tolist()) # this will print the missing column name
print(len(X_PAU.columns.difference(X.columns).tolist())) # this will print the difference number
Output:
2
1
set()
['test2']
1
I am a beginner in Data Science and I am trying to pivot this data frame using Pandas:
So it becomes something like this: (The labels should become the column and file paths the rows.)
I tried this code which gave me an error:
EDIT:
I have tried Marcel's suggestion, the output it gave is this:
The "label" column is a group or class of file paths. I want to convert it in such a way it fits this function: tf.Keras.preprocessing.image.flow_from_dataframe in categorical
Thanks in advance to all for helping me out.
I did not understand your question very well, but if you just want to convert columns to rows then you can do
train_df.T
wich means transpose
I think you are looking for something like this:
import pandas as pd
df = pd.DataFrame({
'labels': ['a', 'a', 'a', 'b', 'b'],
'pathes' : [1, 2, 3, 4, 5]
})
labels = df['labels'].unique()
new_cols = []
for label in labels:
new_cols.append(df['pathes'].where(df['labels'] == label).dropna().reset_index(drop=True))
df_final = pd.concat(new_cols, axis=1)
print(df_final)
I've found what was wrong, I misunderstood y_col and x_col in tf.Keras.preprocessing.image.ImageDataGenerator.flow_from_dataframe. Thanks to all of you for your contributions. Your answers are all correct in different ways. Thanks again Marcel h and user16714199!
I have a list of predictor (X) and an outcome (y) variables from my df. There are 100s of variables in my df so I only care about a few of them below.
X = df[['a', 'b', 'c']]
y = df['d']
I then want to delete all of the rows with missing data for any of my "X" variables, so I ran this:
for i in X:
df = df[df[i].notna()]
This then leaves me with a modified df with no missing values in the columns of interest. However, my list X and y are still populated with the old df, thus I can not use these as inputs to my model. While I know I could just copy and paste the code I used to create those lists in the first place to "refresh" the code, that seems inefficient. Though I can not seem to think of a better way. Thoughts appreciated!
You can use df.dropna:
X = X.dropna()
I have the following error builtins.AssertionError: 12 columns passed, passed data had 6 columns The last 6 Columns datawise will vary so Im happy to have None in the areas the data is missing. However I cant seem to find a simple way to do this, im pretty sure there must be an option for it but I cant see it in the docs or any google searches.
Any help would be apprecaited. I would like to reiterate that I know what is causing the problem and I know data is missing from coloumns. I would like to ignore missing data and am ahppy to have None or NaN in the output csv.
I imagine you have fixed headers, so my solution would be to extend each row respectively:
import pandas as pd
import numpy as np
columns = ('Person', 'Title', 'AnotherPerson', 'AnotherPerson2', 'AnotherPerson3', 'AnotherPerson4', 'Date', 'Group')
mandatory = len(columns)
data = [[1,2,3], [1, 2], [1, 2, 3, 4]]
data = list(map(lambda x: dict(enumerate(x)), data))
data = [[item.get(i, np.nan) for i in range(mandatory)] for item in data]
df = pd.DataFrame(data=data, columns=columns)
I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511
The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))