I'm trying to build a multi regression model with qualitative data.
In order to do that I need to build a new data frame that creates a new data frame with columns based on the unique values and marks 1 if the index had that value.
Example:
d = {'City': ['Tokyo','Tokyo','Lisbon','Tokyo','Madrid','Lisbon','Madrid','London','Tokyo','London','Tokyo'],
'Card': ['Visa','Visa','Visa','Master Card','Bitcoin','Master Card','Bitcoin','Visa','Master Card','Visa','Bitcoin'],
'Client Number':[1,2,3,4,5,6,7,8,9,10,11],
}
d = pd.DataFrame(data=d).set_index('Client Number')
And get a result equal to this
Let us try get_dummies
df = pd.get_dummies(d,prefix='', prefix_sep='')
Out[202]:
Lisbon London Madrid Tokyo Bitcoin Master Card Visa
Client Number
1 0 0 0 1 0 0 1
2 0 0 0 1 0 0 1
3 1 0 0 0 0 0 1
4 0 0 0 1 0 1 0
5 0 0 1 0 1 0 0
6 1 0 0 0 0 1 0
7 0 0 1 0 1 0 0
8 0 1 0 0 0 0 1
9 0 0 0 1 0 1 0
10 0 1 0 0 0 0 1
11 0 0 0 1 1 0 0
Related
I'm having trouble describing exactly what I want to achieve. I've tried looking here on stack to find others with the same problem, but are unable to find any. So I will try to describe exactly what I want and give you a sample setup code.
I would like to have a function that gives me a new column/pd.Series. This new column has boolean TRUE values (or int's) that are based on a certain condition.
The condition being as follows. There are N number of columns (example is 8), each with the same name but ending with one new number. IE, column_1, column_2 etc. The function I need is twofold:
If N is given, look for/through each column row and see if it and the next N columns row are also TRUE/1 ..
If N is NOT given, look for each column row and if all next columns rows are also TRUE/1, with the numbers as ID's to look at the column.
def get_df_series(df: pd.DataFrame, columns_ids: list, n: int=8) -> pd.Dataframe:
for i in columns_ids:
# missing code here .. i dont know if this would be the way to go
pass
return df
def create_dataframe(numbers: list) -> pd.DataFrame:
df = pd.DataFrame() # empty df
# create a column for each number with the number as ID and with random boolean values as int's
for i in numbers:
df[f'column_{i}'] = np.random.randint(2, size=20)
return df
if __name__=="__main__":
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
df = create_dataframe(numbers=numbers)
df = get_df_series(df=df, numbers=numbers, n=3)
I have some experience with Pandas dataframes and know how to create IF/ELSE things with np.select for example.
(function) select(condlist: Sequence[ArrayLike], choicelist: Sequence[ArrayLike], default: ArrayLike = ...) -> NDArray
The problem I'm running into is that I don't know how to make a conditional statement if I don't know how many columns are ahead. For example, if I want to know for column_5 if the next 3 are also true, I can hardcode this, but I have columns up to id 20 and would love to not have to hardcode everything from column_1 to column_20 if I want to know if all conditions in all those columns are true.
Now the problem is that I don't know if this is even possible. So any feedback would be appreaciated. Even just giving me a hint on where to look for a way to do this.
EDIT: What I forgot to mention was that there will be random columns in between that obviously cannot be taking into the equation. For example, there will be main_column_1, main_column_2, main_column_3, side_column_1, side_column_2, right_column_1, main_column_3, main_column_4 etc...
The answer Corralien gave is correct, but I should've made my question more clearer.
I need to be able to, say, look at main_column and for that one look ahead N amount of columns of the same type: main_column.
Try:
n = 3
out = (df.rolling(n, min_periods=1, axis=1).sum()
.shift(-n+1, fill_value=0, axis=1).eq(n).astype(int)
.rename(columns=lambda x: 'result_' + x.split('_')[1]))
Output:
>>> out
result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8
0 1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 1 1 1 0 0 0 0
9 0 0 0 0 0 1 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 1 1 0 0 0
14 0 0 0 0 0 1 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0
18 0 0 1 0 0 0 0 0
19 0 0 0 0 0 0 0 0
Input:
>>> df
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8
0 1 1 1 1 1 1 1 1
1 0 1 0 0 0 1 1 0
2 1 1 0 1 0 1 1 0
3 1 0 1 0 0 0 0 0
4 1 0 0 1 1 1 0 1
5 1 1 0 1 0 1 1 0
6 1 0 1 0 0 0 0 1
7 0 0 1 0 0 0 0 0
8 0 1 1 1 1 1 0 0
9 1 0 1 1 0 1 1 1
10 0 0 1 1 0 0 1 1
11 1 0 1 0 1 1 1 0
12 0 1 1 0 1 0 1 0
13 0 0 0 1 1 1 1 0
14 0 0 1 1 0 1 1 1
15 1 0 0 1 0 1 0 0
16 1 0 0 0 0 0 0 1
17 0 0 1 1 1 0 0 1
18 0 0 1 1 1 0 0 1
19 0 0 1 0 0 0 1 0
I have a dataframe where some cells contain lists of multiple values. How can I create new columns based on unique values of those lists? Those lists can contain values already included in previous observations, and also can be empty. How I create a new column (One Hot Encoding) based on those values?
CHECK EDIT - Data is within quotation marks:
data = {'tokens': ['["Spain", "Germany", "England", "Japan"]',
'["Spain", "Germany"]',
'["Morocco"]',
'[]',
'["Japan"]',
'[]']}
my_new_pd = pd.DataFrame(data)
0 ["Spain", "Germany", "England", "Japan"]
1 ["Spain", "Germany"]
2 ["Morocco"]
3 []
4 ["Japan", ""]
5 []
Name: tokens, dtype: object
I want something like
tokens_Spain|tokens_Germany |tokens_England |tokens_Japan|tokens_Morocco
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3. 0 0 0 0 0
4. 0 0 1 1 0
5. 0 0 0 0 0
Method one from sklearn, since you already have the list type column in your dfs
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['tokens']),columns=mlb.classes_, index=df.index)
Method two we do explode first then find the dummies
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_A tokens_B tokens_C tokens_D tokens_Z
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 0 0 1 1
5 0 0 0 0 0
Method three kind of like "explode" on the axis = 0
pd.get_dummies(pd.DataFrame(df.tokens.tolist()),prefix='tokens',prefix_sep='_').sum(level=0,axis=1)
tokens_A tokens_D tokens_Z tokens_B tokens_C
0 1 1 0 1 1
1 1 0 0 1 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 1 1 0 0
5 0 0 0 0 0
Update
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_England tokens_Germany tokens_Japan tokens_Morocco tokens_Spain
0 1 1 1 0 1
1 0 1 0 0 1
2 0 0 0 1 0
3 0 0 0 0 0
4 1 0 1 0 0
5 0 0 0 0 0
Seems like an easy question but I'm running into an odd error. I have a large dataframe with 24+ columns that all contain 1s or 0s. I wish to concatenate each field to create a binary key that'll act as a signature.
However, when the number of columns exceeds 12, the whole process falls apart.
a = np.zeros(shape=(3,12))
df = pd.DataFrame(a)
df = df.astype(int) # This converts each 0.0 into just 0
df[2]=1 # Changes one column to all 1s
#result
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
Concatenating function...
df['new'] = df.astype(str).sum(1).astype(int).astype(str) # Concatenate
df['new'].apply('{0:0>12}'.format) # Pad leading zeros
# result
0 1 2 3 4 5 6 7 8 9 10 11 new
0 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
1 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
2 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
This is good. However, if I increase the number of columns to 13, I get...
a = np.zeros(shape=(3,13))
# ...same intermediate steps as above...
0 1 2 3 4 5 6 7 8 9 10 11 12 new
0 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
1 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
2 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
Why am I getting -2147483648? I was expecting 0010000000000
Any help is appreciated!
I have huge data set, sample is below, and i need to compute the co-occurence matrix of the skill column, please refer to the sample data below, i read about the co-occurance matrix, and CountVectorizer from scikit learn shed light, i wrote the below code, but i am confused about how to see the results. If anyone can, then please help me, please find the sample data, and my tried code below
df1 = pd.DataFrame([["1000074", "6284 6295"],["75634786", "4044 4714 5789 6076 6077 6079 6082 6168 6229"],["75635714","4092 4420 4430 4437 4651"]], columns=['people_id', 'skills_id'])
count_vect = CountVectorizer(ngram_range=(1,1),lowercase= False)
X_counts = count_vect.fit_transform(df1['skills_id'])
Xc = (X_counts.T * X_counts)
Xc.setdiag(0)
print(Xc.todense())
I am pretty new to this terminology of co-occurence matrix with numbers, word-to-word co-occurence i can understand, but how toread and understand the result.
Well you may think of it just like a word-to-word co-occurrence matrix. Here, assuming your skill column is that 2nd column with numbers of size 4, it first looks at all unique possible values :
>>> count_vect.get_feature_names()
['4044',
'4092',
'4420',
'4430',
'4437',
'4651',
'4714',
'5789',
'6076',
'6077',
'6079',
'6082',
'6168',
'6229',
'6284',
'6295']
That's an array of size 16 which represents the 16 different words that were found in your skill column. Indeed, sklearn.text.CountVectorizer() finds the words by splitting your strings using space delimiter.
The final matrix you see using print(Xc.todense()) is just the co-occurrence matrix for these 16 words. That's why it is of size (16,16)
To make it clearer (please forgive the columns alignment formatting), you could look at :
>> pd.DataFrame(Xc.todense(),
columns=count_vect.get_feature_names(),
index=count_vect.get_feature_names())
4044 4092 4420 ...
4044 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0
4092 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0
4420 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0
4430 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
4437 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0
4651 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
4714 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0
5789 1 0 0 0 0 0 1 0 1 1 1 1 1 1 0 0
6076 1 0 0 0 0 0 1 1 0 1 1 1 1 1 0 0
6077 1 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0
6079 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 0
6082 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 0
6168 1 0 0 0 0 0 1 1 1 1 1 1 0 1 0 0
6229 1 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0
6284 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
6295 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
tl;dr In that case, as you input strings, whether they are numbers (e.g. "23") or nouns (e.g. "cat") doesn't change anything. The co-occurrence still displays binary values representing whether a given token is found with another one. The default tokenizer for CountVectorizer() is just splitting on spaces.
What exactly would you have expected differently with numbers ?
I'm basically trying to one hot encode a column with values like this:
tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...
First I got all the set of all the tickers(which is about 467 tickers):
all_tickers = list()
for tickers in df.tickers:
for ticker in tickers:
all_tickers.append(ticker)
all_tickers = set(all_tickers)
Then I implemented One Hot Encoding this way:
for i in range(len(df.index)):
for ticker in all_tickers:
if ticker in df.iloc[i]['tickers']:
df.at[i+1, ticker] = 1
else:
df.at[i+1, ticker] = 0
The problem is the script runs incredibly slow when processing about 5000+ rows.
How can I improve my algorithm?
I think you need str.join with str.get_dummies:
df = df['tickers'].str.join('|').str.get_dummies()
Or:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
AAPL ABT ADBE AMGN AMZN BABA BAY CVS DIS ECL EMR FAST GE \
1 0 0 0 0 0 0 0 0 1 0 0 0 0
2 1 0 0 0 1 1 1 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0
4 0 1 1 1 0 0 0 1 0 0 0 0 0
5 0 1 0 0 0 0 0 1 1 1 1 1 1
GOOGL MCDO PEP
1 0 0 0
2 0 0 0
3 0 1 1
4 0 0 0
5 1 0 0
You can use apply(pd.Series) and then get_dummies():
df = pd.DataFrame({"tickers":[["DIS"], ["AAPL","AMZN","BABA","BAY"],
["MCDO","PEP"], ["ABT","ADBE","AMGN","CVS"],
["ABT","CVS","DIS","ECL","EMR","FAST","GE","GOOGL"]]})
pd.get_dummies(df.tickers.apply(pd.Series), prefix="", prefix_sep="")
AAPL ABT DIS MCDO ADBE AMZN CVS PEP AMGN BABA DIS BAY CVS ECL \
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0 1 0 1 0 0
2 0 0 0 1 0 0 0 1 0 0 0 0 0 0
3 0 1 0 0 1 0 0 0 1 0 0 0 1 0
4 0 1 0 0 0 0 1 0 0 0 1 0 0 1
EMR FAST GE GOOGL
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 1 1 1 1