I'm trying to fill an empty array with one row of zeroes. This is apparently a lot harder said than done. This is my attempt:
Array = pd.DataFrame(columns=["curTime", "HC", "AC", "HG", "HF1", "HF2", "HF3", "HF4", "HF5", "HF6",
"HF7", "HF8", "HF9", "HF10", "HF11", "HF12", "HD1", "HD2", "HD3", "HD4", "HD5", "HD6",
"AG", "AF1", "AF2", "AF3", "AF4", "AF5", "AF6", "AF7", "AF8", "AF9", "AF10", "AF11", "AF12",
"AD1", "AD2", "AD3", "AD4", "AD5", "AD6"])
appendArray = [[0] * len(Array.columns)]
Array = Array.append(appendArray, ignore_index = True)
This however creates a row that stacks another 41 columns to the right of my existing 41 columns, and fills them with zeroes, while the original 41 columns get a "NaN" value.
How do I most easily do this?
You can using pd.Series within the append
Array.append(pd.Series(appendArray,index=Array.columns), ignore_index = True)
Out[780]:
curTime HC AC HG HF1 HF2 HF3 HF4 HF5 HF6 ... AF9 AF10 AF11 \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0
AF12 AD1 AD2 AD3 AD4 AD5 AD6
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
[2 rows x 41 columns]
Related
I have a dataframe consisting of online reviews. I have assigned topics (topic 1-5; and 0 meaning no topic is assigned) and labels (positive or negative) in each instance. I want to create a dummy variable for each topic and label. This is what my data looks like...
reviewId
topic
label
01
2
negative
02
2
positive
03
0
negative
04
5
negative
05
1
positive
What should I do to make my data look like this? (1 meaning assigned, 0 meaning not assigned)
reviewId
topic
label
T1pos
T1neg
T2pos
T2neg
T3pos
T3neg
T4pos
T4neg
T5pos
T5neg
01
2
negative
0
0
0
1
0
0
0
0
0
0
02
2
positive
0
0
1
0
0
0
0
0
0
0
03
0
negative
0
0
0
0
0
0
0
0
0
0
04
5
negative
0
0
0
0
0
0
0
0
0
1
05
1
positive
1
0
0
0
0
0
0
0
0
0
You can create your own encoding by converting the two columns to a power of two and get its binary representation:
# I used 'p' as 'pos' and 'n' as 'neg' to save space
MAX_TOPIC = df['topic'].max()
mi = pd.MultiIndex.from_product([range(1, MAX_TOPIC+1), ['p', 'n']])
mi = [f'T{t}{l}' for t, l in mi]
# >> 2 to remove T0n and T0p
num = np.array(2**(df['topic']*2+df['label'].eq('negative'))) >> 2
hot = (((n[:, None] & (1 << np.arange(MAX_TOPIC*2)))) > 0).astype(int)
out = pd.concat([df, pd.DataFrame(hot, columns=mi, index=df.index)], axis=1)
Output:
>>> out
reviewId topic label T1p T1n T2p T2n T3p T3n T4p T4n T5p T5n
0 1 2 negative 0 0 0 1 0 0 0 0 0 0
1 2 2 positive 0 0 1 0 0 0 0 0 0 0
2 3 0 negative 0 0 0 0 0 0 0 0 0 0
3 4 5 negative 0 0 0 0 0 0 0 0 0 1
4 5 1 positive 1 0 0 0 0 0 0 0 0 0
>>> num
array([ 8, 4, 0, 512, 1])
The binary representation comes from Convert integer to binary array with suitable padding
Someone can probably come up with a more elegant solution, but this works:
import numpy as np
import pandas as pd
# recreate your DataFrame:
df = pd.DataFrame({
'reviewid': ['01', '02', '03', '04', '05'],
'topic': [2, 2, 0, 5, 1],
'label': ['neg', 'pos', 'neg', 'neg', 'pos']})
# Add dummy columns initialized to 0:
dummies = [
f'T{t}{lab}' for t in sorted(df.topic.unique()) if t != 0
for lab in sorted(df.label.unique())]
dummy_df = pd.DataFrame(
np.zeros((len(df), len(dummies)), dtype=int),
columns=dummies,
index=df.index)
df = pd.concat([df, dummy_df], axis=1)
# Fill in the dummy columns
for i, (t, lab) in enumerate(zip(df.topic, df.label)):
if t != 0:
df.loc[i, f'T{t}{lab}'] = 1
df # view result
I am using pd.crosstab to count presence/absence data. In the first column, I have several presence counts (represented by 1's), in the second column I have just one 'presence'. Howwever, when I run crosstab on this data that single presence in the second column isn't counted. Could anyone shed some light on why this happening and what I'm doing wrong?
Python v. 3.8.5
Pandas v. 1.2.3
System: MacOS Monterey v. 12.5.1
Column1:
>>> mbx_final['Cmpd1640']
OV745_1A 0
OV745_1B 0
OV745_1C 1
OV745_1D 1
OV745_1E 0
OV745_4A 1
OV745_4B 1
OV745_4C 0
OV22_12A 1
OV22_12B 1
OV22_12C 1
OV22_12D 0
OV22_12E 0
OV22_12F 0
OV22_13A 0
OV22_13B 0
OV22_13C 0
OV86_6A 1
OV86_6D 1
OV86_6E 1
OV86_6F 1
OV86_6G 1
OV86_6H 1
OV86_6I 1
OV86_6J 1
OV86_6K 0
OV86_6L 1
OV86_8A 1
OV86_8B 1
OV86_8C 1
OB1B 1
OB1C 1
SK3A 0
SK3B 0
SK3C 0
SK7A 1
SK7B 0
Column2:
>>> mgx_final['Otu2409']
OV745_1A 0
OV745_1B 0
OV745_1C 0
OV745_1D 0
OV745_1E 0
OV745_4A 0
OV745_4B 0
OV745_4C 0
OV22_12A 0
OV22_12B 0
OV22_12C 0
OV22_12D 0
OV22_12E 0
OV22_12F 0
OV22_13A 0
OV22_13B 0
OV22_13C 0
OV86_6A 0
OV86_6D 0
OV86_6E 0
OV86_6F 0
OV86_6G 0
OV86_6H 0
OV86_6I 0
OV86_6J 0
OV86_6K 0
OV86_6L 0
OV86_8A 0
OV86_8B 0
OV86_8C 0
OB1A 1
OB1C 0
SK3A 0
SK3B 0
SK3C 0
SK7A 0
SK7B 0
Crosstab command:
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True)
Results:
>>> contingency_tab
Otu2409 0 All
Cmpd1640
0 15 15
1 21 21
All 36 36
I would expect to see a result like this:
>>> contingency_tab
Otu2409 0 1 All
Cmpd1640
0 15 0 15
1 21 1 22
All 36 1 37
What am I doing wrong?
You can use the dropna parameter, which is by default set to True. Setting it to False will include columns whose entries are all NaN.
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'],mgx_final['Otu2409'],margins=True, dropna=False)
You can read more on the official documentation here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html
Edit 1:
I've replicated your dataset and code and run the following:
df_in = pd.read_excel("Book1.xlsx", index_col="index")
mbx_final = df_in[["Cmpd1640"]]
mgx_final = df_in[["Otu2409"]]
contingency_tab = pd.crosstab(mbx_final['Cmpd1640'], mgx_final['Otu2409'], margins=True)
display(contingency_tab)
And I get your expected output:
There might be something wrong with how you're displaying the crosstab function output.
I am trying to create a target variable based on 2 conditions. I have X values that are binary and X2 values that are also binary. My condition is whenver X changes from 1 to zero, we have one in y only if it is followed by a change from 0 to 1 in X2. If that was followed by a change from 0 to 1 in X then we don't do the change in the first place. I attached a picture from excel.
I also did the following to account for the change in X
df['X-prev']=df['X'].shift(1)
df['Change-X;]=np.where(df['X-prev']+df['X']==1,1,0)
# this is the data frame
X=[1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0]
X2=[0,0,0,0,0,0,0,0,0,1,1,1,1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,1,1]
df=pd.DataFrame()
df['X']=X
df['X2']=X2
however, this is not enough as I need to know which change came first after the X change. I attached a picture of the example.
Thanks a lot for all the contributions.
Keep rows that match your transition (X=1, X+1=0) and (X2=1, X2-1=0) then merge all selected rows to a list where a value of 0 means 'start a cycle' and 1 means 'end a cycle'.
But in this list, you can have consecutive start or end so you need to filter again to get only cycles of (0, 1). After that, reindex this new series by your original dataframe index and back fill with 1.
x1 = df['X'].sub(df['X'].shift(-1)).eq(1)
x2 = df['X2'].sub(df['X2'].shift(1)).eq(1)
sr1 = pd.Series(0, df.index[x1])
sr2 = pd.Series(1, df.index[x2])
sr = pd.concat([sr2, sr1]).sort_index()
df['Y'] = sr[sr.lt(sr.shift(-1)) | sr.gt(sr.shift(1))] \
.reindex(df.index).bfill().fillna(0).astype(int)
>>> df
X X2 Y
0 1 0 0 # start here: (X=1, X+1=0) but never ended before another start
1 1 0 0
2 0 0 0
3 0 0 0
4 1 0 0 # start here: (X=1, X+1=0)
5 0 0 1 # <- fill with 1
6 0 0 1 # <- fill with 1
7 0 0 1 # <- fill with 1
8 0 0 1 # <- fill with 1
9 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
10 0 1 0
11 0 1 0
12 0 1 0
13 0 1 0
14 0 0 0
15 0 0 0
16 0 1 0 # end here: (X2=1, X2-1=0) but never started before
17 0 0 0
18 0 0 0
19 0 0 0
20 1 0 0
21 1 0 0 # start here: (X=1, X+1=0)
22 0 0 1 # <- fill with 1
23 0 0 1 # <- fill with 1
24 0 0 1 # <- fill with 1
25 0 0 1 # <- fill with 1
26 0 0 1 # <- fill with 1
27 0 1 1 # end here: (X2=1, X2-1=0) so fill back rows with 1
28 0 1 0
29 0 1 0
I have a dataset called "data" with categorical values I'd like to encode with mean (likelihood/target) encoding rather than label encoding.
My dataset looks like:
data.head()
ID X0 X1 X10 X100 X101 X102 X103 X104 X105 ... X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
0 0 k v 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6 k t 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
2 7 az w 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 9 az t 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
4 13 az v 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
5 rows × 377 columns
I've tried:
# Select categorical features
cat_features = data.dtypes == 'object'
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)
which raises:
KeyError: False
I've also tried:
# Define function
def mean_encoding(df, target):
for c in df.columns:
if df[c].dtype == 'object':
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
which raises:
KeyError: 'Columns not found: 87.68, 87.43, 94.38, 72.11, 73.7, 74.0,
74.28, 76.26,...
I've concated train and test dataset into one called "data" and saved train target before dropping in the dataset as:
target = train.y
split = len(train)
data = pd.concat(objs=[train, test])
data = data.drop('y', axis=1)
data.shape
Help would be appreciated. Thanks.
I think you are not selecting categorical columns correctly. By doingcat_features = data.dtypes == 'object' you are not getting columns names, instead you get boolean showing if column type is categorical or not. Resulting in KeyError: False
You can select categorical column as
mycolumns = data.columns
numerical_columns = data._get_numeric_data().columns
cat_features= list(set(mycolumns) - set(numerical_columns))
or
cat_features = df.select_dtypes(['object']).columns
Rest of you code will be same
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)
I need to extract some data from .dat file which I usually do with
import numpy as np
file = np.loadtxt('blablabla.dat')
Here my data are not separated by a specific delimiter but have predefined length (digits) and some lines don't have any values for some columns.
Here an sample to be clear :
3 0 36 0 0 0 0 0 0 0 99.
-2 0 0 0 0 0 0 0 0 0 99.
2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
-5 0 0 0 0 0 0 0 0 0 99.
99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5
My little code above get the error :
# Convert each value according to its column and store
ValueError: Wrong number of columns at line 3
Does someone have an idea about how to collect this kind of data?
numpy.genfromtxt seems to be what you want; it you can specify field widths for each column and treats missing data as NaNs.
For this case:
import numpy as np
data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5])
If you want to keep information in the string part of the file, you could read twice and specify the usecols parameter:
import numpy as np
number_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(0,1,2,3,4,5,6,7,8,9,11))
string_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(10),dtype=str)
What you essentially need is to get list of empty "columns" position that serve as delimiters
That will get you started
In [108]: table = ''' 3 0 36 0 0 0 0 0 0 0 99.
.....: -2 0 0 0 0 0 0 0 0 0 99.
.....: 2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
.....: 5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
.....: -5 0 0 0 0 0 0 0 0 0 99.
.....: 99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5'''.split('\n')
In [110]: max_row_len = max(len(row) for row in table)
In [117]: spaces = reduce(lambda res, row: res.intersection(idx for idx, c in enumerate(row) if c == ' '), table, set(range(max_row_len)))
This code builds set of character positions in the longest row - and reduce leaves only set of positions that have spaces in all rows