I have a dataframe column contains 10 different digits. Through pd.get_dummies I've got 10 new columns which column names are numbers. Then I want to rename these number named columns by df = df.rename(columns={'0':'topic0'}) but failed. How can I rename these columns' name from numbers to strings?
Use DataFrame.add_prefix:
df = pd.DataFrame({'col':[1,5,7,8,3,6,5,8,9,10]})
df1 = pd.get_dummies(df['col']).add_prefix('topic')
print (df1)
topic1 topic3 topic5 topic6 topic7 topic8 topic9 topic10
0 1 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 0 0
3 0 0 0 0 0 1 0 0
4 0 1 0 0 0 0 0 0
5 0 0 0 1 0 0 0 0
6 0 0 1 0 0 0 0 0
7 0 0 0 0 0 1 0 0
8 0 0 0 0 0 0 1 0
9 0 0 0 0 0 0 0 1
With the example dataframe you can do:
d = {0: [1, 2], 1: [3, 4]}
df = pd.DataFrame(data=d)
You can do for example:
df.rename(index=str, columns={0: "a", 1: "c"})
And then use this method to rename the other columns.
Compactly:
for x in range(3):
df.rename(index=str, columns={x: "topic"+str(x)})
Related
I would like to split the column of a dataframe as follows.
Here is the main dataframe.
import pandas as pd
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
df_az
Then, I applied this code to split the column.
out_az = (df_az.stack().apply(pd.Series).rename(columns=lambda x: f'a combination').unstack().swaplevel(0,1,axis=1).sort_index(axis=1))
out_az = pd.concat([out_az], axis=1)
out_az.head()
However, the result is as follows.
Meanwhile, the expected result is:
Could anyone help me what to change on the code, please? Thank you in advance.
You can apply np.ravel:
>>> pd.DataFrame.from_records(df_az['AZ Combination'].apply(np.ravel))
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 0 0 0 1
Convert column to list and reshape for 2d array, so possible use Dataframe contructor.
Then set columns names, for avoid duplicated columns names are add counter:
storage_AZ = [[[0,0,0],[0,0,0]],
[[0,0,0],[0,0,1]],
[[0,0,0],[0,1,0]],
[[0,0,0],[1,0,0]],
[[0,0,0],[1,0,1]]]
df_az = pd.DataFrame(list(zip(storage_AZ)),columns =['AZ Combination'])
N = 3
L = ['a combination','z combination']
df = pd.DataFrame(np.array(df_az['AZ Combination'].tolist()).reshape(df_az.shape[0],-1))
df.columns = [f'{L[a]}_{b}' for a, b in zip(df.columns // N, df.columns % N)]
print(df)
a combination_0 a combination_1 a combination_2 z combination_0 \
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
If need MultiIndex:
df = pd.concat({'AZ Combination':df}, axis=1)
print(df)
AZ Combination \
a combination_0 a combination_1 a combination_2 z combination_0
0 0 0 0 0
1 0 0 0 0
2 0 0 0 0
3 0 0 0 1
4 0 0 0 1
z combination_1 z combination_2
0 0 0
1 0 1
2 1 0
3 0 0
4 0 1
I'm having trouble describing exactly what I want to achieve. I've tried looking here on stack to find others with the same problem, but are unable to find any. So I will try to describe exactly what I want and give you a sample setup code.
I would like to have a function that gives me a new column/pd.Series. This new column has boolean TRUE values (or int's) that are based on a certain condition.
The condition being as follows. There are N number of columns (example is 8), each with the same name but ending with one new number. IE, column_1, column_2 etc. The function I need is twofold:
If N is given, look for/through each column row and see if it and the next N columns row are also TRUE/1 ..
If N is NOT given, look for each column row and if all next columns rows are also TRUE/1, with the numbers as ID's to look at the column.
def get_df_series(df: pd.DataFrame, columns_ids: list, n: int=8) -> pd.Dataframe:
for i in columns_ids:
# missing code here .. i dont know if this would be the way to go
pass
return df
def create_dataframe(numbers: list) -> pd.DataFrame:
df = pd.DataFrame() # empty df
# create a column for each number with the number as ID and with random boolean values as int's
for i in numbers:
df[f'column_{i}'] = np.random.randint(2, size=20)
return df
if __name__=="__main__":
numbers = [1, 2, 3, 4, 5, 6, 7, 8]
df = create_dataframe(numbers=numbers)
df = get_df_series(df=df, numbers=numbers, n=3)
I have some experience with Pandas dataframes and know how to create IF/ELSE things with np.select for example.
(function) select(condlist: Sequence[ArrayLike], choicelist: Sequence[ArrayLike], default: ArrayLike = ...) -> NDArray
The problem I'm running into is that I don't know how to make a conditional statement if I don't know how many columns are ahead. For example, if I want to know for column_5 if the next 3 are also true, I can hardcode this, but I have columns up to id 20 and would love to not have to hardcode everything from column_1 to column_20 if I want to know if all conditions in all those columns are true.
Now the problem is that I don't know if this is even possible. So any feedback would be appreaciated. Even just giving me a hint on where to look for a way to do this.
EDIT: What I forgot to mention was that there will be random columns in between that obviously cannot be taking into the equation. For example, there will be main_column_1, main_column_2, main_column_3, side_column_1, side_column_2, right_column_1, main_column_3, main_column_4 etc...
The answer Corralien gave is correct, but I should've made my question more clearer.
I need to be able to, say, look at main_column and for that one look ahead N amount of columns of the same type: main_column.
Try:
n = 3
out = (df.rolling(n, min_periods=1, axis=1).sum()
.shift(-n+1, fill_value=0, axis=1).eq(n).astype(int)
.rename(columns=lambda x: 'result_' + x.split('_')[1]))
Output:
>>> out
result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8
0 1 1 1 1 1 1 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0 0
5 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 0 0
8 0 1 1 1 0 0 0 0
9 0 0 0 0 0 1 0 0
10 0 0 0 0 0 0 0 0
11 0 0 0 0 1 0 0 0
12 0 0 0 0 0 0 0 0
13 0 0 0 1 1 0 0 0
14 0 0 0 0 0 1 0 0
15 0 0 0 0 0 0 0 0
16 0 0 0 0 0 0 0 0
17 0 0 1 0 0 0 0 0
18 0 0 1 0 0 0 0 0
19 0 0 0 0 0 0 0 0
Input:
>>> df
column_1 column_2 column_3 column_4 column_5 column_6 column_7 column_8
0 1 1 1 1 1 1 1 1
1 0 1 0 0 0 1 1 0
2 1 1 0 1 0 1 1 0
3 1 0 1 0 0 0 0 0
4 1 0 0 1 1 1 0 1
5 1 1 0 1 0 1 1 0
6 1 0 1 0 0 0 0 1
7 0 0 1 0 0 0 0 0
8 0 1 1 1 1 1 0 0
9 1 0 1 1 0 1 1 1
10 0 0 1 1 0 0 1 1
11 1 0 1 0 1 1 1 0
12 0 1 1 0 1 0 1 0
13 0 0 0 1 1 1 1 0
14 0 0 1 1 0 1 1 1
15 1 0 0 1 0 1 0 0
16 1 0 0 0 0 0 0 1
17 0 0 1 1 1 0 0 1
18 0 0 1 1 1 0 0 1
19 0 0 1 0 0 0 1 0
I have generated a pandas dataframe using below code where an example sequence column is like '0-0-0-1-0-0-2-0-0-0-0'
I split the sequence string into different columns
df = DataFrame(data, columns = ['id', 'sequence'])
print(df.sequence.str.split("-", expand=True))
0 1 2 3 4 5 6 7 8 9
0 0 0 0 1 0 0 2 0 0
1 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 1 0 0 2 0 0 0 0
4 0 0 0 3 0 0 0 0 0
5 0 0 0 0 0 4 0 0 0
If I want to count the number of occurences of 0 in each column, how do I do it?
I tried something like this print df[df.education == '9th'].count() based on Python Pandas Counting the Occurrences of a Specific value but I don't have column names appropriately, and I am not sure how to do it.
Can you please help.
Have you tried this?
df.rename(columns={"A": "NewName", "B": "NewName"})
I have a dataframe where some cells contain lists of multiple values. How can I create new columns based on unique values of those lists? Those lists can contain values already included in previous observations, and also can be empty. How I create a new column (One Hot Encoding) based on those values?
CHECK EDIT - Data is within quotation marks:
data = {'tokens': ['["Spain", "Germany", "England", "Japan"]',
'["Spain", "Germany"]',
'["Morocco"]',
'[]',
'["Japan"]',
'[]']}
my_new_pd = pd.DataFrame(data)
0 ["Spain", "Germany", "England", "Japan"]
1 ["Spain", "Germany"]
2 ["Morocco"]
3 []
4 ["Japan", ""]
5 []
Name: tokens, dtype: object
I want something like
tokens_Spain|tokens_Germany |tokens_England |tokens_Japan|tokens_Morocco
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3. 0 0 0 0 0
4. 0 0 1 1 0
5. 0 0 0 0 0
Method one from sklearn, since you already have the list type column in your dfs
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['tokens']),columns=mlb.classes_, index=df.index)
Method two we do explode first then find the dummies
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_A tokens_B tokens_C tokens_D tokens_Z
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 0 0 1 1
5 0 0 0 0 0
Method three kind of like "explode" on the axis = 0
pd.get_dummies(pd.DataFrame(df.tokens.tolist()),prefix='tokens',prefix_sep='_').sum(level=0,axis=1)
tokens_A tokens_D tokens_Z tokens_B tokens_C
0 1 1 0 1 1
1 1 0 0 1 0
2 0 0 1 0 0
3 0 0 0 0 0
4 0 1 1 0 0
5 0 0 0 0 0
Update
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
tokens_England tokens_Germany tokens_Japan tokens_Morocco tokens_Spain
0 1 1 1 0 1
1 0 1 0 0 1
2 0 0 0 1 0
3 0 0 0 0 0
4 1 0 1 0 0
5 0 0 0 0 0
I have a dataframe with the following column:
df = pd.DataFrame({"A": [1,2,1,2,2,2,0,1,0]})
and i want:
df2 = pd.DataFrame({"0": [0,0,0,0,0,0,1,0,1],"1": [1,0,1,0,0,0,0,1,0],"2": [0,1,0,1,1,1,0,0,0]})
is there an elegant way of doing this using a oneliner.
NOTE
I can do this using df['0'] = df['A'].apply(find_zeros)
I dont mind if 'A' is included in the final.
Use get_dummies:
df2 = pd.get_dummies(df.A)
print (df2)
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0
In [50]: df.A.astype(str).str.get_dummies()
Out[50]:
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0