I have a Dataframe df,you can have it by running:
import pandas as pd
data = [10,20,30,40,50,60]
df = pd.DataFrame(data, columns=['Numbers'])
df
now I want to check if df's columns are in an existing list,if not then create a new column and set the column value as 0,column name is same as the value of the list:
columns_list=["3","5","8","9","12"]
for i in columns_list:
if i not in df.columns.to_list():
df[i]=0
How can I code it in one line,I have tried this:
[df[i]=0 for i in columns_list if i not in df.columns.to_list()]
However the IDE return :
SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?
Any friend can help ?
try:
columns_list=["3","5","8","9","12"]
df = df.reindex(
set(
list(df.columns) + columns_list
),
axis=1,
fill_value=0,
)
You can also use the dictionary unpacking operator.
columns_list = ["3","5","8","9","12"]
df = df.assign(**{col: 0 for col in columns_list if col not in df.columns})
With the df.assign, you unpack the dictionary you create with all the columns that are not part of your columns_list and add the value 0 for that column.
If you really want that in one, single line, then move the column_list as well
df = df.assign(**{col: 0 for col in ["3","5","8","9","12"] if col not in df.columns})
import numpy as np
import pandas as pd
# Some example data
df = pd.DataFrame(
np.random.randint(10, size=(5, 6)),
columns=map(str, range(6))
)
# 0 1 2 3 4 5
# 0 9 4 8 7 3 6
# 1 6 9 0 5 3 4
# 2 7 9 0 9 0 3
# 3 4 4 6 4 6 4
# 4 6 9 7 1 5 5
columns_list=["3","5","8","9","12"]
# Figure out which columns in your list do not appear in your dataframe
# by creating a new Index and using pd.Index.difference:
df[ pd.Index(columns_list).difference(df.columns, sort=False) ] = 0
# 0 1 2 3 4 5 8 9 12
# 0 9 4 8 7 3 6 0 0 0
# 1 6 9 0 5 3 4 0 0 0
# 2 7 9 0 9 0 3 0 0 0
# 3 4 4 6 4 6 4 0 0 0
# 4 6 9 7 1 5 5 0 0 0
Related
I have a pandas DataFrame that has a little over 100 columns.
There are about 50 columns that I want to sort ascending and then there is one column (a date_time column) that I want to reverse sort.
How do I go about achieving this? I know I can do something like...
df = df.sort_values(by = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time'], ascending=[True, True, True, True,... False])
... but I am trying to avoid having to type 'True' 50 times.
Just wondering if there is a quick hand way of doing this.
Thanks.
Dan
You can use:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=[True]*49+[False])
Or, for a programmatic variant for which you don't need to know the position of the False, using numpy:
cols = ['column_001', 'column_003', 'column_009', 'column_017',... 'date_time']
df.sort_values(by=cols, ascending=np.array(cols)!='date_time')
It should go something like this.
to_be_reserved = "COLUMN_TO_BE_RESERVED"
df = df.sort_values(by=[col for col in df.columns if col != to_be_reserved],ignore_index=True)
df[to_be_reserved] = df[to_be_reserved].sort_values(ascending=False,ignore_index = True)
You can also use filter if your 49 columns have a regular pattern:
# if you have a column name pattern
cols = df.filter(regex=('^(column_|date_time)')).columns.tolist()
ascending_false = ['date_time']
ascending = [True if c not in ascending_false else False for c in cols]
df.sort_values(by=cols, ascending=ascending)
Example:
>>> df
column_0 column_1 date_time value other_value another_value
0 4 2 6 6 1 1
1 4 4 0 6 0 2
2 3 2 6 9 0 7
3 9 2 1 7 4 7
4 6 9 2 4 4 1
>>> df.sort_values(by=cols, ascending=ascending)
column_0 column_1 date_time value other_value another_value
2 3 2 6 9 0 7
0 4 2 6 6 1 1
1 4 4 0 6 0 2
4 6 9 2 4 4 1
3 9 2 1 7 4 7
I'm trying to sort col headers by last 3 columns only. Using below, sort_index works on the whole data frame but not when I select the last 3 cols only.
Note: I can't hard-code the sorting because I don't know the columns headers beforehand.
import pandas as pd
df = pd.DataFrame({
'Z' : [1,1,1,1,1],
'B' : ['A','A','A','A','A'],
'C' : ['B','A','A','A','A'],
'A' : [5,6,6,5,5],
})
# sorts all cols
df = df.sort_index(axis = 1)
# aim to sort by last 3 cols
#df.iloc[:,1:3] = df.iloc[:,1:3].sort_index(axis=1)
Intended Out:
Z A B C
0 1 A B 5
1 1 A A 6
2 1 A A 6
3 1 A A 5
4 1 A A 5
Try with reindex
out = df.reindex(columns=df.columns[[0]].tolist()+sorted(df.columns[1:].tolist()))
Out[66]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A
Method two insert
newdf = df.iloc[:,1:].sort_index(axis=1)
newdf.insert(loc=0, column='Z', value=df.Z)
newdf
Out[74]:
Z A B C
0 1 5 A B
1 1 6 A A
2 1 6 A A
3 1 5 A A
4 1 5 A A
I want an efficient way to solve this problem below because my code seems inefficient.
First of all, let me provide a dummy dataset.
import numpy as np
import pandas as pd
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
df1= {'a0' : [1,2,2,1,3], 'a1' : [2,3,3,2,4], 'a2' : [3,4,4,3,5], 'a3' : [4,5,5,4,6], 'a4' : [5,6,6,5,7]}
df2 = {'b0' : [3,6,6,3,8], 'b1' : [6,8,8,6,9], 'b2' : [8,9,9,8,7], 'b3' : [9,7,7,9,2], 'b4' : [7,2,2,7,1]}
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
My actual dataset has more than 100,000 rows and 15 columns. Now, what I want to do is pretty complicated to explain, but here we go.
Goal: I want to create a new df using the two dfs above.
find the global min and max from df1. Since the value is sorted by row, column 'a' will always have the minimum each row, and 'e' will have the maximum. Therefore, I will find the minimum in column 'a0' and maximum in 'a4'.
Min = df1['a0'].min()
Max = df1['a4'].max()
Min
Max
Then I will create a data frame filled with 0s and columns of range(Min, Max). In this case, 1 through 7.
column = []
for i in np.arange(Min, Max+1):
column.append(i)
newdf = pd.DataFrame(0, index = df1.index, columns=column)
The third step is to find the place where the values from df2 will go:
I want to loop through each value in df1 and match each value with the column name in the new df in the same row.
For example, if we are looking at row 0 and go through each column; the values in this case [1,2,3,4,5]. Then the row 0 of the newdf, column 1,2,3,4,5 will be filled with the corresponding values from df2.
Lastly, each corresponding values in df2 (same place) will be added to the place where we found in step 2.
So, the very first row of the new df will look like this:
output = {'1' : [3], '2' : [6], '3' : [8], '4' : [9], '5' : [7], '6' : [0], '7' : [0]}
output = pd.DataFrame(output)
Column 6 and 7 will not be updated because we didn't have 6 and 7 in the very first row of df1.
Here is my code for this process:
for rowidx in range(0, len(df1)):
for columnidx in range(0,len(df1.columns)):
new_column = df1[str(df1.columns[columnidx])][rowidx]
newdf.loc[newdf.index[rowidx], new_column] = df2['b' + df1.columns[columnidx][1:]][rowidx]
I think this does the job, but as I said, my actual dataset is huge with 2999999 rows and Min to Max range is 282 which means 282 columns in the new data frame.
So, the code above runs forever. Is there a faster way to do this? I think I learned something like map-reduce, but I don't know if that would apply here.
Idea is create default columns names in both DataFrames, then concat of DataFrame.stacked Series, add first 0 column to index, remove second level, so possible use DataFrame.unstack:
df1.columns = range(len(df1.columns))
df2.columns = range(len(df2.columns))
newdf = (pd.concat([df1.stack(), df2.stack()], axis=1)
.set_index(0, append=True)
.reset_index(level=1, drop=True)[1]
.unstack(fill_value=0)
.rename_axis(None, axis=1))
print (newdf)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Another solutions:
comp =[pd.Series(a, index=df1.loc[i]) for i, a in enumerate(df2.values)]
df = pd.concat(comp, axis=1).T.fillna(0).astype(int)
print (df)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
Or:
comp = [dict(zip(x, y)) for x, y in zip(df1.values, df2.values)]
c = pd.DataFrame(comp).fillna(0).astype(int)
print (c)
1 2 3 4 5 6 7
0 3 6 8 9 7 0 0
1 0 6 8 9 7 2 0
2 0 6 8 9 7 2 0
3 3 6 8 9 7 0 0
4 0 0 8 9 7 2 1
I have a table
I want to sum values of the columns beloning to the same class h.*. So, my final table will look like this:
Is it possible to aggregate by string column name?
Thank you for any suggestions!
Use lambda function first for select first 3 characters with parameter axis=1 or indexing columns names similar way and aggregate sum:
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
Or:
df1 = df.set_index('object')
df2 = df1.groupby(df1.columns.str[:3], axis=1).sum().reset_index()
Sample:
np.random.seed(123)
cols = ['object', 'h.1.1','h.1.2','h.1.3','h.1.4','h.1.5',
'h.2.1','h.2.2','h.2.3','h.2.4','h.3.1','h.3.2','h.3.3']
df = pd.DataFrame(np.random.randint(10, size=(4, 13)), columns=cols)
print (df)
object h.1.1 h.1.2 h.1.3 h.1.4 h.1.5 h.2.1 h.2.2 h.2.3 h.2.4 \
0 2 2 6 1 3 9 6 1 0 1
1 9 3 4 0 0 4 1 7 3 2
2 4 8 0 7 9 3 4 6 1 5
3 8 3 5 0 2 6 2 4 4 6
h.3.1 h.3.2 h.3.3
0 9 0 0
1 4 7 2
2 6 2 1
3 3 0 6
df1 = df.set_index('object')
df2 = df1.groupby(lambda x: x[:3], axis=1).sum().reset_index()
print (df2)
object h.1 h.2 h.3
0 2 21 8 9
1 9 11 13 13
2 4 27 16 9
3 8 16 16 9
The solution above works great, but is vulnerable in case the h.X goes beyond single digits. I'd recommend the following:
Sample Data:
cols = ['h.%d.%d' %(i, j) for i in range(1, 11) for j in range(1, 11)]
df = pd.DataFrame(np.random.randint(10, size=(4, len(cols))), columns=cols, index=['p_%d'%p for p in range(4)])
Proposed Solution:
new_df = df.groupby(df.columns.str.split('.').str[1], axis=1).sum()
new_df.columns = 'h.' + new_df.columns # the columns are originallly numbered 1, 2, 3. This brings it back to h.1, h.2, h.3
Alternative Solution:
Going through multiindices might be more convoluted, but may be useful while manipulating this data elsewhere.
df.columns = df.columns.str.split('.', expand=True) # Transform into a multiindex
new_df = df.sum(axis = 1, level=[0,1])
new_df.columns = new_df.columns.get_level_values(0) + '.' + new_df.columns.get_level_values(1) # Rename columns
I don't have much experience with working with pandas. I have a pandas dataframe as shown below.
df = pd.DataFrame({ 'A' : [1,2,1],
'start' : [1,3,4],
'stop' : [3,4,8]})
I would like to create a new dataframe that iterates through the rows and appends to resulting dataframe. For example, from row 1 of the input dataframe - Generate a sequence of numbers [1,2,3] and corresponding column to named 1
A seq
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
So far, I've managed to identify what function to use to iterate through the rows of the pandas dataframe.
Here's one way with apply:
(df.set_index('A')
.apply(lambda x: pd.Series(np.arange(x['start'], x['stop'] + 1)), axis=1)
.stack()
.to_frame('seq')
.reset_index(level=1, drop=True)
.astype('int')
)
Out:
seq
A
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
If you would want to use loops.
In [1164]: data = []
In [1165]: for _, x in df.iterrows():
...: data += [[x.A, y] for y in range(x.start, x.stop+1)]
...:
In [1166]: pd.DataFrame(data, columns=['A', 'seq'])
Out[1166]:
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
To add to the answers above, here's a method that defines a function for interpreting the dataframe input shown, into a form that the poster wants:
def gen_df_permutations(perm_def_df):
m_list = []
for i in perm_def_df.index:
row = perm_def_df.loc[i]
for n in range(row.start, row.stop+1):
r_list = [row.A,n]
m_list.append(r_list)
return m_list
Call it, referencing the specification dataframe:
gen_df_permutations(df)
Or optionally call it wrapped in a dataframe creation function to return a final dataframe output:
pd.DataFrame(gen_df_permutations(df),columns=['A','seq'])
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
N.B. the first column there is the dataframe index that can be removed/ignored as requirements allow.