I have pandas dataframe like this
Column 1
Column 2
1
a
2
a
3
b
4
c
5
d
I want to name the column 2 as:
Column 1
Column 2
1
row1
2
row1
3
row2
4
row3
5
row4
I am trying the ways that are hard coded, like renaming each column, but in practice I have lots of rows so hard coding is not possible, is there any function or something python that can do the same task for me?
Let's try Series.factorize
df['Column2'] = (pd.Series(df['Column2'].factorize()[0])
.add(1).astype(str).radd('row'))
print(df)
Column1 Column2
0 1 row1
1 2 row1
2 3 row2
3 4 row3
4 5 row4
Related
I have a list of Pandas dataframes where I want to add together the rows that have the same index. For example, say we have two dataframes where their indexes are unordered:
Column1 Column2
Item1 1 4
Item3 2 5
Item2 3 6
Column1 Column2
Item1 1 3
Item2 2 4
Is there a way to add these two dataframes together by index to get the following result with Item3 included? Because a simple df1 + df2 will return the first two lines correctly, but Item3 will end up having NaNs. Having the results become floats is fine.
# What I want to calculate
Column1 Column2
Item1 2 7
Item2 5 10
Item3 2 5
# What actually calculates
Column1 Column2
Item1 2.0 7.0
Item2 5.0 10.0
Item3 NaN NaN
You can try
df_final=(df1.set_index('item').add(df2.set_index('item'), fill_value=0).reset_index())
#float to int
df_final = df_final.astype(str).replace('\.0','',regex=True).replace(['nan','None'], np.nan)
print(df_final)
output
item col1 col2
0 it1 2 7
1 it2 5 10
2 it3 2 5
I have a table with multiply lines as follows:
table1
col1
col2
col2
row1
1
2
3
row2
3
4
6
row3
4
5
7
row4
5
4
6
row5
6
2
3
row6
7
4
6
I want to change it like this:
table1
col1
col2
col2
row1
1
2
3
row2
3
4
6
table1
col1
col2
col2
row1
4
5
7
row3
5
4
6
table3
col1
col2
col2
row4
6
2
3
row5
7
4
6
namely,just insert a row(title) to separate it,because they belong to different subtables.
I have try use insert function to inert a value of title
d.DataFrame(np.insert(df.values, 0, values=["col1", "col1", "col3"], axis=0))
but in a specific column of DataFrame, the type of all values must be the same.
I also use xlwings (insert function)and openpyxl (insert_rows function) to insert one row
but it seems that they can only insert with a blank value, not specific values.
After finish constructing this table, I will use it to set some styles.
In excel I just need to copy and paste, is there a flexible way?
or inserting maybe not a good way, and just to split and combine tables(with subtitle and keep format)?
addition:
[data link][1]
[1]: https://cowtransfer.com/s/a160ccec698a48,you need input code 454008
you can try:
s=pd.DataFrame([df.columns]*int(len(df)/2),columns=df.columns)
s.index=pd.Series(s.index+1).cumsum()
df=pd.concat([df,s]).sort_index().iloc[:-1].reset_index(drop=True)
output of df:
table1 col1 col2 col2
0 row1 1 2 3
1 row2 3 4 6
2 table1 col1 col2 col2
3 row3 4 5 7
4 row4 5 4 6
5 table1 col1 col2 col2
6 row5 6 2 3
7 row6 7 4 6
Update:
you can try:
s=pd.DataFrame([df.columns]*int(len(df)/16),columns=df.columns)
s.index=[x+1 for x in range(16,16*(len(s)+1),16)]
df=pd.concat([df,s]).sort_index().reset_index(drop=True)
#If needed to remove last row then use:
#df=df.iloc[:-1]
Sample dataframe used by me for testing:
import numpy as np
df=pd.DataFrame()
df['table1']=[f'row {x}' for x in range(65)]
df['col1']=np.random.randint(1,10,(65,1))
df['col2']=np.random.randint(1,10,(65,1))
df['col3']=np.random.randint(1,10,(65,1))
df.columns=[*df.columns[:-1],'col2']
So I have a dataframe and I would like to be able to compare each value with other values in its row and column at the same time. For example, I have something like this
Col1 Col2 Col3 NumCol
Row1 1 4 7 16
Row2 2 5 8 13
Row3 3 6 9 30
NumRow 28 14 10
For each value that isn't in the NumRow or NumCol, I would like to compare the NumCol and NumRow values in the same column/row as it.
I would like it to return the value of the first instance where NumCol is larger than NumRow in each row.
So the result would be this:
Row1 4
Row2 8
Row3 3
I have no clue on how to even begin this, but is there a way to do this elegantly without using for loops to loop through the whole dataframe to find these values?
First we flatten the dataframe (here df is your original dataframe):
df2 = (df.fillna('NumRow')
.set_index('NumCol')
.transpose()
.set_index('NumRow')
.stack()
.reset_index(name='value')
)
df2
output
NumRow NumCol value
0 28 16.0 1
1 28 13.0 2
2 28 30.0 3
3 14 16.0 4
4 14 13.0 5
5 14 30.0 6
6 10 16.0 7
7 10 13.0 8
8 10 30.0 9
now for each row of the new dataframe df2 we have the corresponding number from NumRow, corresponding number from NumCol, and the number from within the 'body' of the original dataframe df
Next we apply the condition, group by NumCol and within each group find the first row where the condition is satisfied. We report corresponding value:
df3 = (df2.assign(cond = df2['NumCol']>df2['NumRow'])
.groupby('NumCol')
.apply(lambda d:d[d['cond']].iloc[0])['value']
)
df3.index = df3.index.map(dict(zip(df['NumCol'],df.index)))
df3.sort_index()
Output
NumCol
Row1 4
Row2 8
Row3 3
Name: value, dtype: int64
I'd like to populate one dataframe (df2) based on the column names of df2 matching values within a column in another dataframe (df2). Here is a simplified example:
names = list('abcd')
data = list('aadc')
df1 = pd.DataFrame(data,columns=['data'])
df2 = pd.DataFrame(np.empty([4,4]),columns=names)
df1:
data
0 a
1 a
2 d
3 c
df2:
a b c d
0 0.00 0.00 0.00 0.00
1 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00
3 0.00 0.00 0.00 0.00
I'd like to update df2 so that the first row returns a number (let's say 1 for now) under column a, and 0 for other columns. Second row of df2 would return the same, third frow would return a 0 for column a/b/c and a 1 for column d, fourth row would return a 0 for column a/b/d and a 1 for column c.
Thanks very much for the help!
You can do numpy broadcasting here:
df2[:] = (df1['data'].values[:,None] == df2.columns.values).astype(int)
Or use get_dummies:
df2[:] = pd.get_dummies(df1['data']).reindex(df2.columns, axis=1)
Output:
a b c d
0 1 0 0 0
1 1 0 0 0
2 0 0 0 1
3 0 0 1 0
I have a Pandas DataFrame where some of the values are missing (denoted by ?). Is there any easy way of deleting all rows where at least one column has the value ??
Normally, I would do boolean indexing but I have many columns. One way is as follows:
for index, row in df.iterrows():
for col in df.columns:
if '?' in row[col]:
#delete row
But this seems unPythonic...
Any ideas?
Or just replace it as NaN and using dropna
df.replace({'?':np.nan}).dropna()
Out[126]:
col1 col2 col3 col4
row4 24 12 52 17
Option 1a
boolean indexing and any
df
col1 col2 col3 col4
row1 65 24 47 ?
row2 33 48 ? 89
row3 ? 34 67 ?
row4 24 12 52 17
(df.astype(str) == '?').any(1)
row1 True
row2 True
row3 True
row4 False
dtype: bool
df = df[~(df.astype(str) == '?').any(1)]
df
col1 col2 col3 col4
row4 24 12 52 17
Here, the astype(str) check is to prevent a TypeError: Could not compare ['?'] with block values from being raised if you have a mixture of string and numeric columns in your dataframe.
Option 1b
Direct comparison with values
(df.values == '?').any(1)
array([ True, True, True, False], dtype=bool)
df = df[~(df.values == '?').any(1)]
df
col1 col2 col3 col4
row4 24 12 52 17
Option 2
df.replace and df.notnull
df.replace('?', np.nan).notnull().all(1)
row1 False
row2 False
row3 False
row4 True
dtype: bool
df = df[df.replace('?', np.nan).notnull().all(1)]
col1 col2 col3 col4
row4 24 12 52 17
Which avoids the astype(str) call. Alternatively, you might do as Wen suggested and just drop them:
df.replace('?', np.nan).dropna()
You can use boolean indexing with all for check if values not contain ?
if mixed types - numeric with ints:
df = pd.DataFrame({'B':[4,5,'?',5,5,4],
'C':[7,'?',9,4,2,3],
'D':[1,3,5,7,'?',0],
'E':[5,3,'?',9,2,4]})
print (df)
B C D E
0 4 7 1 5
1 5 ? 3 3
2 ? 9 5 ?
3 5 4 7 9
4 5 2 ? 2
5 4 3 0 4
df = df[(df.astype(str) != '?').all(axis=1)].astype(int)
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4
Or compare with numpy array created by values:
df = df[(df.values != '?').all(axis=1)]
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4
if all values are strings solution can be simplify:
df = pd.DataFrame({'B':[4,5,'?',5,5,4],
'C':[7,'?',9,4,2,3],
'D':[1,3,5,7,'?',0],
'E':[5,3,'?',9,2,4]}).astype(str)
df = df[(df != '?').all(axis=1)].astype(int)
print (df)
B C D E
0 4 7 1 5
3 5 4 7 9
5 4 3 0 4