I have 2 dataframes with the same column names, for example:
col1 col2 col3
1 2 3
and
col1 col2 col3
4 5 6
1 7 8
I have appended them, so now the new dataframe is like below:
col1 col2 col3
1 2 3
4 5 6
1 7 8
The problem is that I need the rows that have the same value in the col1 to come one after the other, just like this:
col1 col2 col3
1 2 3
1 7 8
4 5 6
How can I sort the dataframe by col1 to create this effect(without modifying the dataframe type)?
Use DataFrame.sort_values:
df = pd.concat([df1, df2]).sort_values('col1', ignore_index=True)
if you care about ensuring that dataframe 1 values are sorted before dataframe 2 values where they are tied, you can use the 'mergesort' algorithm. The default algorithm will arbritarily order any tied values.
df.sort_values(by='col1', axis=1, ascending=True, inplace=True, kind='mergesort')
You can sort a DataFrame by any column, for example:
df.sort_values(by=['col_1', 'col_2'], ascending=[True, False], inplace=True)
After that you may like to reset the row index, as they will be jumbled up:
df.reset_index(drop=True, inplace=True)
Related
How can I merge two data frames when the column has a slight offset than the column I am merging to?
df1 =
col1
col2
1
a
2
b
3
c
df2 =
col1
col3
1.01
d
2
e
2.95
f
so, the merged column would end up like this even though the values in col1 are slightly different.
df_merge =
col1
col2
col3
1
a
d
2
b
e
3
c
f
I have seen scenarios like this where "col1" is a string, but I'm wondering if it's possible to do this with something like pandas.merge() in the scenario where there is slight numerical offset (e.g +/- 0.05).
Lets do merge_asof with tolerance parameter
pd.merge_asof(
df1.astype({'col1': 'float'}).sort_values('col1'),
df2.sort_values('col1'),
on='col1',
direction='nearest',
tolerance=.05
)
col1 col2 col3
0 1.0 a d
1 2.0 b e
2 3.0 c f
PS: if the dataframes are already sorted on col1 then there is no need to sort again.
I have a data frame
index col1 col2 col3
0 1 3 5
1 12 7 21
... ... ... ...
I want to delete some rows, with the criteria being that the values in col1 and col2 show up in a certain list.
Let the list be [(12,7),(100,34),...].
In this case, the row with index 1 would be deleted.
Use Index.isin for test MultiIndex created by both columns by DataFrame.set_index, invert mask by ~ and filter in boolean indexing:
L = [(12,7),(100,34)]
df = df[~df.set_index(['col1','col2']).index.isin(L)]
print (df)
col1 col2 col3
0 1 3 5
Let's take this dataframe :
pd.DataFrame(dict(Col1=["a","c"],Col2=["b","d"],Col3=[1,3],Col4=[2,4]))
Col1 Col2 Col3 Col4
0 a b 1 2
1 c d 3 4
I would like to have one row per value in column Col1 and column Col2 (n=2 and r=2 so the expected dataframe have 2*2 = 4 rows).
Expected result :
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
How please could I do ?
Pandas melt does the job here; the rest just has to do with repositioning and renaming the columns appropriately.
Use pandas melt to transform the dataframe, using Col3 and 4 as the index variables. melt typically converts from wide to long.
Next step - reindex the columns, with variable and value as lead columns.
Finally, rename the columns appropriately.
(df.melt(id_vars=['Col3','Col4'])
.reindex(['variable','value','Col3','Col4'],axis=1)
.rename({'variable':'Ind','value':'Value'},axis=1)
)
Ind Value Col3 Col4
0 Col1 a 1 2
1 Col1 c 3 4
2 Col2 b 1 2
3 Col2 d 3 4
I have two data frames:
df1
col1 col2
8 A
12 C
20 D
df2
col1 col3
7 F
15 G
I want to merge these two data frames on col1 in a way that the closest value of col1 from df2 and df1 will merge in a single row.
the final data frame will look like,
df
col1 col2 col3
8 A F
12 C G
20 D NA
I can do this using for loop and comparing the numbers, but the execution time will be huge.
Is there any pythonic way to do it, so the runtime will be reduced. Some pandas shortcut may be.
Use merge_asof with direction='nearest' and tolerance parameter:
df = pd.merge_asof(df1, df2, on='col1', direction='nearest', tolerance=3)
print (df)
col1 col2 col3
0 8 A F
1 12 C G
2 20 D NaN
I have a pandas dataframe:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
And I want to add a new row summing over two columns [Col1,Col2] like:
Col1 Col2 Col3
0 1 2 3
1 2 3 4
Total 3 5 NaN
Ignoring Col3. What should I do? Thanks in advance.
You can use the pandas.DataFrame.append and pandas.DataFrame.sum methods:
df2 = df.append(df.sum(), ignore_index=True)
df2.iloc[-1, df2.columns.get_loc('Col3')] = np.nan
You can use pd.DataFrame.loc. Note the final column will be converted to float since NaN is considered float:
import numpy as np
df.loc['Total'] = [df['Col1'].sum(), df['Col2'].sum(), np.nan]
df[['Col1', 'Col2']] = df[['Col1', 'Col2']].astype(int)
print(df)
Col1 Col2 Col3
0 1 2 3.0
1 2 3 4.0
Total 3 5 NaN