Issue concating two dataframes ontop of each other - python

I have two dataframes
df_train_1 with shape (70652, 4)
and
df_test_1 with shape (24581, 4)
I am trying to concat them with df_train_1 ontop.
I have tried the following two methods:
df_combined = df_train_1.append(df_test_1)
df_combined = pd.concat([df_train_1, df_test_1])
when I call df_combined.title[0] I get the both [0] values from the original dataframe. Can someone point me in the direction of how to avoid this please

If you look at the example of the documentation of pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
A B
0 1 2
1 3 4
0 5 6
1 7 8
You will see that the index will be stacked.
So like the comment suggested, use ignore_index=True to reset the index to numeric order:
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8

Related

Keep/select columns with the n highest values in last row of a Pandas dataframe

So I have a dataframe as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.array([[1, 2, 3], [4, 3, 6], [7, 2, 9]]),
columns=['a', 'b', 'c'])
df
Output:
a
b
c
1
2
3
4
3
6
7
2
9
I want to select or keep the two columns, with the highest values in the last row. What is the best way to approach?
So in fact I just want to select or keep column 'a' due to value 7 and column 'c' due to value 9.
Try:
df = df[df.iloc[-1].nlargest(2).index]
Output:
c a
0 3 1
1 6 4
2 9 7
If you want to keep original column sequence as well, you can use Index.intersection() together with .nlargest(), as follows:
df[df.columns.intersection(df.iloc[-1].nlargest(2).index, sort=False)]
Result:
a c
0 1 3
1 4 6
2 7 9

Select rows of pandas dataframe in order of a given list with repetitions and keep the original index

After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3

Sort all columns of a pandas DataFrame independently using sort_values()

I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5

Adding a new column to a pandas dataframe with different number of rows

I'm not sure if pandas is made to do this... But I'd like to add a new row to my dataframe with more rows than the existing columns.
Minimal example:
import pandas as pd
df = pd.DataFrame()
df ['a'] = [0,1]
df ['b'] = [0,1,2]
Could someone please explain if this is possible? I'm using a dataframe to store long lists of data and they all have different lengths that I don't necessarily know at the start.
Absolutely possible. Use pd.concat
Demonstration
df1 = pd.DataFrame([[1, 2, 3]])
df2 = pd.DataFrame([[4, 5, 6, 7, 8, 9]])
pd.concat([df1, df2])
df1 looks like
0 1 2
0 1 2 3
df2 looks like
0 1 2 3 4 5
0 4 5 6 7 8 9
pd.concat looks like
0 1 2 3 4 5
0 1 2 3 NaN NaN NaN
0 4 5 6 7.0 8.0 9.0

In pandas, how can I reset index without adding a new column?

In [37]: df = pd.DataFrame([[1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6]])
In [38]: df2 = pd.concat([df, df])
In [39]: df2.reset_index()
Out[39]:
index 0 1 2 3
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 0 1 2 3 4
4 1 2 3 4 5
5 2 3 4 5 6
How can I reset_index without adding a new column index?
You can use the drop=True option in reset_index(). See here.
An often encountered issue is that reset_index() returns a copy, so it will have to be assigned to another variable (or itself) to modify the the dataframe. You can use the inplace= parameter so remove the old indexes in-place.
df = df.reset_index(drop=True)
# or
df.reset_index(drop=True, inplace=True)
df

Categories

Resources