In [37]: df = pd.DataFrame([[1, 2, 3, 4], [2, 3, 4, 5], [3, 4, 5, 6]])
In [38]: df2 = pd.concat([df, df])
In [39]: df2.reset_index()
Out[39]:
index 0 1 2 3
0 0 1 2 3 4
1 1 2 3 4 5
2 2 3 4 5 6
3 0 1 2 3 4
4 1 2 3 4 5
5 2 3 4 5 6
How can I reset_index without adding a new column index?
You can use the drop=True option in reset_index(). See here.
An often encountered issue is that reset_index() returns a copy, so it will have to be assigned to another variable (or itself) to modify the the dataframe. You can use the inplace= parameter so remove the old indexes in-place.
df = df.reset_index(drop=True)
# or
df.reset_index(drop=True, inplace=True)
df
Related
I have two dataframes
df_train_1 with shape (70652, 4)
and
df_test_1 with shape (24581, 4)
I am trying to concat them with df_train_1 ontop.
I have tried the following two methods:
df_combined = df_train_1.append(df_test_1)
df_combined = pd.concat([df_train_1, df_test_1])
when I call df_combined.title[0] I get the both [0] values from the original dataframe. Can someone point me in the direction of how to avoid this please
If you look at the example of the documentation of pandas: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df
A B
0 1 2
1 3 4
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
A B
0 1 2
1 3 4
0 5 6
1 7 8
You will see that the index will be stacked.
So like the comment suggested, use ignore_index=True to reset the index to numeric order:
df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
After looking here and here and in the documentation, I still cannot find a way to select rows from a DataFrame according to all these criteria:
Return rows in an order given from a list of values from a given column
Return repeated rows (associated with repeated values in the list)
Preserve the original indices
Ignore values of the list not present in the DataFrame
As an example, let
df = pd.DataFrame({'A': [5, 6, 3, 4], 'B': [1, 2, 3, 5]})
df
A B
0 5 1
1 6 2
2 3 3
3 4 5
and let
list_of_values = [3, 4, 6, 4, 3, 8]
Then I would like to get the following DataFrame:
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
How can I accomplish that? Zero's answer looks promising as it is the only one I found which preserves the original index, but it does not work with repetitions. Any ideas about how to modify/generalize it?
We have to preserve the index by assigning it as a column first so we can set_index after the mering:
list_of_values = [3, 4, 6, 4, 3, 8]
df2 = pd.DataFrame({'A': list_of_values, 'order': range(len(list_of_values))})
dfn = (
df.assign(idx=df.index)
.merge(df2, on='A')
.sort_values('order')
.set_index('idx')
.drop('order', axis=1)
)
A B
idx
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
If you want to get rid of the index name (idx), use rename_axis:
dfn = dfn.rename_axis(None)
A B
2 3 3
3 4 5
1 6 2
3 4 5
2 3 3
Here's a way to do that using merge:
list_df = pd.DataFrame({"A": list_of_values, "order": range(len(list_of_values))})
pd.merge(list_df, df, on="A").sort_values("order").drop("order", axis=1)
The output is:
A B
0 3 3
2 4 5
4 6 2
3 4 5
1 3 3
Given this dataframe:
df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
I can use groupby to show the size of the group of combinations:
df.groupby(['A','B']).size()
A B
1 2 1
3 1
4 6 1
How can I combine the unique values of B into a list and also display the size of A like this?
A B
1 2,3 2
4 6 1
Use:
df['B'].astype(str).groupby(df['A']).agg([','.join,'size'])
Out[134]:
join size
A
1 2,3 2
4 6 1
Group only on A and use .agg specifying a dictionary to use for each column.
df.groupby('A').agg({'B': list, 'A': 'size'}).rename(columns={'A': 'Size'})
B Size
A
1 [2, 3] 2
4 [6] 1
I have a large dataframe. When it was created 'None' was used as the value where a number could not be calculated (instead of 'nan')
How can I delete all rows that have 'None' in any of it's columns? I though I could use df.dropna and set the value of na, but I can't seem to be able to.
Thanks
I think this is a good representation of the dataframe:
temp = pd.DataFrame(data=[['str1','str2',2,3,5,6,76,8],['str3','str4',2,3,'None',6,76,8]])
Setup
Borrowed #MaxU's df
df = pd.DataFrame([
[1, 2, 3],
[4, None, 6],
[None, 7, 8],
[9, 10, 11]
], dtype=object)
Solution
You can just use pd.DataFrame.dropna as is
df.dropna()
0 1 2
0 1 2 3
3 9 10 11
Supposing you have None strings like in this df
df = pd.DataFrame([
[1, 2, 3],
[4, 'None', 6],
['None', 7, 8],
[9, 10, 11]
], dtype=object)
Then combine dropna with mask
df.mask(df.eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
You can ensure that the entire dataframe is object when you compare with.
df.mask(df.astype(object).eq('None')).dropna()
0 1 2
0 1 2 3
3 9 10 11
Thanks for all your help. In the end I was able to get
df = df.replace(to_replace='None', value=np.nan).dropna()
to work. I'm not sure why your suggestions didn't work for me.
UPDATE:
In [70]: temp[temp.astype(str).ne('None').all(1)]
Out[70]:
0 1 2 3 4 5 6 7
0 str1 str2 2 3 5 6 76 8
Old answer:
In [35]: x
Out[35]:
a b c
0 1 2 3
1 4 None 6
2 None 7 8
3 9 10 11
In [36]: x = x[~x.astype(str).eq('None').any(1)]
In [37]: x
Out[37]:
a b c
0 1 2 3
3 9 10 11
or bit nicer variant from #roganjosh:
In [47]: x = x[x.astype(str).ne('None').all(1)]
In [48]: x
Out[48]:
a b c
0 1 2 3
3 9 10 11
im a bit late to the party, but this is prob the simplest method:
df.dropna(axis=0, how='any')
Parameters:
axis='index/column' how='any/all'
axis '0' is for dropping rows (most common), and '1' will drop columns instead.
and the parameter how will drop if there are 'any' None types in the row/ column,
or if they are all None types (how='all')
if still None is not removed , we can do
df = df.replace(to_replace='None', value=np.nan).dropna()
the above solution worked partially still the None was converted to NaN but not removed (thanks to the above answer as it helped to move further)
so then i added one more line of code that is take the particular column
df['column'] = df['column'].apply(lambda x : str(x))
this changed the NaN to nan
now remove the nan
df = df[df['column'] != 'nan']
I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5