Removing duplicated rows in a pandas dataframe without considering order [duplicate] - python

This question already has answers here:
(pandas) Drop duplicates based on subset where order doesn't matter
(2 answers)
Pandas: remove duplicates that exist in any order
(3 answers)
Closed 10 months ago.
I'm in the situation of having a dataframe on the form:
import pandas as pd
df_1 = pd.DataFrame({
'A': [0, 0, 1, 1, 1, 2],
'B': [0, 1, 0, 1, 2, 1],
'C': ['a', 'a', 'b', 'b', 'c', 'c']
})
what I want to do is to drop rows of that dataframe where the ordered couples coming from numbers of column 'A'and 'B' are duplicated.
So what I want is:
df_1 = pd.DataFrame({
'A': [0, 0, 1, 1],
'B': [0, 1, 1, 2],
'C': ['a', 'a', 'b', 'c']
})
My idea was to add a column with a the sorted couple as a string and to use the drop_duplicates function of the dataframe, but since i'm using a very huge dataframe this solution is very expansive.
Did you have any suggestions? Thanks for the answers.

Related

How to use pd.json_normalize() on a column formatted as string representation of a list?

I have a large file that is read into a DataFrame which has a column 'features' which is a string representation of a list. The elements in this "list" are sometimes strings, sometimes numbers, as shown below, but the lists in reality at times may be very long depending on the data source.
df = pd.DataFrame(["['a', 'b', 1, 2, 3, 'c', -5]",
"['a', 'b', 1, 2, 4, 'd', 3]",
"['a', 'b', 1, 2, 3, 'c', -5]"],
columns=['features'])
df
features
0 ['a', 'b', 1, 2, 3, 'c', -5]
1 ['a', 'b', 1, 2, 4, 'd', 3]
2 ['a', 'b', 1, 2, 3, 'c', -5]
# Looking at first two characters in first row for example--
df.features[0][0:2]
"['"
I am trying to use pd.json_normalize() to get the column into a "flat table" so it is easier to perform operations on various elements in the features column, (not all of them, but different sets of them depending on the operation being done). However, I can't seem to figure out how to get this to work.
How can I use json_normalize() properly here?
above you are setting the items as a list of strings. What you should be doing is setting them as a list of arrays.
import pandas as pd
df = pd.DataFrame({'features' : [['a', 'b', 1, 2, 3, 'c', -5],
['a', 'b', 1, 2, 4, 'd', 3],
['a', 'b', 1, 2, 3, 'c', -5]]})
will give you
features
0 [a, b, 1, 2, 3, c, -5]
1 [a, b, 1, 2, 4, d, 3]
2 [a, b, 1, 2, 3, c, -5]
Notice the missing quotes around the characters?
so you want df.features[0][0:2]
you get
['a', 'b']
Now how are you getting the data for your dataframe?
or if you have to get your dataframe like that,
df = pd.DataFrame(["['a', 'b', 1, 2, 3, 'c', -5]",
"['a', 'b', 1, 2, 4, 'd', 3]",
"['a', 'b', 1, 2, 3, 'c', -5]"],
columns=['features'])
df.features = df.features.str.replace(']','').str.replace('[','').str.replace(' ','').str.replace("'",'').str.split(',')
then df.features[0][0:2]
will give you
['a', 'b']

Selecting different rows from different GroupBy groups

As opposed to GroupBy.nth, which selects the same index for each group, I would like to take specific indices from each group. For example, if my GroupBy object consisted of four groups and I would like the 1st, 5th, 10th, and 15th from each respectively, then I would like to be able to pass x = [0, 4, 9, 14] and get those rows.
This is kind of a strange thing to want; is there a reason?
In any case, to do what you want, try this:
df = pd.DataFrame([['a', 1], ['a', 2],
['b', 3], ['b', 4], ['b', 5],
['c', 6], ['c', 7]],
columns=['group', 'value'])
def index_getter(which):
def get(series):
return series.iloc[which[series.name]]
return get
which = {'a': 0, 'b': 2, 'c': 1}
df.groupby('group')['value'].apply(index_getter(which))
Which results in:
group
a 1
b 5
c 7

what is meaning of axis=1 in pandas sort_values function? [duplicate]

This question already has answers here:
What does axis in pandas mean?
(27 answers)
Closed 4 years ago.
I have a following code of snippet.
df = pd.DataFrame({'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
'col2' : [2, 1, 9, 8, 7, 4],
'col3': [0, 1, 9, 4, 2, 3]})
print(df)
sorted=df.sort_values(by=1,axis=1)
print(sorted)
The above data is original dataframe .
The above one is output of the df.sort_values() function.
Can anyone explain what is happening here?
The parameter axis=1 refer to columns, while 0 refers to rows. In this case you are sorting by columns, specifically index 1, which is col2 (indexing in python starts at 0).
Some good examples here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html

Convert pandas DataFrame to lists, with column names as list names [duplicate]

This question already has answers here:
How do I create variable variables?
(17 answers)
Convert a Pandas DataFrame to a dictionary
(11 answers)
Closed 5 years ago.
I want to convert a data frame so that each column becomes one list, named by the column name. To give an example, suppose I have the following data frame
import pandas as pd
d = {'A' : [1, 2, 3, 4],
'B' : ['x', 'y', 'z', 'w']}
df = pd.DataFrame(d)
print(df)
A B
0 1 x
1 2 y
2 3 z
3 4 w
The desired output is to have
A = [1,2,3,4]
B = ['x', 'y', 'z', 'w']
I know how to get the column names as well as the values into list format separately or together in the following ways:
test1 = df.T.reset_index().values.tolist()
print(test1)
[['A', 1, 2, 3, 4], ['B', 'x', 'y', 'z', 'w']]
test2 = df.T.values.tolist()
print(test2)
[[1, 2, 3, 4], ['x', 'y', 'z', 'w']]
test3 = df.columns.values.tolist()
print(test3)
['A', 'B']
But I cannot figure out if/how I can name the lists created in test2 the names as given by the column names.

Get DataFrame selection's row posititions

Instead of the indices, I'd like to obtain the row positions, so I can use the result later using df.iloc(row_positions).
This is the example:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
print df[df['a']>=2].index
# Int64Index([2, 7], dtype='int64')
# How do I convert the index list [2, 7] to [1, 2] (the row position)
# I managed to do this for 1 index element, but how can I do this for the entire selection/index list?
df.index.get_loc(2)
Update
I could use a list comprehension to apply the selected result on the get_loc function, but perhaps there's some Pandas-built-in function.
you can use where from numpy:
import numpy as np
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
np.where( df.a>=2)
returns row indices:
(array([1, 2], dtype=int64),)
#ssm's answer is what I would normally use. However to answer your specific query of how to select multiple rows try this:
df = pd.DataFrame({'a': [1, 2, 3], 'b': ['a', 'b', 'c']}, index=[10, 2, 7])
indices = df[df['a']>=2].index
print df.ix[indices]
More information on .ix indexing scheme is here
[EDIT to answer the specific query]
How do I convert the index list [2, 7] to [1, 2] (the row position)
df[df['a']>=2].reset_index().index

Categories

Resources