One-liner to identify duplicates using pandas? [duplicate]

One-liner to identify duplicates using pandas? [duplicate] - python

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 1 year ago.
In preps for data analyst interview questions, I came across "find all duplicate emails (not unique emails) in "one-liner" using pandas."
The best I've got is not a single line but rather three:
# initialize dataframe
import pandas as pd
d = {'email':['a','b','c','a','b']}
df= pd.DataFrame(d)
# select emails having duplicate entries
results = pd.DataFrame(df.value_counts())
results.columns = ['count']
results[results['count'] > 1]
>>>
count
email
b 2
a 2
Could the second block following the latter comment be condensed into a one-liner, avoiding the temporary variable results?

Just use duplicated:
>>> df[df.duplicated()]
email
3 a
4 b
Or if you want a list:
>>> df[df["email"].duplicated()]["email"].tolist()
['a', 'b']

Related

extract value from pandas dataframe [duplicate]

This question already has answers here:
Extract int from string in Pandas
(8 answers)
Closed 1 year ago.
Below is the dataframe
import pandas as pd
import numpy as np
d = {'col1': ['Get URI||1621992600749||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90',
'Get URI||1621992600799||com.particlenews.newsbreak||https://graph.fb.com||2021-05-26 01:30:00||1.3.0-QA-1100||90']}
df = pd.DataFrame(data=d)
and need to extract the "1621992600749" and "1621992600799" values.
i have done it multiple ways , by using the split function
new = df["col1"].str.split("||", n = 1, expand = True)
but doesnt give the expected results, any thoughts will be helpful.

You cna use the extract with regex
df['col1'].str.extract(r'(\d+)')
#output
0
0 1621992600749
1 1621992600799

How can I concat multiple dataframes in Python? [duplicate]

This question already has answers here:
Append multiple pandas data frames at once
(5 answers)
How do I create variable variables?
(17 answers)
Closed 4 years ago.
I have multiple (more than 100) dataframes. How can I concat all of them?
The problem is, that I have too many dataframes, that I can not write them manually in a list, like this:
>>> cluster_1 = pd.DataFrame([['a', 1], ['b', 2]],
... columns=['letter ', 'number'])
>>> cluster_1
letter number
0 a 1
1 b 2
>>> cluster_2 = pd.DataFrame([['c', 3], ['d', 4]],
... columns=['letter', 'number'])
>>> cluster_2
letter number
0 c 3
1 d 4
>>> pd.concat([cluster_1, cluster_2])
letter number
0 a 1
1 b 2
0 c 3
1 d 4
The names of my N dataframes are cluster_1, cluster_2, cluster_3,..., cluster_N. The number N can be very high.
How can I concat N dataframes?

I think you can just put it into a list, and then concat the list. In Pandas, the chunk function kind of already does this. I personally do this when using the chunk function in pandas.
pdList = [df1, df2, ...] # List of your dataframes
new_df = pd.concat(pdList)
To create the pdList automatically assuming your dfs always start with "cluster".
pdList = []
pdList.extend(value for name, value in locals().items() if name.startswith('cluster_'))

Generally it goes like:
frames = [df1, df2, df3]
result = pd.concat(frames)
Note: It will reset the index automatically.
Read more details on different types of merging here.
For a large number of data frames:
If you have hundreds of data frames, depending one if you have in on disk or in memory you can still create a list ("frames" in the code snippet) using a for a loop. If you have it in the disk, it can be easily done just saving all the df's in one single folder then reading all the files from that folder.
If you are generating the df's in memory, maybe try saving it in .pkl first.

Use:
pd.concat(your list of column names)
And if want regular index:
pd.concat(your list of column names,ignore_index=True)

Python/Pandas - Query a MultiIndex Column [duplicate]

This question already has answers here:
Select columns using pandas dataframe.query()
(5 answers)
Closed 4 years ago.
I'm trying to use query on a MultiIndex column. It works on a MultiIndex row, but not the column. Is there a reason for this? The documentation shows examples like the first one below, but it doesn't indicate that it won't work for a MultiIndex column.
I know there are other ways to do this, but I'm specifically trying to do it with the query function
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((4,4)))
df.index = pd.MultiIndex.from_product([[1,2],['A','B']])
df.index.names = ['RowInd1', 'RowInd2']
# This works
print(df.query('RowInd2 in ["A"]'))
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
df.columns.names = ['ColInd1', 'ColInd2']
# query on index works, but not on the multiindexed column
print(df.query('index < 2'))
print(df.query('ColInd2 in ["A"]'))

To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()

You can using IndexSlice
df.query('ilevel_0>2')
Out[327]:
ColInd1 1 2
ColInd2 A B A B
3 0.652576 0.639522 0.52087 0.446931
df.loc[:,pd.IndexSlice[:,'A']]
Out[328]:
ColInd1 1 2
ColInd2 A A
0 0.092394 0.427668
1 0.326748 0.383632
2 0.717328 0.354294
3 0.652576 0.520870

Correct way to set value on a slice in pandas [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 6 years ago.
I have a pandas dataframe: data. it has columns ["name", 'A', 'B']
What I want to do (and works) is:
d2 = data[data['name'] == 'fred'] #This gives me multiple rows
d2['A'] = 0
This will set the column A on the fred rows to 0.
I've also done:
indexes = d2.index
data['A'][indexes] = 0
However, both give me the same warning:
/Users/brianp/work/cyan/venv/lib/python2.7/site-packages/pandas/core/indexing.py:128: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
How does pandas WANT me to do this?

This is a very common warning from pandas. It means you are writing in a copy slice, not the original data so it might not apply to the original columns due to confusing chained assignment. Please read this post. It has detailed discussion on this SettingWithCopyWarning. In your case I think you can try
data.loc[data['name'] == 'fred', 'A'] = 0

Python data frames - how to select all columns that have a specific substring in their name [duplicate]

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 7 years ago.
in Python I have a data frame (df) that contains columns with the following names A_OPEN, A_CLOSE, B_OPEN, B_CLOSE, C_OPEN, C_CLOSE, D_ etc.....
How can I easily select only the columns that contain _CLOSE in their name? A,B,C,D,E,F etc can have any value so I do not want to use the specific column names
In SQL this would be done with the like operator: df[like'%_CLOSE%']
What's the python way?

You could use a list comprehension, e.g.:
df[[x for x in df.columns if "_CLOSE" in x]]
Example:
df = pd.DataFrame(
columns = ['_CLOSE_A', '_CLOSE_B', 'C'],
data = [[2,3,4], [3,4,5]]
)
Then,
>>>print(df[[x for x in df.columns if "_CLOSE" in x]])
_CLOSE_A _CLOSE_B
0 2 3
1 3 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

One-liner to identify duplicates using pandas? [duplicate] - python

Just use duplicated: >>> df[df.duplicated()] email 3 a 4 b Or if you want a list: >>> df[df["email"].duplicated()]["email"].tolist() ['a', 'b']

Related

extract value from pandas dataframe [duplicate]

How can I concat multiple dataframes in Python? [duplicate]

Python/Pandas - Query a MultiIndex Column [duplicate]

Correct way to set value on a slice in pandas [duplicate]

Python data frames - how to select all columns that have a specific substring in their name [duplicate]

Categories

Resources