I have a pandas data frame that contains two columns, with trace numbers [col_1] and ID numbers [col_2]. Trace numbers can be duplicates, as can ID numbers - however, each trace & ID should correspond only a specific fellow in the adjacent column.
Each of my two columns are the same length, but have different unique value counts, which should be the same, as shown below:
in[1]: Trace | ID
1 | 5054
2 | 8291
3 | 9323
4 | 9323
... |
100 | 8928
in[2]: print('unique traces: ', df['Trace'].value_counts())
print('unique IDs: ', df['ID'].value_counts())
out[3]: unique traces: 100
unique IDs: 99
In the code above, the same ID number (9232) is represented by two Trace numbers (3 & 4) - how can I isolate these incidences? Thanks for looking!
By using the duplicated() function (docs), you can do the following:
df[df['ID'].duplicated(keep=False)]
By setting keep to False, we get all the duplicates (instead of excluding the first or the last one).
Which returns:
Trace ID
2 3 9323
3 4 9323
You can use groupby and filter:
df.groupby('ID').filter(lambda x: x.Trace.nunique() > 1)
Output:
Trace ID
2 3 9323.0
3 4 9323.0
#this should tell you the index of Non-unique Trace or IDs.
df.groupby('ID').filter(lambda x: len(x)>1)
Out[85]:
Trace ID
2 3 9323
3 4 9323
df.groupby('Trace').filter(lambda x: len(x)>1)
Out[86]:
Empty DataFrame
Columns: [Trace, ID]
Index: []
Related
I have a column in a dataset that has string and digits, (Column 2),
I need to extract digits with 10 or more. as (Column 3) / output.
any idea how to do this?
Column1
Column2
A
ghjy 123456677777 rttt 123.987 rtdggd
ABC
90999888877 asrteg 12.98 tggff 12300004
B
thdhdjdj 123 jsjsjjsjl tehshshs 126666555533333
DLT
1.2897 thhhsskkkk 456633388899000022
XYZ
tteerr 12.34
Expected output:
|Column3|
|-------|
|123456677777|
|90999888877|
|126666555533333|
|456633388899000000|
| |
I tried a few codes, regex, lambda function, apply, map, but is taking the entire column as one string. didnt want to split it because real dataset has so many words and digits on it.
You could try:
df['Column3'] = df['Column2'].str.extract(r'(\d{10,})')
print(df)
Column1 Column2 Column3
0 A ghjy 123456677777 rttt 123.987 rtdggd 123456677777
1 ABC 90999888877 asrteg 12.98 tggff 12300004 90999888877
2 B thdhdjdj 123 jsjsjjsjl tehshshs 126666555533333 126666555533333
3 DLT 1.2897 thhhsskkkk 456633388899000022 456633388899000022
4 XYZ tteerr 12.34 NaN
To allow for multiple matches per string, you could do:
df['Column3'] = df['Column2'].str.findall(r'(\d{10,})').apply(', '.join)
Maybe this works:
Take the value of the Column 2
Split the values
for loop the values
Check if the value is numeric and if the length is equal or greater than 10
Get the value if the previous validation is true
Set the value to the Column 3
I have two columns in a larger dataframe that represent the ID of the record within my database and a hash of PII data that does not have to be unique. What I am trying to achieve is a window-like function that ranks each PII hash based on the ID in ascending order (See example below). However, I am running into an issue with the groupby().rank() method chain because these values are both strings. Is there some transformation that I would need to make to achieve this?
id | sha256_cpn | rank
2bce | 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb | 2
1bce | 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb | 1
3bce | 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb | 3
Here is the error:
DataError: No numeric types to aggregate
Here is my code:
// id = object
// sha256_cpn = object
df['rank'] = df.sort_values(['sha256_cpn', 'id']).groupby('sha256_cpn')['id'].rank(method="first")
Let's try groupby on sha256_cpn and transform id using Series.factorize:
df['rank'] = df.groupby('sha256_cpn')['id']\
.transform(lambda s: s.factorize(sort=True)[0] + 1)
Another approach with sort_values then groupby + cumcount:
df['rank'] = df.sort_values(['sha256_cpn', 'id'])\
.groupby('sha256_cpn').cumcount().add(1)
id sha256_cpn rank
0 2bce 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb 2
1 1bce 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb 1
2 3bce 1005a9eaf26b44bfd70b6430f1e86fd14add9b042d4383b6f6fcb6549e5360cb 3
In a Pandas.DataFrame, I would like to find the index of the row whose value in a given column is closest to (but below) a specified value. Specifically, say I am given the number 40 and the DataFrame df:
| | x |
|---:|----:|
| 0 | 11 |
| 1 | 15 |
| 2 | 17 |
| 3 | 25 |
| 4 | 54 |
I want to find the index of the row such that df["x"] is lower but as close as possible to 40. Here, the answer would be 3 because df[3,'x']=25 is smaller than the given number 40 but closest to it.
My dataframe has other columns, but I can assume that the column "x" is increasing.
For an exact match, I did (correct me if there is a better method):
list = df[(df.x == number)].index.tolist()
if list:
result = list[0]
But for the general case, I do not know how to do it in a "vectorized" way.
Filter rows below 40 by Series.lt in boolean indexing and get mximal index value by Series.idxmax:
a = df.loc[df['x'].lt(40), 'x'].idxmax()
print (a)
3
For improve performance is possible use numpy.where with np.max, solution working if default index:
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
If not default RangeIndex:
df = pd.DataFrame({'x':[11,15,17,25,54]}, index=list('abcde'))
a = np.max(np.where(df['x'].lt(40))[0])
print (a)
3
print (df.index[a])
d
How about that:
import pandas as pd
data = {'x':[0,1,2,3,4,20,50]}
df = pd.DataFrame(data)
#get df with selected condition
sub_df = df[df['x'] <= 40]
#get the idx of the maximum
idx = sub_df.idxmax()
print(idx)
Use Series.where to mask greater or equal than n, then use Series.idxmax to obtain
the closest one:
n=40
val = df['x'].where(df['x'].lt(n)).idxmax()
print(val)
3
We could also use Series.mask:
df['x'].mask(df['x'].ge(40)).idxmax()
or callable with loc[]
df['x'].loc[lambda x: x.lt(40)].idxmax()
#alternative
#df.loc[lambda col: col['x'].lt(40),'x'].idxmax()
If not by default RangeIndex
i = df.loc[lambda col: col['x'].lt(40),'x'].reset_index(drop=True).idxmax()
df.index[i]
This question already has answers here:
How can I change a specific row label in a Pandas dataframe?
(2 answers)
Closed 5 years ago.
I see how to rename columns, but I want to rename an index (row name) that I have in a data frame.
I had a table with 350 rows in it, I then added a total to the bottom. I then removed every row except the last row.
-------------------------------------------------
| | A | B | C |
-------------------------------------------------
| TOTAL | 1243 | 423 | 23 |
-------------------------------------------------
So I have the row called 'Total', and then several columns. I want to rename the word 'Total' to something else.
Is this even possible?
Many thanks
You could use a dictionary structure with rename(), for example,
In [1]: import pandas as pd
df = pd.Series([1, 2, 3])
df
Out[1]: 0 1
1 2
2 3
dtype: int64
In [2]: df.rename({1: 3, 2: 'total'})
Out[2]: 0 1
3 2
total 3
dtype: int64
Easy as this...
df.index.name = 'Name'
I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max
Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].