How to select top N columns in a dataframe with a criteria - python

Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns
The columns in my data are split into 3 categories
columns start with "Basic"
columns end with "_T"
and everything else
a sample of my dataframe is this
RowID Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1 2 8 4 3 1 5 6 7
2 8 3 5 0 9 7 0 5
I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns
I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"
and by Top I mean the greatest values
Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
How can I achieve that?

Step 1: You can use .filter() with regex to filter the columns with the following 2 conditions:
start with "Basic", or
end with "_T"
The regex used is r'(?:^Basic)|(?:_T$)' where:
(?: ) is a non-capturing group of regex. It is for a temporary grouping.
^ is the start of text anchor to indicate start position of text
Basic matches with the text Basic (together with ^, this Basic must be at the beginning of column label)
| is the regex meta-character for or
_T matches the text _T
$ is the end of text anchor to indicate end of text position (together with _T, _T$ indicate _T at the end of column name.
We name these columns as cols_Basic_T
Step 2: Then, use Index.difference() to find others columns. We name these other columns as cols_others.
Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others.
Full set of codes:
## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns
## Step 2
cols_others = df.columns.difference(cols_Basic_T)
## Step 3
#Top = 20
Top = 3 # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
# To keep the original column sequence
df_others = df_others[df.columns.intersection(cols_others)]
Results:
cols_Basic_T
print(cols_Basic_T)
Index(['Basic1011', 'Basic2837', 'Car92_T', 'Basic383_T'], dtype='object')
cols_others
print(cols_others)
Index(['Brat82', 'Jot112', 'Lemon836', 'Manf3953', 'RowID'], dtype='object')
df_others
print(df_others)
## With Top 3 shown as non-zeros. Other non-Top3 masked as zeros
RowID Lemon836 Manf3953 Brat82 Jot112
0 0 4 0 5 7
1 0 0 9 7 5

Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.
# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()
# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]
#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
#then merge dfn with df_ranked

Related

seach for substring with minimum characters match pandas

I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object

I am looking for a partial sting in each row of my pandas Dataframe, I created a Dataframe where the rows are not all equal, so I cannot select by col

I read in a file and created a Dataframe from that file, the problem is that not all of the information that I read was separated properly and was not the same length. I have a df that has 1600 columns but I do not need them all I specifically need the information that is 3 columns to the left of a specific particular sting in one of the previous columns. For Example:
In the 1st row column number 1000, it has a value of ['HFOBR'] and then I need the column value that is 3 to the left.
In the 2nd row the column number with ['PQOBR'] might be 799 but I still need the value that is 3 to the left.
In the 3rd row the column number might be 400 with ['BBSOBR'] but I still need the lave 3 to the left.
And so on I really am trying to search each row for the partial sting OBR and then take the value of 3 to the left of it and put that value in a new df with a column of its own.
Here you will find a snapshot of the dataframe
Here you will see the code I used to create the dataframe in the first place where I read in an HL7 file and tried to convert it to a Dataframe, and each of the HL7 messages are not the same length whish is casing part of the problem I am having
message = []
parsed_msg = []
with open(filename) as msgs:
start = False
for line in msgs.readlines():
if line[:3] == 'MSH':
if start:
parsed_msg = hl7.parse_batch(msg)
#print(parsed_msg)
start = False
message += parsed_msg
msg = line
start = True
else:
msg += line
df = pd.DataFrame(message)
Sample data:
df = pd.DataFrame([["HFOBR", "foo", "a", "b", "c"], ["foo", "PQOBR", "a", "b", "c"]])
df
0 1 2 3 4
0 HFOBR foo a b c
1 foo PQOBR a b c
Define a function to find the value three columns to the left of the first column containing a string with "OBR"
import numpy as np
def find_left_value(row):
obr_col_idx = np.where(row.str.contains("OBR"))[0]
left_col_idx = obr_col_idx + 3
return row[left_col_idx].iloc[0]
Apply this function to your dataframe:
df['result'] = df.apply(find_left_value, axis=1)
Resulting dataframe:
0 1 2 3 4 result
0 HFOBR foo a b c b
1 foo PQOBR a b c c
FYI: making sample data like this that people can test answers on will help you 1) define your problem more clearly, and 2) get answers.

How do I maintain my column pattern during a df split?

I need to maintain the pattern of a column ('Item Type') when I split my dataframe. For example, this is my data:
What I'm trying to achieve is for example: if I split after 10 rows, then I want to still include the 11th row since it is part of the pattern. The pattern here is one 'Product', x number of 'SKU' followed by y number of 'Rule'. Any split inside this pattern should include the whole pattern.
My current code:
import pandas as pd
import numpy as np
df = pd.read_csv("bracelet_no_variants.csv")
l=[i*1000 for i in range(len(df)//1000+1)]+[len(df)]
for i in range(len(l)-1):
temp=df.iloc[l[i]:l[i+1]]
temp.to_csv('bracelet_no_variants_'+str(l[i+1])+'.csv')
Would I have to add an if/else statement maybe?
Here is a general solution that given a number of rows, will find the next row with 'Product' and then include all rows up-to that point.
For example, given n=7:
n = 7
df_after = df.iloc[n:]
new_idx = df_after.loc[df_after['Item Type'] == 'Product'].index[0]
res = df.loc[:new_idx].iloc[:-1]
Will give:
Item Type
1 Product
2 SKU
3 SKU
4 SKU
5 SKU
6 SKU
7 Rule
8 Rule
9 Rule
10 Rule
11 Rule
This code should work independently of the index values, i.e., the index can be anything as long as there are no duplicates.

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

Python pandas an efficient way to see which column contains a value and use its coordinates as an offset

As part of trying to learn pandas I'm trying to reshape a spreadsheet. After removing non zero values I need to get some data from a single column.
For the sample columns below, I want to find the most effective way of finding the row and column index of the cell that contains the value date and get the value next to it. (e.g. here it would be 38477.
In practice this would be a much bigger DataFrame and the date row could change and it may not always be in the first column.
What is the best way to find out where date is in the array and return the value in the adjacent cell?
Thanks
<bound method DataFrame.head of 0 1 2 4 5 7 8 10 \
1 some title
2 date 38477
5 cat1 cat2 cat3 cat4
6 a b c d e f g
8 Z 167.9404 151.1389 346.197 434.3589 336.7873 80.52901 269.1486
9 X 220.683 56.0029 73.73679 428.8939 483.7445 251.1877 243.7918
10 C 433.0189 390.1931 251.6636 418.6703 12.21859 113.093 136.28
12 V 226.0135 418.1141 310.2038 153.9018 425.7491 73.08073 277.5065
13 W 295.146 173.2747 2.187459 401.6453 51.47293 175.387 397.2021
14 S 306.9325 157.2772 464.1394 216.248 478.3903 173.948 328.9304
15 A 19.86611 73.11554 320.078 199.7598 467.8272 234.0331 141.5544
This really just reformats a lot of the iteration you are doing to make it clearer and take advantage of pandas ability to easily select, etc.
First, we need a dummy dataframe (with date in the last row and explicitly ordered the way you have in your setup)
import pandas as pd
df = pd.DataFrame({"A": [1,2,3,4,np.NaN],
"B":[5, 3, np.NaN, 3, "date"],
"C":[np.NaN,2, 1,3, 634]})[["A","B","C"]]
A clear way to do it is to find the row and then enumerate over the row to find date:
row = df[df.apply(lambda x: (x == "date").any(), axis=1)].values[0] # will be an array
for i, val in enumerate(row):
if val == "date":
print row[i + 1]
break
If your spreadsheet only has a few non-numeric columns, you could go by column, check for date and get a row and column index (this may be faster because it searches by column rather than by row, though I'm not sure)
# gives you column labels, which are `True` if at least one entry has `date` in it
# have to check `kind` otherwise you get an error.
col_result = df.apply(lambda x: x.dtype.kind == "O" and (x == "date").any())
# select only columns where True (this should be one entry) and get their index (for the label)
column = col_result[col_result].index[0]
col_index = df.columns.get_loc(column)
# will be True if it contains date
row_selector = df.icol(col_index) == "date"
print df[row_selector].icol(col_index + 1).values

Categories

Resources