How do I maintain my column pattern during a df split? - python

I need to maintain the pattern of a column ('Item Type') when I split my dataframe. For example, this is my data:
What I'm trying to achieve is for example: if I split after 10 rows, then I want to still include the 11th row since it is part of the pattern. The pattern here is one 'Product', x number of 'SKU' followed by y number of 'Rule'. Any split inside this pattern should include the whole pattern.
My current code:
import pandas as pd
import numpy as np
df = pd.read_csv("bracelet_no_variants.csv")
l=[i*1000 for i in range(len(df)//1000+1)]+[len(df)]
for i in range(len(l)-1):
temp=df.iloc[l[i]:l[i+1]]
temp.to_csv('bracelet_no_variants_'+str(l[i+1])+'.csv')
Would I have to add an if/else statement maybe?

Here is a general solution that given a number of rows, will find the next row with 'Product' and then include all rows up-to that point.
For example, given n=7:
n = 7
df_after = df.iloc[n:]
new_idx = df_after.loc[df_after['Item Type'] == 'Product'].index[0]
res = df.loc[:new_idx].iloc[:-1]
Will give:
Item Type
1 Product
2 SKU
3 SKU
4 SKU
5 SKU
6 SKU
7 Rule
8 Rule
9 Rule
10 Rule
11 Rule
This code should work independently of the index values, i.e., the index can be anything as long as there are no duplicates.

Related

seach for substring with minimum characters match pandas

I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object

Count groups of consecutive values in pandas

I have a dataframe with 0 and 1 and I would like to count groups of 1s (don't mind the 0s) with a Pandas solution (not itertools, not python iteration).
Other SO posts suggest methods based on shift()/diff()/cumsum() which seems not to work when the leading sequence in the dataframe starts with 0.
df = pandas.Series([0,1,1,1,0,0,1,0,1,1,0,1,1]) # should give 4
df = pandas.Series([1,1,0,0,1,0,1,1,0,1,1]) # should also give 4
df = pandas.Series([1,1,1,1,1,0,1]) # should give 2
Any idea ?
If you only have 0/1, you can use:
s = pd.Series([0,1,1,1,0,0,1,0,1,1,0,1,1])
count = s.diff().fillna(s).eq(1).sum()
output: 4 (4 and 2 for the other two)
Then fillna ensures that Series starting with 1 will be counted
faster alternative
use the diff, count the 1 and correct the result with the first item:
count = s.diff().eq(1).sum()+(s.iloc[0]==1)
comparison of different pandas approaches:
Let us identify the diffrent groups of 1's using cumsum, then use nunique to count the number of unique groups
m = df.eq(0)
m.cumsum()[~m].nunique()
Result
case 1: 4
case 2: 4
case 3: 2

How to select top N columns in a dataframe with a criteria

Here is my dataframe, it has high dimensionality (big number of columns) more than 10000 columns
The columns in my data are split into 3 categories
columns start with "Basic"
columns end with "_T"
and everything else
a sample of my dataframe is this
RowID Basic1011 Basic2837 Lemon836 Car92_T Manf3953 Brat82 Basic383_T Jot112 ...
1 2 8 4 3 1 5 6 7
2 8 3 5 0 9 7 0 5
I want to have in my dataframe all "Basic" & "_T" columns and only TOP N (variable could be 3, 5, 10, 100, etc) of other columns
I have this code to give me top N for all columns. but what I am looking for just the top N for columns are not "Basic" or "_T"
and by Top I mean the greatest values
Top = 20
df = df.where(df.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
How can I achieve that?
Step 1: You can use .filter() with regex to filter the columns with the following 2 conditions:
start with "Basic", or
end with "_T"
The regex used is r'(?:^Basic)|(?:_T$)' where:
(?: ) is a non-capturing group of regex. It is for a temporary grouping.
^ is the start of text anchor to indicate start position of text
Basic matches with the text Basic (together with ^, this Basic must be at the beginning of column label)
| is the regex meta-character for or
_T matches the text _T
$ is the end of text anchor to indicate end of text position (together with _T, _T$ indicate _T at the end of column name.
We name these columns as cols_Basic_T
Step 2: Then, use Index.difference() to find others columns. We name these other columns as cols_others.
Step 3: Then, we apply the similar code you used to give you top N for all columns on these selected columns col_others.
Full set of codes:
## Step 1
cols_Basic_T = df.filter(regex=r'(?:^Basic)|(?:_T$)').columns
## Step 2
cols_others = df.columns.difference(cols_Basic_T)
## Step 3
#Top = 20
Top = 3 # use fewer columns here for smaller sample data here
df_others = df[cols_others].where(df[cols_others].apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
# To keep the original column sequence
df_others = df_others[df.columns.intersection(cols_others)]
Results:
cols_Basic_T
print(cols_Basic_T)
Index(['Basic1011', 'Basic2837', 'Car92_T', 'Basic383_T'], dtype='object')
cols_others
print(cols_others)
Index(['Brat82', 'Jot112', 'Lemon836', 'Manf3953', 'RowID'], dtype='object')
df_others
print(df_others)
## With Top 3 shown as non-zeros. Other non-Top3 masked as zeros
RowID Lemon836 Manf3953 Brat82 Jot112
0 0 4 0 5 7
1 0 0 9 7 5
Try something like this, you may have to play around with column selection on outset to be sure you're filtering correctly.
# this gives you column names with Basic or _T anywhere in the column name.
unwanted = df.filter(regex='Basic|_T').columns.tolist()
# the tilda takes the opposite of the criteria, so no Basic or _T
dfn = df[df.columns[~df.columns.isin(unwanted)]]
#apply your filter
Top = 2
df_ranked = dfn.where(dfn.apply(lambda x: x.eq(x.nlargest(Top)), axis=1), 0)
#then merge dfn with df_ranked

Pandas - how to filter dataframe by regex comparisons on mutliple column values

I have a dataframe like the following, where everything is formatted as a string:
df
property value count
0 propAb True 10
1 propAA False 10
2 propAB blah 10
3 propBb 3 8
4 propBA 4 7
5 propCa 100 4
I am trying to find a way to filter the dataframe by applying a series of regex-style rules to both the property and value columns together.
For example, some sample rules may be like the following:
"if property starts with 'propA' and value is not 'True', drop the row".
Another rule may be something more mathematical, like:
"if property starts with 'propB' and value < 4, drop the row".
Is there a way to accomplish something like this without having to iterate over all rows each time for every rule I want to apply?
You still have to apply each rule (how else?), but let pandas handle the rows. Also, instead of removing the rows that you do not like, keep the rows that you do. Here's an example of how the first two rules can be applied:
rule1 = df.property.str.startswith('propA') & (df.value != 'True')
df = df[~rule1] # Keep everything that does NOT match
rule2 = df.property.str.startswith('propB') & (df.value < 4)
df = df[~rule2] # Keep everything that does NOT match
By the way, the second rule will not work because value is not a numeric column.
For the first one:
df = df.drop(df[(df.property.startswith('propA')) & (df.value is not True)].index)
and the other one:
df = df.drop(df[(df.property.startswith('propB')) & (df.value < 4)].index)

Grouping by everything except for one index column in pandas

My data analysis repeatedly falls back on a simple but iffy motif, namely "groupby everything except". Take this multi-index example, df:
accuracy velocity
name condition trial
john a 1 -1.403105 0.419850
2 -0.879487 0.141615
b 1 0.880945 1.951347
2 0.103741 0.015548
hans a 1 1.425816 2.556959
2 -0.117703 0.595807
b 1 -1.136137 0.001417
2 0.082444 -1.184703
What I want to do now, for instance, is averaging over all available trials while retaining info about names and conditions. This is easily achieved:
average = df.groupby(level=('name', 'condition')).mean()
Under real-world conditions, however, there's a lot more metadata stored in the multi-index. The index easily spans 8-10 columns per row. So the pattern above becomes quite unwieldy. Ultimately, I'm looking for a "discard" operation; I want to perform an operation that throws out or reduces a single index column. In the case above, that's trial number.
Should I just bite the bullet or is there a more idiomatic way of going about this? This might well be an anti-pattern! I want to build a decent intuition when it comes to the "true pandas way"... Thanks in advance.
You could define a helper-function for this:
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
Demo:
import pandas as pd
levels = ('name', 'condition', 'trial')
names = ('john', 'hans')
conditions = list('ab')
trials = range(1, 3)
idx = pd.MultiIndex.from_product(
[names, conditions, trials], names=levels)
df = pd.DataFrame(np.random.randn(len(idx), 2),
index=idx, columns=('accuracy', 'velocity'))
def allbut(*names):
names = set(names)
return [item for item in levels if item not in names]
In [40]: df.groupby(level=allbut('condition')).mean()
Out[40]:
accuracy velocity
trial name
1 hans 0.086303 0.131395
john 0.454824 -0.259495
2 hans -0.234961 -0.626495
john 0.614730 -0.144183
You can remove more than one level too:
In [53]: df.groupby(level=allbut('name', 'trial')).mean()
Out[53]:
accuracy velocity
condition
a -0.597178 -0.370377
b -0.126996 -0.037003
In the documentation of groupby, there is an example of how to group by all but one specified column of a multiindex. It uses the .difference method of the index names:
df.groupby(level=df.index.names.difference(['name']))

Categories

Resources