Boolean filtering using AND - Pandas

Boolean filtering using AND - Pandas - python

I just can't seem to get the proper output when filtering the df below using boolean operators. I want the df to remove rows where ID is <= 2 AND String == A,B, or C. Is seems to be removing strings that are not equal to A,B, or C.
df = pd.DataFrame({
'String' : ['A','F','B','C','D','A','X','C','B','D','A','Y','A','C','A','D','C','B'],
'ID' : [4,2,3,4,5,6,4,2,3,4,5,6,4,2,3,4,5,6],
})
df = df[~(df['ID'] <= 2) & (df['String'].isin(['A','B','C']))]
intended Output:
String ID
0 A 4
1 F 2
2 B 3
3 C 4
4 D 5
5 A 6
6 X 4
#7 C 2 Remove
8 B 3
9 D 4
10 A 5
11 Y 6
12 A 4
#13 C 2 Remove
14 A 3
15 D 4
16 C 5
17 B 6

Related

keeping first column value .melt func

I want to use dataframe.melt function in pandas lib to convert data format from rows into column but keeping first column value. I ve just tried also .pivot, but it is not working good. Please look at the example below and please help:
ID Alphabet Unspecified: 1 Unspecified: 2
0 1 A G L
1 2 B NaN NaN
2 3 C H NaN
3 4 D I M
4 5 E J NaN
5 6 F K O
Into this:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
11 6 O

Try (assuming ID is unique and sorted):
df = (
pd.melt(df, "ID")
.sort_values("ID", kind="stable")
.drop(columns="variable")
.dropna()
.reset_index(drop=True)
.rename(columns={"value": "Alphabet"})
)
print(df)
Prints:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O

Don't melt but rather stack, this will directly drop the NaNs and keep the order per row:
out = (df
.set_index('ID')
.stack().droplevel(1)
.reset_index(name='Alphabet')
)
Output:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O

One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'ID',
names_to = 'Alphabet',
names_pattern = ['.+'],
sort_by_appearance = True)
.dropna()
)
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
6 3 C
7 3 H
9 4 D
10 4 I
11 4 M
12 5 E
13 5 J
15 6 F
16 6 K
17 6 O
In the code above, the names_pattern accepts a list of regular expression to match the desired columns, all the matches are collated into one column names Alphabet in names_to.

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?

I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.

Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20

I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

Convert Outline format in CSV to Two Columns

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.

Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Get all dataframe based in certain value in dataframe column

I have a DataFrame looks something like this :
import numpy as np
import pandas as pd
df=pd.DataFrame([['d',5,6],['a',6,6],['index',5,8],['b',3,1],['b',5,6],['index',6,7],
['e',2,3],['c',5,6],['index',5,8]],columns=['A','B','C'])
I want to select all the lines that are between index and create many dataframes
I want to obtain all as :
dataframe1:
A B C
1 a 6 6
2 index 5 8
3 3 b 3
dataframe 2
A B C
4 b 5 6
5 index 6 7
6 c 2 3
datframe3:
A B C
7 c 5 6
8 index 5 8
9 4 3 1
dataframe4 :
A B C
11 5 2 3
12 index 4 2
13 1 2 5

index_list = df.index[df['A'] == 'index'].tolist() # create a list of the index where df['A']=='index'
new_df = [] # empty list for dataframes
for i in index_list: # for loop
try:
new_df.append(df.iloc[i-1:i+2])
except:
pass
this creates a list of dataframes you can call them by new_df[0] new_df[1] or use a loop to print them out:
for i in range(len(new_df)):
print(f'{new_df[i]}\n')
A B C
1 a 6 6
2 index 5 8
3 b 3 1
A B C
4 b 5 6
5 index 6 7
6 e 2 3
A B C
7 c 5 6
8 index 5 8

Filling missing data in df.loc filtered conditions?

I have following problem with filling nan in a filtered df.
Let's take this df :
condition value
0 A 1
1 B 8
2 B np.nan
3 A np.nan
4 C 3
5 C np.nan
6 A 2
7 B 5
8 C 4
9 A np.nan
10 B np.nan
11 C np.nan
How can I fill np.nan with the value from the last value based on condition, so that I get following result?
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
I've failed with following code (ValueError: Cannot index with multidimensional key):
conditions = set(df['condition'].tolist())
for c in conditions :
filter = df.loc[df['condition'] == c]
df.loc[filter, 'value'] = df.loc[filter, 'value'].fillna(method='ffill')
THX & BR from Vienna

If your values are actual NaN, you simply need to do a groupby on condition, and then call ffill (which is essentially a wrapper for fillna(method='ffill')):
df.groupby('condition').ffill()
Which returns:
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
If your values are strings that say np.nan, as in your example, then replace them before:
df.replace('np.nan', np.nan, inplace=True)
df.groupby('condition').ffill()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Boolean filtering using AND - Pandas - python

Related

keeping first column value .melt func

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Convert Outline format in CSV to Two Columns

Get all dataframe based in certain value in dataframe column

Filling missing data in df.loc filtered conditions?

Categories

Resources