Boolean filtering using AND - Pandas - python

I just can't seem to get the proper output when filtering the df below using boolean operators. I want the df to remove rows where ID is <= 2 AND String == A,B, or C. Is seems to be removing strings that are not equal to A,B, or C.
df = pd.DataFrame({
'String' : ['A','F','B','C','D','A','X','C','B','D','A','Y','A','C','A','D','C','B'],
'ID' : [4,2,3,4,5,6,4,2,3,4,5,6,4,2,3,4,5,6],
})
df = df[~(df['ID'] <= 2) & (df['String'].isin(['A','B','C']))]
intended Output:
String ID
0 A 4
1 F 2
2 B 3
3 C 4
4 D 5
5 A 6
6 X 4
#7 C 2 Remove
8 B 3
9 D 4
10 A 5
11 Y 6
12 A 4
#13 C 2 Remove
14 A 3
15 D 4
16 C 5
17 B 6

Related

keeping first column value .melt func

I want to use dataframe.melt function in pandas lib to convert data format from rows into column but keeping first column value. I ve just tried also .pivot, but it is not working good. Please look at the example below and please help:
ID Alphabet Unspecified: 1 Unspecified: 2
0 1 A G L
1 2 B NaN NaN
2 3 C H NaN
3 4 D I M
4 5 E J NaN
5 6 F K O
Into this:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
11 6 O
Try (assuming ID is unique and sorted):
df = (
pd.melt(df, "ID")
.sort_values("ID", kind="stable")
.drop(columns="variable")
.dropna()
.reset_index(drop=True)
.rename(columns={"value": "Alphabet"})
)
print(df)
Prints:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
Don't melt but rather stack, this will directly drop the NaNs and keep the order per row:
out = (df
.set_index('ID')
.stack().droplevel(1)
.reset_index(name='Alphabet')
)
Output:
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
4 3 C
5 3 H
6 4 D
7 4 I
8 4 M
9 5 E
10 5 J
11 6 F
12 6 K
13 6 O
One option is with pivot_longer from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index = 'ID',
names_to = 'Alphabet',
names_pattern = ['.+'],
sort_by_appearance = True)
.dropna()
)
ID Alphabet
0 1 A
1 1 G
2 1 L
3 2 B
6 3 C
7 3 H
9 4 D
10 4 I
11 4 M
12 5 E
13 5 J
15 6 F
16 6 K
17 6 O
In the code above, the names_pattern accepts a list of regular expression to match the desired columns, all the matches are collated into one column names Alphabet in names_to.

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

Convert Outline format in CSV to Two Columns

I have data in a CSV file of the following format (one column in a dataframe). This is essentially like an outline in a Word document, where the headers I've shown here are letters are the main headers, and the items as numbers are subheaders:
A
1
2
3
B
1
2
C
1
2
3
4
I want to convert this to the following format (two columns in a dataframe):
A 1
A 2
A 3
B 1
B 2
C 1
C 2
C 3
C 4
I'm using pandas read_csv to convert the data into a dataframe, and I'm trying to reformat through for loops, but I'm having difficulty because the data repeats and gets overwritten. For example, A 3 will get overwritten with C 3 (resulting in two instance of C 3 when only one is desired, and losing A 3 altogether) later in the loop. What's the best way to do this?
Apologies for poor formatting, new to the site.
Use:
#if no csv header use names parameter
df = pd.read_csv(file, names=['col'])
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
df = df[df['a'] != df['col']]
print (df)
a col
1 A 1
2 A 2
3 A 3
5 B 1
6 B 2
8 C 1
9 C 2
10 C 3
11 C 4
Details:
Check isnumeric values:
print (df['col'].str.isnumeric())
0 False
1 True
2 True
3 True
4 False
5 True
6 True
7 False
8 True
9 True
10 True
11 True
Name: col, dtype: bool
Replace True by NaNs by mask and forward fill missing values:
print (df['col'].mask(df['col'].str.isnumeric()).ffill())
0 A
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 C
10 C
11 C
Name: col, dtype: object
Add new column to first position by DataFrame.insert:
df.insert(0, 'a', df['col'].mask(df['col'].str.isnumeric()).ffill())
print (df)
a col
0 A A
1 A 1
2 A 2
3 A 3
4 B B
5 B 1
6 B 2
7 C C
8 C 1
9 C 2
10 C 3
11 C 4
and last remove rows with same values by boolean indexing.

Get all dataframe based in certain value in dataframe column

I have a DataFrame looks something like this :
import numpy as np
import pandas as pd
df=pd.DataFrame([['d',5,6],['a',6,6],['index',5,8],['b',3,1],['b',5,6],['index',6,7],
['e',2,3],['c',5,6],['index',5,8]],columns=['A','B','C'])
I want to select all the lines that are between index and create many dataframes
I want to obtain all as :
dataframe1:
A B C
1 a 6 6
2 index 5 8
3 3 b 3
dataframe 2
A B C
4 b 5 6
5 index 6 7
6 c 2 3
datframe3:
A B C
7 c 5 6
8 index 5 8
9 4 3 1
dataframe4 :
A B C
11 5 2 3
12 index 4 2
13 1 2 5
index_list = df.index[df['A'] == 'index'].tolist() # create a list of the index where df['A']=='index'
new_df = [] # empty list for dataframes
for i in index_list: # for loop
try:
new_df.append(df.iloc[i-1:i+2])
except:
pass
this creates a list of dataframes you can call them by new_df[0] new_df[1] or use a loop to print them out:
for i in range(len(new_df)):
print(f'{new_df[i]}\n')
A B C
1 a 6 6
2 index 5 8
3 b 3 1
A B C
4 b 5 6
5 index 6 7
6 e 2 3
A B C
7 c 5 6
8 index 5 8

Filling missing data in df.loc filtered conditions?

I have following problem with filling nan in a filtered df.
Let's take this df :
condition value
0 A 1
1 B 8
2 B np.nan
3 A np.nan
4 C 3
5 C np.nan
6 A 2
7 B 5
8 C 4
9 A np.nan
10 B np.nan
11 C np.nan
How can I fill np.nan with the value from the last value based on condition, so that I get following result?
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
I've failed with following code (ValueError: Cannot index with multidimensional key):
conditions = set(df['condition'].tolist())
for c in conditions :
filter = df.loc[df['condition'] == c]
df.loc[filter, 'value'] = df.loc[filter, 'value'].fillna(method='ffill')
THX & BR from Vienna
If your values are actual NaN, you simply need to do a groupby on condition, and then call ffill (which is essentially a wrapper for fillna(method='ffill')):
df.groupby('condition').ffill()
Which returns:
condition value
0 A 1
1 B 8
2 B 8
3 A 1
4 C 3
5 C 3
6 A 2
7 B 5
8 C 4
9 A 2
10 B 5
11 C 4
If your values are strings that say np.nan, as in your example, then replace them before:
df.replace('np.nan', np.nan, inplace=True)
df.groupby('condition').ffill()

Categories

Resources