Seperate rows in a dataframe based on custom value? - python

I have a df with two columns a and b.
import pandas as pd
raw_data = {'a': ['2019145236792', 'abc_def date_1220', '2020124832852', 'jhi_abc this_1219_abc'],
'b': ['tom','john','mark','jim']}
df = pd.DataFrame(raw_data, columns=['a', 'b'])
df
a b
0 2019145236792 tom
1 abc_def date_1220 john
2 2020124832852 mark
3 jhi_abc this_1219_abc20 jim
I want to seperate the data which only contains 20. The position of 20 won't change.
eg: 2020124832852 and abc_def date_1220
Expected output:
a b
0 abc_def date_1220 john
1 2020124832852 mark

Use boolean indexing with comapre by Series.eq and indexing by str chained by | for bitwise OR by second mask with Series.str.extract for values after date_:
m1 = df['a'].str[2:4].eq('20')
m2 = df['a'].str.extract('date_(.*)', expand=False).str[2:4].eq('20')
df = df[m1 | m2]
print (df)
a b
1 abc_def date_1220 john
2 2020124832852 mark
EDIT:
m2 = df['a'].str.split('_', n=2).str[2].str[2:4].eq('20')

you can use a list comprehension to get the wanted rows but you have to specify the required positions:
import re
req_pos = {2, 15}
df[[any(e.start() in req_pos for e in re.finditer('20', s)) for s in df.a]]

Related

Python: sample from dataframe, storing the non-sampled

I have a pandas DataFrame.
Say I want to sample two persons of each group, I use the following code to get a new dataframe:
sample_df = df.groupby("category").apply(lambda group_df: group_df.sample(2, random_state=1234)
I would like to create a dataframe where the non-sampled persons are stored.
The sample_df stil has the indices of the original df so I probably have to do something with that, but I'm not sure what...
Thanks in advance!
First add group_keys=False to groupby for avoid category to MultiIndex:
df = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'category':list('aaabbb')
})
sample_df = (df.groupby("category", group_keys=False)
.apply(lambda group_df: group_df.sample(2, random_state=1234)))
print(sample_df)
A B category
0 a 4 a
1 b 5 a
3 d 5 b
4 e 5 b
So possible filter original index values with boolean indexing by Index.isin and inverted mask by ~:
non_sample_df = df[~df.index.isin(sample_df.index)]
print(non_sample_df)
A B category
2 c 4 a
5 f 4 b

Accessing an Non Numerical Index in a DataFrame [duplicate]

I'm simply trying to access named pandas columns by an integer.
You can select a row by location using df.ix[3].
But how to select a column by integer?
My dataframe:
df=pandas.DataFrame({'a':np.random.rand(5), 'b':np.random.rand(5)})
Two approaches that come to mind:
>>> df
A B C D
0 0.424634 1.716633 0.282734 2.086944
1 -1.325816 2.056277 2.583704 -0.776403
2 1.457809 -0.407279 -1.560583 -1.316246
3 -0.757134 -1.321025 1.325853 -2.513373
4 1.366180 -1.265185 -2.184617 0.881514
>>> df.iloc[:, 2]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
>>> df[df.columns[2]]
0 0.282734
1 2.583704
2 -1.560583
3 1.325853
4 -2.184617
Name: C
Edit: The original answer suggested the use of df.ix[:,2] but this function is now deprecated. Users should switch to df.iloc[:,2].
You can also use df.icol(n) to access a column by integer.
Update: icol is deprecated and the same functionality can be achieved by:
df.iloc[:, n] # to access the column at the nth position
You could use label based using .loc or index based using .iloc method to do column-slicing including column ranges:
In [50]: import pandas as pd
In [51]: import numpy as np
In [52]: df = pd.DataFrame(np.random.rand(4,4), columns = list('abcd'))
In [53]: df
Out[53]:
a b c d
0 0.806811 0.187630 0.978159 0.317261
1 0.738792 0.862661 0.580592 0.010177
2 0.224633 0.342579 0.214512 0.375147
3 0.875262 0.151867 0.071244 0.893735
In [54]: df.loc[:, ["a", "b", "d"]] ### Selective columns based slicing
Out[54]:
a b d
0 0.806811 0.187630 0.317261
1 0.738792 0.862661 0.010177
2 0.224633 0.342579 0.375147
3 0.875262 0.151867 0.893735
In [55]: df.loc[:, "a":"c"] ### Selective label based column ranges slicing
Out[55]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
In [56]: df.iloc[:, 0:3] ### Selective index based column ranges slicing
Out[56]:
a b c
0 0.806811 0.187630 0.978159
1 0.738792 0.862661 0.580592
2 0.224633 0.342579 0.214512
3 0.875262 0.151867 0.071244
You can access multiple columns by passing a list of column indices to dataFrame.ix.
For example:
>>> df = pandas.DataFrame({
'a': np.random.rand(5),
'b': np.random.rand(5),
'c': np.random.rand(5),
'd': np.random.rand(5)
})
>>> df
a b c d
0 0.705718 0.414073 0.007040 0.889579
1 0.198005 0.520747 0.827818 0.366271
2 0.974552 0.667484 0.056246 0.524306
3 0.512126 0.775926 0.837896 0.955200
4 0.793203 0.686405 0.401596 0.544421
>>> df.ix[:,[1,3]]
b d
0 0.414073 0.889579
1 0.520747 0.366271
2 0.667484 0.524306
3 0.775926 0.955200
4 0.686405 0.544421
The method .transpose() converts columns to rows and rows to column, hence you could even write
df.transpose().ix[3]
Most of the people have answered how to take columns starting from an index. But there might be some scenarios where you need to pick columns from in-between or specific index, where you can use the below solution.
Say that you have columns A,B and C. If you need to select only column A and C you can use the below code.
df = df.iloc[:, [0,2]]
where 0,2 specifies that you need to select only 1st and 3rd column.
You can use the method take. For example, to select first and last columns:
df.take([0, -1], axis=1)

How to check not in on multiple dataframes pandas?

I have the following dataframes.
import pandas as pd
d={'P':['A','B','C'],
'Q':[5,6,7]
}
df=pd.DataFrame(data=d)
print(df)
d={'P':['A','C','D'],
'Q':[5,7,8]
}
df1=pd.DataFrame(data=d)
print(df1)
d={'P':['B','E','F'],
'Q':[5,7,8]
}
df3=pd.DataFrame(data=d)
print(df3)
Code to check one dataframe column not present in other is this:
df.loc[~df['P'].isin(df1['P'])]
How to check the same in multiple columns?
How to find P column in df3 not in P column of df and df1?
Expected Output:
P Q
0 E 7
1 F 8
You can chain 2 conditions with & for bitwise AND:
cond1 = ~df3['P'].isin(df1['P'])
cond2 = ~df3['P'].isin(df['P'])
df = df3.loc[cond1 & cond2]
print (df)
P Q
1 E 7
2 F 8
Or join values of columns - by concatenate or join list by +:
df = df3.loc[~df3['P'].isin(np.concatenate([df1['P'],df['P']]))]
#another solution
#df = df3.loc[~df3['P'].isin(df1['P'].tolist() + df['P'].tolist())]
What about, However jezrael already given expert answer :)
You can simply define the conditions, and then combine them logically, like:
con1 = df3['P'].isin(df['P'])
con2 = df3['P'].isin(df1['P'])
df = df3[~ (con1 | con2)]
>>> df
P Q
1 E 7
2 F 8

Subsetting DataFrame based on column names of another DataFrame

I have two DataFrames and I want to subset df2 based on the column names that intersect with the column names of df1. In R this is easy.
R code:
df1 <- data.frame(a=rnorm(5), b=rnorm(5))
df2 <- data.frame(a=rnorm(5), b=rnorm(5), c=rnorm(5))
df2[names(df2) %in% names(df1)]
a b
1 -0.8173361 0.6450052
2 -0.8046676 0.6441492
3 -0.3545996 -1.6545289
4 1.3364769 -0.4340254
5 -0.6013046 1.6118360
However, I'm not sure how to do this in pandas.
pandas attempt:
df1 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,))})
df2 = pd.DataFrame({'a': np.random.standard_normal((5,)), 'b': np.random.standard_normal((5,)), 'c': np.random.standard_normal((5,))})
df2[df2.columns in df1.columns]
This results in TypeError: unhashable type: 'Index'. What's the right way to do this?
If you need a true intersection, since .columns yields an Index object which supports basic set operations, you can use &, e.g.
df2[df1.columns & df2.columns]
or equivalently with Index.intersection
df2[df1.columns.intersection(df2.columns)]
However if you are guaranteed that df1 is just a column subset of df2 you can directly use
df2[df1.columns]
or if assigning,
df2.loc[:, df1.columns]
Demo
>>> df2[df1.columns & df2.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
>>> df2.loc[:, df1.columns]
a b
0 1.952230 -0.641574
1 0.804606 -1.509773
2 -0.360106 0.939992
3 0.471858 -0.025248
4 -0.663493 2.031343
The equivalent would be:
df2[df1.columns.intersection(df2.columns)]
Out:
a b
0 -0.019703 0.379820
1 0.040658 0.243309
2 1.103032 0.066454
3 -0.921378 1.016017
4 0.188666 -0.626612
With this, you will not get a KeyError if a column in df1 does not exist in df2.

Pandas merge creates unwanted duplicate entries

I'm new to Pandas and I want to merge two datasets that have similar columns. The columns are going to each have some unique values compared to the other column, in addition to many identical values. There are some duplicates in each column that I'd like to keep. My desired output is shown below. Adding how='inner' or 'outer' does not yield the desired result.
import pandas as pd
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
print(pd.merge(df1,df2))
output:
A
0 2
1 2
2 2
3 2
4 3
5 4
6 5
desired/expected output:
A
0 2
1 2
2 3
3 4
4 5
Please let me know how/if I can achieve the desired output using merge, thank you!
EDIT
To clarify why I'm confused about this behavior, if I simply add another column, it doesn't make four 2's but rather there are only two 2's, so I would expect that in my first example it would also have the two 2's. Why does the behavior seem to change, what's pandas doing?
import pandas as pd
df1 = df2 = pd.DataFrame(
{'A': [2,2,3,4,5], 'B': ['red','orange','yellow','green','blue']}
)
print(pd.merge(df1,df2))
output:
A B
0 2 red
1 2 orange
2 3 yellow
3 4 green
4 5 blue
However, based on the first example I would expect:
A B
0 2 red
1 2 orange
2 2 red
3 2 orange
4 3 yellow
5 4 green
6 5 blue
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1).reset_index()
df2 = pd.DataFrame(dict2).reset_index()
df = df1.merge(df2, on = 'A')
df = pd.DataFrame(df[df.index_x==df.index_y]['A'], columns=['A']).reset_index(drop=True)
print(df)
Output:
A
0 2
1 2
2 3
3 4
4 5
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2).drop('index', 1, inplace = True)
The idea is to merge based on the matching indices as well as matching 'A' column values.
Previously, since the way merge works depends on matches, what happened is that the first 2 in df1 was matched to both the first and second 2 in df2, and the second 2 in df1 was matched to both the first and second 2 in df2 as well.
If you try this, you will see what I am talking about.
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1['index'] = [i for i in range(len(df1))]
df2 = pd.DataFrame(dict2)
df2['index'] = [i for i in range(len(df2))]
df1.merge(df2, on = 'A')
did you try df.drop_duplicates() ?
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df=pd.merge(df1,df2)
df_new=df.drop_duplicates()
print df
print df_new
Seems that it gives the results that you want
The duplicates are caused by duplicate entries in the target table's columns you're joining on (df2['A']). We can remove duplicates while making the join without permanently altering df2:
df1 = df2 = pd.DataFrame({'A': [2,2,3,4,5]})
join_cols = ['A']
merged = pd.merge(df1, df2[df2.duplicated(subset=join_cols, keep='first') == False], on=join_cols)
Note we defined join_cols, ensuring columns being joined and columns duplicates are being removed on match.
I have unfortunately stumbled upon a similar problem which I see is now old.
I solved it by using this function in a different way, applying it to the two original tables, even though there were no duplicates in these. This is an example (I apologize, I am not a professional programmer):
import pandas as pd
dict1 = {'A':[2,2,3,4,5]}
dict2 = {'A':[2,2,3,4,5]}
df1 = pd.DataFrame(dict1)
df1=df1.drop_duplicates()
df2 = pd.DataFrame(dict2)
df2=df2.drop_duplicates()
df=pd.merge(df1,df2)
print('df1:')
print( df1 )
print('df2:')
print( df2 )
print('df:')
print( df )

Categories

Resources