Find all rows whose column name contains a specific string

Find all rows whose column name contains a specific string - python

I have a dataframe as shown below.It has 3 columns with names "TTN_163_2.5_-40 ","TTN_163_2.7_-40" and " TTN_163_3.6_-40".
I need to select all rows whose column name contains '2.5','3.6','2.7'.
I have some column names which contains 1.6,1.62 and 1.656.I need to select
these separately.when I am writing df_psrr_funct_1V6.filter(regex='1\.6|^xvalues$') I am geting all rows corresponds to 1.6 ,1.65 and 1.62 .I don't want this .May I know how to select uniquely.
I used this method (df_psrr_funct = df_psrr_funct.filter(regex='2.5'))but it is not capturing 1st column(xvalues)
Sample dataframe
xvalues TTN_163_2.5_-40 TTN_163_2.7_-40 TTN_163_3.6_-40
23.0279 -58.7591 -58.5892 -60.0966
30.5284 -58.6903 -57.3153 -59.9111
Please the image my dataframe
May I know how to do this

Expand regex with | for or, ^ is for start string, $ is for end string for extract column name xvalues and avoid extract colums names with substrings like xvalues 1 or aaa xvalues:
df_psrr_funct = df_psrr_funct.filter(regex='2\.5|^xvalues$')
print (df_psrr_funct)
xvalues TTN_163_2.5_-40
0 23.0279 -58.7591
1 30.5284 -58.6903
EDIT: If need values between _ use:
print (df_psrr_funct)
xvalues TTN_163_1.6_-40 TTN_163_1.62_-40 TTN_163_1.656_-40
0 23.0279 -58.7591 -58.5892 -60.0966
1 30.5284 -58.6903 -57.3153 -59.9111
df_psrr_funct = df_psrr_funct.filter(regex='_1\.6_|^xvalues$')
print (df_psrr_funct)
xvalues TTN_163_1.6_-40
0 23.0279 -58.7591
1 30.5284 -58.6903

Another approach:
df_psrr_funct.filter(regex = '^\D+$|2.5')
xvalues TTN_163_2.5_-40
0 23.0279 -58.7591
1 30.5284 -58.6903

using regex for this doesnt make any sense... just do
columns_with_2point5 = [c for c in df.columns if "2.5" in c]
only_cool_cols = df[['xvalues'] + columns_with_2point5]
dont overcomplicate it ...
if you dont need the first column you can just use filter with like instead of using one of the regex solutions (see first comment from #BeRT2me)

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.

Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360

You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

Column wise concatenation for each set of values

I am trying to append rows by column for every set of 4 rows/values.
I have 11 values the first 4 values should be in one concatenated row and row-5 to row-8 as one value and last 3 rows as one value even if the splitted values are not four.
df_in = pd.DataFrame({'Column_IN': ['text 1','text 2','text 3','text 4','text 5','text 6','text 7','text 8','text 9','text 10','text 11']})
and my expected output is as follows
df_out = pd.DataFrame({'Column_OUT': ['text 1&text 2&text 3&text 4','text 5&text 6&text 7&text 8','text 9&text 10&text 11']})
I have tried to get my desired output df_out as below.
df_2 = df_1.iloc[:-7].agg('&'.join).to_frame()
Any modification required to get required output?

Try using groupby and agg:
>>> df_in.groupby(df_in.index // 4).agg('&'.join)
Column_IN
0 text 1&text 2&text 3&text 4
1 text 5&text 6&text 7&text 8
2 text 9&text 10&text 11
>>>

splitting of urls from a list in dataframe where column name is company_urls

I have a dataframe(df) like this:
company_urls
0 [https://www.linkedin.com/company/gulf-capital...
1 [https://www.linkedin.com/company/gulf-capital...
2 [https://www.linkedin.com/company/fajr-capital...
3 [https://www.linkedin.com/company/goldman-sach...
And df.company_urls[0] is
['https://www.linkedin.com/company/gulf-capital/about/',
'https://www.linkedin.com/company/the-abraaj-group/about/',
'https://www.linkedin.com/company/abu-dhabi-investment-company/about/',
'https://www.linkedin.com/company/national-bank-of-dubai/about/',
'https://www.linkedin.com/company/efg-hermes/about/']
So I have to create a new columns like this:
company_urls company_url1 company_url2 company_url3 ...
0 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/the-abraaj-group/about/...
1 [https://www.linkedin.com/company/gulf-capital... https://www.linkedin.com/company/gulf-capital/about/ https://www.linkedin.com/company/gulf-related/about/...
2 [https://www.linkedin.com/company/fajr-capital... https://www.linkedin.com/company/fajr-capital/about/...
3 [https://www.linkedin.com/company/goldman-sach... https://www.linkedin.com/company/goldman-sachs/about/...
How do I do that?

I have created this function for my personal use, and I think will work for your needs:
a) Specify the df name
b) Specify the column you want to split
c) Specify the delimiter
def composition_split(dat,col,divider =','): # set your delimiter here
"""
splits the column of interest depending on how many delimiters we have
creates all the columns needed to make the split
"""
x1 = dat[col].astype(str).apply(lambda x: x.count(divider)).max()
x2 = ["company_url_"+str(i) for i in np.arange(0,x1+1,1)]
dat[x2] = dat[col].str.split(divider,expand = True)
return dat
Basically this will create as many columns needed depending on how you specify the delimiter. For example, if the URL can be split 3 times based on a certain delimiter, it will create 3 new columns.
your_new_df = composition_split(df,'col_to_split',',') # for example

Change one column with one of multiple strings from another column if condition is met

I want to populate one column with a string (one of many) contained in another column (if it is contained in that column)
Right now I can do it by repeating the line of code for every different string, I'm looking for the more efficient way of doing it. I have about a dozen in total.
df.loc[df['column1'].str.contains('g/mL'),'units'] = 'g/mL'
df.loc[df['column1'].str.contains('mPa.s'),'units'] = 'mPa.s'
df.loc[df['column1'].str.contains('mN/m'),'units'] = 'mN/m'
I don't know how to make it to check
df.loc[df['column1'].str.contains('g/mL|mPa.s|mN/m'),'units'] = ...
And then make it equal to the one that is contained.

Use str.extract:
# example dataframe
df = pd.DataFrame({'column1':['this is test g/mL', 'this is test2 mPa.s', 'this is test3 mN/m']})
column1
0 this is test g/mL
1 this is test2 mPa.s
2 this is test3 mN/m
df['units'] = df['column1'].str.extract('(g/mL|mPa.s|mN/m)')
column1 units
0 this is test g/mL g/mL
1 this is test2 mPa.s mPa.s
2 this is test3 mN/m mN/m

Use loop with str.contains:
L = ['g/mL', 'mPa.s', 'mN/m']
for val in L:
df.loc[df['column1'].str.contains(val),'units'] = val
Or Series.str.extract with list of all possible values:
L = ['g/mL', 'mPa.s', 'mN/m']
df['units'] = df['column1'].str.extract('(' + '|'.join(L) + ')')

Actually, according to the docs you can exactly do that using the regex=True parameter!
df.loc[df['column1'].str.contains('g/mL|mPa.s|mN/m', regex=True),'units'] = ...

How to split a pandas data frame column to multiple columns based on the text value contained

Let's say there is column like below.
df = pd.DataFrame(['A-line B-station 9-min C-station 3-min',
'D-line E-station 8-min F-line G-station 5-min',
'G-line H-station 1-min I-station 6-min J-station 8-min'],
columns=['station'])
A,B,C is just arbitrary characters and there are whole bunch of rows like this.
station
0 A-line B-station 9-min C-station 3-min
1 D-line E-station 8-min F-line G-station 5-min
2 G-line H-station 1-min I-station 6-min J-stati...
How can we make columns like below?
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station null null null
1 D-line E-station null null F-line G-station
2 G-line H-station I-station J-station null null
stationX-X means that Station (line number) - (order of station)
Station1-1 means first station for first line(line1)
Station1-2 means second station for first line(line1)
Station2-1 means first station for second line(line2)
I tried to split by delimiter; however, it doesn't work since every row has different number of lines and stations.
What I maybe need is to split columns based on their characters contained. For example, I could store first '-line' to Line1 and store first '-station' to station1-1.
Does anybody have any ideas how to do this?
Any small thoughts help me!
Thank you!

First create Series with Series.str.split and DataFrame.stack:
s = df['station'].str.split(expand=True).stack()
Then remove values ending with min by boolean indexing with Series.str.endswith:
df1 = s[~s.str.endswith('min')].to_frame('data').rename_axis(('a','b'))
Then create counters for lines and for station rows with filtering and GroupBy.cumcount:
df1['Line'] = (df1[df1['data'].str.endswith('line')]
.groupby(level=0)
.cumcount()
.add(1)
.astype(str))
df1['Line'] = df1['Line'].ffill()
df1['station'] = (df1[df1['data'].str.endswith('station')]
.groupby(['a','Line'])
.cumcount()
.add(1)
.astype(str))
Create Series with join, replace missing values by df1['Line'] by Series.fillna:
df1['station'] = (df1['Line'] + '-' + df1['station']).fillna(df1['Line'])
Reshape by DataFrame.set_index with DataFrame.unstack:
df1 = df1.set_index('station', append=True)['data'].reset_index(level=1, drop=True).unstack()
Rename columns names - not before for avoid wrong sorted:
df1 = df1.rename(columns = lambda x: 'Station' + x if '-' in x else 'Line' + x)
Remove columns name:
df1.columns.name = None
df1.index.name = None
print (df1)
Line1 Station1-1 Station1-2 Station1-3 Line2 Station2-1
0 A-line B-station C-station NaN NaN NaN
1 D-line E-station NaN NaN F-line G-station
2 G-line H-station I-station J-station NaN NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find all rows whose column name contains a specific string - python

Another approach: df_psrr_funct.filter(regex = '^\D+$|2.5') xvalues TTN_163_2.5_-40 0 23.0279 -58.7591 1 30.5284 -58.6903

Related

Extract values within the quotes signs into two separate columns with python

Column wise concatenation for each set of values

splitting of urls from a list in dataframe where column name is company_urls

Change one column with one of multiple strings from another column if condition is met

How to split a pandas data frame column to multiple columns based on the text value contained

Categories

Resources