We have the following dataframe
# raw_df
print(raw_df.to_dict())
{'Edge': {1: '-1.9%-2.2%', 2: '+5.8%-9.4%', 3: '+3.5%-7.2%'}, 'Grade': {1: 'D+D', 2: 'BF', 3: 'B-F'}}
We are trying to split these 2 columns into 4 columns. The Edge column should split after the first %, and the Grade column should split before the 2nd capital letter appears. The output should look like:
output_df
edge_1 edge_2 grade_1 grade_2
-1.9% -2.2% D+ D
+5.8% -9.4% B F
+3.5% -7.2% B- F
We have raw_df[['t1_grade', 't2_grade']] = raw_df['Grade'].str.extractall(r'([A-Z])').unstack() to split the Grade column, however the + and - are dropped here, which is a problem. And we are not sure how to split the Edge column after the first % appears.
We can use str.extract as follows:
df["edge_1"] = df["Edge"].str.extract(r'^([+-]?\d+(?:\.\d+)?%)')
df["edge_2"] = df["Edge"].str.extract(r'([+-]?\d+(?:\.\d+)?%)$')
df["grade_1"] = df["Grade"].str.extract(r'^([A-Z][+-]?)')
df["grade_2"] = df["Grade"].str.extract(r'([A-Z][+-]?)$')
The strategy here is to extract the first/last percentage/grade from the two current columns using regex.
Looks like you already have your solution, but here is another idea for splitting Edge without regex:
strip the trailing '%'
split by '%' with expand=True
add back '%'
df[['edge_1', 'edge_2']] = (
df['Edge'].str.rstrip('%').str.split('%', expand=True).add('%')
)
Related
I want to check if sentence columns contains any keyword in other columns (without case sensitive).
I also got the problem when import file from csv, the keyword list has ' ' on the string so when I tried to use join str.join('|') it add | into every character
Sentence = ["Clear is very good","Fill- low light, compact","stripping topsoil"]
Keyword =[['Clearing', 'grubbing','clear','grub'],['Borrow,', 'Fill', 'and', 'Compaction'],['Fall']]
df = pd.DataFrame({'Sentence': Sentence, 'Keyword': Keyword})
My expect output will be
df['Match'] = [True,True,False]
You can try DataFrame.apply on rows
import re
df['Match'] = df.apply(lambda row: bool(re.search('|'.join(row['Keyword']), row['Sentence'], re.IGNORECASE)), axis=1)
print(df)
Sentence Keyword Match
0 Clear is very good [Clearing, grubbing, clear, grub] True
1 Fill- low light, compact [Borrow,, Fill, and, Compaction] True
2 stripping topsoil [Fall] False
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]
I have a dataframe which has some id's. I want to check the pattern of those column values.
Here is how the column looks like-
id: {ASDH12HK,GHST67KH,AGSH90IL,THKI86LK}
I want to to write a code that can distinguish characters and numerics in the pattern above and display an output like 'SSSS99SS' as the pattern of the column above where 'S' represents a character and '9' represents a numeric.This dataset is a large dataset so I can't predefine the position the characters and numeric will be in.I want the code to calculate the position of the characters and numerics. I am new to python so any leads will be helpful!
You can try something like:
my_string = "ASDH12HK"
def decode_pattern(my_string):
my_string = ''.join(str(9) if s.isdigit() else s for s in my_string)
my_string = ''.join('S' if s.isalpha() else s for s in my_string)
return my_string
decode_pattern(my_string)
Output:
'SSSS99SS'
You can apply this to the column in your dataframe as well as below:
import pandas as pd
df = pd.DataFrame(['ASDH12HK','GHST67KH','AGSH90IL','THKI86LK', 'SOMEPATTERN123'], columns=['id'])
df['pattern'] = df['id'].map(decode_pattern)
df
Output:
id pattern
0 ASDH12HK SSSS99SS
1 GHST67KH SSSS99SS
2 AGSH90IL SSSS99SS
3 THKI86LK SSSS99SS
4 SOMEPATTERN123 SSSSSSSSSSS999
You can use regular experssion:
st = "SSSS99SSSS"
a = re.match("[A-Za-z]{4}[0-9]{2}[A-Za-z]{4}", st)
It will return a match if the string starting with 4 Char followed by 2 numeric and again 4 char
So you can use this in your df to filter the df
You can use the function findall() from the re module:
import re
text = "ASDH12HK,GHST67KH,AGSH90IL,THKI86LK"
result = re.findall("[A-Za-z]{4}[0-9]{2}[A-Za-z]{2}", text)
print(result)
I use Python3 and need to split price column which mixed price_value and price_unit together in a dataframe, the example data looks like 20dollar/m2/month or 1.8dollar/m2/day, I want split them to this format by word dollar:
price_value price_unit
20 dollar/m2/month
1.8 dollar/m2/day
I have tried with the following code:
Option 1:
df['price_value'] = df['price'].apply(lambda row: row.split('dollar')[0])
df['price_unit'] = df['price'].apply(lambda row: row.split('dollar')[-1])
Option 2:
df['price_value'], df['price_unit'] = df1["price"].str.split('dollar', 1).str
But I get:
price_value price_unit
20 /m2/month
1.8 /m2/day
How can I split them correctly? Thanks.
You may use str.extract with a r'(?P<price_value>.*?)(?P<price_unit>dollar.*)' regex:
>>> import pandas as pd
>>> df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price'])
>>> df['price'].str.extract(r'(?P<price_value>.*?)(?P<price_unit>dollar.*)')
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day
See the regex demo.
Details
(?P<price_value>.*?) - Group "price_value": any 0+ chars other than line break chars as few as possible
(?P<price_unit>dollar.*) - Group "price_unit": dollar and any 0+ chars other than line break chars as many as possible.
I assume that you do not have any line breaks in the input, but if you happen to have any, prepend the pattern with the inline DOTALL modifier, (?s): r'(?s)(?P<price_value>.*?)(?P<price_unit>dollar.*)'
To add the newly extracted columns to the existing data frame, you may also use
df[['price_value', 'price_unit']] = df['price'].str.extract(r'(.*?)(dollar.*)')
Here, named capturing groups are not necessary since you define the column names beforehand.
You could do:
df = pd.DataFrame(data=['20dollar/m2/month', '1.8dollar/m2/day'], columns=['price_unit'])
# split by capture group
result = df['price_unit'].str.split('(dollar.*$)', expand=True).drop(2, axis=1)
# rename columns
result.columns = ['price_value', 'price_unit']
print(result)
Output
price_value price_unit
0 20 dollar/m2/month
1 1.8 dollar/m2/day
I have two files with 2 columns each, I need to use 1 column from one and one from another and create a new file with 2 columns.
while i<500020:
columns=datas.readline()
columns2 = datas2.readline()
columns = columns.split(" ")
columns2 = columns2.split(" ")
colum.write(" {1} {0}".format((columns2[1]), (columns[1]) ))
i=i+1
My output is like this:
181.053131
0.0005301
168.785828
0.3596852
I want to show them on same line, EX:
181.053131 0.0005301
168.785828 0.3596852
You need to remove the newline from columns2[1]:
columns2 = datas.readline().rstrip('\n')
otherwise you'll always insert those newlines into your output.
I'd also remove the newline from columns and use an explicit newline when writing:
columns = datas.readline().rstrip('\n')
and
colum.write(" {1} {0}\n".format(columns2[1], columns[1]))