change string in a big list with another string - python

i have some list as t_pre_eks_tfberita
i want to replace the string in row with "Label" string that contain "BUKAN HOAX (1)" to "BUKAN HOAX" and change a string that contain "HOAX (1)" as "HOAX".
but i found error when i use this code.
for i in range (len(t_pre_eks_tfberita)):
if(t_pre_eks_tfberita[i][0]=="Label"):
j=1
while j in range (len(t_pre_eks_tfberita[i])):
cek = re.search("BUKAN",t_pre_eks_tfberita[i][j])
if(cek):
t_pre_eks_tfberita[i][j] = "BUKANHOAX"
else:
t_pre_eks_tfberita[i][j] = "HOAX"
j+=1
dfr_eks_tfberita = pd.DataFrame(list(map(list, zip(*t_pre_eks_tfberita))))
new_header = dfr_eks_tfberita.iloc[0] #grab the first row for the header
dfr_eks_tfberita = dfr_eks_tfberita[1:] #take the data less the header row
dfr_eks_tfberita.columns = new_header
for i in range(len(new_header)):
if new_header[i] != 'Label' and new_header[i] != 'Isi_Dokumen':
dfr_eks_tfberita[new_header[i]] = dfr_eks_tfberita[new_header[i]].astype('int')
dfr_eks_tfberita
when i run it, i found error like this.
any solution for this problem?

using re is in overkill here.
you need to traverse the df values, and just check if "BUKAN HOAX (1)" or "HOAX (1)".
if "HOAX (1)" in t_pre_eks_tfberita[i][j]:
dosomething()
but you can actually do it inside the DF using pandas own function like iterrows().

IIUC, try pandas.Series.str.replace with strip:
import pandas as pd
s = pd.Series(['HOAX', 'HOAX (1)', 'BUKAN HOAX', 'BUKAN HOAX (1000)'])
# Sample input
new_s = s.str.replace('\(\d+\)', '').str.strip()
print(new_s)
Output:
0 HOAX
1 HOAX
2 BUKAN HOAX
3 BUKAN HOAX
dtype: object

Related

convert pandas series (with strings) to python list

It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Python pandas empty df but columns has elements

I have really irritating thing in my script and don't have idea what's wrong. When I try to filter my dataframe and then add rows to newone which I want to export to excel this happen.
File exports as empty DF, also print shows me that "report" is empty but when I try to print report.Name, report.Value etc. I got normal and proper output with elements. Also I can only export one column to excel not entire DF which looks like empty.... What can cause that strange accident?
So this is my script:
df = pd.read_excel('testfile2.xlsx')
report = pd.DataFrame(columns=['Type','Name','Value'])
for index, row in df.iterrows():
if type(row[0]) == str:
type_name = row[0].split(" ")
if type_name[0] == 'const':
selected_index = index
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
else:
for elements in type_name:
report['Value'].loc[selected_index] += " " + elements
elif type(row[0]) == float:
df = df.drop(index=index)
print(report) #output - Empty DataFrame
print(report.Name) output - over 500 elements
You are trying to manipulate a series that does not exist which leads to the described behaviour.
Doing what you did just with a way more simple example i get the same result:
report = pd.DataFrame(columns=['Type','Name','Value'])
report['Type'].loc[0] = "A"
report['Name'].loc[0] = "B"
report['Value'].loc[0] = "C"
print(report) #empty df
print(report.Name) # prints "B" in a series
Easy solution: Just add the whole row instead of the three single values:
report = pd.DataFrame(columns=['Type','Name','Value'])
report.loc[0] = ["A", "B", "C"]
or in your code:
report.loc[index] = [type_name[1], type_name[2], row[1]]
If you want to do it the same way you are doing it at the moment you first need to add an empty series with the given index to your DataFrame before you can manipulate it:
report.loc[index] = pd.Series([])
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]

How to find a string in data frame which has the text with underscore in between

In Data frame I have values For Ex as below :
Rhymes(column name)
Johny johny.yes_papa eating
sugar
No papa.open_mouth_ha ha ha
The output should be the list of all the strings which has underscores like:
yes_papa,
open_mouth_ha
I am trying with the below code but I am only getting column names in the dataframe:
df[df["Rhymes"].str.contains("_&_&_",na =False)]
Should I be using regular expressions to get the exact match
The method pd.Series.str.contains returns a Boolean series, it doesn't return the strings you want.
You can instead use a custom function with str.split, apply this to your series, drop nulls and convert back to a dataframe:
df = pd.DataFrame({'Rhymes': ['Johny johny.yes_papa eating', 'sugar',
'No papa.open_mouth_ha ha ha']})
def get_underscores(x):
return next((i for i in x.replace('.',' ').split() if '_' in i), None)
res = df['Rhymes'].apply(get_underscores).dropna().to_frame()
print(res)
Rhymes
0 yes_papa
2 open_mouth_ha
For a string, it should work like this.
string = "Johny johny yes_papa eating sugar No papa open_mouth_ha ha ha"
def find_underscore(string):
a = []
for i in string.split():
for j in i:
if j == '_':
a.append(i)
return a
for a column of data frame:
new_list = []
for index, row in df.iterrows():
print(find_underscore(row["column_name"]))
new_list.append(find_underscore(row["column_name"]))
df.new_column = new_list
Try this to get rows containing an underscore:
df[df["Rhymes"].str.contains("_")]
Or this to get just values:
df.loc[df["Rhymes"].str.contains("_"), "Rhymes"].values

Is there a way to use str.count() function with a LIST of values instead of a single string?

I am trying to count the number of times that any string from a list_of_strings appears in a csv file cell.
For example, the following would work fine.
import pandas as pd
data_path = "SurveryResponses.csv"
df = pd.read_csv(data_path)
totalCount = 0
for row in df['rowName']:
if type(row) == str:
print(row.count('word_of_interest'))
However, I would like to be able to enter a list of strings (['str1', str2', str3']) rather than just one 'word_of_interest', such that if any of those strings appear the count value will increase by one.
Is there a way to do this?
Perhaps something along the lines of
totalCount = 0
words_of_interst = ['cat','dog','foo','bar']
for row in df['rowName']:
if type(row) == str:
if sum([word in row for word in words_of_interst]) > 0:
totalCount += 1
Use the str accessor:
df['rowName'].str.count('word_of_interest')
If you need to convert the column to string first, use astype:
df['rowName'].astype(str).str.count('word_of_interest')
Assuming list_of_strings = ['str1', str2', str3'] you can try the following:
if any(map(lambda x: x in row, list_of_strings)):
totalCount += 1
You can use this method to count from an external list
strings = ['string1','string2','string3']
sum([1 if sr in strings else 0 for sr in df.rowName])
Here is an example:
import io
filedata = """animal,amount
"['cat','dog']",2
"['cat','horse']",2"""
df = pd.read_csv(io.StringIO(filedata))
Returns this dataframe:
animal amount
0 ['cat','dog'] 2
1 ['cat','horse'] 2
Search for word cat (looping through all columns as series):
search = "cat"
# sums True for each serie and then wrap a sum around all sums
# sum([2,0]) in this case
sum([sum(df[cols].astype(str).str.contains(search)) for cols in df.columns])
Returns 2

Categories

Resources