convert pandas series (with strings) to python list - python

It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?

Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']

Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.

Related

How to use wide_to_long (Pandas)

I have this code which I thought would reformat the dataframe so that the columns with the same column name would be replaced by their duplicates.
# Function that splits dataframe into two separate dataframes, one with all unique
# columns and one with all duplicates
def sub_dataframes(dataframe):
# Extract common prefix -> remove trailing digits
columns = dataframe.columns.str.replace(r'\d*$', '', regex=True).to_series().value_counts()
# Split columns
unq_cols = columns[columns == 1].index
dup_cols = dataframe.columns[~dataframe.columns.isin(unq_cols)] # All columns from
dataframe that is not in unq_cols
return dataframe[unq_cols], dataframe[dup_cols]
unq_df = sub_dataframes(df)[0]
dup_df = sub_dataframes(df)[1]
print("Unique columns:\n\n{}\n\nDuplicate
columns:\n\n{}".format(unq_df.columns.tolist(), dup_df.columns.tolist()))
Output:
Unique columns:
['total_tracks', 'popularity']
Duplicate columns:
['t_dur0', 't_dur1', 't_dur2', 't_dance0', 't_dance1', 't_dance2', 't_energy0', 't_energy1', 't_energy2',
't_key0', 't_key1', 't_key2', 't_speech0', 't_speech1', 't_speech2', 't_acous0', 't_acous1', 't_acous2',
't_ins0', 't_ins1', 't_ins2', 't_live0', 't_live1', 't_live2', 't_val0', 't_val1', 't_val2', 't_tempo0',
't_tempo1', 't_tempo2']
Then I tried to use wide_to_long to combine columns with the same name:
cols = unq_df.columns.tolist()
temp = pd.wide_to_long(dataset.reset_index(), stubnames=['t_dur','t_dance', 't_energy', 't_key', 't_mode',
't_speech', 't_acous', 't_ins', 't_live', 't_val',
't_tempo'], i=['index'] + cols, j='temp', sep='t_')
.reset_index().groupby(cols, as_index=False).mean()
temp
Which gave me this output:
I tried to look at this question, but the dataframe that's returned has "Nothing to show". What am I doing wrong here? How do I fix this?
EDIT
Here is an example of how I've done it "by-hand", but I am trying to do it more efficiently using the already defined built-in functions.
The desired output is the dataframe that is shown last.

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Python pandas empty df but columns has elements

I have really irritating thing in my script and don't have idea what's wrong. When I try to filter my dataframe and then add rows to newone which I want to export to excel this happen.
File exports as empty DF, also print shows me that "report" is empty but when I try to print report.Name, report.Value etc. I got normal and proper output with elements. Also I can only export one column to excel not entire DF which looks like empty.... What can cause that strange accident?
So this is my script:
df = pd.read_excel('testfile2.xlsx')
report = pd.DataFrame(columns=['Type','Name','Value'])
for index, row in df.iterrows():
if type(row[0]) == str:
type_name = row[0].split(" ")
if type_name[0] == 'const':
selected_index = index
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]
else:
for elements in type_name:
report['Value'].loc[selected_index] += " " + elements
elif type(row[0]) == float:
df = df.drop(index=index)
print(report) #output - Empty DataFrame
print(report.Name) output - over 500 elements
You are trying to manipulate a series that does not exist which leads to the described behaviour.
Doing what you did just with a way more simple example i get the same result:
report = pd.DataFrame(columns=['Type','Name','Value'])
report['Type'].loc[0] = "A"
report['Name'].loc[0] = "B"
report['Value'].loc[0] = "C"
print(report) #empty df
print(report.Name) # prints "B" in a series
Easy solution: Just add the whole row instead of the three single values:
report = pd.DataFrame(columns=['Type','Name','Value'])
report.loc[0] = ["A", "B", "C"]
or in your code:
report.loc[index] = [type_name[1], type_name[2], row[1]]
If you want to do it the same way you are doing it at the moment you first need to add an empty series with the given index to your DataFrame before you can manipulate it:
report.loc[index] = pd.Series([])
report['Type'].loc[index] = type_name[1]
report['Name'].loc[index] = type_name[2]
report['Value'].loc[index] = row[1]

change string in a big list with another string

i have some list as t_pre_eks_tfberita
i want to replace the string in row with "Label" string that contain "BUKAN HOAX (1)" to "BUKAN HOAX" and change a string that contain "HOAX (1)" as "HOAX".
but i found error when i use this code.
for i in range (len(t_pre_eks_tfberita)):
if(t_pre_eks_tfberita[i][0]=="Label"):
j=1
while j in range (len(t_pre_eks_tfberita[i])):
cek = re.search("BUKAN",t_pre_eks_tfberita[i][j])
if(cek):
t_pre_eks_tfberita[i][j] = "BUKANHOAX"
else:
t_pre_eks_tfberita[i][j] = "HOAX"
j+=1
dfr_eks_tfberita = pd.DataFrame(list(map(list, zip(*t_pre_eks_tfberita))))
new_header = dfr_eks_tfberita.iloc[0] #grab the first row for the header
dfr_eks_tfberita = dfr_eks_tfberita[1:] #take the data less the header row
dfr_eks_tfberita.columns = new_header
for i in range(len(new_header)):
if new_header[i] != 'Label' and new_header[i] != 'Isi_Dokumen':
dfr_eks_tfberita[new_header[i]] = dfr_eks_tfberita[new_header[i]].astype('int')
dfr_eks_tfberita
when i run it, i found error like this.
any solution for this problem?
using re is in overkill here.
you need to traverse the df values, and just check if "BUKAN HOAX (1)" or "HOAX (1)".
if "HOAX (1)" in t_pre_eks_tfberita[i][j]:
dosomething()
but you can actually do it inside the DF using pandas own function like iterrows().
IIUC, try pandas.Series.str.replace with strip:
import pandas as pd
s = pd.Series(['HOAX', 'HOAX (1)', 'BUKAN HOAX', 'BUKAN HOAX (1000)'])
# Sample input
new_s = s.str.replace('\(\d+\)', '').str.strip()
print(new_s)
Output:
0 HOAX
1 HOAX
2 BUKAN HOAX
3 BUKAN HOAX
dtype: object

Python append to specific column in data frame

I have a data frame df with 3 columns and a loop creating strings from a text file depending on the column-names of the loop:
exampletext = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
Columnnames = ("Nr1", "Nr2", "Nr3")
df1= pd.DataFrame(columns = Columnnames)
for i in range(0,len(Columnnames)):
solution = exampletext.find(Columnnames[i])
lsolution= len(Columnnames[i])
Solutionwords = exampletext[solution+lsolution:solution+lsolution+10]
Now I want to append the solutionwords at the end of the dataframe df1 in the correct field, e.g. when looking for Nr1 I want to append the solutionwords to column named Nr1.
I tried working with append and creating a list, but this will just append at the end of the list. I need the data frame to seperate the words depending on the word I was looking for. Thank you for any help!
edit for desired output and readability:
Desired Output should be a data frame and look like the following:
Nr1 | Nr2 | Nr3
thisword1 | thisword2 | thisword3
I've assumed that your word for the cell value always follows your column name and is separated by a space. In which case, I'd probably try and achieve this by adding your values to a dictionary and then creating a dataframe from it after it contains the data you want, like this:
example_text = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
column_names = ("Nr1", "Nr2", "Nr3")
d = dict()
split_text = example_text.split(' ')
for i, text in enumerate(split_text):
if text in column_names:
d[text] = split_text[i+1]
df = pd.DataFrame(d, index=[0])
which will give you:
>>> df
Nr1 Nr2 Nr3
0 thisword1 thisword2 thisword3

Categories

Resources