I have a data frame df with 3 columns and a loop creating strings from a text file depending on the column-names of the loop:
exampletext = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
Columnnames = ("Nr1", "Nr2", "Nr3")
df1= pd.DataFrame(columns = Columnnames)
for i in range(0,len(Columnnames)):
solution = exampletext.find(Columnnames[i])
lsolution= len(Columnnames[i])
Solutionwords = exampletext[solution+lsolution:solution+lsolution+10]
Now I want to append the solutionwords at the end of the dataframe df1 in the correct field, e.g. when looking for Nr1 I want to append the solutionwords to column named Nr1.
I tried working with append and creating a list, but this will just append at the end of the list. I need the data frame to seperate the words depending on the word I was looking for. Thank you for any help!
edit for desired output and readability:
Desired Output should be a data frame and look like the following:
Nr1 | Nr2 | Nr3
thisword1 | thisword2 | thisword3
I've assumed that your word for the cell value always follows your column name and is separated by a space. In which case, I'd probably try and achieve this by adding your values to a dictionary and then creating a dataframe from it after it contains the data you want, like this:
example_text = "Nr1 thisword1 and Nr2 thisword2 and Nr3 thisword3"
column_names = ("Nr1", "Nr2", "Nr3")
d = dict()
split_text = example_text.split(' ')
for i, text in enumerate(split_text):
if text in column_names:
d[text] = split_text[i+1]
df = pd.DataFrame(d, index=[0])
which will give you:
>>> df
Nr1 Nr2 Nr3
0 thisword1 thisword2 thisword3
Related
I've joined or concatenated two series into a dataframe. However one of the issues I'm not facing is that I have no column headings on the actual data that would help me do a sort
hist_a = pd.crosstab(category_a, category, normalize=True)
hist_b = pd.crosstab(category_b, category, normalize=True)
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index])
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index])
df_plots = pd.concat([counts_a, counts_b], axis=1).fillna(0)
The data looks like the following:
0 1
category
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
and I'd like to do a sort, but there are no proper column headings
df_plots = df_plots.sort_values(by=['0?'])
But the dataframe seems to be in two parts. How could I better structure the dataframe to have 'proper' columns such as '0' or 'plot a' rather than being indexable by an integer, which seems to be hard to work with.
category plot a plot b
0017817703277 0.000516 5.384341e-04
0017817703284 0.000516 5.384341e-04
0017817731348 0.000216 2.856169e-04
0017817731355 0.000216 2.856169e-04
Just rename the columns of the dataframe, for example:
df = pd.DataFrame({0:[1,23]})
df = df.rename(columns={0:'new name'})
If you have a lot of columns you rename all of them at once like:
df = pd.DataFrame({0:[1,23]})
rename_dict = {key: f'Col {key}' for key in df.keys() }
df = df.rename(columns=rename_dict)
You can also define the series with the name, so you avoid changing the name afterwards:
counts_a = pd.Series(np.diag(hist_a), index=[hist_a.index], name = 'counts_a')
counts_b = pd.Series(np.diag(hist_b), index=[hist_b.index], name = 'counts_b')
I have a data frame like as shown below
df = pd.DataFrame({'subject_id':[11,11,11,12,12,12],
'test_date':['02/03/2012 10:24:21','05/01/2019 10:41:21','12/13/2011 11:14:21','10/11/1992 11:14:21','02/23/2002 10:24:21','07/19/2005 10:24:21'],
'original_enc':['A742','B963','C354','D563','J323','G578']})
hash_file = pd.DataFrame({'source_enc':['A742','B963','C354','D563','J323','G578'],
'hash_id':[1,2,3,4,5,6]})
cols = ["subject_id","test_date","enc_id","previous_enc_id"]
test_df = pd.DataFrame(columns=cols)
test_df.head()
I would like to do two things here
Map original_enc to their corresponding hash_id and store it in enc_id
Find the previous hash_id for each subject based on their current hash_id and store it in previous_enc_id
I tried the below
test_df['subject_id'] = df['subject_id']
test_df['test_date'] = df['test_date']
test_df['enc_id'] = df['original_enc'].map(hash_file)
test_df = test_df.sort_values(['subject_id','test_date'],ascending=True)
test_df['previous_enc_id'] = test_df.groupby(['subject_id','test_date'])['enc_id'].shift(1)
However, I don't get the expected output for the previous_enc_id column as it is all NA.
I expect my output to be like as shown below. You see NA in the expected row for the 1st row of every subject because that's their 1st encounter. There is no info to look back to.
Use only one column for groupby:
test_df['previous_enc_id'] = test_df.groupby('subject_id')['enc_id'].shift()
It's probably a silly thing but I can't seem to correctly convert a pandas series originally got from an excel sheet to a list.
dfCI is created by importing data from an excel sheet and looks like this:
tab var val
MsrData sortfield DetailID
MsrData strow 4
MsrData inputneeded "MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided","BiMonthlyTest"
# get list of cols for which input is needed
cols = dfCI[((dfCI['var'] == 'inputneeded') & (dfCI['tab'] == 'MsrData'))]['val'].values.tolist()
print(cols)
>> ['"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"']
# replace null text with text
invalid = 'Input Needed'
for col in cols:
dfMSR[col] = np.where((dfMSR[col].isnull()), invalid, dfMSR[col])
However the second set of (single) quotes added when I converted cols from series to list, makes all the columns a single value so that
col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
The desired output for cols is
cols = ["MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"]
What am I doing wrong?
Once you've got col, you can convert it to your expected output:
In [1109]: col = '"MeasDescriptionTest", "SiteLocTest", "SavingsCalcsProvided", "BiMonthlyTest"'
In [1114]: cols = [i.strip() for i in col.replace('"', '').split(',')]
In [1115]: cols
Out[1115]: ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Another possible solution that comes to mind given the structure of cols is:
list(eval(cols[0])) # ['MeasDescriptionTest', 'SiteLocTest', 'SavingsCalcsProvided', 'BiMonthlyTest']
Although this is valid, it's less safe and I would go with list-comprehension as #MayankPorwal suggested.
I have a dataframe in which I have multiple leg columns names like leg/1 leg/2 leg/3 till leg/24 but the problem is that each leg has multiple string attached with like leg/1/a1 leg/1/a2.
For example I have a dataframe like this
leg/1/a1 leg/1/a2 leg/2/a1 leg/3/a2
I need that all leg names in the dataframe should have equal columns like leg/1
For example my required pandas dataframe column names should be
leg/1/a1 leg/1/a2 leg/2/a1 leg/2/a2 leg/3/a1 leg/3/a2
this should be the output of the dataframe.
for that purpose I have first collected the leg/1 details inside the list
legs=['leg/1/a1','leg/1/a2']
this list i have created to match all the dataframe column names
After that I have collected all the dataframe column names that are started with leg
cols = [col for col in df.columns if 'leg' in col]
but the problem is that I am unable to match , any help would be appreciated.
column_list = ['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'] #replace with df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
Code to preproduce on blank df
import pandas as pd
df = pd.DataFrame([],columns=['leg/1/a1','leg/1/a2','leg/2/a1','leg/3/a2'])
column_list = df.columns
col_end_list = set([e.split('/')[-1] for e in column_list]) # get all a1,a2,....an
#Loop theough leg/1/a1 to leg/24/an
for i in range(1,25):
for c in col_end_list:
check_str = 'leg/'+str(i)+'/'+c
if check_str not in column_list: #Check if column doesn't exist ad a column
df[check_str] = 0 #adding new column
>>> df.columns
>>> Index(['leg/1/a1', 'leg/1/a2', 'leg/2/a1', 'leg/3/a2', 'leg/2/a2', 'leg/3/a1',
'leg/4/a1', 'leg/4/a2', 'leg/5/a1', 'leg/5/a2', 'leg/6/a1', 'leg/6/a2',
'leg/7/a1', 'leg/7/a2', 'leg/8/a1', 'leg/8/a2', 'leg/9/a1', 'leg/9/a2',
'leg/10/a1', 'leg/10/a2', 'leg/11/a1', 'leg/11/a2', 'leg/12/a1',
'leg/12/a2', 'leg/13/a1', 'leg/13/a2', 'leg/14/a1', 'leg/14/a2',
'leg/15/a1', 'leg/15/a2', 'leg/16/a1', 'leg/16/a2', 'leg/17/a1',
'leg/17/a2', 'leg/18/a1', 'leg/18/a2', 'leg/19/a1', 'leg/19/a2',
'leg/20/a1', 'leg/20/a2', 'leg/21/a1', 'leg/21/a2', 'leg/22/a1',
'leg/22/a2', 'leg/23/a1', 'leg/23/a2', 'leg/24/a1', 'leg/24/a2'],
dtype='object')
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]