Creating Automation to Create Multiple Columns in between a Data Frame - python

I am new to coding in this aspect and need help creating x amount of columns. I have a datagram that currently being updated and I need a way to show that whatever columns from the data frame the user picks it will show just those selected columns but in-between those columns I want a column to say 'Keep'. So far I was able to have the code select what the user wants, I am just having trouble creating a automated way to make the keep show up without adding them myself in between.
name_of_cols =['id','start_date', 'end_date', 'name', 'job_title', 'Keep']
All but Keep is part of the data frame prior.
def clean_df(df, list_col):
df2 = df.copy()
df2 = df2.drop_duplicates(list_col)
df3 = df2.copy()
df3 = df3[[id,start_date, end_date, name, job_title]].reset_index(drop = true)
df_3 = df3_new.columns.tolist()
conditions =[df3 = name_of_cols,
df3!= name_of_cols
results = ['Keep' , 'No Keep']
df3_new['Keep'] = np.select(conditions, results)
return df3[name_of_cols]
df3_new = cleanup_df(df3, name_of_cols)
This creates the list I need but when I try and add 'Keep' I get:
KeyError: Index([Keep'], dtype='object')
I am assuming this is because 'Keep is not apart of the orginal data frame.
I have code that defines all this so defining the data frames are not an issue I have.

From what I can tell, as far as your code goes, it might be a syntactical error.
results = ['Keep' , 'Don't Keep'] df3_new['keep'] = np.select(conditions, results) return df3[name_of_cols]
It seems like you have an unintended apostrophe where you have Don't Keep. I might suggest using quotation marks to eliminate this issue, but I don't know if this is the solution you are looking for. (I don't know a whole lot about data frames)

Related

Python sort table on multiple columns

I am busy with making a system that can sort some things from a excel document, i have added a part of the document here: shorturl.at/DKNP7
It has the following inputs: Day, time, sort, number, gourmet/fondue, sort_exclusive
I want to have this sorted as follows, it must contain the sum of each of the different types.
I have some code but i doubt it is efficient, the start of the code i have included below.
df = pd.read_excel('Example_excel.xlsm', sheet_name="INVOER")
gourmet = df[['Day', 'Time', 'Sort', 'number', 'Gourmet/Fondue', 'sort exclusive']]
gourmet1 = gourmet.dropna(subset=['Sort'], inplace=False) #if 'Sort' is not filled in it is dropped.
gourmet1.to_excel('test.xlsx', index=False, sheet_name='gourmet')
Maybe it is needed to split it in 2 parts, where 1 part is 'exclusief' with 'sort exclusive' and another part for 'populair' and 'deluxe' from the 'sort'column.
Looking forward to your reply!
One of the things I have tried is to split it;
gourmet_pop_del = gourmet1.groupby(['Day','Sort', 'Gourmet/Fondue' ])['number'].sum()
gourmet_pop_del = gourmet_pop_del.reset_index()
gourmet_pop_del.sort_values(by=['Day', 'Sort','Gourmet/Fondue'], inplace=True)

How to search through pandas data frame row by row and extract variables

I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.

Extract and match items dealing with multiple data-frames using Python

I have two dataframes which can be created using the code shown below
df1 = pd.DataFrame({'home':[1,np.nan,2,np.nan,3,4],
'PERSONAL INFORMATION':['Study Number', 'Study ID','Age when interview
done', 'Derived using date of birth','Gender','ethnicity],
'VARIABLE':
['studyid','dummy','age_interview','dummy','gender','Chinese'],
'Remarks':[2000000001,20005000001,4265453,0,4135376,2345678]})
df2 = df2 = pd.DataFrame({'level_0': ['studyid','age_interview','gender','dobyear','ethderived','smoke','alcohol'],
'0':['tmp001', 56,'Female',1950,'Chinese','No', 'Yes']})
Aim
1) My objective is to take the values from 'level_0' column of df2 and look for them in 'VARIABLE' column of df1 to fetch their 'Remarks' column value provided it satisfies the the below condition
a) 'Home' column of df1 should contain digits as part of their value( Ex: 1,2,3,4,B1.5,C1.9, D1.2 etc are all valid values for 'Home' column)
2) My objective is same as above, but here I would like to take the values from '0' column of df2 and look for them in 'PERSONAL INFORMATION' column of df1 to fetch their 'Remarks' value provided it satisfies the below condition
a) 'VARIABLE' column of df1 should contain 'dummy' as a value
For the above two scenarios, I have written the below code but for some reason I feel that it is quite lengthy/inefficient. There should be some easy way to do this.
Scenario - 1
qconc_id = []
missed_items=[]
col_list=[]
for i in df7.index:
ques = df7['level_0'][i]
col_list.append(ques)
try:
qindex = int(df[df['VARIABLE']==ques].index[0]),
df.columns.get_loc('VARIABLE')
pos_qindex = qindex[0]
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
while result is None:
pos_qindex = pos_qindex-1
ques_value = df['home '][pos_qindex]
result = re.match(r"[A-Z]?[\d]?[\.]?[\d]+", ques_value)
qconc_id.append(df['Remarks'][pos_qindex])
except:
missed_items.append(ans)
Scenario - 2
aconc_id = []
missed_items=[]
ans_list=[]
for i in df7.index:
ans = df7[0][i]
print("ans is ",ans)
ans_list.append(ans)
idx=0
try:
aindex = df[df['PERSONAL
INFORMATION'].str.contains(ans,case=False,regex=False)].index
print(aindex)
pos_aindex = aindex[idx]
while (df['VARIABLE'][pos_aindex] !='dummy') and
(df['PERSONAL INFORMATION'].str.contains('Yes|No',regex=True)
[pos_aindex])==False):
pos_aindex = aindex[idx+1]
print("The value is ",df['Remarks'][pos_aindex])
aconc_id.append(df['Remarks'][pos_aindex])
except:
print("Goes to Exception")
aconc_id.append('0')
missed_items.append(ans)
Please note these two things
a) I have used while loop because the values might be repeating. For example, we might have a matching value as 'No' but the df1['VARIABLE'] may not be dummy. So I increase the id values in both scenarios to find whether the next occurrence of 'No' has 'Dummy' value for VARIABLE column. The same applies for scenario 1 as well
b) How can I handle scenarios like "No" when finds match in "Notes", "Nocase". As you can see from my code that I am using regex but it still am encountering error here.
As you can see, I am making some modifications to the code and writing it twice. How can I make it elegant and efficient? I am sure there must be very easy and simple way to do this.
Any suggestions/ideas on alternative approach w.r.t to changing the data format of source data or using merge/join approach is also welcome.
I expect the output, the 'Remarks' value to be stored in the list. Please find the screenshot of what I have done
You should avoid as much as possible explicit loops in pandas, because they will not be vectorized (optimized in pandas and numpy wording). Here you could merge your dataframes:
Scenario 1:
# extract values where df2.level_0 == df1.VARIABLE
tmp = pd.merge(pd.DataFrame(df2.level_0), df1.loc[:,['home', 'VARIABLE', 'Remarks']],
left_on = ['level_0'], right_on=['VARIABLE'])
# drop lines where home would not contain a digit
tmp.drop(tmp.loc[~tmp.home.astype(np.str_).str.contains(r'\d')].index,
inplace=True)
# extract the Remarks column into a list
lst = tmp.Remarks.tolist()
With your example data I get [2000000001, 4265453, 4135376]
Scenario 2:
tmp = pd.merge(pd.DataFrame(df2['0']), df1.loc[:,['PERSONAL INFORMATION',
'VARIABLE', 'Remarks']],
left_on = ['0'], right_on=['PERSONAL INFORMATION'])
tmp.drop(tmp.loc[~tmp['VARIABLE'] == 'dummy'].index, inplace=True)
lst.extend(tmp.Remarks.tolist())
With your example data I get no additional values because from the first step, tmp is an empty dataframe.

Pandas formatting column within DataFrame and adding timedelta Index error

I'm trying to use panda to do some analysis on some messaging data and am running into a few problems try to prep the data. It is coming from a database I don't have control of and therefore I need to do a little pruning and formatting before analyzing it.
Here is where I'm at so far:
#select all the messages in the database. Be careful if you get the whole test data base, may have 5000000 messages.
full_set_data = pd.read_sql("Select * from message",con=engine)
After I make this change to the timestamp, and set it as index, I'm no longer and to call to_csv.
#convert timestamp to a timedelta and set as index
#full_set_data[['timestamp']] = full_set_data[['timestamp']].astype(np.timedelta64)
indexed = full_set_data.set_index('timestamp')
indexed.to_csv('indexed.csv')
#extract the data columns I really care about since there as a bunch I don't need
datacolumns = indexed[['address','subaddress','rx_or_tx', 'wordcount'] + [col for col in indexed.columns if ('DATA' in col)]]
Here I need to format the DATA columns, I get a "SettingWithCopyWarning".
#now need to format the DATA columns to something useful by removing the upper 4 bytes
for col in datacolumns.columns:
if 'DATA' in col:
datacolumns[col] = datacolumns[col].apply(lambda x : int(x,16) & 0x0000ffff)
datacolumns.to_csv('data_col.csv')
#now group the data by "interaction key"
groups = datacolumns.groupby(['address','subaddress','rx_or_tx'])
I need to figure out how to get all the messages from a given group. get_group() requires I know key values ahead of time.
key_group = groups.get_group((1,1,1))
#foreach group in groups:
#do analysis
I have tried everything I could think of to fix the problems I'm running into but I cant seem to get around it. I'm sure it's from me misunderstanding/misusing Pandas as I'm still figuring it out.
I looking to solve these issues:
1) Can't save to csv after I add index of timestamp as timedelta64
2) How do I apply a function to a set of columns to remove SettingWithCopyWarning when reformatting DATA columns.
3) How to grab the rows for each group without having to use get_group() since I don't know the keys ahead of time.
Thanks for any insight and help so I can better understand how to properly use Pandas.
Firstly, you can set the index column(s) and parse dates while querying the DB:
indexed = pd.read_sql_query("Select * from message", engine=engine,
parse_dates='timestamp', index_col='timestamp')
Note I've used pd.read_sql_query here rather than pd.read_sql, which is deprecated, I think.
SettingWithCopy warning is due to the fact that datacolumns is a view of indexed, i.e. a subset of it's rows /columns, not an object in it's own right. Check out this part of the docs: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
One way to get around this is to define
datacolumns = indexed[<cols>].copy()
Another would to do
indexed = indexed[<cols>]
which effectively removes the columns you don't want, if you're happy that you won't need them again. You can then manipulate indexed at your leisure.
As for the groupby, you could introduce a columns of tuples which would be the group keys:
indexed['interaction_key'] = zip(indexed[['address','subaddress','rx_or_tx']]
indexed.groupby('interaction_key').apply(
lambda df: some_function(df.interaction_key, ...)
I'm not sure if it's all exactly what you want but let me know and I can edit.

DataFrame change doesn't save when iterating

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...
Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Categories

Resources