Getting next value of groupby after explode with pandas - python

I am trying to get the first non-null value of the list inside each row of Emails column to write to the Email_final1 then get the next value of the list inside each row of Emails, if there is one, to Emails_final2 otherwise to write Emails2 value to Emails2_final if not blank and doesn't equal 'Emails' otherwise leave Emails_final2 blank. Lastly if a value from Emails 2 was written to Emails_final1 then make Emails_final2 None I have tried many different ways to achieve this to no avail here is what I have so far including pseudo-code:
My Current Code:
df = pd.DataFrame({'Emails': [['jjj#gmail.com', 'jp#gmail.com', 'jc#gmail.com'],[None, 'www#gmail.com'],[None,None,None]],
'Emails 2': ['sss#gmail.com', 'zzz#gmail.com','ccc#gmail.com'],
'num_specimen_seen': [10, 2,3]},
index=['falcon', 'dog','cat'])
df['Emails_final1'] = df['Emails'].explode().groupby(level=0).first()
#pseudo code
df['Emails_final2'] = df['Emails'].explode().groupby(level=0).next() #I know next doesn't exist but I want it to try to get the next value of 'Emails' before trying to get 'Emails 2 values.
Desired Output:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
falcon www#gmail.com zzz#gmail.com
falcon ccc#gmail.com None
Any direction of how to approach a problem like this would be appreciated.

It looks a bit messy but it works. Basically, we keep a boolean mask from the first step in filling "Emails_final1" and use it in the second step to fill "Emails_final1".
To fill the second column, the idea is to use groupby + nth to get the second elements and if they don't match the previously selected emails; keep it (for example for the first row) but if it doesn't select from "Emails 2" column, unless it was already selected before (for example in the 3rd row):
exp_g = df['Emails'].explode().groupby(level=0)
df['Emails_final1'] = exp_g.first()
msk = df['Emails_final1'].notna()
df['Emails_final1'] = df['Emails_final1'].fillna(df['Emails 2'])
df['Emails_final2'] = exp_g.nth(1)
df['Emails_final2'] = df['Emails_final2'].mask(lambda x: ((x == df['Emails_final1']) | x.isna()) & msk, df['Emails 2'])
The relevant columns are:
Emails_final1 Emails_final2
falcon jjj#gmail.com jp#gmail.com
dog www#gmail.com zzz#gmail.com
cat ccc#gmail.com None

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!
https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Python: Lambda function with multiple conditions based on multiple previous rows

I am trying to define a lambda function that assigns True or False to a row based on various conditions.
There is a column with a Timestamp and what I want is, that if within the last 10 seconds (based on the timestamp of the current row x) some specific values occured in other columns of the dataset, the current row x gets the True or False tag.
So basically I have to check whether in the previous n rows, i.e. Timestamp(x) - 10 seconds value a occured in column A and value b occured in column B.
I already looked at the shift() function with freq = 10 seconds and another attempt looked like that:
data['Timestamp'][(data['Timestamp']-pd.Timedelta(seconds=10)):data['Timestamp']]
But I wasn't able to proceed with either of the two options.
Is it possible to start an additional select within a lambda function? If yes, how could that look like?
P.S.: Working with regular for-loops instead of the lambda function is not an option due to the overall setup of the application/code.
Thanks for your help and input!
Perhaps you're looking for something like this, if I understood correctly:
def create_tag(current_timestamp, df, cols_vals):
# Before the current timestamp
mask = (df['Timestamp'] <= current_timestamp)
# After the current timestamp - 10s
mask = mask & (df['Timestamp'] >= current_timestamp - pd.to_timedelta('10s'))
# Filter all dataframe following the mask
filtered = df[mask]
# Check if each val of col is present
present = all(value in filtered[column_name].values for column_name, value in cols_vals.items())
return present
data['Tag'] = data['Timestamp'].apply(lambda x: create_tag(x, data, {'column A': 'a', 'column B', 'b'}))
The idea behind this code is, for each timestamp that you have, we're going to apply the create_tag function. This takes the current timestamp, the whole dataframe as well as a dictionary containing column names as keys and the respective values you're looking for as values.

Associating numbers with different words python

So I have a dataframe (df9), that has several columns, one of which is "ASSET_CLASS", and I also have a variable called "terms". ASSET_CLASS is made up of different names, whereas "terms" is numbers. I want to be able to create a new row in the dataframe that outputs different numbers per row based on the corresponding asset class and # of terms. For example, if Asset_Class is 'A' in a row, and #terms is between 30 and 60 for that row, I want my new column to output the number 5 for that row. Or if Asset_Class is 'A' and #terms is between 0 and 30, the new column shoots out 3 for that row. Or if Asset Class is 'B', and terms is between 30 and 60, then the output in the new column for that row is 8. Anyone have a good idea of how to do this? I was thinking maybe if, else statements, but I'm not sure.
Use numpy.select. To match your first two examples, the following code should add a new column called 'newcol' to your data frame, matching your first two cases, and putting a value of -1 everywhere not covered by explicitly defined cases.
ac = df9.ASSET_CLASS
t = df9.terms
condlist = [(ac=='A') & (t>=30) & (t<60), (ac=='A') & (t>=0) & (t<30)]
choicelist = [5, 3]
df9['newcol'] = np.select(condlist, choicelist, default=-1)

Getting all column values from google sheet using Gspread and Python

So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.
You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)
Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.
You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)

DataFrame change doesn't save when iterating

I am trying to read a certain DF from file and add to it two more columns containing, say, the year and the week from other columns in DF. When i apply the code to generate a single new column, all works great. But when there are few columns to be created, the change does not apply. Specifically, new columns are created but their values are not what they are supposed to be.
I know that this happens because i first set all new values to a certain initial string and then change some of them, but I don't understand why it works on a single column and is "nulled" for multiple columns, leaving only the latest column changed... Help please?
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
bad_ind = list(np.where(tbl[date_cols[i]] == 'No Fill')[0])
tbl_ind = range(len(tbl))
for i in range(len(bad_ind)):
tbl_ind.remove(bad_ind[i])
tmp = pd.to_datetime(tbl[date_cols[i]][tbl_ind])
tbl[tmp_col_name][tbl_ind] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
If I try the following lines, disregarding possible "empty data values", everything works...
tbl = pd.read_csv(file).fillna('No Fill')
date_cols = ['Col1','Col2']
for i in range(len(date_cols)):
tmp_col_name = date_cols[i] + '_WEEK'
tbl[tmp_col_name] = 'No Week'
tmp = pd.to_datetime(tbl[date_cols[i]])
tbl[tmp_col_name] = tmp.apply(lambda x: str(x.isocalendar()[0]) + '+' + str(x.isocalendar()[1]))
it has to do with not changing all data values, but i don't understand why the change does not apply - after all, before the second iteration begins, the DF seems to be updated and then tbl[tmp_col_name] = 'No Week' for the second iteration "deletes" the changes made in the first iteration, but only partially - it leaves the new column created but filled with 'No Week' values...
Many thanks to #EdChum! Performing chained indexing may or may not work. In case of creating new multiple columns and then filling in only some of their values, it doesn't work. More precise, it does work but only on the last updated column. Using loc, iloc or ix accessors to set the data works. In case of the above code, to make it work, one needs to cast the tbl_ind into np.array, using tbl[col_name[j]].iloc[np.array(tbl_ind)] = tmp.apply(lambda x: x.year)
Many thanks and credit for the answer to #EdChum.

Categories

Resources