Consider this simple pandas DataFrame with columns 'record', 'start', and 'param'. There can be multiple rows with the same record value, and each unique record value corresponds to the same start value. However, the 'param' value can be different for the same 'record' and 'start' combination:
pd.DataFrame({'record':[1,2,3,4,4,5,6,7,7,7,8], 'start':[0,5,7,13,13,19,27,38,38,38,54], 'param':['t','t','t','u','v','t','t','t','u','v','t']})
I'd like to make a column 'end' that takes the value of 'start' in the row with the next unique value of 'record'. The values of column 'end' should be:
[5,7,13,19,19,27,38,54,54,54,NaN]
I'm able to do this using a for loop, but I know this is not preferred when using pandas:
max_end = 100
for idx, row in df.iterrows():
try:
n = 1
next_row = df.iloc[idx+n]
while next_row['start'] == row['start']:
n = n+1
next_row = df.iloc[idx+n]
end = next_row['start']
except:
end = max_end
df.at[idx, 'end'] = end
Is there an easy way to achieve this without a for loop?
I have no doubt there is a smarter solution but here is mine.
df1['end'] = df1.drop_duplicates(subset = ['record', 'start'])['start'].shift(-1).reindex(index = df1.index, method = 'ffill')
-=EDIT=-
Added subset into drop_duplicates to account for question amendment
This solution is equivalent to #Quixotic22 although more explicit.
df = pd.DataFrame({
'record':[1,2,3,4,4,5,6,7,7,7,8],
'start':[0,5,7,13,13,19,27,38,38,38,54],
'param':['t','t','t','u','v','t','t','t','u','v','t']
})
max_end = 100
df["end"] = None # create new column with empty values
loc = df["record"].shift(1) != df["record"] # record where the next value is diff from previous
df.loc[loc, "end"] = df.loc[loc, "start"].shift(-1) # assign desired values
df["end"].fillna(method = "ffill", inplace = True) # fill remaining missing values
df.loc[df.index[-1], "end"] = max_end # override last value
df
Related
I would like to iterate through a dataframe rows and concatenate that row to a different dataframe basically building up a different dataframe with some rows.
For example:
`IPCSection and IPCClass Dataframes
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
if (secrow[0] in clrow[0]):
pdList = [finalpatentclasses, pd.DataFrame(secrow), pd.DataFrame(clrow)]
finalpatentclasses = pd.concat(pdList, axis=0, ignore_index=True)
display(finalpatentclasses)
The output is:
I want the nan values to dissapear and move all the data under the correct columns. I tried axis = 1 but messes up the column names. Append does not work as well all values are placed diagonally at the table with nan values as well.
Alright, I have figured it out. The idea is that you create a newrowDataframe and concatenate all the data in a list from there you can add it to the dataframe and then conc with the final dataframe.
Here is the code:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns), axis = 0)
finalpatentclasses = pd.DataFrame(columns=allcolumns)
for isec, secrow in IPCSection.iterrows():
for icl, clrow in IPCClass.iterrows():
newrow = pd.DataFrame(columns=allcolumns)
values = np.concatenate((secrow.values, subclrow.values), axis=0)
newrow.loc[len(newrow.index)] = values
finalpatentclasses = pd.concat([finalpatentclasses, newrow], axis=0)
finalpatentclasses.reset_index(drop=false, inplace=True)
display(finalpatentclasses)
Update the code below is more efficient:
allcolumns = np.concatenate((IPCSection.columns, IPCClass.columns, IPCSubClass.columns, IPCGroup.columns), axis = 0)
newList = []
for secrow in IPCSection.itertuples():
for clrow in IPCClass.itertuples():
if (secrow[1] in clrow[1]):
values = ([secrow[1], secrow[2], subclrow[1], subclrow[2]])
new_row = {IPCSection.columns[0]: [secrow[1]], IPCSection.columns[1]: [secrow[2]],
IPCClass.columns[0]: [clrow[1]], IPCClass.columns[1]: [clrow[2]]}
newList.append(values)
finalpatentclasses = pd.DataFrame(newList, columns=allcolumns)
display(finalpatentclasses)
I would like to populate a dataframe using a for loop.
one of the column is a list.
this list is empty at the begining at each itteration an element is added or removed from it.
when I print my list at each iteration I am getting the right results, but when I print my dataframe, I am getting the same list on each row:
I you have a look to my code the list I am updatin is list_employe. The magic should happen in the 3 last rows but it did not.
Does anyone have an idea why the list is updated in one way and the dataframe record only the last update on all rows
list_employe = []
total_employe = 0
rows=[]
shiftday = example['SHIFT_DATE'].dt.strftime('%Y-%m-%d').unique().tolist()
for i in shiftday:
shift_day = example[example['SHIFT_DATE'] == i]
list_employe_shift = example[example['SHIFT_DATE']==i]['EMPLOYEE_CODE_POS_UPPER'].unique().tolist()
new_employe = 0
end_employe = 0
for k in list_employe_shift:
shift_days_emp = shift_day[shift_day['EMPLOYEE_CODE_POS_UPPER'] == k]
days = shift_days_emp.iloc[0]['last_day']
#print(days)
if k in list_employe:
if days>1:
end_employe= end_employe+1
total_employe = total_employe-1
list_employe.remove(k)
else:
new_employe = new_employe+1
total_employe = total_employe + 1
list_employe.extend([k])
day = i
total_emp = total_employe
new_emp = new_employe
end_emp = end_employe
rows.append([day, total_emp, new_emp, end_emp, list_employe])
print(list_employe)
df = pd.DataFrame(rows, columns=["day", "total_employe", "new_employe", "end_employe", "list_employe"])
the list list_employe is always the same object that you append to the list rows. What you need to do to solve the problem is at the 3rd line from the bottom : rows.append([day, total_emp, new_emp, end_emp, list(list_employe)]) Which create a new list at each itteration
I have a dataset in this format:
and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])
Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.
Based on the suggestions to use bfill(). I tried putting it as following:
df["EndDate"] = (
df.sort_values(by=["StartDate"])
.groupby(["DocumentId", "PersonId"])["EndDate"]
.bfill()
)
Above works fine but how can I add the condition for DocumentCode being RT?
You can calculate the value to use to fill nan inside the apply function.
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)
You could find the empty cells and replace with np.nan, then fillna with method='bfill'
df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)
Alternatively you could iterate through the df from last row to first row, and fill in the EndDate where necessary:
d = df.loc[df.shape[0]-1, 'EndDate'] #initial condition
for i in range(df.shape[0]-1, -1, -1):
if df.loc[i, 'DocumentCode'] == 'RT':
d = df.loc[i, 'EndDate']
else:
df.loc[i, 'EndDate'] = d
df_headlines =
I want to group by the date column and then count how many times -1, 0, and 1 appear by date and then whichever has the highest count, use that as the daily_score.
I started with a groupby:
df_group = df_headlines.groupby('date')
This returns a groupby object and I'm not sure how to work with this given what I want to do above:
Can I iterate through this using the following?:
for index, row in df_group.iterrows():
daily_pos = []
daily_neg = []
daily_neu = []
As Ch3steR hinted as a comment, you can iterate through your groups in the following way:
for name, group in headlines.groupby('date'):
daily_pos = len(group[group['score'] == 1])
daily_neg = len(group[group['score'] == -1])
daily_neu = len(group[group['score'] == 0])
print(name, daily_pos, daily_neg, daily_neu)
For each iteration, the variable name will contain a value from the date column (e.g. 4/13/20, 4/14/20, 5/13/20), and the variable group will contain a dataframe of all rows for the date contained in the name variable.
Try:
df_headlines.groupby("date")["score"].nlargest(1).reset_index(level=1, drop=True)
No loop required - you will get most common score within each group
I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)