Iterate over a pandas data frame or groupby object - python

df_headlines =
I want to group by the date column and then count how many times -1, 0, and 1 appear by date and then whichever has the highest count, use that as the daily_score.
I started with a groupby:
df_group = df_headlines.groupby('date')
This returns a groupby object and I'm not sure how to work with this given what I want to do above:
Can I iterate through this using the following?:
for index, row in df_group.iterrows():
daily_pos = []
daily_neg = []
daily_neu = []

As Ch3steR hinted as a comment, you can iterate through your groups in the following way:
for name, group in headlines.groupby('date'):
daily_pos = len(group[group['score'] == 1])
daily_neg = len(group[group['score'] == -1])
daily_neu = len(group[group['score'] == 0])
print(name, daily_pos, daily_neg, daily_neu)
For each iteration, the variable name will contain a value from the date column (e.g. 4/13/20, 4/14/20, 5/13/20), and the variable group will contain a dataframe of all rows for the date contained in the name variable.

Try:
df_headlines.groupby("date")["score"].nlargest(1).reset_index(level=1, drop=True)
No loop required - you will get most common score within each group

Related

Pandas assign value based on next row(s)

Consider this simple pandas DataFrame with columns 'record', 'start', and 'param'. There can be multiple rows with the same record value, and each unique record value corresponds to the same start value. However, the 'param' value can be different for the same 'record' and 'start' combination:
pd.DataFrame({'record':[1,2,3,4,4,5,6,7,7,7,8], 'start':[0,5,7,13,13,19,27,38,38,38,54], 'param':['t','t','t','u','v','t','t','t','u','v','t']})
I'd like to make a column 'end' that takes the value of 'start' in the row with the next unique value of 'record'. The values of column 'end' should be:
[5,7,13,19,19,27,38,54,54,54,NaN]
I'm able to do this using a for loop, but I know this is not preferred when using pandas:
max_end = 100
for idx, row in df.iterrows():
try:
n = 1
next_row = df.iloc[idx+n]
while next_row['start'] == row['start']:
n = n+1
next_row = df.iloc[idx+n]
end = next_row['start']
except:
end = max_end
df.at[idx, 'end'] = end
Is there an easy way to achieve this without a for loop?
I have no doubt there is a smarter solution but here is mine.
df1['end'] = df1.drop_duplicates(subset = ['record', 'start'])['start'].shift(-1).reindex(index = df1.index, method = 'ffill')
-=EDIT=-
Added subset into drop_duplicates to account for question amendment
This solution is equivalent to #Quixotic22 although more explicit.
df = pd.DataFrame({
'record':[1,2,3,4,4,5,6,7,7,7,8],
'start':[0,5,7,13,13,19,27,38,38,38,54],
'param':['t','t','t','u','v','t','t','t','u','v','t']
})
max_end = 100
df["end"] = None # create new column with empty values
loc = df["record"].shift(1) != df["record"] # record where the next value is diff from previous
df.loc[loc, "end"] = df.loc[loc, "start"].shift(-1) # assign desired values
df["end"].fillna(method = "ffill", inplace = True) # fill remaining missing values
df.loc[df.index[-1], "end"] = max_end # override last value
df

How to create a dataframe in the for loop?

I want to create a dataframe that consists of values obtained inside the for loop.
columns = ['BIN','Date_of_registration', 'Tax','TaxName','KBK',
'KBKName','Paynum','Paytype', 'EntryType','Writeoffdate', 'Summa']
df = pd.DataFrame(columns=columns)
I have this for loop:
for elements in tree.findall('{http://xmlns.kztc-cits/sign}payment'):
print("hello")
tax = elements.find('{http://xmlns.kztc-cits/sign}TaxOrgCode').text
tax_name_ru = elements.find('{http://xmlns.kztc-cits/sign}NameTaxRu').text
kbk = elements.find('{http://xmlns.kztc-cits/sign}KBK').text
kbk_name_ru = elements.find('{http://xmlns.kztc-cits/sign}KBKNameRu').text
paynum = elements.find('{http://xmlns.kztc-cits/sign}PayNum').text
paytype = elements.find('{http://xmlns.kztc-cits/sign}PayType').text
entry_type = elements.find('{http://xmlns.kztc-cits/sign}EntryType').text
writeoffdate = elements.find('{http://xmlns.kztc-cits/sign}WriteOffDate').text
summa = elements.find('{http://xmlns.kztc-cits/sign}Summa').text
print(tax, tax_name_ru, kbk, kbk_name_ru, paynum, paytype, entry_type, writeoffdate, summa)
How can I append acquired values to the initially created(outside for loop) dataframe?
A simple way if you only need the dataframe after the loop is completed is to append the data to a list of lists and then convert to a dataframe. Caveat: Responsibility is on you to make sure the list ordering matches the columns, so if you change your columns in the future you have to reposition the list.
list_of_rows = []
for elements in tree.findall('{http://xmlns.kztc-cits/sign}payment'):
list_of_rows.append([
tax, tax_name_ru, kbk, kbk_name_ru, paynum, paytype,entry_type, writeoffdate, summa])
df = pd.DataFrame(columns=columns, data=list_of_rows)

Pandas - Incrementally add to DataFrame

I'm trying to add rows and columns to pandas incrementally. I have a lot of data stored across multiple datastores and a heuristic to determine a value. As I navigate across this datastore, I'd like to be able to incrementally update a dataframe, where in some cases, either names or days will be missing.
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
df = df.append({col: value, 'name': name}, ignore_index=True)
df.set_index('name', inplace=True, drop=True)
print(df.loc['Bill'])
This produces the following results:
2016_1 2016_2 2016_3
name
Bill 15.0 NaN NaN
Bill NaN 12.0 NaN
I've created a heatmap of the data and it's blocky due to duplicate names, so the output I'm looking for is:
2016_1 2016_2 2016_3
name
Bill 15.0 12.0 NaN
How can I combine these rows?
Is there a more efficient means of creating this dataframe?
Try this :-
df.groupby('name')[df.columns.values].sum()
try this:
df.pivot_table(index='name', aggfunc='sum', dropna=False)
After you run your foo() function, you can use any aggregation function (if you have only one value per column and all the othes are null) and groupby on df.
First, use reset_index to get back your name column.
Then use groupby and apply. Here I propose a custom function which checks that there is only one value per column, and raise a ValueError if not.
df.reset_index(inplace=True)
def aggdata(x):
if all([i <= 1 for i in x.count()]):
return x.mean()
else:
raise ValueError
ddf = df.groupby('name').apply(aggdata)
If all the values of the column are null but one, x.mean() will return that value (actually, you can use almost any aggregator, since there is only one value, that is the one returned).
It would be easier to have the name as column and date as index instead. Plus, you can work within the loop with lists and afterwards create the pd.DataFrame.
e.g.
year = 2016
names = ['Bill', 'Bob', 'Ryan']
index = []
valueBill = []
valueBob = []
valueRyan = []
for day in range(1, 4):
if random.choice([True, False]): # sometimes a name will be missing
valueBill.append(random.randrange(0, 20))
valueBob.append(random.randrange(0, 90))
valueRyan.append(random.randrange(0, 200))
index.append('{}-0{}'.format(year, day)) # column name
else:
valueBill.append(np.nan)
valueBob.append(np.nan)
valueRyan.append(np.nan)
index.append(np.nan)
df = pd.DataFrame({})
for name, value in zip(names,[valueBill,valueBob,valueRyan]):
df[name] = value
df.set_index(pd.to_datetime(index))
You can append the entries with new names if it does not already exist and then do an update to update existing entries.
import pandas as pd
import random
def foo():
df = pd.DataFrame()
year = 2016
names = ['Bill', 'Bob', 'Ryan']
for day in range(1, 4, 1):
for name in names:
if random.choice([True, False]): # sometimes a name will be missing
continue
value = random.randrange(0, 20, 1) # random value from heuristic
col = '{}_{}'.format(year, day) # column name
new_df = pd.DataFrame({col: value, 'name':name}, index=[1]).set_index('name')
df = pd.concat([df,new_df[~new_df.index.isin(df.index)].dropna()])
df.update(new_df)
#df.set_index('name', inplace=True, drop=True)
print(df)

Check for each row in several columns and append for each row if the requirement is met or not. python

I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.
Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.
This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.

How to add "order within group" column in pandas?

Take the following dataframe:
import pandas as pd
df = pd.DataFrame({'group_name': ['A','A','A','B','B','B'],
'timestamp': [4,6,1000,5,8,100],
'condition': [True,True,False,True,False,True]})
I want to add two columns:
The row's order within its group
rolling sum of the condition column within each group
I know I can do it with a custom apply, but I'm wondering if anyone has any fun ideas? (Also this is slow when there are many groups.) Here's one solution:
def range_within_group(input_df):
df_to_return = input_df.copy()
df_to_return = df_to_return.sort('timestamp')
df_to_return['order_within_group'] = range(len(df_to_return))
df_to_return['rolling_sum_of_condition'] = df_to_return.condition.cumsum()
return df_to_return
df.groupby('group_name').apply(range_within_group).reset_index(drop=True)
GroupBy.cumcount does:
Number each item in each group from 0 to the length of that group - 1.
so simply:
>>> gr = df.sort('timestamp').groupby('group_name')
>>> df['order_within_group'] = gr.cumcount()
>>> df['rolling_sum_of_condition'] = gr['condition'].cumsum()

Categories

Resources