Get certain row where there is a Value in pandas - python

I want to get a list of some rows in column "a" till where the value of a changes from 0 to something.
The main dataframe
a b c d e
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
5.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
10.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
15.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
20.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
The list of dataframe i want
a b c d e
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
5.0 0.0 0.0 0.0 0.0
In the next loop the list should contain the first list plus the rows where the values changes
a b c d e
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
5.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
10.0 0.0 0.0 0.0 0.0
I have thousand of rows data I want to use the first list and then train the data and then update the list to the second list and so on.
The code i tired:
df1 = pd.read_csv(".csv")
df1['a'] = df1['a'].astype(int)
ar = []
i = 0
while True:
if df1.iloc[i].Yield > 0:
ar.append(df.iloc[:i])
i = df1.iloc[i].Yield
i += 1
if i > len(df):
break

TRY:
result = [df.iloc[:i+1, :] for i in df[df.a.ne(0)].index]
NOTE: change the type of a if required, and you can use the generator expression here if you wanna iterate once.
With generator expression:
result = (df.iloc[:i+1, :] for i in df[df.a.ne(0)].index)
next(result) # this will yield the next element.
or if wanna process each dataframe:
for i in df[df.a.ne(0)].index:
temp_df = df.iloc[:i+1, :].copy(deep=True)
# do the processing with temp_df

You could try this (df is your df1 after csv load):
import numpy as np
import pandas as pd
# group indexes
df['group'] = df['a'].replace(0, np.nan)
df['group'] = pd.Categorical(df['group'].backfill())
df['group'] = df['group'].cat.codes
# create list of indexes you want to slice
l_indexes = list()
for e in df['group'].unique().tolist():
l_indexes.append(df.index[df['group']<=e])
# execute your code, based on the slices
for indexes in l_indexes:
print(df.loc[indexes])

Related

My Python while loop is not terminating and I don't know why

It seems like the while loop should terminate once the start int == 1, but it keeps going. It also seems it's not actually printing the values....just 0
Given a positive integer n, the following rules will always create a
sequence that ends with 1, called the hailstone sequence:
If n is even, divide it by 2
If n is odd, multiply it by 3 and add 1(i.e. 3n +1)
Continue until n is 1
Write a program that reads an
integer as input and prints the hailstone sequence starting with the
integer entered. Format the output so that ten integers, each
separated by a tab character (\t), are printed per line.
The output format can be achieved as follows: print(n, end='\t')
Ex: If the input is:
25
the output is:
25 76 38 19 58 29 88 44 22 11
34 17 52 26 13 40 20 10 5 16
8 4 2 1
My code:
''' Type your code here. '''
start = int()
while True:
print(start, end='\t')
if start % 2 == 0:
start = start/2
print(start, end='\t')
elif start % 2 == 1:
start = (start *3)+1
print(start, end='\t')
if start == 1:
print(start, end='\t')
break
print(start, end='\t')
Program errors displayed here
Program generated too much output.
Output restricted to 50000 characters.
Check program for any unterminated loops generating output.
Program output displayed here
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
Your loop isn't terminating because you used 0 as an input and as 0 % 2 == 0 is true and 0/2=0 you become stuck in an infinite loop. you could fix this by raising an exception if start is <=0 like this:
start = int(input())
if start <=0:
raise Exception('Start must be strictly positive')
while True:
print(start, end='\t')
if not start % 2:
start //= 2
elif start % 2:
start = 3*start+1
if start == 1:
break

Python pandas: converting column values into other columns

I have dataframe which looks like below:
df:
Review_Text Noun Thumbups
Would be nice to be able to import files from ... [My, Tracks, app, phone, Google, Drive, import... 1.0
No Offline Maps! It used to have offline maps ... [Offline, Maps, menu, option, video, exchange,... 18.0
Great application. Designed with very well tho... [application, application] 16.0
Great App. Nice and simple but accurate. Wish ... [Great, App, Nice, Exported] 0.0
Save For Offline - This does not work. The rou... [Save, Offline, route, filesystem] 12.0
Since latest update app will not run. Subscrip... [update, app, Subscription, March, application] 9.0
Great app. Love it! And all the things it does... [Great, app, Thank, work] 1.0
I have paid for subscription but keeps telling... [subscription, trial, period] 0.0
Error: The route cannot be save for no locatio... [Error, route, i, GPS] 0.0
When try to restore my tracks it says "unable ... [try, file, locally-1] 0.0
Was a good app but since the update it only re... [app, update, metre] 2.0
based on 'Noun' Column values, I want to create other columns. For example, all values of noun column from first row become columns and those columns contain value of 'Thumbups' column value. If the column name already present in dataframe then it adds 'Thumbups' value into the existing value of the column.
I was trying to implement by using pivot_table :
pd.pivot_table(latest_review,columns='Noun',values='Thumbups')
But got following error:
TypeError: unhashable type: 'list'
Could anyone help me in fixing the issue?
Use Series.str.join with Series.str.get_dummies for dummies and then multiple by column Thumbups by DataFrame.mul:
df1 = df['Noun'].str.join('|').str.get_dummies().mul(df['Thumbups'], axis=0)
print (df1)
App Drive Error Exported GPS Google Great Maps March My Nice \
0 0.0 10.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 10.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Offline Save Subscription Thank Tracks app application exchange \
0 0.0 0.0 0.0 0.0 10.0 10.0 0.0 0.0
1 180.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0
2 0.0 0.0 0.0 0.0 0.0 0.0 160.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 120.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 90.0 0.0 0.0 90.0 90.0 0.0
6 0.0 0.0 0.0 10.0 0.0 10.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
file filesystem i import locally-1 menu metre option period \
0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 180.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
phone route subscription trial try update video work
0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
rows = []
#_unpacking Noun column row list values and storing it in rows list
_ = df.apply(lambda row: [rows.append([row['Review_Text'],row['Thumbups'], nn])
for nn in row.Noun], axis=1)
#_creates new dataframe with unpacked values
df_new = pd.DataFrame(rows, columns=df.columns)
#_now doing pivot operation on df_new
pivot_df = df_new.pivot(index='Review_Text', columns='Noun')

Concat dataframes/series with axis=1 in a loop

I have a dataframe of email senders as follows.
I am trying to get as an output a dataframe which is the number of emails send by eac person by month.
I want the index to be the month end and the columns to be the persons.
I am able to build this but with two issues:
First, I: am using multiple pd.concat statements (all the df_temps) which is ugly and does not scale.
Is there a way to put this in a for loop or some other way to loop over say the first n persons?
Second, while it puts all the data together correctly, there is a discontinuity in the index.
The second last row is 1999-01-31 and the last one is 2000-01-31.
Is there an option or a way to get NaN for the in between months?
Code below:
import pandas as pd
df_in = pd.DataFrame({
'sender':['Able Boy','Able Boy','Able Boy','Mark L. Taylor','Mark L. Taylor',
'Mark L. Taylor','scott kirk','scott kirk','scott kirk','scott kirk',
'Able Boy','Able Boy','james h. madison','james h. madison','james h. madison',
'james joyce','scott kirk','james joyce','james joyce','james joyce',
'james h. madison','Able Boy'],
'receiver':['Toni Z. Zapata','Mark Angel','scott kirk','paul a boyd','michelle fam',
'debbie bradford','Mark Angel','Johnny C. Cash','Able Boy','Mark L. Taylor',
'jenny chang','julie s. smith', 'scott kirk', 'tiffany r.','Able Boy',
'Mark Angel','Able Boy','julie s. smith','jenny chang','debbie bradford',
'Able Boy','Toni Z. Zapata'],
'time':[911929000000,911929000000,910228000000,911497000000,911497000000,
911932000000,914261000000,914267000000,914269000000,914276000000,
914932000000,915901000000,916001000000,916001000000,916001000000,
947943000000,947943000000,947943000000,947943000000,947943000000,
916001000000,911929100000],
'email_ID':['<A34E5R>','<A34E5R>','<B34E5R>','<C34E5R>','<C34E5R>',
'<C36E5R>','<C36E5A>','<C36E5B>','<C36E5C>','<C36E5D>',
'<D000A0>','<D000A1>','<D000A2>','<D000A2>','<D000A2>',
'<D000A3>','<D000A3>','<D000A3>','<D000A3>','<D000A3>',
'<D000A4>','<A34E5S>']
})
df_in['time'] = pd.to_datetime(df_in['time'],unit='ms')
df_1 = df_in.copy()
df_1['number'] = 1
df_2 = df_1.drop_duplicates(subset="email_ID",keep="first",inplace=False)\
.reset_index()
df_3 = df_2.drop(columns=['index','receiver','email_ID'],inplace=False)
df_6 = df_3.groupby(['sender',pd.Grouper(key='time',freq='M')]).sum()
df_6_squeezed = df_6.squeeze()
df_grp_1 = df_3.groupby(['sender']).count()
df_grp_1.sort_values(by=['number'],ascending=False,inplace=True)
toppers = list(df_grp_1.index.array)
df_temp_1 = df_6_squeezed[toppers[0]]
df_temp_2 = df_6_squeezed[toppers[1]]
df_temp_3 = df_6_squeezed[toppers[2]]
df_temp_4 = df_6_squeezed[toppers[3]]
df_temp_5 = df_6_squeezed[toppers[4]]
df_temp_1.rename(toppers[0],inplace=True)
df_temp_2.rename(toppers[1],inplace=True)
df_temp_3.rename(toppers[2],inplace=True)
df_temp_4.rename(toppers[3],inplace=True)
df_temp_5.rename(toppers[4],inplace=True)
df_concat_1 = pd.concat([df_temp_1,df_temp_2],axis=1,sort=False)
df_concat_2 = pd.concat([df_concat_1,df_temp_3],axis=1,sort=False)
df_concat_3 = pd.concat([df_concat_2,df_temp_4],axis=1,sort=False)
df_concat_4 = pd.concat([df_concat_3,df_temp_5],axis=1,sort=False)
print("\nCONCAT (df_concat_4):")
print(df_concat_4)
print(type(df_concat_4))
Consider pivot_table after calculating month_end (see #Root's answer). Also, use reindex to fill in missing months. Usually in Pandas, grouping aggregations like count of senders per month does not require looping or temporary helper data frames.
from pandas.tseries.offsets import MonthEnd
df_in['month_end'] = (df_in['time'] + MonthEnd(0)).dt.normalize()
agg_df = (df_in.pivot_table(index='month_end', columns='sender', values='time', aggfunc='count')
.reindex(pd.date_range('1998-01-01', '2000-01-31', freq='m').values, axis='index')
.fillna(0)
)
Output
print(agg_df)
# sender Able Boy Mark L. Taylor james h. madison james joyce scott kirk
# month_end
# 1998-01-31 0.0 0.0 0.0 0.0 0.0
# 1998-02-28 0.0 0.0 0.0 0.0 0.0
# 1998-03-31 0.0 0.0 0.0 0.0 0.0
# 1998-04-30 0.0 0.0 0.0 0.0 0.0
# 1998-05-31 0.0 0.0 0.0 0.0 0.0
# 1998-06-30 0.0 0.0 0.0 0.0 0.0
# 1998-07-31 0.0 0.0 0.0 0.0 0.0
# 1998-08-31 0.0 0.0 0.0 0.0 0.0
# 1998-09-30 0.0 0.0 0.0 0.0 0.0
# 1998-10-31 0.0 0.0 0.0 0.0 0.0
# 1998-11-30 4.0 3.0 0.0 0.0 0.0
# 1998-12-31 1.0 0.0 0.0 0.0 4.0
# 1999-01-31 1.0 0.0 4.0 0.0 0.0
# 1999-02-28 0.0 0.0 0.0 0.0 0.0
# 1999-03-31 0.0 0.0 0.0 0.0 0.0
# 1999-04-30 0.0 0.0 0.0 0.0 0.0
# 1999-05-31 0.0 0.0 0.0 0.0 0.0
# 1999-06-30 0.0 0.0 0.0 0.0 0.0
# 1999-07-31 0.0 0.0 0.0 0.0 0.0
# 1999-08-31 0.0 0.0 0.0 0.0 0.0
# 1999-09-30 0.0 0.0 0.0 0.0 0.0
# 1999-10-31 0.0 0.0 0.0 0.0 0.0
# 1999-11-30 0.0 0.0 0.0 0.0 0.0
# 1999-12-31 0.0 0.0 0.0 0.0 0.0
# 2000-01-31 0.0 0.0 0.0 4.0 1.0

Can't Re-Order Columns Data

I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

How to drop columns of Pandas DataFrame with zero values in the last row

I have a Pandas Dataframe which tells me monthly sales of items in shops
df.head():
ID month sold
0 150983 0 1.0
1 56520 0 13.0
2 56520 1 7.0
3 56520 2 13.0
4 56520 3 8.0
I want to remove all IDs where there were no sales last month. I.e. month == 33 & sold == 0. Doing the following
unwanted_df = df[((df['month'] == 33) & (df['sold'] == 0.0))]
I just get 46 rows, which is far too little. But nevermind, I would like to have the data in different format anyway. Pivoted version of above table is just what I want:
pivoted_df = df.pivot(index='month', columns = 'ID', values = 'sold').fillna(0)
pivoted_df.head()
ID 0 2 3 5 6 7 8 10 11 12 ... 214182 214185 214187 214190 214191 214192 214193 214195 214197 214199
month
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Question. How to remove columns with the value 0 in the last row in pivoted_df?
You can do this with one line:
pivoted_df= pivoted_df.drop(pivoted_df.columns[pivoted_df.iloc[-1,:]==0],axis=1)
I want to remove all IDs where there were no sales last month
You can first calculate the IDs satisfying your condition:
id_selected = df.loc[(df['month'] == 33) & (df['sold'] == 0), 'ID']
Then filter these from your dataframe via a Boolean mask:
df = df[~df['ID'].isin(id_selected)]
Finally, use pd.pivot_table with your filtered dataframe.

Categories

Resources