I have spent quite sometime on this now. The count of values in column 'Lead Type' displayed by using pivot_tables is incorrect when I count the same values by using filter in the excel file.
import pandas as pd
import openpyxl
import warnings
import numpy as np
with warnings.catch_warnings(record=True):
warnings.simplefilter("always")
datam = pd.read_excel(r"C:\Users\Complinity\Desktop\Marketing\CRM\Reports\CRM-Data 1 -31 July 2022.xlsx",engine="openpyxl", skiprows=[0,1])
datam = pd.DataFrame(datam, columns= ['Source','Medium','Keywords','Lead Type'])
datam_pivot_table = datam.pivot_table(index="Source",columns = "Lead Type", aggfunc="count", values="Medium",margins = True)
datam_pivot_table
Also when I used groupby function on the same dataframe to achieve the same result, my answers are in accordance with excel values.
new = datam.groupby(['Source', 'Lead Type']).agg({'Lead Type': ['count']}).reset_index()
new.columns = ['Source', 'Lead Type', 'Lead Count']
print (new)
new['Lead Count'].sum()
So why is pivot_table not working? Is it a cache issue or something else?
Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])
I am a very beginner in programming and trying to learn to code. so please bear with my bad coding. I am using pandas to find a string from a column (Combinations column in the below code ) in the data frame and print the entire row containing the string . Find the code below. Basically I need to find all the instances where the string occurs , and print the entire row .find my code below . I am not able to figure out how to find that particular instance of the column and print it .
import pandas as pd
data = pd.read_csv("signallervalues.csv",index_col=False)
data.head()
data['col1'] = data['col1'].astype(str)
data['col2'] = data['col2'].astype(str)
data['col3'] = data['col3'].astype(str)
data['col4'] = data['col4'].astype(str)
data['col5']= data['col5'].astype(str)
data.head()
combinations= data['Col1']+data['col2'] + data['col3'] + data['col4'] + data['col5']
data['combinations']= combinations
print(data.head())
list_of_combinations = data['combinations'].to_list()
print(list_of_combinations)
for i in list_of_combinations:
if data['combinations'].str.contains(i).any():
print(i+ 'data occurs in row' )
# I need to print the row containing the string here
else:
print(i +'is occuring only once')
my data frame looks like this
import pandas as pd
data=pd.DataFrame()
# recreating your data (more or less)
data['signaller']= pd.Series(['ciao', 'ciao', 'ciao'])
data['col6']= pd.Series(['-1-11-11', '11', '-1-11-11'])
list_of_combinations=['11', '-1-11-11']
data.reset_index(inplace=True)
# group by the values of column 6 and counting how many times they occur
g=data.groupby('col6')['index']
count= pd.DataFrame(g.count())
count=count.rename(columns={'index':'occurences'})
count.reset_index(inplace=True)
# create a df that keeps only the rows in the list 'list_of_combinations'
count[~count['col6'].isin(list_of_combinations)== False]
My result
I have dataframe like this.
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]
"Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
But this dataframe is have a problem. Some numbers are wrong.
Previous number always has to be smaller than next number(6,4,6,,7,8,7...50,75,60,45,100)
I don't use df.sort because it's not about sorting it's about correction.
Edit: I added corrected numbers in "number is corrected" column.
guessing from your 'Number corrected' list, you could probably use this:
import pandas as pd
#create dataframe
df= pd.DataFrame({"Date":range(0,22),
"Country":["USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA","USA",],
"Number":[0,0,0,0,0,1,1,3,5,6,4,6,7,8,7,10,25,50,75,60,45,100]})
# "Number is Corrected":[0,0,0,0,0,1,1,3,5,6,6,6,7,7,7,10,25,50,50,60,60,100]})
def correction():
df['Number is Corrected'] = df['Number']
cache = 0
for num, content in enumerate(df['Number is Corrected'], start=0):
if(df['Number is Corrected'][num] < cache):
df['Number is Corrected'][num] = cache
else:
cache = df['Number is Corrected'][num]
print(df)
if __name__ == "__main__":
correction()
but there is some inconsistency, like your conversation with jezrael. Evtl. you'll need to update the logic of the code, if it gets clearer, what the output you wished. Good luck.
I wrote an function which only depends on a dataframe. The functions output is also a dataframe. I would like make different dataframes according a condition and save them as different datasets with different names. However I couldnt save them as dataframes with different names. Instead i manually do the process. Is there a code which would do the same. It would be much beneficial.
import os
import numpy as np
import pandas as pd
data1 = pd.read_csv('C:/Users/Oz/Desktop/vintage/vintage1.csv', encoding='latin-1')
product_list= data1['product_types'].unique()
def vintage_table(df):
df['Disbursement_Date']=pd.to_datetime(df.Disbursement_Date)
df['Closing_Date']=pd.to_datetime(df.Closing_Date)
df['NPL_date']=pd.to_datetime(df.NPL_date, errors='ignore')
df['NPL_date_period']=df.loc[df.NPL_date > '2015-01-01', 'NPL_date'].apply(lambda x: x.strftime('%Y-%m'))
df['Dis_date_period'] = df.Disbursement_Date.apply(lambda x: x.strftime('%Y-%m'))
df['diff']=((df.NPL_date-df.Disbursement_Date) / np.timedelta64(3, 'M')).round(0)
df=df.groupby(['Dis_date_period','NPL_date_period']).agg({'Dis_amount' : 'sum', 'NPL_amount' : 'sum', 'diff' : 'mean'})
df.reset_index(level=0, inplace=True)
df['Vintage_Ratio']=df['NPL_amount']/df['Dis_amount']
table=pd.pivot_table(df,values='Vintage_Ratio',index='Dis_date_period',columns=['diff'],).fillna(0)
return
The above is the function
#for e in product_list:
# sub = data1[data1['product_types'] == e]
# print(sub)
consumer = data1[data1['product_types'] == product_list[0]]
mortgage = data1[data1['product_types'] == product_list[1]]
vehicle = data1[data1['product_types'] == product_list[2]]
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vehicle)
I would like to improve this part is there a better way to do the same process?
You could have your vintage_table() function return a dataframe instead of just modifying one dataframe over and over and that way you could do this in the second code block:
table_con = vintage_table(consumer)
table_mor = vintage_table(mortgage)
table_veh = vintage_table(vechicle)