Subtracting DataFrames resulting in unexpected numbers - python

I'm trying to subtract one data frame from another which all results should result in a 0 or blank based on the data in each my current excel files but will result in 0, 1, 2, or blank in the future. While some do result in a 0 or blank I'm also getting a -1 and 1. Any help that can be provided will be appreciated.
The two Excel sheets are identical except for number changes in second column.
Example
ExternalId TotalInteractions
name1 1
name2 2
name3 2
name4 1
Both sheets will look like the example and the output will look the same. I just need the difference between the two sheets
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = df1['ExternalId']
df4 = df1['TotalInteractions']
df5 = df2['TotalInteractions']
df6 = df4.sub(df5)
frames = (df3, df6)
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
I managed to create a partial answer to getting the unexpected numbers. My problem now is that NewInter has more names than PrevInter does. Which results in a blank in TotalInteractions next to the new ExternalId. Any idea how to make it if it there is a blank to accept the value from NewInter?
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4 - df5
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()

Figured out the issues. First part needed to be merged in order for the subtraction to work as the dataframes are not the same size. Also had to add in fill_value = 0 so it would take information from the new file.
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4.sub(df5, fill_value = 0)
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()

Related

Why i concat() 3 df and still NaN valaue?

my code:
dfs = [df_uk_rfmt, df_uk_clv, df_uk_prod_pen]
final_df = pd.concat(dfs, axis = 1)
final_df.head()
And my new df looks like this:
but when I using Microsoft Query, Some NaN has value for example on CustomerID 12748 on this pic:
PS. All df index = CustomerID
My purpose is to join 3 data frames with full outer.
Thank you so much for your help.
Before defining dfs you need to make sure you do not have MultiIndex. So, do this:
df_uk_rfmt = df_uk_rfmt.reset_index()
df_uk_clv = df_uk_clv.reset_index()
df_uk_prod_pen = df_uk_prod_pen.reset_index()
Then
dfs = [df_uk_rfmt, df_uk_clv, df_uk_prod_pen]
final_df = pd.concat(dfs, axis = 1)
final_df.head()

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')
df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']
You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

Python every three rows to columns using pandas

I have a text file with data that repeoates every 3 rows. Lets say it is hash, directory, sub directory. The data looks like the following:
a3s2d1f32a1sdf321asdf
Dir_321321
Dir2_asdf
s21a3s21d3f21as32d1f
Dir_65465
Dir2_werq
asd21231asdfa3s21d
Dir_76541
Dir2_wbzxc
....
I have created a python script that takes the data and every 3 rows creates columns:
import pandas as pd
df1 = pd.read_csv('RogTest/RogTest.txt', delimiter = "\t", header=None)
df2 = df1[df1.index % 3 == 0]
df2 = df2.reset_index(drop=True)
df3 = df1[df1.index % 3 == 1]
df3 = df3.reset_index(drop=True)
df4 = df1[df1.index % 3 == 2]
df4 = df4.reset_index(drop=True)
df5 = pd.concat([df2, df3], axis=1)
df6 = pd.concat([df5, df4], axis=1)
#Rename columns
df6.columns = ['Hash', 'Dir_1', 'Dir_2']
#Write to csv
df6.to_csv('RogTest/RogTest.csv', index=False, header=True)
This works fine but I am curious if there is a more efficient way to do this aka less code?
You can use:
df_final = pd.DataFrame(np.reshape(df.values,(3, df.shape[0]/3)))
df_final.columns = ['Hash', 'Dir_1', 'Dir_2']
Output:
Hash Dir_1 Dir_2
0 a3s2d1f32a1sdf321asdf Dir_321321 Dir2_asdf
1 s21a3s21d3f21as32d1f Dir_65465 Dir2_werq
2 asd21231asdfa3s21d Dir_76541 Dir2_wbzxc

Pandas extract the first year from dataframe

I have this large data frame and I need to when certain resource are available for the first time. Let me explain it from my code.
df1 = df[df['Resource_ID'] == 1348]
df1 = df1[['Format', 'Range_Start', 'Number']]
df1["Range_Start"] = df1["Range_Start"].str[:7]
df1 = df1.groupby(['Format', 'Range_Start'], as_index=True).last()
pd.options.display.float_format = '{:,.0f}'.format
df1 = df1.unstack()
df1.columns = df1.columns.droplevel()
df2 = df1[1:4].sum(axis=0)
df2.name = 'sum'
df2 = df1.append(df2)
df3 = df2.T[['entry', 'sum']].copy()
df3.index = pd.to_datetime(df3.index)
Now print(df3.first('1D')) gives the following output:
Format entry sum
Range_Start
2011-07-01 97 72
I can now see that Resource_ID 1348 first occurs on 2011-07-01, how do I extract only the Year from this information?
This is my sample input csv data:
Access_Stat_ID,Resource_ID,Range_Start,Range_End,Name,Format,Number,Matched_URL
1,15,"2009-03-01 00:00:00","2009-03-31 23:59:59","Mar 2009","entry",3,""
203,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","entry",18,""
204,13,"2009-04-01 00:00:00","2009-04-30 23:59:59","Apr 2009","pdf",7,""
It seems need:
first_year = df3.index.year[0]

"Can't convert float Nan to int" but no Nan?

I have a dataframe and attempt to make the following operation:
data['SD_rates']=np.array([int((data['actual value'][i]-data['means'][i])/data['std'][i]) for i in range (len(data['means']))])
It breaks with the following message:
"Can't convert float Nan to int"
It is an error I understand but tested the df with data.isnull() and no column involved includes NaN (I controlled it manually by sending data.to_csv).
I even filled data['std'] with fillna(-1, inplace=True) but still, it breaks. I don't understand why, since there is no division by 0 (i also controlled that there were no zeros in this column, so no original 0 and Null/Nan filled with -1), and actual values and means are fillna(0) for missing values, and anyway the substraction can't produce a nan (data range in [0-10]).
What could be wrong? (as i said, the data right before triggering the operation is correct...). Thanks
Here is a code snippet:
One of my hypotheses is that in some way, groupby might generate NaN, that I can't get rid off when calculating my means (but I believed that it was ignored by pandas automatically...) and that are not filled with 0 or -1 (I chose -1 for standard deviation deliberately to avoid dividing by 0).
def stats_setting(data):
print('Stats settings')
print(data.columns)
print(data.dtypes)
#sys.exit()
data['marks']=np.log1p(data['marks'].astype(float))
data['students']=np.log1p(data['students'].astype(float))#Rossman9 think this has to be tested
#were filled with fillna before)
#First Part: by studentType and Assortment
types_DoM_select=['Type','Type2','Category']
#First Block:types_DoM students grouped by categories
#wonder if can do a groupby of groupb
print("types_DoM_marks_means")
types_DoM_marks_means = data.groupby(types_DoM_select)['marks'].mean()
types_DoM_marks_means.name = 'types_DoM_marks_means'
types_DoM_marks_means = types_DoM_marks_means.reset_index()
data = pd.merge(data, types_DoM_marks_means, on = types_DoM_select, how='left')
print("types_DoM_students_means")
types_DoM_students_means = data.groupby(types_DoM_select)['students'].mean() #.students won't work. Why?
types_DoM_students_means.name = 'types_DoM_students_means'
types_DoM_students_means=types_DoM_students_means.reset_index()
data = pd.merge(data, types_DoM_students_means, on = types_DoM_select, how='left')
print("types_DoM_marks_medians")
types_DoM_marks_medians = data.groupby(types_DoM_select)['marks'].median()
types_DoM_marks_medians.name = 'types_DoM_marks_medians'
types_DoM_marks_medians = types_DoM_marks_medians.reset_index()
data = pd.merge(data, types_DoM_marks_medians, on = types_DoM_select, how='left')
print("types_DoM_students_medians")
types_DoM_students_medians = data.groupby(types_DoM_select)['students'].median() #.students won't work. Why?
types_DoM_students_medians.name = 'types_DoM_students_medians'
types_DoM_students_medians=types_DoM_students_medians.reset_index()
data = pd.merge(data, types_DoM_students_medians, on = types_DoM_select, how='left')
print("types_DoM_marks_std")
types_DoM_marks_std = data.groupby(types_DoM_select)['marks'].std()
types_DoM_marks_std.name = 'types_DoM_marks_std'
types_DoM_marks_std = types_DoM_marks_std.reset_index()
data = pd.merge(data, types_DoM_marks_std, on = types_DoM_select, how='left')
print("types_DoM_students_std")
types_DoM_students_std = data.groupby(types_DoM_select)['students'].std()
types_DoM_students_std.name = 'types_DoM_students_std'
types_DoM_students_std = types_DoM_students_std.reset_index()
data = pd.merge(data, types_DoM_students_std, on = types_DoM_select, how='left')
data['types_DoM_marks_means'].fillna(-1, inplace=True)
data['types_DoM_students_means'].fillna(-1, inplace=True)
data['types_DoM_marks_medians'].fillna(-1, inplace=True)
data['types_DoM_students_medians'].fillna(-1, inplace=True)
data['types_DoM_marks_std'].fillna(-1, inplace=True)
data['types_DoM_students_std'].fillna(-1, inplace=True)
#Second Part: by specific student
student_DoM_select=['Type','Type2','Category']
#First Block:student_DoM
#wonder if can do a groupby of groupb
print("student_DoM_marks_means")
student_DoM_marks_means = data.groupby(student_DoM_select)['marks'].mean()
student_DoM_marks_means.name = 'student_DoM_marks_means'
student_DoM_marks_means = student_DoM_marks_means.reset_index()
data = pd.merge(data, student_DoM_marks_means, on = student_DoM_select, how='left')
print("student_DoM_students_means")
student_DoM_students_means = data.groupby(student_DoM_select)['students'].mean() #.students won't work. Why?
student_DoM_students_means.name = 'student_DoM_students_means'
student_DoM_students_means=student_DoM_students_means.reset_index()
data = pd.merge(data, student_DoM_students_means, on = student_DoM_select, how='left')
print("student_DoM_marks_medians")
student_DoM_marks_medians = data.groupby(student_DoM_select)['marks'].median()
student_DoM_marks_medians.name = 'student_DoM_marks_medians'
student_DoM_marks_medians = student_DoM_marks_medians.reset_index()
data = pd.merge(data, student_DoM_marks_medians, on = student_DoM_select, how='left')
print("student_DoM_students_medians")
student_DoM_students_medians = data.groupby(student_DoM_select)['students'].median() #.students won't work. Why?
student_DoM_students_medians.name = 'student_DoM_students_medians'
student_DoM_students_medians=student_DoM_students_medians.reset_index()
data = pd.merge(data, student_DoM_students_medians, on = student_DoM_select, how='left')
# May I use data['marks','students','marksMean','studentsMean','marksMedian','studentsMedian']=data['marks','students','marksMean','studentsMean','marksMedian','studentsMedian'].astype(int) to spare memory?
print("student_DoM_marks_std")
student_DoM_marks_std = data.groupby(student_DoM_select)['marks'].std()
student_DoM_marks_std.name = 'student_DoM_marks_std'
student_DoM_marks_std = student_DoM_marks_std.reset_index()
data = pd.merge(data, student_DoM_marks_std, on = student_DoM_select, how='left')
print("student_DoM_students_std")
student_DoM_students_std = data.groupby(student_DoM_select)['students'].std()
student_DoM_students_std.name = 'student_DoM_students_std'
student_DoM_students_std = student_DoM_students_std.reset_index()
data = pd.merge(data, student_DoM_students_std, on = student_DoM_select, how='left')
data['student_DoM_marks_means'].fillna(0, inplace=True)
data['student_DoM_students_means'].fillna(0, inplace=True)
data['student_DoM_marks_medians'].fillna(0, inplace=True)
data['student_DoM_students_medians'].fillna(0, inplace=True)
data['student_DoM_marks_std'].fillna(0, inplace=True)
data['student_DoM_students_std'].fillna(0, inplace=True)
#Third Part: Exceptional students
#I think int is better here as it helps defining categories but can't use it.#
#print(data.isnull().sum())
#print(data['types_DoM_marks_std'][data['types_DoM_marks_std']==0].sum())
#data.to_csv('ex')
#print(data.columns)
#Original version:#int raises the "can't convert Nan float to int. While there were no Nan as I verified in the data just before sending it to the
data['Except_student_IP2_DoM_marks_means']=np.array([int((data['student_IP2_DoM_marks_means'][i]-data['types_IP2_DoM_marks_means'][i])/data['types_IP2_DoM_students_std'][i]) for i in range (len(data['year']))])
data['Except_student_IP2_DoM_marks_medians']=np.array([int((data['student_IP2_DoM_marks_medians'][i]-data['types_IP2_DoM_marks_means'][i])/data['types_IP2_DoM_students_std'][i]) for i in range (len(data['year']))])
#Second version: raises no error but final data (returned) is filled with these stupid NaN
data['Except_student_P2M_DoM_marks_means']=np.array([np.round((data['student_DoM_marks_means'][i]-data['types_DoM_marks_means'][i])/data['types_DoM_marks_std'][i],0) for i in range (len(data['year']))])
data['Except_student_P2M_DoM_marks_medians']=np.array([np.round((data['student_DoM_marks_medians'][i]-data['types_DoM_marks_medians'][i])/data['types_DoM_marks_std'][i],0) for i in range (len(data['year']))])
#End
return data
Most likely you are correct that there are no Nans in your data frame, however you are creating them in your calculations. See the following:
In [15]: import pandas as pd
In [16]: df = pd.DataFrame([[1, 2], [0, 0]], columns=['actual value', 'col2'])
df['means'] = df.mean(axis=1)
df['std'] = df.std(axis=1)
In [17]: df
Out[17]:
actual value col2 means std
0 1 2 1.5 0.5
1 0 0 0.0 0.0
So the data frame doesn't have any Nans, but what about the calculations?
In [21]: [(df['actual value'][i]-df['means'][i])/df['std'][i] for i in range (len(df['means']))]
Out[21]: [-1.0, nan]
Now when you call int on that you get an error on the resulting list.
Finally, I would suggest (if possible) performing the operations directly in the underlying arrays rather then using a for loop, as it will be much faster.
In [25]: (df['actual value']-df['means'])/df['std']
Out[25]:
0 -1
1 NaN
dtype: float64
This may not be possible depending on what return value of a 0 division is desired though.

Categories

Resources