Append a dataframe with a loop - python

Morning,
I have 3 excels that i have imported via from excel. I am trying to create a DataFrame which has taken the name ('Ticker') column from each import, add the title of the excel ('Secto') and append it to eachother to create a new DataFrame. This new DataFrame will then be exported to excel.
AA = ['Aero&Def','REITs', 'Auto&Parts']
File = 'FTSEASX_'+AA[0]+'_Price.xlsx'
xlsx = pd.ExcelFile('C:/Users/Ben/'+File)
df = pd.read_excel(xlsx, 'Price_Data')
df = df[df.Identifier.notnull()]
df.fillna(0)
a = []
b = []
for i in df['Ticker']:
a.append(i)
b.append(AA[0])
raw_data = {'Ticker': a, 'Sector': b}
df2 = pd.DataFrame(raw_data, columns = ['Ticker', 'Sector'])
del AA[0]
for j in AA:
File = 'FTSEASX_'+j+'_Price.xlsx'
xlsx = pd.ExcelFile('C:/Users/Ben/'+File)
df3 = pd.read_excel(xlsx, 'Price_Data')
df3 = df3[df3.Identifier.notnull()]
df3.fillna(0)
a = []
b = []
for i in df3['Ticker']:
a.append(i)
b.append(j)
raw_data = {'Ticker': a, 'Sector': b}
df4 = pd.DataFrame(raw_data, columns = ['Ticker', 'Sector'])
df5 = df2.append(df4)
I am currently getting the below but obviously the 2nd import, titled 'REITs' is not getting captured.
Ticker Sector
0 AVON-GB Aero&Def
1 BA-GB Aero&Def
2 COB-GB Aero&Def
3 MGGT-GB Aero&Def
4 SNR-GB Aero&Def
5 ULE-GB Aero&Def
6 QQ-GB Aero&Def
7 RR-GB Aero&Def
8 CHG-GB Aero&Def
0 GKN-GB Auto&Parts
How would i go about achieving this? or is there a better more pythonic way of achieving this?

I would do it this way:
import pandas as pd
AA = ['Aero&Def','REITs', 'Auto&Parts']
# assuming that ['Ticker','Sector','Identifier'] columns are in 'B,D,E' Excel columns
xl_cols='B,D,E'
dfs = [ pd.read_excel('FTSEASX_{0}_Price.xlsx'.format(f),
'Price_Data',
parse_cols=xl_cols,
).query('Identifier == Identifier')
for f in AA]
df = pd.concat(dfs, ignore_index=True)
print(df[['Ticker', 'Sector']])
Explanation:
.query('Identifier == Identifier') - gives you only those rows where Identifier is NOT NULL (using the fact that value == NaN will always be False)
PS You don't want to loop through your data frames when working with Pandas...

Related

Apply function to data frame and make output a separate df pandas

I have a data frame
cat input.csv
dwelling,wall,weather,occ,height,temp
5,2,Ldn,Pen,154.7,23.4
5,4,Ldn,Pen,172.4,28.7
3,4,Ldn,Pen,183.5,21.2
3,4,Ldn,Pen,190.2,30.3
To which I'm trying to apply the following function:
input_df = pd.read_csv('input.csv')
def folder_column(row):
if row['dwelling'] == 5 and row['wall'] == 2:
return 'folder1'
elif row['dwelling'] == 3 and row['wall'] == 4:
return 'folder2'
else:
return 0
I want to run the function on the input dataset and store the output in a separate data frame using something like this:
temp_df = pd.DataFrame()
temp_df = input_df['archetype_folder'] = input_df.apply(folder_column, axis=1)
But when I do this I only get the newly created 'archetype_folder' in the temp_df, when I would like all the original columns from the input_df. Can anyone help? Note that I don't want to add the new column 'archetype_folder' to the original, input_df. I've also tried this:
temp_df = input_df
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
But when I run the second command both input_df and temp_df end up with the new column?
Any help is appreciated!
Use Dataframe.copy :
temp_df = input_df.copy()
temp_df['archetype_folder'] = temp_df.apply(folder_column, axis=1)
You need to create copy of original DataFrame, then assign return values of your function to it, consider following simple example
import pandas as pd
def is_odd(row):
return row.value % 2 == 1
df1 = pd.DataFrame({"value":[1,2,3],"name":["uno","dos","tres"]})
df2 = df1.copy()
df2["odd"] = df1.apply(is_odd,axis=1)
print(df1)
print("=====")
print(df2)
gives output
value name
0 1 uno
1 2 dos
2 3 tres
=====
value name odd
0 1 uno True
1 2 dos False
2 3 tres True
You don't need apply. Use .loc to be more efficient.
temp_df = input_df.copy()
m1 = (input_df['dwelling'] == 5) & (input_df['wall'] == 2)
m2 = (input_df['dwelling'] == 3) & (input_df['wall'] == 4)
temp_df.loc[m1, 'archetype_folder'] = 'folder1'
temp_df.loc[m2, 'archetype_folder'] = 'folder2'

Pandas create two new columns based on 2 existing columns

I have a dataframe like the below:
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
Email Ticket_Category Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com Tier1 5 1345.45
1 joblogs#gmail.com Tier2 2 10295.88
What I'm trying to do is to create 2 new columns "Tier1_Quantity_Purchased" and "Tier2_Quantity_Purchased" based on the existing dataframe, and sum the total of "Total_Price_Paid" as below:
dummy_dict_desired = {'Email':['joblogs#gmail.com'],
'Tier1_Quantity_Purchased': [5],
'Tier2_Quantity_Purchased':[2],
'Total_Price_Paid':[11641.33]}
Email Tier1_Quantity_Purchased Tier2_Quantity_Purchased Total_Price_Paid
0 joblogs#gmail.com 5 2 11641.33
Any help would be greatly appreciated. I know there is an easy way to do this, just can't figure out how without writing some silly for loop!
What you want to do is to pivot your table, and then add a column with aggregated data from the original table.
df = pd.DataFrame(dummy_dict_existing)
pivot_df = df.pivot(index='Email', columns='Ticket_Category', values='Quantity_Purchased')
pivot_df['total'] = df.groupby('Email')['Total_Price_Paid'].sum()
Email
Tier1
Tier2
total
joblogs#gmail.com
5
2
11641.33
For more details on pivoting, take a look at How can I pivot a dataframe?
import pandas as pd
dummy_dict_existing = {'Email':['joblogs#gmail.com', 'joblogs#gmail.com'],
'Ticket_Category': ['Tier1', 'Tier2'],
'Quantity_Purchased': [5,2],
'Total_Price_Paid':[1345.45, 10295.88]}
df = pd.DataFrame(dummy_dict_existing)
df2 = df[['Ticket_Category', 'Quantity_Purchased']]
df_transposed = df2.T
df_transposed.columns = ['Tier1_purchased', 'Tier2_purchased']
df_transposed = df_transposed.iloc[1:]
df_transposed = df_transposed.reset_index()
df_transposed = df_transposed[['Tier1_purchased', 'Tier2_purchased']]
df = df.groupby('Email')[['Total_Price_Paid']].sum()
df = df.reset_index()
df.join(df_transposed)
output

Comparing two spreadsheets, removing the duplicates and exporting the result to a csv in python

I'm trying to compare two excel spreadsheets, remove the names that appear in both spreadsheets from the first spreadsheet and then export it to a csv file using python. I am new, but here's what I have so far:
import pandas as pd
data_1 = pd.read_excel (r'names1.xlsx')
bit_data = pd.DataFrame(data_1, columns= ['Full_Name'])
bit_col = len(bit_data)
data_2 = pd.read_excel (r'force_names.xlsx')
force_data = pd.DataFrame(data_2, columns= ['FullName'])
force_col = len(force_data)
for bit_num in range(bit_col):
for force_num in range(force_col):
if bit_data.iloc[bit_num,0] == force_data.iloc[force_num,0]:
data_1 = data_1.drop(data_1.index[[bit_num]])
data_1.to_csv(r"/Users/name/Desktop/Reports/Names.csv")
When I run it it gets rid of some duplicates but not all, any advice anyone has would be greatly appreciated.
Use pandas merge to get all unique names, with no duplicates. If you want to drop any names that are in both files (I'm not sure if that's what you're asking), you can do so. See this toy example:
row1list = ['G. Anderson']
row2list = ['Z. Ebra']
df1 = pd.DataFrame([row1list, row2list], columns=['FullName'])
row1list = ['G. Anderson']
row2list = ['C. Obra']
df2 = pd.DataFrame([row1list, row2list], columns=['FullName'])
df3 = df1.merge(df2, on='FullName', how='outer', indicator=True)
print(df3)
# FullName _merge
# 0 G. Anderson both
# 1 Z. Ebra left_only
# 2 C. Obra right_only
df3 = df3.loc[df3['_merge'] != 'both']
del df3['_merge']
print(df3)
# FullName
# 1 Z. Ebra
# 2 C. Obra

Subtracting DataFrames resulting in unexpected numbers

I'm trying to subtract one data frame from another which all results should result in a 0 or blank based on the data in each my current excel files but will result in 0, 1, 2, or blank in the future. While some do result in a 0 or blank I'm also getting a -1 and 1. Any help that can be provided will be appreciated.
The two Excel sheets are identical except for number changes in second column.
Example
ExternalId TotalInteractions
name1 1
name2 2
name3 2
name4 1
Both sheets will look like the example and the output will look the same. I just need the difference between the two sheets
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = df1['ExternalId']
df4 = df1['TotalInteractions']
df5 = df2['TotalInteractions']
df6 = df4.sub(df5)
frames = (df3, df6)
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
I managed to create a partial answer to getting the unexpected numbers. My problem now is that NewInter has more names than PrevInter does. Which results in a blank in TotalInteractions next to the new ExternalId. Any idea how to make it if it there is a blank to accept the value from NewInter?
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4 - df5
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()
Figured out the issues. First part needed to be merged in order for the subtraction to work as the dataframes are not the same size. Also had to add in fill_value = 0 so it would take information from the new file.
def GCList():
df1 = pd.read_excel('NewInter.xlsx')
df2 = pd.read_excel('PrevInter.xlsx')
df3 = pd.merge(df1, df2, on = 'ExternalId', how = 'outer')
df4 = df3['TotalInteractions_x']
df5 = df3['TotalInteractions_y']
df6 = df3['ExternalId']
df7 = df4.sub(df5, fill_value = 0)
frames = [df6,df7]
df = pd.concat(frames, axis = 1)
df.to_excel('GCList.xlsx')
GCList()

Python every three rows to columns using pandas

I have a text file with data that repeoates every 3 rows. Lets say it is hash, directory, sub directory. The data looks like the following:
a3s2d1f32a1sdf321asdf
Dir_321321
Dir2_asdf
s21a3s21d3f21as32d1f
Dir_65465
Dir2_werq
asd21231asdfa3s21d
Dir_76541
Dir2_wbzxc
....
I have created a python script that takes the data and every 3 rows creates columns:
import pandas as pd
df1 = pd.read_csv('RogTest/RogTest.txt', delimiter = "\t", header=None)
df2 = df1[df1.index % 3 == 0]
df2 = df2.reset_index(drop=True)
df3 = df1[df1.index % 3 == 1]
df3 = df3.reset_index(drop=True)
df4 = df1[df1.index % 3 == 2]
df4 = df4.reset_index(drop=True)
df5 = pd.concat([df2, df3], axis=1)
df6 = pd.concat([df5, df4], axis=1)
#Rename columns
df6.columns = ['Hash', 'Dir_1', 'Dir_2']
#Write to csv
df6.to_csv('RogTest/RogTest.csv', index=False, header=True)
This works fine but I am curious if there is a more efficient way to do this aka less code?
You can use:
df_final = pd.DataFrame(np.reshape(df.values,(3, df.shape[0]/3)))
df_final.columns = ['Hash', 'Dir_1', 'Dir_2']
Output:
Hash Dir_1 Dir_2
0 a3s2d1f32a1sdf321asdf Dir_321321 Dir2_asdf
1 s21a3s21d3f21as32d1f Dir_65465 Dir2_werq
2 asd21231asdfa3s21d Dir_76541 Dir2_wbzxc

Categories

Resources