Comparing 2 revisions of excel files in python pandas - python

I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products

You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)

I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')

Related

Load webscraping results into pandas dataframe and export to excel

I'm trying to write a short python snippet of code that loops through different webpages structured in the same way (i.e. same number of columns / rows) and loads all the information into a pandas dataframe and finally exports this one into excel.
I managed to write all the code that gathers what should be the column headers (in the dt HTML tag) and the rows (in the dd HTML tag), but having issues into placing all this info into a pandas dataframe.
for row in rows:
QA_link = row.find('td', class_='views-field views-field-nothing-1').find('a', href=True)['href']
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_table = soup_QA.find('dl', class_='dl-horizontal SingleRulebookRecord')
if boolInitialiseTable:
QA_hdr = [str.replace(link.string, ':', '') for link in QA_table.findAll('dt')]
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.DataFrame()
df = pd.concat([df, pd.DataFrame(QA_details).transpose()], ignore_index=True, axis=0)
boolInitialiseTable = False
df.columns = QA_hdr
else:
QA_details = [str(link.string) for link in QA_table.findAll('dd')]
df = pd.concat([df, pd.DataFrame(QA_details).transpose()])
Where rows contains all the different web pages that needs to be accessed to gather the info i need to put in the pandas dataframe.
So from the HTML table like content of:
<dl class="dl-horizontal SingleRulebookRecord">
<dt>Question ID:</dt>
<dd>2020_5469 </dd>
<dt>Topic:</dt>
<dd>Weather</dd>
<dt>Date</dt>
<dd>06/06/2020</dd>
</dl>
I would like to get a pandas dataframe with:
Question ID
Topic
Date
2020_5469
Weather
06/06/2020
Finally df.to_excel('results.xlsx') should do the job of exporting everything into Excel.
I feel that all this transpose in the code is not the correct way of doing it, in addition to that the type of the fields of the table is object and not string as i would expect - but maybe this is not a problem
I would do it like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
links = ['https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5469',
'https://www.eba.europa.eu/single-rule-book-qa/qna/view/publicId/2020_5128']
dfs = []
for QA_link in links:
req_QA = requests.get(QA_link)
soup_QA = BeautifulSoup(req_QA.text, 'html.parser')
QA_hdr = [link.get_text() for link in soup_QA.findAll('dt')]
QA_details = [[link.get_text() for link in soup_QA.findAll('dd')]]
dfs.append(pd.DataFrame(QA_details, columns=QA_hdr))
df_all = pd.concat(dfs, axis=0).reset_index(drop=True)
# check for NaN values (columns not shared between urls)
print(df_all[df_all.columns[df_all.isna().any()]].T)
0 1
Name of institution / submitter: BearingPoint Switzerland AG NaN
Country of incorporation / residence: Switzerland NaN
Answer prepared by: Answer prepared by the EBA. NaN
Subparagraph: NaN (f)
df_all.iloc[:,:5].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Question ID: 2 non-null object
1 Legal Act: 2 non-null object
2 Topic: 2 non-null object
3 Article: 2 non-null object
4 Paragraph: 2 non-null object
dtypes: object(5)
memory usage: 208.0+ bytes
Notice that QA_details is a nested list. E.g. each nested list would fill a new row; it's just that you only have one. E.g. here's how it works if you have two nested lists:
lst = [[1,2],[3,4]]
df = pd.DataFrame(lst, columns=['A','B'])
print(df)
A B
0 1 2
1 3 4
As for the reason why the Dtype is given as object, see e.g. this SO post. But all your cells will in fact contain strings, which we can easily check. E.g.:
cols = df_all.columns[df_all.notna().all()]
print(all([isinstance(i, str) for i in df_all.loc[0, cols]]))
# True
Finally, yes df.to_excel('results.xlsx') will work to export the df to Excel. Perhaps add df.to_excel('results.xlsx', index=False) to avoid exporting the index.

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

How to append two dataframe objects containing same column data but different column names?

I want to append an expense df to a revenue df but can't properly do so. Can anyone offer how I may do this?
'''
import pandas as pd
import lxml
from lxml import html
import requests
import numpy as np
symbol = 'MFC'
url = 'https://www.marketwatch.com/investing/stock/'+ symbol +'/financials'
df=pd.read_html(url)
revenue = pd.concat(df[0:1]) # the revenue dataframe obj
revenue = revenue.dropna(axis='columns') # drop naN column
header = revenue.iloc[:0] # revenue df header row
expense = pd.concat(df[1:2]) # the expense dataframe obj
expense = expense.dropna(axis='columns') # drop naN column
statement = revenue.append(expense) #results in a dataframe with an added column (Unnamed:0)
revenue = pd.concat(df[0:1]) =
Fiscal year is January-December. All values CAD millions.
2015
2016
2017
2018
2019
expense = pd.concat(df[1:2]) =
Unnamed: 0
2015
2016
2017
2018
2019
'''
How can I append the expense dataframe to the revenue dataframe so that I am left with a single dataframe object?
Thanks,
Rename columns.
df = df.rename(columns={'old_name': 'new_name',})
Then append with merge(), join(), or concat().
I managed to append the dataframes with the following code. Thanks David for putting me on the right track. I admit this is not the best way to do this because in a run time environment, I don't know the value of the text to rename and I've hard coded it here. Ideally, it would be best to reference a placeholder at df.iloc[:0,0] instead, but I'm having a tough time getting that to work.
df=pd.read_html(url)
revenue = pd.concat(df[0:1])
revenue = revenue.dropna(axis='columns')
revenue.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
header = revenue.iloc[:0]
expense = pd.concat(df[1:2])
expense = expense.dropna(axis='columns')
expense.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = revenue.append(expense,ignore_index=True)
Using the df=pd.read_html(url) construct, several lists are returned when scraping marketwatch financials. The below function returns a single dataframe of all balance sheet elements. The same code applies to quarterly and annual income and cash flow statements.
def getBalanceSheet(url):
df=pd.read_html(url)
count = sum([1 for Listitem in df if 'Unnamed: 0' in Listitem])
statement = pd.concat(df[0:1])
statement = statement.dropna(axis='columns')
if 'q' in url: #quarterly
statement.rename({'All values CAD millions.':'LineItem'},axis=1,inplace=True)
else:
statement.rename({'Fiscal year is January-December. All values CAD millions.':'LineItem'},axis=1,inplace=True)
for rowidx in range(count):
df_name = 'df_'+str(int(rowidx))
df_name = pd.concat(df[rowidx+1:rowidx+2])
df_name = df_name.dropna(axis='columns')
df_name.rename({'Unnamed: 0':'LineItem'}, axis=1, inplace=True)
statement = statement.append(df_name,ignore_index=True)
return statement

groupby and sum two columns and set as one column in pandas

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.
A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()
You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

How to loop over multiple DataFrames and produce multiple csv?

making a change from R to Python I have some difficulties to write multiple csv using pandas from a list of multiple DataFrames:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize,
DelayFunction)
diamonds = [diamonds, diamonds, diamonds]
path = "/user/me/"
def extractDiomands(path, diamonds):
for each in diamonds:
df = DplyFrame(each) >> select(X.carat, X.cut, X.price) >> head(5)
df = pd.DataFrame(df) # not sure if that is required
df.to_csv(os.path.join('.csv', each))
extractDiomands(path,diamonds)
That however generates an errors. Appreciate any suggestions!
Welcome to Python! First I'll load a couple libraries and download an example dataset.
import os
import pandas as pd
example_data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(example_data.head(5))
first few rows of our example data:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Now here's what I think you want done:
# spawn a few datasets to loop through
df_1, df_2, df_3 = example_data.head(20), example_data.tail(20), example_data.head(10)
list_of_datasets = [df_1, df_2, df_3]
output_path = 'scratch'
# in Python you can loop through collections of items directly, its pretty cool.
# with enumerate(), you get the index and the item from the sequence, each step through
for index, dataset in enumerate(list_of_datasets):
# Filter to keep just a couple columns
keep_columns = ['gre', 'admit']
dataset = dataset[keep_columns]
# Export to CSV
filepath = os.path.join(output_path, 'dataset_'+str(index)+'.csv')
dataset.to_csv(filepath)
At the end, my folder 'scratch' has three new csv's called dataset_0.csv, dataset_1.csv, and dataset_2.csv

Categories

Resources