Pythom:Compare 2 columns and write data to excel sheets - python

I need to compare two columns together: "EMAIL" and "LOCATION".
I'm using Email because it's more accurate than name for this issue.
My objective is to find total number of locations each person worked
at, sum up the total of locations to select which sheet the data
will been written to and copy the original data over to the new
sheet(tab).
I need the original data copied over with all the duplicate
locations, which is where this problem stumps me.
Full Excel Sheet
Had to use images because it flagged post as spam
The Excel sheet (SAMPLE) I'm reading in as a data frame:
Excel Sample Spreadsheet
Example:
TOMAPPLES#EXAMPLE.COM worked at WENDYS,FRANKS HUT, and WALMART - That
sums up to 3 different locations, which I would add to a new sheet
called SHEET: 3 Different Locations
SJONES22#GMAIL.COM worked at LONDONS TENT and YOUTUBE - That's 2 different locations, which I would add to a new sheet called SHEET:
2 Different Locations
MONTYJ#EXAMPLE.COM worked only at WALMART - This user would be added
to SHEET: 1 Location
Outcome:
data copied to new sheets
Sheet 2
Sheet 2: different locations
Sheet 3
Sheet 3: different locations
Sheet 4
Sheet 4: different locations
Thanks for taking your time looking at my problem =)

Hi Check below lines if work for you..
import pandas as pd
df = pd.read_excel('sample.xlsx')
df1 = df.groupby(['Name','Location','Job']).count().reset_index()
# this is long line
df2 = df.groupby(['Name','Location','Job','Email']).agg({'Location':'count','Email':'count'}).rename(columns={'Location':'Location Count','Email':'Email Count'}).reset_index()
print(df1)
print('\n\n')
print(df2)
below is the output change columns to check more variations
df1
Name Location Job Email
0 Monty Jakarta Manager 1
1 Monty Mumbai Manager 1
2 Sahara Jonesh Paris Cook 2
3 Tom App Jakarta Buser 1
4 Tom App Paris Buser 2
df2 all columns
Name Location ... Location Count Email Count
0 Monty Jakarta ... 1 1
1 Monty Mumbai ... 1 1
2 Sahara Jonesh Paris ... 2 2
3 Tom App Jakarta ... 1 1
4 Tom App Paris ... 2 2

Related

Check if each column values exist in another dataframe column where another column value is the column header

companies.xlsx
company To
1 amazon hi#test.de
2 google bye#test.com
3 amazon hi#tld.com
4 starbucks hi#test.de
5 greyhound bye#tuz.de
emails.xlsx
hi#test.de bye#test.com hi#tld.com ...
1 amazon google microsoft
2 starbucks amazon tesla
3 Grey Hound greyhound
4 ferrari
So i have the 2 excel sheets above and read both of em:
file1 = pd.ExcelFile('data/companies.xlsx')
file2 = pd.ExcelFile('data/emails.xlsx')
df_companies = file1.parse('sheet1')
df_emails = file2.parse('sheet1')
what i'm trying to accomplish is:
check if df_companies['To'] is an existing header in df_emails
if the header exists in df_emails, search the appropriate column of that header for df_companies['company']
if the company is found, add a column to df_companies and fill in '1', if not fill in '0'
e.g.: company amazon has the To email hi#test.de in company.xlsx. in email.xlsx the header hi#test.de exists and also amazon was found in the column - so its a '1'.
Anyone knows how to accomplish this?
Here's one approach. Convert df_emails to a dictionary and map it to df_companies. Then, compare the mapped column with df_companies['company'].
df_companies['check'] = df_companies['To'].map(df_emails.to_dict(orient='list')).fillna('')
df_companies['check'] = df_companies.apply(lambda x: x['company'] in x['check'], axis=1).astype(int)
company To check
1 amazon hi#test.de 1
2 google bye#test.com 1
3 amazon hi#tld.com 0
4 starbucks hi#test.de 1
5 greyhound bye#tuz.de 0

How to read multiple files from a directory and append them in python?

Iam working with files in a folder where i need better way to loop through files and append a column to make master file. For two files i was using reading as two dataframe and appending series. However now i ran into situation with more more than 100 files.
file 1 is as below:
Num Department Product Salesman Location rating1
1 Electronics TV 3 Bigmart, Delhi 5
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 3
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 5
similary file 2:
Num Department Product Salesman Location rating2
1 Electronics TV 3 Bigmart, Delhi 2
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 4
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 3
What I am trying to achieve is read Rating column from all the other file and append verticaly. Expected:
Num Department Product Salesman Location rating1 rating2
1 Electronics TV 3 Bigmart, Delhi 5 2
2 Electronics TV 1 Bigmart, Mumbai 4 4
3 Electronics TV 2 Bigmart, Bihar 3 5
4 Electronics TV 2 Bigmart, Chandigarh 5 5
5 Electronics Camera 2 Bigmart, Jharkhand 5 3
I modified some of the code posted here. Following Code worked:
def read_folder(folder):
files = [i for i in os.listdir(folder) if 'xlsx' in i]
df = pd.read_excel(folder+'/{}'.format(files[0]))
for f in files[1:]:
df2 = pd.read_excel(folder+'/{}'.format(f))
df = df.merge(df2.iloc[:,5],left_index=True,right_index=True)
return df
This method read folder and return all in a pandas dataframe
import pandas as pd
import os
def read_folder(csv_folder)
files = os.listdir(csv_folder)
df = []
for f in files:
print(f)
csv_file = csv_folder + "/" + f
df.append(pd.read_csv(csv_file))
df_full = pd.concat(df, ignore_index=True)
return df, full
As I understand your last comment, you need to add rating columns and create one file. After reading all files you can do below operation.
final_df = df[0]
i = 1
for d in df[1:]:
final_df["rating_"+i] = d["rating"]
i = i+1
This version of read_folder() returns a list of data frames. It also add a helper column (for ratings).
import pandas as pd
from pathlib import Path
def read_folder(csv_folder):
''' Input is a folder with csv files; return list of data frames.'''
csv_folder = Path(csv_folder).absolute()
csv_files = [f for f in csv_folder.iterdir() if f.name.endswith('csv')]
# the assign() method adds a helper column
dfs = [
pd.read_csv(csv_file).assign(rating_src = f'rating-{idx}')
for idx, csv_file in enumerate(csv_files, 1)
]
return dfs
Now assemble the data frames into the desired shape:
dfs = read_folder(csv_folder)
dfs = (pd.concat((d for d in dfs))
.set_index(['Num', 'Department', 'Product', 'Salesman', 'Location', 'rating_src'])
.squeeze()
.unstack(level='rating_src')
.reset_index()
)
dfs.columns.name = ''

Grouping and exporting excel rows using python

This is using Python.
I have an excel sheet that in its most basic form looks like this
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
New York Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Stockholm Cup a 5
Stockholm Cup b 3
Madrid Cup a 5
New York Plate a 8
I want to group the locations together so that all the new yorks are together and madrids etc and export them to separate excel sheets called new york, madrid, stockholm. With the same info on the rows. So basically just a copy and paste of the row into a new excel sheet named after that row. Then I want to add all cups together as one and all plates as one on the second page of each one. Would make sense to do this before exporting the data right?
End result 3 excel sheets named, containing only their data, and some easy math on the second sheet.
The real excel sheet is dealing with 15000 rows 50 locations and 100 items. These change so would have to be a procedural way. New york might be Toronto next time.
So far I have been able to group them by pandas but every attempt after that as failed.
New to pandas so I thought this one would be relatively easy to do.
import pandas as pd
stock_report_excel = "small_stores_blocked_stock_value.xlsx"
df_soh = pd.read_excel(stock_report_excel, sheet_name='SOH')
df_stores = df_soh.groupby(['Site Name'])
guess loop to add to sheet
adding of the items to sheet 2
exporting
While not very clear what is your desired purpose, I guess Pandas MultiIndex DataFrame may be helpful for you. I write some simple codes below and wish could guide you further.
import pandas as pd
sites=pd.Series(['New York','Stockholm','Madrid','New York','New York','Madrid','Stockholm','Stockholm','Stockholm','Madrid','New York'])
col2=pd.Series(['Cup','Plate','Cup','Cup','Plate','Cup','Plate','Cup','Cup','Cup','Plate'])
col3=pd.Series(['a','b','a','b','a','b','a','a','b','a','a'])
col4=pd.Series([3,5,2,5,8,9,2,5,3,5,8])
data=pd.DataFrame({'sites':sites,'col2':col2,'col3':col3,'col4':col4})
# You can of course replce all the codes above with Pandas read related functions.
data1 = data.set_index(['sites','col2','col3']) # Set as MultiIndex DataFrame.
data1.loc[('New York'),:] # This will give you all the 'New York' data
data1.loc[('New York','Cup'),:] # This will give you all the 'New York' & 'Cup' data.
# Retrieving all the 'Cup' data is a bit tricky, see the following
idx=pd.IndexSlice
data1.loc[idx[:,'Cup'],:]
Output as follows.
# data
sites col2 col3 col4
0 New York Cup a 3
1 Stockholm Plate b 5
2 Madrid Cup a 2
3 New York Cup b 5
4 New York Plate a 8
5 Madrid Cup b 9
6 Stockholm Plate a 2
7 Stockholm Cup a 5
8 Stockholm Cup b 3
9 Madrid Cup a 5
10 New York Plate a 8
# data1
col4
sites col2 col3
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Cup a 5
b 3
Madrid Cup a 5
New York Plate a 8
# data1.loc[('New York'),:]
col4
col2 col3
Cup a 3
b 5
Plate a 8
a 8
# data1.loc[('New York','Cup'),:]
col4
col3
a 3
b 5
# data1.loc[idx[:,'Cup'],:]
col4
sites col2 col3
New York Cup a 3
Madrid Cup a 2
New York Cup b 5
Madrid Cup b 9
Stockholm Cup a 5
b 3
Madrid Cup a 5
If you do not want to see any warnings and want to keep high performance, you can all use idx and explicit coding, which are:
data1.loc[idx['New York',:,:],:]
data1.loc[idx['New York','Cup',:],:]
data1.loc[idx['','Cup',:],:]
Your next step is to write these data selections into a separate sheet. I am not very familiar with that because I always write data into text files. For example, write one of them into a csv file is as simple as data1.loc[idx['New York','Cup',:],:].to_csv('result.csv',index=False). I recommend you search your desired functions.
Hope this is helpful. Good luck!
Answer to the problem
import pandas as pd
import os
file = "yourfile.xlsx"
extension = os.path.splitext(file)[1]
filename = os.path.splitext(file)[0]
abpath = os.path.dirname(os.path.abspath(file))
df=pd.read_excel(file, sheet_name="sheetname")
colpick = "column to extract"
cols=list(set(df[colpick].values))
def sendtofile(cols):
for i in cols:
df[df[colpick] == i].to_excel("{}/exported/{}.xlsx".format(abpath, i), sheet_name=i, index=False)
return

using python and xlrd to combine/merge 2 different spreadsheets

I have 2 separate excel spreadsheets
spreadsheet 1 is as such:
ID tin name date
1 21043 Bob 8/1/2019
2 45667 Jim 7/1/2018
3 69780 Sal 4/24/2017
The 2nd spreadsheet is as such:
ID tin job
1 21043 02
2 76544 02
3 45667 04
I am trying to figure out how to match the 2 spreadsheets and make 1 list as such:
ID tin name date job
1 21043 Bob 8/1/2019 02
2 45667 Jim 7/1/2018 04
3 69780 Sal 4/24/2017
4 76544 02
the common denominator is the "tin" but i have to merge the ones that duplicate, but then add the ones from both sheets that dont duplicate..
I am new to python and VERY new to xlrd so i cannot seem to even figure out the best terms to use to google an example.
I found some information on a next(iter statement but after countless attempts i could not figure out a useful way to use it to combine.
Is there an easy way or am i "up a creek"??
Thank you,
Bob
You can use pandas for this. Pandas uses xlrd and other excel readers under the hood.
You will do something like this:
df1 = pandas.read_excel('file1.xls', sheet_name='...')
df2 = pandas.read_excel('file2.xls', sheet_name='...')
df1.merge(df2, how='outer')
You may need some variation of this depending on your column names.. see pandas merge

Use Excel sheet to create dictionary in order to replace values

I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88

Categories

Resources