companies.xlsx
company To
1 amazon hi#test.de
2 google bye#test.com
3 amazon hi#tld.com
4 starbucks hi#test.de
5 greyhound bye#tuz.de
emails.xlsx
hi#test.de bye#test.com hi#tld.com ...
1 amazon google microsoft
2 starbucks amazon tesla
3 Grey Hound greyhound
4 ferrari
So i have the 2 excel sheets above and read both of em:
file1 = pd.ExcelFile('data/companies.xlsx')
file2 = pd.ExcelFile('data/emails.xlsx')
df_companies = file1.parse('sheet1')
df_emails = file2.parse('sheet1')
what i'm trying to accomplish is:
check if df_companies['To'] is an existing header in df_emails
if the header exists in df_emails, search the appropriate column of that header for df_companies['company']
if the company is found, add a column to df_companies and fill in '1', if not fill in '0'
e.g.: company amazon has the To email hi#test.de in company.xlsx. in email.xlsx the header hi#test.de exists and also amazon was found in the column - so its a '1'.
Anyone knows how to accomplish this?
Here's one approach. Convert df_emails to a dictionary and map it to df_companies. Then, compare the mapped column with df_companies['company'].
df_companies['check'] = df_companies['To'].map(df_emails.to_dict(orient='list')).fillna('')
df_companies['check'] = df_companies.apply(lambda x: x['company'] in x['check'], axis=1).astype(int)
company To check
1 amazon hi#test.de 1
2 google bye#test.com 1
3 amazon hi#tld.com 0
4 starbucks hi#test.de 1
5 greyhound bye#tuz.de 0
Iam working with files in a folder where i need better way to loop through files and append a column to make master file. For two files i was using reading as two dataframe and appending series. However now i ran into situation with more more than 100 files.
file 1 is as below:
Num Department Product Salesman Location rating1
1 Electronics TV 3 Bigmart, Delhi 5
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 3
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 5
similary file 2:
Num Department Product Salesman Location rating2
1 Electronics TV 3 Bigmart, Delhi 2
2 Electronics TV 1 Bigmart, Mumbai 4
3 Electronics TV 2 Bigmart, Bihar 4
4 Electronics TV 2 Bigmart, Chandigarh 5
5 Electronics Camera 2 Bigmart, Jharkhand 3
What I am trying to achieve is read Rating column from all the other file and append verticaly. Expected:
Num Department Product Salesman Location rating1 rating2
1 Electronics TV 3 Bigmart, Delhi 5 2
2 Electronics TV 1 Bigmart, Mumbai 4 4
3 Electronics TV 2 Bigmart, Bihar 3 5
4 Electronics TV 2 Bigmart, Chandigarh 5 5
5 Electronics Camera 2 Bigmart, Jharkhand 5 3
I modified some of the code posted here. Following Code worked:
def read_folder(folder):
files = [i for i in os.listdir(folder) if 'xlsx' in i]
df = pd.read_excel(folder+'/{}'.format(files[0]))
for f in files[1:]:
df2 = pd.read_excel(folder+'/{}'.format(f))
df = df.merge(df2.iloc[:,5],left_index=True,right_index=True)
return df
This method read folder and return all in a pandas dataframe
import pandas as pd
import os
def read_folder(csv_folder)
files = os.listdir(csv_folder)
df = []
for f in files:
print(f)
csv_file = csv_folder + "/" + f
df.append(pd.read_csv(csv_file))
df_full = pd.concat(df, ignore_index=True)
return df, full
As I understand your last comment, you need to add rating columns and create one file. After reading all files you can do below operation.
final_df = df[0]
i = 1
for d in df[1:]:
final_df["rating_"+i] = d["rating"]
i = i+1
This version of read_folder() returns a list of data frames. It also add a helper column (for ratings).
import pandas as pd
from pathlib import Path
def read_folder(csv_folder):
''' Input is a folder with csv files; return list of data frames.'''
csv_folder = Path(csv_folder).absolute()
csv_files = [f for f in csv_folder.iterdir() if f.name.endswith('csv')]
# the assign() method adds a helper column
dfs = [
pd.read_csv(csv_file).assign(rating_src = f'rating-{idx}')
for idx, csv_file in enumerate(csv_files, 1)
]
return dfs
Now assemble the data frames into the desired shape:
dfs = read_folder(csv_folder)
dfs = (pd.concat((d for d in dfs))
.set_index(['Num', 'Department', 'Product', 'Salesman', 'Location', 'rating_src'])
.squeeze()
.unstack(level='rating_src')
.reset_index()
)
dfs.columns.name = ''
This is using Python.
I have an excel sheet that in its most basic form looks like this
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
New York Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Stockholm Cup a 5
Stockholm Cup b 3
Madrid Cup a 5
New York Plate a 8
I want to group the locations together so that all the new yorks are together and madrids etc and export them to separate excel sheets called new york, madrid, stockholm. With the same info on the rows. So basically just a copy and paste of the row into a new excel sheet named after that row. Then I want to add all cups together as one and all plates as one on the second page of each one. Would make sense to do this before exporting the data right?
End result 3 excel sheets named, containing only their data, and some easy math on the second sheet.
The real excel sheet is dealing with 15000 rows 50 locations and 100 items. These change so would have to be a procedural way. New york might be Toronto next time.
So far I have been able to group them by pandas but every attempt after that as failed.
New to pandas so I thought this one would be relatively easy to do.
import pandas as pd
stock_report_excel = "small_stores_blocked_stock_value.xlsx"
df_soh = pd.read_excel(stock_report_excel, sheet_name='SOH')
df_stores = df_soh.groupby(['Site Name'])
guess loop to add to sheet
adding of the items to sheet 2
exporting
While not very clear what is your desired purpose, I guess Pandas MultiIndex DataFrame may be helpful for you. I write some simple codes below and wish could guide you further.
import pandas as pd
sites=pd.Series(['New York','Stockholm','Madrid','New York','New York','Madrid','Stockholm','Stockholm','Stockholm','Madrid','New York'])
col2=pd.Series(['Cup','Plate','Cup','Cup','Plate','Cup','Plate','Cup','Cup','Cup','Plate'])
col3=pd.Series(['a','b','a','b','a','b','a','a','b','a','a'])
col4=pd.Series([3,5,2,5,8,9,2,5,3,5,8])
data=pd.DataFrame({'sites':sites,'col2':col2,'col3':col3,'col4':col4})
# You can of course replce all the codes above with Pandas read related functions.
data1 = data.set_index(['sites','col2','col3']) # Set as MultiIndex DataFrame.
data1.loc[('New York'),:] # This will give you all the 'New York' data
data1.loc[('New York','Cup'),:] # This will give you all the 'New York' & 'Cup' data.
# Retrieving all the 'Cup' data is a bit tricky, see the following
idx=pd.IndexSlice
data1.loc[idx[:,'Cup'],:]
Output as follows.
# data
sites col2 col3 col4
0 New York Cup a 3
1 Stockholm Plate b 5
2 Madrid Cup a 2
3 New York Cup b 5
4 New York Plate a 8
5 Madrid Cup b 9
6 Stockholm Plate a 2
7 Stockholm Cup a 5
8 Stockholm Cup b 3
9 Madrid Cup a 5
10 New York Plate a 8
# data1
col4
sites col2 col3
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Cup a 5
b 3
Madrid Cup a 5
New York Plate a 8
# data1.loc[('New York'),:]
col4
col2 col3
Cup a 3
b 5
Plate a 8
a 8
# data1.loc[('New York','Cup'),:]
col4
col3
a 3
b 5
# data1.loc[idx[:,'Cup'],:]
col4
sites col2 col3
New York Cup a 3
Madrid Cup a 2
New York Cup b 5
Madrid Cup b 9
Stockholm Cup a 5
b 3
Madrid Cup a 5
If you do not want to see any warnings and want to keep high performance, you can all use idx and explicit coding, which are:
data1.loc[idx['New York',:,:],:]
data1.loc[idx['New York','Cup',:],:]
data1.loc[idx['','Cup',:],:]
Your next step is to write these data selections into a separate sheet. I am not very familiar with that because I always write data into text files. For example, write one of them into a csv file is as simple as data1.loc[idx['New York','Cup',:],:].to_csv('result.csv',index=False). I recommend you search your desired functions.
Hope this is helpful. Good luck!
Answer to the problem
import pandas as pd
import os
file = "yourfile.xlsx"
extension = os.path.splitext(file)[1]
filename = os.path.splitext(file)[0]
abpath = os.path.dirname(os.path.abspath(file))
df=pd.read_excel(file, sheet_name="sheetname")
colpick = "column to extract"
cols=list(set(df[colpick].values))
def sendtofile(cols):
for i in cols:
df[df[colpick] == i].to_excel("{}/exported/{}.xlsx".format(abpath, i), sheet_name=i, index=False)
return
I have 2 separate excel spreadsheets
spreadsheet 1 is as such:
ID tin name date
1 21043 Bob 8/1/2019
2 45667 Jim 7/1/2018
3 69780 Sal 4/24/2017
The 2nd spreadsheet is as such:
ID tin job
1 21043 02
2 76544 02
3 45667 04
I am trying to figure out how to match the 2 spreadsheets and make 1 list as such:
ID tin name date job
1 21043 Bob 8/1/2019 02
2 45667 Jim 7/1/2018 04
3 69780 Sal 4/24/2017
4 76544 02
the common denominator is the "tin" but i have to merge the ones that duplicate, but then add the ones from both sheets that dont duplicate..
I am new to python and VERY new to xlrd so i cannot seem to even figure out the best terms to use to google an example.
I found some information on a next(iter statement but after countless attempts i could not figure out a useful way to use it to combine.
Is there an easy way or am i "up a creek"??
Thank you,
Bob
You can use pandas for this. Pandas uses xlrd and other excel readers under the hood.
You will do something like this:
df1 = pandas.read_excel('file1.xls', sheet_name='...')
df2 = pandas.read_excel('file2.xls', sheet_name='...')
df1.merge(df2, how='outer')
You may need some variation of this depending on your column names.. see pandas merge
I have an excel file with product names. First row is the category (A1: Water, A2: Sparkling, A3:Still, B1: Soft Drinks, B2: Coca Cola, B3: Orange Juice, B4:Lemonade etc.), each cell below is a different product. I want to keep this list in a viewable format (not comma separated etc.) as this is very easy for anybody to update the product names (I have a second person running the script without understanding the script)
If it helps I can also have the excel file in a CSV format and I can also move the categories from the top row to the first column
I would like to replace the cells of a dataframe (df) with the product categories. For example, Coca Cola would become Soft Drinks. If the product is not in the excel it would not be replaced (ex. Cookie).
print(df)
Product Quantity
0 Coca Cola 1234
1 Cookie 4
2 Still 333
3 Chips 88
Expected Outcome:
print (df1)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88
Use DataFrame.melt with DataFrame.dropna or DataFrame.stack for helper Series and then use Series.replace:
s = df1.melt().dropna().set_index('value')['variable']
Alternative:
s = df1.stack().reset_index(name='v').set_index('v')['level_1']
df['Product'] = df['Product'].replace(s)
#if performance is important
#df['Product'] = df['Product'].map(s).fillna(df['Product'])
print (df)
Product Quantity
0 Soft Drinks 1234
1 Cookie 4
2 Water 333
3 Snacks 88