Grouping and exporting excel rows using python

Grouping and exporting excel rows using python - python

This is using Python.
I have an excel sheet that in its most basic form looks like this
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
New York Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Stockholm Cup a 5
Stockholm Cup b 3
Madrid Cup a 5
New York Plate a 8
I want to group the locations together so that all the new yorks are together and madrids etc and export them to separate excel sheets called new york, madrid, stockholm. With the same info on the rows. So basically just a copy and paste of the row into a new excel sheet named after that row. Then I want to add all cups together as one and all plates as one on the second page of each one. Would make sense to do this before exporting the data right?
End result 3 excel sheets named, containing only their data, and some easy math on the second sheet.
The real excel sheet is dealing with 15000 rows 50 locations and 100 items. These change so would have to be a procedural way. New york might be Toronto next time.
So far I have been able to group them by pandas but every attempt after that as failed.
New to pandas so I thought this one would be relatively easy to do.
import pandas as pd
stock_report_excel = "small_stores_blocked_stock_value.xlsx"
df_soh = pd.read_excel(stock_report_excel, sheet_name='SOH')
df_stores = df_soh.groupby(['Site Name'])
guess loop to add to sheet
adding of the items to sheet 2
exporting

While not very clear what is your desired purpose, I guess Pandas MultiIndex DataFrame may be helpful for you. I write some simple codes below and wish could guide you further.
import pandas as pd
sites=pd.Series(['New York','Stockholm','Madrid','New York','New York','Madrid','Stockholm','Stockholm','Stockholm','Madrid','New York'])
col2=pd.Series(['Cup','Plate','Cup','Cup','Plate','Cup','Plate','Cup','Cup','Cup','Plate'])
col3=pd.Series(['a','b','a','b','a','b','a','a','b','a','a'])
col4=pd.Series([3,5,2,5,8,9,2,5,3,5,8])
data=pd.DataFrame({'sites':sites,'col2':col2,'col3':col3,'col4':col4})
# You can of course replce all the codes above with Pandas read related functions.
data1 = data.set_index(['sites','col2','col3']) # Set as MultiIndex DataFrame.
data1.loc[('New York'),:] # This will give you all the 'New York' data
data1.loc[('New York','Cup'),:] # This will give you all the 'New York' & 'Cup' data.
# Retrieving all the 'Cup' data is a bit tricky, see the following
idx=pd.IndexSlice
data1.loc[idx[:,'Cup'],:]
Output as follows.
# data
sites col2 col3 col4
0 New York Cup a 3
1 Stockholm Plate b 5
2 Madrid Cup a 2
3 New York Cup b 5
4 New York Plate a 8
5 Madrid Cup b 9
6 Stockholm Plate a 2
7 Stockholm Cup a 5
8 Stockholm Cup b 3
9 Madrid Cup a 5
10 New York Plate a 8
# data1
col4
sites col2 col3
New York Cup a 3
Stockholm Plate b 5
Madrid Cup a 2
New York Cup b 5
Plate a 8
Madrid Cup b 9
Stockholm Plate a 2
Cup a 5
b 3
Madrid Cup a 5
New York Plate a 8
# data1.loc[('New York'),:]
col4
col2 col3
Cup a 3
b 5
Plate a 8
a 8
# data1.loc[('New York','Cup'),:]
col4
col3
a 3
b 5
# data1.loc[idx[:,'Cup'],:]
col4
sites col2 col3
New York Cup a 3
Madrid Cup a 2
New York Cup b 5
Madrid Cup b 9
Stockholm Cup a 5
b 3
Madrid Cup a 5
If you do not want to see any warnings and want to keep high performance, you can all use idx and explicit coding, which are:
data1.loc[idx['New York',:,:],:]
data1.loc[idx['New York','Cup',:],:]
data1.loc[idx['','Cup',:],:]
Your next step is to write these data selections into a separate sheet. I am not very familiar with that because I always write data into text files. For example, write one of them into a csv file is as simple as data1.loc[idx['New York','Cup',:],:].to_csv('result.csv',index=False). I recommend you search your desired functions.
Hope this is helpful. Good luck!

Answer to the problem
import pandas as pd
import os
file = "yourfile.xlsx"
extension = os.path.splitext(file)[1]
filename = os.path.splitext(file)[0]
abpath = os.path.dirname(os.path.abspath(file))
df=pd.read_excel(file, sheet_name="sheetname")
colpick = "column to extract"
cols=list(set(df[colpick].values))
def sendtofile(cols):
for i in cols:
df[df[colpick] == i].to_excel("{}/exported/{}.xlsx".format(abpath, i), sheet_name=i, index=False)
return

Related

Pythom:Compare 2 columns and write data to excel sheets

I need to compare two columns together: "EMAIL" and "LOCATION".
I'm using Email because it's more accurate than name for this issue.
My objective is to find total number of locations each person worked
at, sum up the total of locations to select which sheet the data
will been written to and copy the original data over to the new
sheet(tab).
I need the original data copied over with all the duplicate
locations, which is where this problem stumps me.
Full Excel Sheet
Had to use images because it flagged post as spam
The Excel sheet (SAMPLE) I'm reading in as a data frame:
Excel Sample Spreadsheet
Example:
TOMAPPLES#EXAMPLE.COM worked at WENDYS,FRANKS HUT, and WALMART - That
sums up to 3 different locations, which I would add to a new sheet
called SHEET: 3 Different Locations
SJONES22#GMAIL.COM worked at LONDONS TENT and YOUTUBE - That's 2 different locations, which I would add to a new sheet called SHEET:
2 Different Locations
MONTYJ#EXAMPLE.COM worked only at WALMART - This user would be added
to SHEET: 1 Location
Outcome:
data copied to new sheets
Sheet 2
Sheet 2: different locations
Sheet 3
Sheet 3: different locations
Sheet 4
Sheet 4: different locations
Thanks for taking your time looking at my problem =)

Hi Check below lines if work for you..
import pandas as pd
df = pd.read_excel('sample.xlsx')
df1 = df.groupby(['Name','Location','Job']).count().reset_index()
# this is long line
df2 = df.groupby(['Name','Location','Job','Email']).agg({'Location':'count','Email':'count'}).rename(columns={'Location':'Location Count','Email':'Email Count'}).reset_index()
print(df1)
print('\n\n')
print(df2)
below is the output change columns to check more variations
df1
Name Location Job Email
0 Monty Jakarta Manager 1
1 Monty Mumbai Manager 1
2 Sahara Jonesh Paris Cook 2
3 Tom App Jakarta Buser 1
4 Tom App Paris Buser 2
df2 all columns
Name Location ... Location Count Email Count
0 Monty Jakarta ... 1 1
1 Monty Mumbai ... 1 1
2 Sahara Jonesh Paris ... 2 2
3 Tom App Jakarta ... 1 1
4 Tom App Paris ... 2 2

Pandas, Dataframe, conditional sum of column for each row

I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!

What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps

Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

I have a dataframe as below:
import pandas as pd
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas','NaN','NaN','Asia','Europe','NaN','NaN'],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison','NaN','Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
I want to group by Country and Flower and forward fill or backward fill the columns Region and Animal where there are missing values. However the column Game should remain intact
I have tried this but it didn't work:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
also :
df.groupby(['Country','Flower'])['Animal', 'Region'].isna().bfill()
I want to know how to go about with this.
while this works but it removes the Games column:
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Animal', 'Region'].bfill().ffill()
And if i do a transform there is a mismatch in the length. Also please note that this is sample dataframe where I had added "NaN" as a string in the original frame it is as np.nan.

If you change your dataframe code to actually include np.nans, then the code you provided actually works. Although nans appear as normal text 'Nan', you can't create a dataframe writing that text by hand because that will be interpreted as a string, not an actual missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],
'Region':['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],
'Flower':['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],
'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion','NaN'],
'Game':['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
Then, this:
df['Region'] = df.groupby(['Country','Flower'])['Region'].transform(lambda x: x.ffill())
actually yields this:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 NaN USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 NaN UK Dandelion cricket Europe

First you need to know 'NaN' is not NaN
df=df.replace({'NaN':np.nan})
df.groupby(['Country','Flower'])['Region'].ffill()
Out[109]:
0 Americas
1 Americas
2 NaN# since here only have single row , that why stay NaN
3 Asia
4 Europe
5 Europe
6 Europe
Name: Region, dtype: object
Second if you need to chain two iid function in pandas you need apply
df.update(df.groupby(['Country','Flower'])['Animal', 'Region'].apply(lambda x : x.bfill().ffill()))
df
Out[119]:
Animal Country Flower Game Region
0 Bison USA Rose Baseball Americas
1 Bison USA Rose Baseball Americas
2 Golden Eagle MEX Lily soccer NaN
3 Tiger IND Orchid hockey Asia
4 Lion UK Dandelion cricket Europe
5 Lion UK Dandelion cricket Europe
6 Lion UK Dandelion cricket Europe

As Mex and Lily are only rows and moreover their region value is nan, fillna function not able to find appropriate group value.
If we catch the exception while fillna group mode then those value where there is no group will be left as it is. Then apply ffill and bfill to cover those value which doesn't have appropriate group
df_stack = pd.DataFrame({'Country':['USA','USA','MEX','IND','UK','UK','UK'],'Region': ['Americas',np.nan,np.nan,'Asia','Europe',np.nan,np.nan],'Flower': ['Rose','Rose','Lily','Orchid','Dandelion','Dandelion','Dandelion'],'Animal':['Bison',np.nan,'Golden Eagle','Tiger','Lion','Lion',np.nan],'Game': ['Baseball','Baseball','soccer','hockey','cricket','cricket','cricket']})
print("-------Before imputation------")
print(df_stack)
def fillna_Region(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_stack["Region"] =
df_stack["Region"].fillna(df_stack.groupby(['Country','Flower']) ['Region'].transform(lambda grp : fillna_Region(grp)))
df_stack["Animal"] =
df_stack["Animal"].fillna(df_stack.groupby(['Country','Flower']) ['Animal'].transform(lambda grp : fillna_Region(grp)))
df_stack = df_stack.ffill(axis = 0)
df_stack = df_stack.bfill(axis =0)
print("-------After imputation------")
print(df_stack)

Pandas read_html returned column with NaN values in Python

I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)

Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.

Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.

Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object

How to create Boolean indicator matrix in PYTHON

I have the following dataset:
user artist sex country
0 1 red hot chili peppers f Germany
1 1 the black dahlia murder f Germany
2 1 goldfrapp f Germany
3 2 dropkick murphys f Germany
4 2 le tigre f Germany
.
.
289950 19718 bob dylan f Canada
289951 19718 pixies f Canada
289952 19718 the clash f Canada
I want to create a Boolean indicator matrix using a dataframe, where there is one row for each user and one column for each artist. For each row(user) if there is artist return 1 else return 0.
Just to mention, there are 1004 unique artists and 15000 unique users—it’s a large data set.
I have created an empty matrix using the following:
pd.DataFrame(index=user, columns=artist)
I am having difficulty populating the dataframe correctly.

There is a method in pandas called notnull
Suppose your dataframe is named df, you should use:
df['has_artist'] = df['artist'].notnull()
This will add a column of boolean named has_artist to your dataframe
If you want to have 0 and 1 do instead:
df['has_artist'] = df['artist'].notnull().astype(int)
You can also store it in a different variable and not alter your dataframe.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping and exporting excel rows using python - python

Related

Pythom:Compare 2 columns and write data to excel sheets

Pandas, Dataframe, conditional sum of column for each row

Forward fill or back fill NaN values in Pandas columns based on grouping of other columns

Pandas read_html returned column with NaN values in Python

How to create Boolean indicator matrix in PYTHON

Categories

Resources