Probably a simple answer but I am new to coding and this is my first project.
I have managed to sum together the necessary information from individual spreadsheets and would now like to write an 'End of Month' spreadsheet to sum all individual data.
heres what i have so far..
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
print(client_total)
This returns
Nominal
1118 379
1135 2367
1158 811
Name: Amount, dtype: int64
Nominal
1118 1147.85
1135 422.66
1158 990.68
Name: Amount, dtype: float64
Nominal
1118 736.38
1135 477.40
1158 470.16
Name: Amount, dtype: float64
Please let me know how I can merge these three separate results into one easy to read month total.
Many thanks.
Create list of Series called out and then use concat with sum by index by sum(level=0):
out = []
from pathlib import Path
path = Path("Spreadsheets")
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"])["Amount"].sum()
out.append(client_total)
df = pd.concat(out).sum(level=0)
print(df)
Assuming you have three dataframes (df1, df2,df3), you can simply use the add function along columns:
df_sum=df1.add(df2)
df_sum=df_sum.add(df3)
print(df_sum)
Nominal
1118 2263.23
1135 3267.06
1158 2271.84
Hopefully, this can help you:
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
df_sum=pd.DataFrame(columns=['Nominal'],index=[1118,1135,1158],data=[0,0,0])
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
print(client_total)
df_sum=df_sum.add(client_total)
print(df_sum)
Would this work?
import pandas as pd
from pathlib import Path
path = Path("Spreadsheets")
dfs = []
for file in path.glob("*.xlsx"):
df = pd.read_excel(f"{file}")
client_total = df.groupby(["Nominal"]).sum()["Amount"]
dfs.append(client_total)
df = dfs[0]
for df in dfs[1:]:
df = df.add(df)
print(df)
Related
Hi I want to get values from first column of excel file as array.
Already I wrote this code-
import os
import pandas as pd
for file in os.listdir("./python_files"):
if file.endswith(".xlsx"):
df = pd.read_excel(os.path.join("./python_files", file))
print(df.iloc[:,1])
What i got in output now
0 172081
1 163314
2 173547
3 284221
4 283170
...
3582 163304
3583 160560
3584 166961
3585 161098
3586 162499
Name: Organization CRD#, Length: 3587, dtype: int64
What I whish to get
172081
163314
173547
284221
283170
...
163304
160560
166961
161098
162499
Can somebody help? Thanks :D
You will just need to use the tolist function from pandas:
import os
import pandas as pd
for file in os.listdir("./python_files"):
if file.endswith(".xlsx"):
df = pd.read_excel(os.path.join("./python_files", file))
print(df.iloc[:,1].tolist())
Output
172081
163314
173547
284221
283170
...
163304
160560
166961
161098
162499
As an array:
df.iloc[:,1].values
As a list:
df.iloc[:,1].values.tolist()
Trying to create a heatmap with this data, and there are a few problems I can't solve. On the x-axis I want the Location, and the y-axis I want the Passengers. Those axises should not have duplicates, and with the x-axis (Location) it's easy to use the drop.duplicates(), but for the y-axis (Passengers) it doesn't work that well. The main problem is that the Passenger column that has multiple entries in a cell. Is there a good way to solve this? Edit I also need to get rid of the empty cells
import numpy as np
from pandas import DataFrame
import seaborn as sns
import pandas as pd
from collections.abc import Iterable
%matplotlib inline
file = "vacation.csv"
df = pd.read_csv(file)
example = df.filter(['Location', 'Passengers'])
print(example)
#x_axis = df.filter(['Location']).drop.duplicates() //Drop duplicates
Output:
Location Passengers
0 Paris []
1 Paris []
2 Stockholm []
3 Berlin ['Peter']
4 Berlin ['Maria, Debra, Kim']
... ... ...
2238 Helsinki ['Peter, Maria']
2239 Berlin ['Debra']
2240 Berlin ['Debra']
2241 Helsinki ['Debra']
2242 Paris ['Peter', 'Debra', 'Kim', 'Maria']
[2243 rows x 2 columns]
You can convert list to columns as follows, but check if it's valid for your case.
import pandas as pd
import numpy as np
def keep_one(row):
unique = {}
for val in row:
unique[val] = None
return list(unique.keys())
df['passengers_col'] = df['passengers_col'].apply(keep_one)
keys = np.unique(df['passengers_col'].apply(pd.Series).dropna()).astype('str').tolist()
cols_val = df['passengers_col'].apply(pd.Series).to_numpy().tolist()
new_cols = pd.DataFrame(data=cols_val, columns=keys)
# encode values
for k in new_cols.keys():
indices = np.where(new_cols[k] != k)[0].tolist()
new_cols.iloc[indices] = 0
Then you can use pd.concat to merge all columns with new_cols DataFrame
I'm not sure if I understood correctly, but maybe this approach could help you:
import pandas as pd
import numpy as np
import seaborn as sns
data = pd.DataFrame([["Paris",[]],["Paris",[]],
["Stockholm",[]],["Berlin",['Peter']],
["Berlin",['Maria', 'Debra', 'Kim']],["Helsinki",['Peter, Maria']],
["Berlin",['Debra']],["Berlin",['Debra']],
["Helsinki",['Debra']],["Paris",['Peter', 'Debra', 'Kim', 'Maria']]], columns = {"Location", "Passengers"})
data = data.groupby(["Location"]).sum()
cols = np.unique(np.sum(data["Passengers"]))
for col in cols:
data[col] = 0
for idx in data.index:
for col in data.loc[idx,"Passengers"]:
data.loc[idx,col] += 1
sns.heatmap(data.iloc[:,1::])
Probably you could improve performance by removing loops if your dataset is big.
It outputs the following:
I have 7 csv files of 7 stocks. Each file shares the same format, of columns and rows.
I have applied different ways to merge these files into 1 dataframe but still don't succeed (loop, using glob, etc). I want to keep the "Date" column as the index for the dataframe, and the "High" column of each file next to each other. Then the "High" columns are renamed based on the stock names.
import pandas as pd
FDX = pd.read_csv("../Data/FDX.csv")
GOOGL = pd.read_csv("../Data/GOOGL.csv")
IBM = pd.read_csv("../Data/IBM.csv")
KO = pd.read_csv("../Data/KO.csv")
MS = pd.read_csv("../Data/MS.csv")
NOK = pd.read_csv("../Data/NOK.csv")
XOM = pd.read_csv("../Data/XOM.csv")
stocks = pd.DataFrame({"FDX": FDX["High"],
"GOOGL": GOOGL["High"],
"IBM": IBM["High"],
"KO": KO["High"],
"MS": MS["High"],
"NOK": NOK["High"],
"XOM": XOM["High"]
})
stocks.head()
The codes I wrote has errors. In there anyway to do it?
Thank you for your answers!
If they all have the same date range this would work.
MergeList = [[GOOGL,'GOOGL'],[IBM,'IBM'],[KO,'KO'],[MS,'MS'],[NOK,'NOK'],[XOM,'XOM']]
NewList = []
for df_t,col_name in MergeList:
df_t = df_t[['Date','High']]
df_t.columns = ['Date',col_name]
NewList.append(df_t)
Merge = FDX
for df_t in NewList:
Merge = pd.merge(Merge,df_t,on='Date')
If this question is unclear, I am very open to constructive criticism.
I have an excel table with about 50 rows of data, with the first column in each row being a date. I need to access all the data for only one date, and that date appears only about 1-5 times. It is the most recent date so I've already organized the table by date with the most recent being at the top.
So my goal is to store that date in a variable and then have Python look only for that variable (that date) and take only the columns corresponding to that variable. I need to use this code on 100's of other excel files as well, so it would need to arbitrarily take the most recent date (always at the top though).
My current code below simply takes the first 5 rows because I know that's how many times this date occurs.
import os
from numpy import genfromtxt
import pandas as pd
path = 'Z:\\folderwithcsvfile'
for filename in os.listdir(path):
file_path = os.path.join(path, filename)
if os.path.isfile(file_path):
broken_df = pd.read_csv(file_path)
df3 = broken_df['DATE']
df4 = broken_df['TRADE ID']
df5 = broken_df['AVAILABLE STOCK']
df6 = broken_df['AMOUNT']
df7 = broken_df['SALE PRICE']
print (df3)
#print (df3.head(6))
print (df4.head(6))
print (df5.head(6))
print (df6.head(6))
print (df7.head(6))
This is a relatively simple filtering operation. You state that you want to "take only the columns" that are the latest date, so I assume that an acceptable result will be a filter DataFrame with just the correct columns.
Here's a simple CSV that is similar to your structure:
DATE,TRADE ID,AVAILABLE STOCK
10/11/2016,123,123
10/11/2016,123,123
10/10/2016,123,123
10/9/2016,123,123
10/11/2016,123,123
Note that I mixed up the dates a little bit, because it's hacky and error-prone to just assume that the latest dates will be on the top. The following script will filter it appropriately:
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
# convert the DATE column to datetimes
df['DATE'] = pd.to_datetime(df['DATE'])
# find the latest datetime
latest_date = df['DATE'].max()
# use index filtering to only choose the columns that equal the latest date
latest_rows = df[df['DATE'] == latest_date]
print (latest_rows)
# now you can perform your operations on latest_rows
In my example, this will print:
DATE TRADE ID AVAILABLE STOCK
0 2016-10-11 123 123
1 2016-10-11 123 123
4 2016-10-11 123 123
So I am doing some merged using Pandas using a name-map because the two files I want don't have exact name names to merge on easily. But My Pdata sheet has lists of dates from 2014 to 2016, but I want to filter the sheet down to only contain dates from 1/1/2015 - 31/12/2016.
Below is the code that I currently have and I am not sure how to/if I can filter on date before the merge.
import pandas as pd
path= 'C:/Users/Rukgo/Desktop/Match thing/'
name_map = pd.read_excel(path+'name_map.xls',sheetname=0)
Tdata = pd.read_excel(path+'2015_TXNs.xls',sheetname=0)
pdata = pd.read_excel(path+'Pipeline.xls', sheetname=0)
#pdata = pdata[(1/1/2015 <=pdata.date)&(pdata.date <=31/12/2015)]
merged = pd.merge(Tdata, name_map, how="left", on="Local Customer")
merged.to_excel(path+"results.xls")
mdata = pd.read_excel(path +'results.xls',sheetname=0)
final_merge = pd.merge(mdata, pdata, how='right', on='Client')
final_merge = final_merge[final_merge.Amount_USD !=0]
final_merge.to_excel(path+"Final Results.xls")
So I had a commented out section that ended up being quite close to the actual code that I needed.
pdata = pdata[(pdata['date']>='20150101')&(pdata['date']<='20151231')]
That ended up working perfectly, though hard codes the dates