Subtract numbers from 2 dataframes based on String in Python - python

I am absolute beginner. Here I have two pivot table stored in two different sheets of same Excel file.
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']
df2 is sub-categories of df1.
Each sheet has a pivot table:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I want to subtract the value of string ends with F in sheet 2. Example: in sheet 2: 1CF, 1EF, 1FF & so on from the respective string i.e 1C, 1E, 1F & so on.
My result should be like "1C= 1C-1CF= 1037" and it should be stored in a new sheet (here: Sheet 3).
My Python code:
#importing pandas
import pandas as pd
import numpy as np
from openpyxl import load_workbook
#Assigning the worksheet to file
file="Stratification_worksheet.xlsx"
#Loading the spreadsheet
data= pd.ExcelFile(file)
#sheetname
print(data.sheet_names)
#loading the sheetname to df1
df=data.parse("Auftrag")
print(df)
# creating tuples
L1=["PMC11","PMP11","PMP21","PMC21","PMP23"]
L2=["PTP33B","PTP31B","PTC31B"]
m1=df["ordercode"].str.startswith(tuple(L1))
m2=df["ordercode"].str.startswith(tuple(L2))
#creating a new column preessurerange and slicing the pressure range from order code
a=df["ordercode"].str.slice(10,12)
b=df["ordercode"].str.slice(11,13)
df["pressurerange"]= np.select([m1,m2],[a,b], default =np.nan)
print(df)
#creating a new coloumn Presssureunit and slicing the preesure unit from ordercode
c=df["ordercode"].str.slice(12,13)
d=df["ordercode"].str.slice(14,15)
df["pressureunit"]= np.select([m1,m2],[c,d], default =np.nan)
print(df)
#creating a tempcolumn to store pressurerange and pressure unit
df["pressuresensor"]=df["pressurerange"] + df["pressureunit"]
print(df)
#pivottable
print(df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum}))
print(df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum}))
#creating new worksheet
df1=df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum})
df2=df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum})
book=load_workbook(file)
writer=pd.ExcelWriter(file,engine="openpyxl")
writer.book = book
df1.to_excel(writer,sheet_name="pressurerangepivot")
df2.to_excel(writer,sheet_name="pressuresensorpivot")
writer.save()
writer.close()
"""now we have classified the ordercode based on the pressurerange and pressureunit and we have the sum under each category"""
#check the columns
print(list(df))
print(list(df1))
print(list(df2))
I used suffix="F" df3=df1.iloc[:,:]-df2.iloc[:,:].endswith(suffix,1,2) But it's showing error:
df3=df1['1C']-df2['1CF']
this gives exactly value. But I don't know how to do for entire dataframe using simple code.

df2= df2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
df2.columns = [i[:-1] for i in df2.columns] # Remove 'F' in the end for column-wise subtraction
result = df1 - df2 # Substract values
result[result.isnull()] = sheet1 #leaves when there is no "F"

Related

dataframe transform partial row data on column

I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.
here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91

How to extract multiples tables from one PDF file using Pandas and tabula-py

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

How to concatenate sum on apply function and print dataframe as a table format within a file

I am trying to concatenate the 'count' value into the top row of my dataframe.
Here is an example of my starting data:
Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5
df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
df_new = df.groupby(['Name', 'IP'])['Count'].apply(lambda x:x.astype(int).sum())
If I print df_new this produces the following output:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
................Excel,15
Fred,200.200.200,MsWord,6
................Python,6
As you can see, the count has correctly been calculated, for Tom it has added 5 to 10 and got an output of 15. However, this is displayed on every row of the group.
Is there any way to get the output as follows - so the count is only on the first line of the group:
Name,IP,Application,Count
Tom,100.100.100,MsWord,15
.................Excel
Fred,200.200.200,MsWord,6
.................Python
Is there anyway to write dt_new to a file in this nice format?
I would like the output to appear like a table and almost look like an excel sheet with merged cells.
I have tried dt_new.to.csv('path') but this removes the nice formatting I am seeing when I output dt to the console.
It is a bit of a challenge to treat a DataFrame and have it provide summary rows. Generally, the DataFrame lends itself to results that are not dependent on position, such as the last item in a group. Can be done, but better to separate those concerns.
import pandas as pd
from StringIO import StringIO
data = StringIO("""Name,IP,Application,Count
Tom,100.100.100,MsWord,5
Tom,100.100.100,Excel,10
Fred,200.200.200,Python,1
Fred,200.200.200,MsWord,5""")
#df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count'])
#df_new = df.groupby(['Name', 'IP', 'Application'])['Count'].apply(lambda x:x.astype(int).sum())
df = pd.read_csv(data)
new_df = df.groupby(['Name', 'IP']).sum()
# reset the two levels of columns resulting from the groupby()
new_df.reset_index(inplace=True)
df.set_index(['Name', 'IP'], inplace=True)
new_df.set_index(['Name', 'IP'], inplace=True)
print(df)
Application Count
Name IP
Tom 100.100.100 MsWord 5
100.100.100 Excel 10
Fred 200.200.200 Python 1
200.200.200 MsWord 5
print(new_df)
Count
Name IP
Fred 200.200.200 6
Tom 100.100.100 15
print(new_df.join(df, lsuffix='_lsuffix', rsuffix='_rsuffix'))
Count_lsuffix Application Count_rsuffix
Name IP
Fred 200.200.200 6 Python 1
200.200.200 6 MsWord 5
Tom 100.100.100 15 MsWord 5
100.100.100 15 Excel 10
From here, you can use the multiindex to access the sum of the groups.

Pandas creating a consolidate report from excel

I have a excel file with below detail. I am trying to use panda to get only first 5 language and their sum in a excel
files language blank comment code
61 Java 1031 533 3959
10 Maven 73 66 1213
12 JSON 0 0 800
32 XML 16 74 421
7 HTML 14 16 161
1 Markdown 23 0 39
1 CSS 0 0 1
Below is my code
import pandas as pd
from openpyxl import load_workbook
df = pd.read_csv("myfile_cloc.csv", nrows=20)
#df = df.iloc[1:]
top_five = df.head(5)
print(top_five)
print(top_five['language'])
print(top_five['code'].sum())
d = {'Languages (CLOC) (Top 5 Only)': "", 'LOC (CLOC)Only Code': 0}
newdf = pd.DataFrame(data=d)
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
#Load excel to append the consolidated info
writer = newdf.ExcelWriter("myfile_cloc.xlsx", engine='openpyxl')
book = load_workbook('myfile_cloc.xlsx')
writer.book = book
newdf.to_excel(writer, sheet_name='top_five', index=False)
writer.save()
Need suggestion in these line
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
so that Expected Output can be
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
Java,Maven,JSON,XML,HTML 6554
Presently getting error
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index
try this,
one way to solve this use index attribute
a=df.head()
df=pd.DataFrame({"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()},index=range(1))
another way to solve this,
use from_records and pass list of dict in Dataframe.
df=pd.DataFrame.from_records([{"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()}])
Output:
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
0 Java,Maven,JSON,XML,HTML 6554
import pandas as pd
sheet1 = pd.read_csv("/home/mycomputer/Desktop/practise/sorting_practise.csv")
sheet1.head()
sortby_blank=sheet1.sort_values('blank',ascending=False)
sortby_blank['blank'].head(5).sum()
values = sortby_blank['blank'].head(5).sum()
/home/nptel/Desktop/practise/sorting_practise.csv ---> File Directory
blank ---> Column you want to sort
use .tail() if you need bottom values
"values" variable will have the answer you are looking for

Pandas dataframe output formatting

I'm importing a trade list and trying to consolidate it into a position file with summed quantities and average prices. I'm grouping based on (ticker, type, expiration and strike). Two questions:
Output has the index group (ticker, type, expiration and strike) in the first column. How can I change this so that each index column outputs to its own column so the output csv is formatted the same way as the input data?
I currently force the stock trades to have values ("1") because leaving the cells blank will cause an error, but this adds bad data since "1" is not meaningful. Is there a way to preserve "" without causing a problem?
Dataframe:
GM stock 1 1 32 100
AAPL call 201612 120 3.5 1000
AAPL call 201612 120 3.25 1000
AAPL call 201611 120 2.5 2000
AAPL put 201612 115 2.5 500
AAPL stock 1 1 117 100
Code:
import pandas as pd
import numpy as np
df = pd.read_csv(input_file, index_col=['ticker', 'type', 'expiration', 'strike'], names=['ticker', 'type', 'expiration', 'strike', 'price', 'quantity'])
df_output = df.groupy(df.index).agg({'price':np.mean, 'quantity':np.sum})
df_output.to_csv(output_file, sep=',')
csv output comes out in this format:
(ticker, type, expiration, strike), price, quantity
desired format:
ticker, type, expiration, strike, price, quantity
For the first question, you should use groupby(df.index_col) instead of groupby(df.index)
For the second, I am not sure why you couldn't preserve "", is that numeric?
I mock some data like below:
import pandas as pd
import numpy as np
d = [
{'ticker':'A', 'type':'M', 'strike':'','price':32},
{'ticker':'B', 'type':'F', 'strike':100,'price':3.5},
{'ticker':'C', 'type':'F', 'strike':'', 'price':2.5}
]
df = pd.DataFrame(d)
print df
#dgroup = df.groupby(['ticker', 'type']).agg({'price':np.mean})
df.index_col = ['ticker', 'type', 'strike']
dgroup = df.groupby(df.index_col).agg({'price':np.mean})
#dgroup = df.groupby(df.index).agg({'price':np.mean})
print dgroup
print type(dgroup)
dgroup.to_csv('check.csv')
output in check.csv:
ticker,type,strike,price
A,M,,32.0
B,F,100,3.5
C,F,,2.5

Categories

Resources