Subtract numbers from 2 dataframes based on String in Python - python
I am absolute beginner. Here I have two pivot table stored in two different sheets of same Excel file.
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']
df2 is sub-categories of df1.
Each sheet has a pivot table:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I want to subtract the value of string ends with F in sheet 2. Example: in sheet 2: 1CF, 1EF, 1FF & so on from the respective string i.e 1C, 1E, 1F & so on.
My result should be like "1C= 1C-1CF= 1037" and it should be stored in a new sheet (here: Sheet 3).
My Python code:
#importing pandas
import pandas as pd
import numpy as np
from openpyxl import load_workbook
#Assigning the worksheet to file
file="Stratification_worksheet.xlsx"
#Loading the spreadsheet
data= pd.ExcelFile(file)
#sheetname
print(data.sheet_names)
#loading the sheetname to df1
df=data.parse("Auftrag")
print(df)
# creating tuples
L1=["PMC11","PMP11","PMP21","PMC21","PMP23"]
L2=["PTP33B","PTP31B","PTC31B"]
m1=df["ordercode"].str.startswith(tuple(L1))
m2=df["ordercode"].str.startswith(tuple(L2))
#creating a new column preessurerange and slicing the pressure range from order code
a=df["ordercode"].str.slice(10,12)
b=df["ordercode"].str.slice(11,13)
df["pressurerange"]= np.select([m1,m2],[a,b], default =np.nan)
print(df)
#creating a new coloumn Presssureunit and slicing the preesure unit from ordercode
c=df["ordercode"].str.slice(12,13)
d=df["ordercode"].str.slice(14,15)
df["pressureunit"]= np.select([m1,m2],[c,d], default =np.nan)
print(df)
#creating a tempcolumn to store pressurerange and pressure unit
df["pressuresensor"]=df["pressurerange"] + df["pressureunit"]
print(df)
#pivottable
print(df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum}))
print(df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum}))
#creating new worksheet
df1=df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum})
df2=df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum})
book=load_workbook(file)
writer=pd.ExcelWriter(file,engine="openpyxl")
writer.book = book
df1.to_excel(writer,sheet_name="pressurerangepivot")
df2.to_excel(writer,sheet_name="pressuresensorpivot")
writer.save()
writer.close()
"""now we have classified the ordercode based on the pressurerange and pressureunit and we have the sum under each category"""
#check the columns
print(list(df))
print(list(df1))
print(list(df2))
I used suffix="F" df3=df1.iloc[:,:]-df2.iloc[:,:].endswith(suffix,1,2) But it's showing error:
df3=df1['1C']-df2['1CF']
this gives exactly value. But I don't know how to do for entire dataframe using simple code.
df2= df2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
df2.columns = [i[:-1] for i in df2.columns] # Remove 'F' in the end for column-wise subtraction
result = df1 - df2 # Substract values
result[result.isnull()] = sheet1 #leaves when there is no "F"
Related
dataframe transform partial row data on column
I have one dataframe where format is given as below image. Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data. Now I want to transform this in column where every type of data appended by another group. Expected output is: is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.
here is one way to do it use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year) create the new row value by combining ticker and year, then using pivot to get the desired result set df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True) df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index() Result set txt ticker2 data1 data2 0 AAPL/2020 0.824676 0.616524 1 AAPL/2021 0.018540 0.046365 2 AAPL/2022 0.222349 0.729845 3 AMZ/2020 0.122288 0.087217 4 AMZ/2021 0.012168 0.734674 5 AMZ/2022 0.923501 0.437676 6 APPL/2020 0.886927 0.520650 7 APPL/2021 0.725515 0.543404 8 APPL/2022 0.211378 0.464898 9 GGL/2020 0.777676 0.052658 10 GGL/2021 0.297292 0.213876 11 GGL/2022 0.894150 0.185207 12 MICO/2020 0.898251 0.882252 13 MICO/2021 0.141342 0.105316 14 MICO/2022 0.440459 0.811005 based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)), columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"]) ticker = ['APPL', 'MICO'] df2.insert(loc=0, column='ticker', value=ticker) df2.head() df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True) df3.head() df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index() txt ticker2 data1 data2 0 APPL/2020 26 9 1 APPL/2021 75 59 2 APPL/2022 20 44 3 MICO/2020 79 90 4 MICO/2021 63 30 5 MICO/2022 73 91
How to extract multiples tables from one PDF file using Pandas and tabula-py
Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp: Table exp in every page student Score Rang Alex 50 23 Julia 80 12 Mariana 94 4 I want to extract all this tables in one dataframe, First i did df = tabula.read_pdf(file_path,pages='all',multiple_tables=True) But i got a messy output so i try this lines of code that looks like this : [student Score Rang Alex 50 23 Julia 80 12 Mariana 94 4 ,student Score Rang Maxim 43 34 Nourah 93 5] so i edited my code like this import pandas as pd import tabula file_path = "filePath.pdf" # read my file df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True) df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True) df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True) df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True) df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True) It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option. Thus, you can use pandas.concat on its output to concatenate the dataframes: df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))
How to concatenate sum on apply function and print dataframe as a table format within a file
I am trying to concatenate the 'count' value into the top row of my dataframe. Here is an example of my starting data: Name,IP,Application,Count Tom,100.100.100,MsWord,5 Tom,100.100.100,Excel,10 Fred,200.200.200,Python,1 Fred,200.200.200,MsWord,5 df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count']) df_new = df.groupby(['Name', 'IP'])['Count'].apply(lambda x:x.astype(int).sum()) If I print df_new this produces the following output: Name,IP,Application,Count Tom,100.100.100,MsWord,15 ................Excel,15 Fred,200.200.200,MsWord,6 ................Python,6 As you can see, the count has correctly been calculated, for Tom it has added 5 to 10 and got an output of 15. However, this is displayed on every row of the group. Is there any way to get the output as follows - so the count is only on the first line of the group: Name,IP,Application,Count Tom,100.100.100,MsWord,15 .................Excel Fred,200.200.200,MsWord,6 .................Python Is there anyway to write dt_new to a file in this nice format? I would like the output to appear like a table and almost look like an excel sheet with merged cells. I have tried dt_new.to.csv('path') but this removes the nice formatting I am seeing when I output dt to the console.
It is a bit of a challenge to treat a DataFrame and have it provide summary rows. Generally, the DataFrame lends itself to results that are not dependent on position, such as the last item in a group. Can be done, but better to separate those concerns. import pandas as pd from StringIO import StringIO data = StringIO("""Name,IP,Application,Count Tom,100.100.100,MsWord,5 Tom,100.100.100,Excel,10 Fred,200.200.200,Python,1 Fred,200.200.200,MsWord,5""") #df = pd.DataFrame(data, columns=['Name', 'IP', 'Application', 'Count']) #df_new = df.groupby(['Name', 'IP', 'Application'])['Count'].apply(lambda x:x.astype(int).sum()) df = pd.read_csv(data) new_df = df.groupby(['Name', 'IP']).sum() # reset the two levels of columns resulting from the groupby() new_df.reset_index(inplace=True) df.set_index(['Name', 'IP'], inplace=True) new_df.set_index(['Name', 'IP'], inplace=True) print(df) Application Count Name IP Tom 100.100.100 MsWord 5 100.100.100 Excel 10 Fred 200.200.200 Python 1 200.200.200 MsWord 5 print(new_df) Count Name IP Fred 200.200.200 6 Tom 100.100.100 15 print(new_df.join(df, lsuffix='_lsuffix', rsuffix='_rsuffix')) Count_lsuffix Application Count_rsuffix Name IP Fred 200.200.200 6 Python 1 200.200.200 6 MsWord 5 Tom 100.100.100 15 MsWord 5 100.100.100 15 Excel 10 From here, you can use the multiindex to access the sum of the groups.
Pandas creating a consolidate report from excel
I have a excel file with below detail. I am trying to use panda to get only first 5 language and their sum in a excel files language blank comment code 61 Java 1031 533 3959 10 Maven 73 66 1213 12 JSON 0 0 800 32 XML 16 74 421 7 HTML 14 16 161 1 Markdown 23 0 39 1 CSS 0 0 1 Below is my code import pandas as pd from openpyxl import load_workbook df = pd.read_csv("myfile_cloc.csv", nrows=20) #df = df.iloc[1:] top_five = df.head(5) print(top_five) print(top_five['language']) print(top_five['code'].sum()) d = {'Languages (CLOC) (Top 5 Only)': "", 'LOC (CLOC)Only Code': 0} newdf = pd.DataFrame(data=d) newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language']) newdf['LOC (CLOC)Only Code'] = top_five['code'].sum() #Load excel to append the consolidated info writer = newdf.ExcelWriter("myfile_cloc.xlsx", engine='openpyxl') book = load_workbook('myfile_cloc.xlsx') writer.book = book newdf.to_excel(writer, sheet_name='top_five', index=False) writer.save() Need suggestion in these line newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language']) newdf['LOC (CLOC)Only Code'] = top_five['code'].sum() so that Expected Output can be Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code Java,Maven,JSON,XML,HTML 6554 Presently getting error raise ValueError('If using all scalar values, you must pass' ValueError: If using all scalar values, you must pass an index
try this, one way to solve this use index attribute a=df.head() df=pd.DataFrame({"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()},index=range(1)) another way to solve this, use from_records and pass list of dict in Dataframe. df=pd.DataFrame.from_records([{"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()}]) Output: Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code 0 Java,Maven,JSON,XML,HTML 6554
import pandas as pd sheet1 = pd.read_csv("/home/mycomputer/Desktop/practise/sorting_practise.csv") sheet1.head() sortby_blank=sheet1.sort_values('blank',ascending=False) sortby_blank['blank'].head(5).sum() values = sortby_blank['blank'].head(5).sum() /home/nptel/Desktop/practise/sorting_practise.csv ---> File Directory blank ---> Column you want to sort use .tail() if you need bottom values "values" variable will have the answer you are looking for
Pandas dataframe output formatting
I'm importing a trade list and trying to consolidate it into a position file with summed quantities and average prices. I'm grouping based on (ticker, type, expiration and strike). Two questions: Output has the index group (ticker, type, expiration and strike) in the first column. How can I change this so that each index column outputs to its own column so the output csv is formatted the same way as the input data? I currently force the stock trades to have values ("1") because leaving the cells blank will cause an error, but this adds bad data since "1" is not meaningful. Is there a way to preserve "" without causing a problem? Dataframe: GM stock 1 1 32 100 AAPL call 201612 120 3.5 1000 AAPL call 201612 120 3.25 1000 AAPL call 201611 120 2.5 2000 AAPL put 201612 115 2.5 500 AAPL stock 1 1 117 100 Code: import pandas as pd import numpy as np df = pd.read_csv(input_file, index_col=['ticker', 'type', 'expiration', 'strike'], names=['ticker', 'type', 'expiration', 'strike', 'price', 'quantity']) df_output = df.groupy(df.index).agg({'price':np.mean, 'quantity':np.sum}) df_output.to_csv(output_file, sep=',') csv output comes out in this format: (ticker, type, expiration, strike), price, quantity desired format: ticker, type, expiration, strike, price, quantity
For the first question, you should use groupby(df.index_col) instead of groupby(df.index) For the second, I am not sure why you couldn't preserve "", is that numeric? I mock some data like below: import pandas as pd import numpy as np d = [ {'ticker':'A', 'type':'M', 'strike':'','price':32}, {'ticker':'B', 'type':'F', 'strike':100,'price':3.5}, {'ticker':'C', 'type':'F', 'strike':'', 'price':2.5} ] df = pd.DataFrame(d) print df #dgroup = df.groupby(['ticker', 'type']).agg({'price':np.mean}) df.index_col = ['ticker', 'type', 'strike'] dgroup = df.groupby(df.index_col).agg({'price':np.mean}) #dgroup = df.groupby(df.index).agg({'price':np.mean}) print dgroup print type(dgroup) dgroup.to_csv('check.csv') output in check.csv: ticker,type,strike,price A,M,,32.0 B,F,100,3.5 C,F,,2.5