Pandas creating a consolidate report from excel - python

I have a excel file with below detail. I am trying to use panda to get only first 5 language and their sum in a excel
files language blank comment code
61 Java 1031 533 3959
10 Maven 73 66 1213
12 JSON 0 0 800
32 XML 16 74 421
7 HTML 14 16 161
1 Markdown 23 0 39
1 CSS 0 0 1
Below is my code
import pandas as pd
from openpyxl import load_workbook
df = pd.read_csv("myfile_cloc.csv", nrows=20)
#df = df.iloc[1:]
top_five = df.head(5)
print(top_five)
print(top_five['language'])
print(top_five['code'].sum())
d = {'Languages (CLOC) (Top 5 Only)': "", 'LOC (CLOC)Only Code': 0}
newdf = pd.DataFrame(data=d)
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
#Load excel to append the consolidated info
writer = newdf.ExcelWriter("myfile_cloc.xlsx", engine='openpyxl')
book = load_workbook('myfile_cloc.xlsx')
writer.book = book
newdf.to_excel(writer, sheet_name='top_five', index=False)
writer.save()
Need suggestion in these line
newdf['Languages (CLOC) (Top 5 Only)'] = str(top_five['language'])
newdf['LOC (CLOC)Only Code'] = top_five['code'].sum()
so that Expected Output can be
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
Java,Maven,JSON,XML,HTML 6554
Presently getting error
raise ValueError('If using all scalar values, you must pass'
ValueError: If using all scalar values, you must pass an index

try this,
one way to solve this use index attribute
a=df.head()
df=pd.DataFrame({"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()},index=range(1))
another way to solve this,
use from_records and pass list of dict in Dataframe.
df=pd.DataFrame.from_records([{"Languages (CLOC) (Top 5 Only)": ','.join(a['language'].unique()),"LOC (CLOC)Only Code":a['code'].sum()}])
Output:
Languages (CLOC) (Top 5 Only) LOC (CLOC)Only Code
0 Java,Maven,JSON,XML,HTML 6554

import pandas as pd
sheet1 = pd.read_csv("/home/mycomputer/Desktop/practise/sorting_practise.csv")
sheet1.head()
sortby_blank=sheet1.sort_values('blank',ascending=False)
sortby_blank['blank'].head(5).sum()
values = sortby_blank['blank'].head(5).sum()
/home/nptel/Desktop/practise/sorting_practise.csv ---> File Directory
blank ---> Column you want to sort
use .tail() if you need bottom values
"values" variable will have the answer you are looking for

Related

FIlrer csv table to have just 2 columns. Python pandas pd .pd

i got .csv file with lines like this :
result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
I need to make them look like this:
time value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info.
I tried this but it doeasnt work enough:
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary
It gives this output:
0
9 83.908
10 80.342
11 79.178
12 75.621
13 72.826
... ...
73522 10.726
73523 5.241
Here is the canonical way to project down to a subset of columns in the pandas ecosystem.
df = df[['_time', '_value']]
You can simply use the keyword argument usecols of pandas.read_csv :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])
NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :
SettingWithCopyWarning:
A value is trying to be set on a copy of a
slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] =
value instead
So, in your case you can use :
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]
Another option is pandas.DataFrame.copy,
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()
.read_csv has a usecols parameter to specify which columns you want in the DataFrame.
df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)
_time _value
0 2022-10-24T12:12:35Z 44.61
1 2022-10-24T12:12:40Z 17.33
2 2022-10-24T12:12:45Z 41.20
3 2022-10-24T12:12:51Z 33.49
4 2022-10-24T12:12:56Z 55.68
5 2022-10-24T12:12:57Z 55.68
6 2022-10-24T12:13:02Z 25.92
7 2022-10-24T12:13:08Z 5.71

dataframe transform partial row data on column

I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.
here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91

How to extract multiples tables from one PDF file using Pandas and tabula-py

Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))

Subtract numbers from 2 dataframes based on String in Python

I am absolute beginner. Here I have two pivot table stored in two different sheets of same Excel file.
df1:
['1C', '1E', '1F', '1H', '1K', '1M', '1N', '1P', '1Q', '1R', '1S', '1U', '1W', '2C', '2E', '2F', '2H', '2K', '2M', '2P', '2S', '2U', '2W']
df2:
['1CA', '1CB', '1CC', '1CF', '1CJ', '1CS', '1CU', '1EA', '1EB', '1EC', '1EF', '1EJ', '1ES', '1FA', '1FB', '1FC', '1FF', '1FJ', '1FS', '1FT', '1FU', '1HA', '1HB', '1HC', '1HF', '1HJ', '1HS', '1HT', '1HU', '1KA', '1KB', '1KC', '1KF', '1KJ', '1KS', '1KU', '1MA', '1MB', '1MC', '1MF', '1MJ', '1MS', '1MU', '1NA', '1NB', '1NC', '1NF', '1NJ', '1PA', '1PB', '1PC', '1PF', '1PJ', '1PS', '1PT', '1PU', '1QA', '1QB', '1QC', '1QF', '1QJ', '1RA', '1RB', '1RC', '1RF', '1RJ', '1SA', '1SB', '1SC', '1SF', '1SJ', '1SS', '1ST', '1SU', '1UA', '1UB', '1UC', '1UF', '1UJ', '1US', '1UU', '1WA', '1WB', '1WC', '1WF', '1WJ', '1WS', '1WU', '2CA', '2CB', '2CC', '2CJ', '2CS', '2EA', '2EB', '2EJ', '2FA', '2FB', '2FC', '2FJ', '2FU', '2HB', '2HC', '2HF', '2HJ', '2HU', '2KA', '2KB', '2KC', '2KF', '2KJ', '2KU', '2MA', '2MB', '2MC', '2MF', '2MJ', '2MS', '2MT', '2PA', '2PB', '2PC', '2PF', '2PJ', '2PU', '2SA', '2SB', '2SC', '2SF', '2SJ', '2UA', '2UB', '2UJ', '2WB', '2WC', '2WF', '2WJ']
df2 is sub-categories of df1.
Each sheet has a pivot table:
df1:[1 rows x 23 columns]
1C 1E 1F 1H 1K ... 2M 2P 2S 2U 2W
total 1057 334 3609 3762 1393 ... 328 1611 1426 87 118
df2:[1 rows x 137 columns]
1CA 1CB 1CC 1CF 1CJ 1CS ... 2UB 2UJ 2WB 2WC 2WF 2WJ
total 11 381 111 20 527 2 ... 47 34 79 2 1 36
I want to subtract the value of string ends with F in sheet 2. Example: in sheet 2: 1CF, 1EF, 1FF & so on from the respective string i.e 1C, 1E, 1F & so on.
My result should be like "1C= 1C-1CF= 1037" and it should be stored in a new sheet (here: Sheet 3).
My Python code:
#importing pandas
import pandas as pd
import numpy as np
from openpyxl import load_workbook
#Assigning the worksheet to file
file="Stratification_worksheet.xlsx"
#Loading the spreadsheet
data= pd.ExcelFile(file)
#sheetname
print(data.sheet_names)
#loading the sheetname to df1
df=data.parse("Auftrag")
print(df)
# creating tuples
L1=["PMC11","PMP11","PMP21","PMC21","PMP23"]
L2=["PTP33B","PTP31B","PTC31B"]
m1=df["ordercode"].str.startswith(tuple(L1))
m2=df["ordercode"].str.startswith(tuple(L2))
#creating a new column preessurerange and slicing the pressure range from order code
a=df["ordercode"].str.slice(10,12)
b=df["ordercode"].str.slice(11,13)
df["pressurerange"]= np.select([m1,m2],[a,b], default =np.nan)
print(df)
#creating a new coloumn Presssureunit and slicing the preesure unit from ordercode
c=df["ordercode"].str.slice(12,13)
d=df["ordercode"].str.slice(14,15)
df["pressureunit"]= np.select([m1,m2],[c,d], default =np.nan)
print(df)
#creating a tempcolumn to store pressurerange and pressure unit
df["pressuresensor"]=df["pressurerange"] + df["pressureunit"]
print(df)
#pivottable
print(df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum}))
print(df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum}))
#creating new worksheet
df1=df.pivot_table(values="total",columns="pressurerange",aggfunc={"total":np.sum})
df2=df.pivot_table(values="total",columns="pressuresensor",aggfunc={"total":np.sum})
book=load_workbook(file)
writer=pd.ExcelWriter(file,engine="openpyxl")
writer.book = book
df1.to_excel(writer,sheet_name="pressurerangepivot")
df2.to_excel(writer,sheet_name="pressuresensorpivot")
writer.save()
writer.close()
"""now we have classified the ordercode based on the pressurerange and pressureunit and we have the sum under each category"""
#check the columns
print(list(df))
print(list(df1))
print(list(df2))
I used suffix="F" df3=df1.iloc[:,:]-df2.iloc[:,:].endswith(suffix,1,2) But it's showing error:
df3=df1['1C']-df2['1CF']
this gives exactly value. But I don't know how to do for entire dataframe using simple code.
df2= df2.filter(regex=(".*F$")) # Leave only 'F' columns in sheet2
df2.columns = [i[:-1] for i in df2.columns] # Remove 'F' in the end for column-wise subtraction
result = df1 - df2 # Substract values
result[result.isnull()] = sheet1 #leaves when there is no "F"

Name a column added with pandas dataframe

l have the following csv file that l process as follow
import pandas as pd
df = pd.read_csv('file.csv', sep=',',header=None)
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes
00196436-12bc-4024-b623-25bac586d314 A know
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi
002882ca-48bb-4161-a75a-cf0ec984d650 A fd
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible
004d9025-86f0-4f8c-9720-01e3385c5e77 A 2015
Now l want to add a new column :
df['val']=None
for img in images:
id, ext = img.rsplit('.',1)
idx = df[df[0] ==id].index.values
df.loc[df.index[idx], 'val'] = id
When l write df in a new file as follow :
df.to_csv('new_file.csv', sep=',',encoding='utf-8')
l noticed that the column is correctly added and filled. But the column remains without name and it's supposed to be named val
id ocr raw_value
00037625-4706-4dfe-a7b3-de8c47e3a28d A 3 4
000a7b30-4c4f-4756-a757-f688ccc55d5d A /c 3
000b08e3-4129-4fd2-8ec0-23d00fe38a45 A yes 1
00196436-12bc-4024-b623-25bac586d314 A know 8
001b8c43-3e73-43c1-ba4f-df5edb10dfac A hi 9
002882ca-48bb-4161-a75a-cf0ec984d650 A fd 10
003b2890-3727-4c79-955a-f74ec6945ed7 A Sensible 14
How to set set to the last column added ?
EDIT1:
print(df.head())
0 1 2 3
0 id ocr raw_value manual_raw_value
1 00037625-4706-4dfe-a7b3-de8c47e3a28d ABBYY 03 03
2 000a7b30-4c4f-4756-a757-f688ccc55d5d ABBYY y/c y/c
3 000b08e3-4129-4fd2-8ec0-23d00fe38a45 ABBYY armoire armoire
4 00196436-12bc-4024-b623-25bac586d314 ABBYY point point
val
0 None
1 93
2 yic
3 armoire
4 point
Need only read_csv, because sep=',' is by default and can be omit and header=None is used if csv have no header:
df = pd.read_csv('file.csv')
Problem is your first row was not parsed to columns names, but to first data row.
df = pd.read_csv('file.csv', sep=',', header=0, index_col=0)
should allow you to simplify the next portion to
df['val']=None
for img in images:
image_id, ext = img.rsplit('.',1)
df.loc[image_id, 'val'] = image_id
If you don't need the image_id as index afterwards, use df.reset_index(inplace=True)
one easy way...
before to_csv:
df.columns.value[3]="val"

Categories

Resources