Openpyxl to create dataframe with sheet name and specific cell values?

Openpyxl to create dataframe with sheet name and specific cell values? - python

What I need to do:
Open Excel Spreadsheet in Python/Pandas
Create df with [name, balance]
Example:
name
balance
Jones Ministry
45,408.83
Smith Ministry
38,596.20
Doe Ministry
28,596.20
What I have done so far...
import pandas as pd
import openpyxl as op
from openpyxl import load_workbook
from pathlib import Path
Then...
# Excel File
src_file = src_file = Path.cwd() / 'lm_balance.xlsx'
df = load_workbook(filename = src_file)
I viewed all the sheet names by...
df.sheetnames
And created a dataframe with the 'name' column
balance_df = pd.DataFrame(df.sheetnames)
My spreadsheet looks like this...
I now need to loop thru each sheet and add the 'ending fund balance' and corresponding 'value'
The "Ending Fund Balance" is at different rows, but always the final row. The 'value' is always in column 'G'
How do I go about doing this?
I have read through examples in:
Automate the Boring Stuff
Openpyxl documentation
PBPython.com examples
Stack Overflow questions
I appreciate your help!
Working samples on github: Github: JohnMillstead: Balance_Study

To ge a cell value first set the data_only=True on load_workbook, otherwise you could end up getting the cell formula. To get last row of a worksheet you can use ws.max_row. Combine the previous command with the already created dataframe and apply for each worksheet name a function to get the last value from that worksheet at the G column (wb[x][f'G{wb[x].max_row}']).
import pandas as pd
from openpyxl import load_workbook
src_file = 'test_balance.xlsx'
wb = load_workbook(filename = src_file, data_only=True)
df = pd.DataFrame(data=wb.sheetnames, columns=["name"])
df["balance"] = df.name.apply(lambda x: wb[x][f'G{wb[x].max_row}'].value)
print(df)
Output from df
name balance
0 Jones Ministry 15100.08
1 Smith Ministry 45408.83
2 Stark Ministry 1561.75
3 Doe Ministry 7625.75
4 Bright Ministry 3078.30
5 Lincoln Ministry 6644.59
6 Martinez Ministry 11500.54
7 Patton Ministry 9782.65
8 Rich Ministry 8429.88
9 Seitz Ministry 2974.58
10 Bhiri Ministry 622.83
11 Pignatelli Ministry 34992.05
12 Cortez Ministry -283.48
13 Little Ministry 13755.80
14 Johnson Ministry -2035.31

Related

Skipping spaces in words of a given column while importing text file in pandas

I am trying to import a dataset from a text file, which looks like this.
id book author
1 Cricket World Cup: The Indian Challenge Ashis Ray
2 My Journey Dr. A.P.J. Abdul Kalam
3 Making of New India Dr. Bibek Debroy
4 Whispers of Time Dr. Krishna Saksena
When I used for importing:
df = pd.read_csv('book.txt', sep=' ')
it results into:
and when I use:
df = pd.read_csv('book.txt')
it results into:
Is there a way to get something like:
Any help on this will be appreciated. Thank you

Try with tab as a seperator:
df = pd.read_csv('book.txt', sep='\t')

Pandas/Python - Merging files on different columns based on incoming files

I have a python program which receive incoming files. Incoming files are files based on different countries. Sample files are below -
File 1 (USA) -
country state city population
USA IL Chicago 2000000
USA TX Dallas 1000000
USA CO Denver 5000000
File 2 (Non USA) -
country state city population
UK London 2000000
UK Bristol 1000000
UK Glasgow 5000000
Then I have a mapping file which needs to be merged with incoming files. Mapping file look like this
Country state Continent
UK Europe
Egypt Africa
USA TX North America
USA IL North America
USA CO North America
Now the requirement is that I need to join the incoming file with mapping file based on state column if its a USA file and join based on Country Column if its a Non USA file. For example -
If its a USA file -
result_file = pd.merge(input_file, mapping_file, on="state", how="left")
If its a non USA file -
result_file = pd.merge(input_file, mapping_file, on="country", how="left")
How can I place a condition which can identify the incoming file and do the merging of file accordingly?
Thanks in advance

In order to get a unified code for the both two cases, After reading the files, add another column for both DataFrame of fileX (df) and DataFrame of the mapping file (dfmap) with the name of (country_state) in which country and state are combined, then make this column is the linked relation.
for example:
import pandas as pd
df = pd.read_csv('fileX.txt') # assumed for fileX
dfmap = pd.read_csv('mapping_file.txt') # assumed for mapping file
df.fillna('') # to replace Nan values with ''
if 'state' in df.columns:
df['country_state'] = df['country'] + df['state']
else:
df['country_state'] = df['country']
dfmap['country_state'] = dfmap['country'] + dfmap['state']
result_file = pd.merge(df, dfmap, on="country_state", how="left")
Then you can drop the columns you do not need
Adding a modification in which adding state if not exist, and set relation based on country and state without adding the column 'country_sate' shown in the previous code:
import pandas as pd
df = pd.read_csv('file1.txt')
dfmap = pd.read_csv('file_map.txt')
df.fillna('')
if 'state' not in df.columns:
df['state']=''
result_file = pd.merge(df, dfmap, on=["country", "state"], how="left")

First, empty the state column for non-US files.
input_file.loc[input_file.country!='US', 'state'] = ''
Then, merge on two columns:
result_file = pd.merge(input_file, mapping_file, on=["country", "state"], how="left")

-How are you loading the files?
Are there any pattern in the names of the files which you can work on?
If they are in the same folder, you can recognize the file with
import os
list_of_files=os.listdir('my_directory/')
or you could do a simple search in the Country column looking for USA, and then apply the merges according to the situation

Import .txt to Pandas Dataframe With Multiple Delimiters

I would like to import .txt file into a Pandas Dataframe, my .txt file:
Ann Gosh 1234567892008-12-15Irvine CA45678A9Z5Steve Ryan
Yosh Dave 9876543212009-04-18St. Elf NY12345P8G0Brad Tuck
Clair Simon 3245674572008-12-29New Jersey NJ56789R9B3Dan John
The dataframe should look like this:
FirstN LastN SID Birth City States Postal TeacherFirstN TeacherLastN
Ann Gosh 123456789 2008-12-15 Irvine CA A9Z5 Steve Ryan
Yosh Dave 987654321 2009-04-18 St. Elf NY P8G0 Brad Tuck
Clair Simon 324567457 2008-12-29 New Jersey NJ R9B3 Dan John
I tried multiple ways including this:
df = pd.read_csv('student.txt', sep='\s+', engine='python', header=None, index_col=False)
to import the raw file into the dataframe, then plan to clean data for each column but it's too complicated. Could you please help me? (the Postal here is just the 4 char before TeacherFirstN)

You can start with setting names on you existing columns, and then applying regex on data while creating the new columns.
In order to fix the "single space delimiter" issue in your output, you can define "at least 2 space characters" eg [\s]{2,} as delimiter which would fix the issue for St. Elf in City names
An example :
import pandas as pd
import re
df = pd.read_csv(
'test.txt',
sep = '[\s]{2,}',
engine = 'python',
header = None,
index_col = False,
names= [
"FirstN","LastN","FULLSID","TeacherData","TeacherLastN"
]
)
sid_pattern = re.compile(r'(\d{9})(\d+-\d+-\d+)(.*)', re.IGNORECASE)
df['SID'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(1), axis = 1)
df['Birth'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(2), axis = 1)
df['City'] = df.apply(lambda row: sid_pattern.search(row.FULLSID).group(3), axis = 1)
teacherdata_pattern = re.compile(r'(.{2})([\dA-Z]+\d)(.*)', re.IGNORECASE)
df['States'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(1), axis = 1)
df['Postal'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(2)[-4:], axis = 1)
df['TeacherFirstN'] = df.apply(lambda row: teacherdata_pattern.search(row.TeacherData).group(3), axis = 1)
del df['FULLSID']
del df['TeacherData']
print(df)
Output :
FirstN LastN TeacherLastN SID Birth City States Postal TeacherFirstN
0 Ann Gosh Ryan 123456789 2008-12-15 Irvine CA A9Z5 Steve
1 Yosh Dave Tuck 987654321 2009-04-18 St. Elf NY P8G0 Brad
2 Clair Simon John 324567457 2008-12-29 New Jersey NJ R9B3 Dan

Python 2.7 code for Replacing whole value of a column in a csv file

I have a data imported from SQL server in a csv file with headers.
I want to write a **code in python2.7 ** which can read a csv file and re-write it into new csv file in which we have masked last 2 columns with a regex like 'SECRET VALUE'.
Sample Input of CSV:
ID,Name,city,SSN,CreditCardNo
1,Joy,London,123-465-456,123456789087645
2,Sam,NewYork,765-465-457,98765434567345
3,Jhon,Paris,678-365-654,765654542345677
4,Eric,Delhi,456-888-999,123456789087645
Expected sample output:
ID,Name,city,SSN,CreditCardNo
1,Joy,London,SECRET VALUE,SECRET VALUE
2,Sam,NewYork,SECRET VALUE,SECRET VALUE
3,Jhon,Paris,SECRET VALUE,SECRET VALUE
4,Eric,Delhi,SECRET VALUE,SECRET VALUE
My attempt:
import sys
import csv
r = csv.reader(open('C:\\Users\\Praveen\\workspace\\sampleFiles\\test1.csv'))
lines = [l for l in r]
lines[2][2] = '30'
writer = csv.writer(open('C:\\Users\\Praveen\\workspace\\sampleFiles\\test4.csv', 'wb'))
writer.writerows(lines)
This changes one element only, i want the whole column to be masked.

I think you need read_csv first, then replace values with iloc and last write to file by DataFrame.to_csv:
import pandas as pd
from pandas.compat import StringIO
temp=u"""ID,Name,city,SSN,CreditCardNo
1,Joy,London,123-465-456,123456789087645
2,Sam,NewYork,765-465-457,98765434567345
3,Jhon,Paris,678-365-654,765654542345677
4,Eric,Delhi,456-888-999,123456789087645"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp))
print df
ID Name city SSN CreditCardNo
0 1 Joy London 123-465-456 123456789087645
1 2 Sam NewYork 765-465-457 98765434567345
2 3 Jhon Paris 678-365-654 765654542345677
3 4 Eric Delhi 456-888-999 123456789087645
df.iloc[:, -2:] = 'SECRET VALUE'
print df
ID Name city SSN CreditCardNo
0 1 Joy London SECRET VALUE SECRET VALUE
1 2 Sam NewYork SECRET VALUE SECRET VALUE
2 3 Jhon Paris SECRET VALUE SECRET VALUE
3 4 Eric Delhi SECRET VALUE SECRET VALUE
df.to_csv('file.csv', index=False)

How do I add a blank line between merged files

I have several CSV files that I have managed to merge. However, I need to add a blank row between each files as they merge so I know a different file starts at that point. Tried everything. Please help.
import os
import glob
import pandas
def concatenate(indir="C:\\testing", outfile="C:\\done.csv"):
os.chdir(indir)
fileList=glob.glob("*.csv")
dfList=[]
colnames=["Creation Date","Author","Tweet","Language","Location","Country","Continent"]
for filename in fileList:
print(filename)
df=pandas.read_csv(filename, header=None)
ins=df.insert(len(df),'\n')
dfList.append(ins)
concatDf=pandas.concat(dfList,axis=0)
concatDf.columns=colnames
concatDf.to_csv(outfile,index=None)

Here is an example script. You can use the loc method with a non-existent key to enlarge the DataFrame and set the value of the new row.
The simplest solution seems to be to create a template DataFrame to use as a separator with the values set as desired. Then just insert it into the list of data frames to concatenate at appropriate positions.
Lastly, I removed the chdir, since glob can search in any path.
import glob
import pandas
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
# Create a separator template
separator = pandas.DataFrame(columns=column_names)
separator.loc[0] = [""]*7
dataframes = []
for file_name in file_list:
print(file_name)
if len(dataframes):
# The list is not empty, so we need to add a separator
dataframes.append(separator)
dataframes.append(pandas.read_csv(file_name))
concatenated = pandas.concat(dataframes, axis=0)
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
concatenate("input", ".out.csv")
An alternative, even shorter, way is to build the concatenated DataFrame iteratively, using the append method.
def concatenate(input_dir, output_file_name):
file_list=glob.glob(input_dir + "/*.csv")
column_names=["Creation Date"
, "Author"
, "Tweet"
, "Language"
, "Location"
, "Country"
, "Continent"]
concatenated = pandas.DataFrame(columns=column_names)
for file_name in file_list:
print(file_name)
if len(concatenated):
# The list is not empty, so we need to add a separator
concatenated.loc[len(concatenated)] = [""]*7
concatenated = concatenated.append(pandas.read_csv(file_name))
concatenated.to_csv(output_file_name, index=None)
print(concatenated)
I tested the script with 3 input CSV files:
input/1.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
input/2.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
input/3.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2018-12-12,who,Hello,EN,Delhi,India,Asia
When I ran it, the following output was written to console:
Console Output (using concat)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
0
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
0
0 2018-12-12 who Hello EN Delhi India Asia
The console output of the shorter variant is slightly different (note the indices in the first column), however this has no effect on the generated CSV file.
Console Output (using append)
input\1.csv
input\2.csv
input\3.csv
Creation Date Author Tweet Language Location Country Continent
0 2015-12-17 foo Hello EN London UK Europe
1 2015-12-18 bar Bye EN Manchester UK Europe
2 2015-12-28 baz Hallo DE Frankfurt Germany Europe
3
0 2016-01-09 bar Tweeeeet EN New York USA America
1 2016-01-09 cat Miau FI Helsinki Finland Europe
6
0 2018-12-12 who Hello EN Delhi India Asia
Finally, this is what the output CSV file it generated looks like:
out.csv
Creation Date,Author,Tweet,Language,Location,Country,Continent
2015-12-17,foo,Hello,EN,London,UK,Europe
2015-12-18,bar,Bye,EN,Manchester,UK,Europe
2015-12-28,baz,Hallo,DE,Frankfurt,Germany,Europe
,,,,,,
2016-01-09,bar,Tweeeeet,EN,New York,USA,America
2016-01-09,cat,Miau,FI,Helsinki,Finland,Europe
,,,,,,
2018-12-12,who,Hello,EN,Delhi,India,Asia

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Openpyxl to create dataframe with sheet name and specific cell values? - python

Related

Skipping spaces in words of a given column while importing text file in pandas

Pandas/Python - Merging files on different columns based on incoming files

Import .txt to Pandas Dataframe With Multiple Delimiters

Python 2.7 code for Replacing whole value of a column in a csv file

How do I add a blank line between merged files

Categories

Resources