Slice last 4 characters from a csv column in Pandas - python

I'm pretty new at Python and I am trying to get the last four characters from every row of data from a column in a CSV file. For example, I have column E for a large number of license plates. I'm able to successfully grab that column from the CSV file and write it in the appropriate spot in xlsx file. I've tried slicing the column in different ways, but I can't seem to get it right. Either I only get the first 4 license plates in the list or don't get the last four characters from each license plate. Below is the code I have right now that only displays the first four from the list.
Any help is greatly appreciated. Thank you in advance.
import pandas as pd
df_csv = pd.read_csv('data.csv', usecols=['license_plate'])
df_xlsx = pd.read_excel('report.xlsx', header=None)
license_plate = df_csv['license_plate']
license_plate = license_plate[-4:]
writer = pd.ExcelWriter('report.xlsx', engine='openpyxl', mode='a', if_sheet_exists='overlay')
license_plate.to_excel(writer, sheet_name='Report', startcol=8, startrow=3, header=None, index=None)
writer.close()
I know there's something I have to change with this part of the code, but after looking at and trying different methods I can't seem to get the correct output. Below is the code I have and only shows the first 4 license plates in their entirety.
license_plate = df_csv['license_plate']
license_plate = license_plate[-4:]
An example of what I am trying to get is if I have a whole column of license plates and the first 3 license plates are: ['123456', 'ABCDEF', ['1A2B3C'] ... and so on. The output should be printed in the excel sheet as '3456' for the first row, 'CDEF' for the 2nd row, '2B3C' for the 3rd row, and so on until the list is completed.

Using the str.slice function on each entry in the license plate column of your dataset, you will get the desired result.
Here's the updated version of your code:
import pandas as pd
from openpyxl import load_workbook
df_csv = pd.read_csv('data.csv', usecols=['license_plate'])
df_xlsx = pd.read_excel('report.xlsx', header=None)
df_csv['license_plate'] = df_csv['license_plate'].str[-4:]
writer = pd.ExcelWriter('report.xlsx', engine='openpyxl', mode='a', if_sheet_exists='overlay')
wb = load_workbook('report.xlsx')
df_csv['license_plate'].to_excel(writer, sheet_name='Report', startcol=8, startrow=3, header=None, index=None)
writer.close()
The str.slice function is applied to each entry in the license plate column using this code, and the final four characters are preserved.
After that, the revised data is written to the excel sheet.

Related

How to preserve complicated excel header formats when manipulating data using Pandas Python?

I am parsing a large excel data file to another one, however the headers are very abnormal. I tried to use "read_excel skiprows" and that did not work. I also tried to include the header in
df = pd.read_excel(user_input, header= [1:3], sheet_name = 'PN Projection'), but then I get this error "ValueError: cannot join with no overlapping index names." To get around this I tried to name the columns by location and that did not work either.
When I run the code as shows below everything works fine, but past cell "U" I get the header titles to be "unnamed1, 2, ..." I understand this is because pandas is considering the first row to be the header(which are empty), but how do I fix this? Is there a way to preserve the headers without manually typing in the format for each cell? Any and all help is appreciated, thank you!
small section of the excel file header
the code I am trying to run
#!/usr/bin/env python
import sys
import os
import pandas as pd
#load source excel file
user_input = input("Enter the path of your source excel file (omit 'C:'): ")
#reads the source excel file
df = pd.read_excel(user_input, sheet_name = 'PN Projection')
#Filtering dataframe
#Filters out rows with 'EOL' in column 'item status' and 'xcvr' in 'description'
df = df[~(df['Item Status'] == 'EOL')]
df = df[~(df['Description'].str.contains("XCVR", na=False))]
#Filters in rows with "XC" or "spartan" in 'description' column
df = df[(df['Description'].str.contains("XC", na=False) | df['Description'].str.contains("Spartan", na=False))]
print(df)
#Saving to a new spreadsheet called Filtered Data
df.to_excel('filtered_data.xlsx', sheet_name='filtered_data')
If you do not need the top 2 rows, then:
df = pd.read_excel(user_input, sheet_name = 'PN Projection',error_bad_lines=False, skiprows=range(0,2)
This has worked for me when handling several strangely formatted files. Let me know if this isn't what your looking for, or if their are additional issues.

Insert a single cell of string above header row in python pandas

I have my dataframe ready to be written to an excel file but I need to add a single cell of string above it. How do I do that?
You can save the dataframe starting from the second row and then use other tools to write the first cell of your excel file.
Note that writing from pandas to excel overwrites its data, so we have to follow this order (but there are also methods how to write to an existing excel file without overwriting data).
1. Save the dataframe, specifying startrow=1:
df.to_excel("filename.xlsx", startrow=1, index=False)
2. Write a cell value.
For example, using openpyxl (from a GeeksforGeeks tutorial):
from openpyxl import load_workbook
# load excel file
workbook = load_workbook(filename="filename.xlsx")
# open workbook
sheet = workbook.active
# modify the desired cell
sheet["A1"] = "A60983A Register"
# save the file
workbook.save(filename="filename.xlsx")
import pandas as pd
df = pd.DataFrame({'label':['first','second','first','first','second','second'],
'first_text':['how is your day','the weather is nice','i am feeling well','i go to school','this is good','that is new'],
'second_text':['today is warm','this is cute','i am feeling sick','i go to work','math is hard','you are old'],
'third_text':['i am a student','the weather is cold','she is cute','ii am at home','this is bad','this is trendy']})
df.loc[-1] = df.columns.values
df.sort_index(inplace=True)
df.reset_index(drop=True, inplace=True)
df.rename(columns=
{"label": "Register", 'first_text':'', 'second_text':'', 'third_text':''},
inplace=True)
Try this MRE, so you can change your data as well.

Adding in multiple dataframes into an existing Excel sheet starting on specific cell references

Good afternoon,
I'm working on a python program that will take 3 separate dataframes and and them into an existing excel file; overwriting the cell ranges in question but leaving the rest of the rows and columns unaltered.
Below is an example of the Excel file structure
Keywords
Match type
col1a
col1b
col1c
col2a
col2b
col2c
col3a
col3b
col3c
counter
not to be removed
not to be removed
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
not to be removed
not to be removed
not to be removed
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
replaced data
not to be removed
In this I need the first df starting in row 2 column 3, the second df in col 6 and the third df in column 9.
Currently with the code below I can get the data into the correct position but all the other data gets lost in the process. I think it may be possible to merge the Excel if opened as a dataframe and the newer data frames but no such luck so far.
My code is below, I am still fiddling with this and at the time of writing the old data has been opened but no action with it has been taken.
DF_LastMonthDL = pd.read_csv (LastMonthDL)
DF_Last3MonthsDL = pd.read_csv (Last3MonthsDL)
DF_LifeTimeDL = pd.read_csv (LifeTimeDL)
########################################################## Manipulating the dataframes
#Sorting the arrays to keep ordering consistent
DF_LifeTimeDL.sort_index(0)
DF_LastMonthDL.sort_index(0)
DF_Last3MonthsDL.sort_index(0)
#Removing first cols as uneeded ¦ Keywords, Matchtype
DF_LifeTimeShrt = DF_LifeTimeDL[["Impressions", "Clicks", "CTR", "Spend(GBP)", "CPC(GBP)", "Orders", "Sales(GBP)","ACOS","ROAS"]]
DF_Last3MonthsShrt = DF_Last3MonthsDL[["Impressions", "Clicks", "CTR", "Spend(GBP)", "CPC(GBP)", "Orders", "Sales(GBP)","ACOS","ROAS"]]
DF_LastMonthShrt = DF_LastMonthDL[["Impressions", "Clicks", "CTR", "Spend(GBP)", "CPC(GBP)", "Orders", "Sales(GBP)","ACOS","ROAS"]]
oldData = pd.read_excel(r"oldData.xlsx")
########################################################## Exporting into excel in set positions
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('Temp.xlsx', engine='openpyxl')
# Position the dataframes in the worksheet
DF_LifeTimeShrt.to_excel(writer, sheet_name='LifeTime', startrow=2, startcol=2, header=True, index=False)
DF_Last3MonthsShrt.to_excel(writer, sheet_name='Sheet1', startrow=2, startcol=11, header=False, index=False)
DF_LastMonthShrt.to_excel(writer, sheet_name='Sheet1', startrow=2, startcol=20, header=False, index=False)
# Close the Pandas Excel writer and output the Excel file.
writer.save()
Any guidance on this would be greatly appreciated.
you can do this using openpyxl.load_workbook() and updating the cells, similar to what you are doing above. Assuming you have the initial part all working correctly, just need to change the last part as below...
import openpyxl
from openpyxl.utils.dataframe import dataframe_to_rows
writer = openpyxl.load_workbook('Temp.xlsx')
ws=writer['LifeTime']
rows = dataframe_to_rows(DF_LifeTimeShrt, index=False, header=True)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws.cell(row=r_idx+2, column=c_idx+2, value=value)
ws=writer['Sheet1']
rows = dataframe_to_rows(DF_Last3MonthsShrt, index=False, header=True)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws.cell(row=r_idx+2, column=c_idx+2, value=value)
ws=writer['Sheet2']
rows = dataframe_to_rows(DF_LastMonthShrt, index=False, header=True)
for r_idx, row in enumerate(rows, 1):
for c_idx, value in enumerate(row, 1):
ws.cell(row=r_idx+2, column=c_idx+2, value=value)
# Close the Excel file... need to provide name the file it needs to be written to.
writer.save('Temp.xlsx')
EDIT - The advantage with load_workbook is that it updates the cell and only overwrites a particular cell without any changes to other cells or even overwriting the color, etc. that may be present. The dataframe_to_rows gives you a way to get a whole DF row into a openpyxl readable from. From there, I am basically reading each row and column (a cell) and updating the value (ws.cell(row,col).value) with the value from the df. The disadvantage of this is that you need to go through the for loops (unlike say df.to_excel), but advantage is that you can update a single cell value without disturbing anything else.... Hope this explanation helps.

python pandas automatic excel lookup system

I know this is alot of code and there is alot to do, but i am really stuck and don't know how to continue after i got the function that the program can match identical files. I am pretty sure you know how the lookup from excel works. This Program does basicly the same. I tried to comment out the important parts and hope you can give me some help how i can continue this project. Thank you very much!
import pandas as pd
import xlrd
File1 = pd.read_excel("Excel_test.xlsx", usecols=[0], header=None, index=False) #the two excel files with the columns that should be compared
File2 = pd.read_excel("Excel_test02.xlsx", usecols=[0], header=None, index=False)
fullFile1 = pd.read_excel("Excel_test.xlsx", header=None, index=False)#the full excel files
fullFile2 = pd.read_excel("Excel_test02.xlsx", header=None, index=False)
i = 0
writer = pd.ExcelWriter("output.xlsx")
def loadingTime(): #just a loader that shows the percentage of the matching process
global i
loading = (i / len(File1)) * 100
loading = round(loading, 2)
print(str(loading) + "%/100%")
def matcher():
global i
while(i < len(File1)):#goes in column that should be compared and goes on higher if there is a match found in second file
for o in range(len(File2)):#runs through the column in second file
matching = File1.iloc[i].str.lower() == File2.iloc[o].str.lower() #matches the column contents of the two files
if matching.bool() == True:
print("Match")
"""
df.append(File1.iloc[i])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
df.append(File2.iloc[o])#the whole row of the matched column should be appended in Dataframe with the arrangement of excel file
"""
i += 1
matcher()
df.to_excel(writer, "Sheet")
writer.save() #After the two files have been compared to each other, now a file containing both excel contents and is also arranged correctly

How to sort Excel sheet using Python

I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.
I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.

Categories

Resources