Python - read excel data when format changes every time

Python - read excel data when format changes every time - python

I get an excel from someone and I need to read the data every month. The format is not stable each time, and by saying "not stable" I mean:
Where the data starts changes: e.g. Section A may start on row 4, column D this time, but next time it may start at row 2, column E.
Under each section there are tags. The number of tags may change as well. But every time I only need the data in tag_2 and tag_3 (these two will always show up)
The only data that I need is from tag_2, tag_3, for each month (month1 - month8). And I want to find a way using Python first locate the section name, then find tag_2, tag_3 under that section, then get the data for month1 to month8 (number of months may change as well).
Please note that I do NOT want to locate the data that I need by specifying locations in excel since the locations change every time. How to I do this?
The end product should be a pandas dataframe that has monthly data for tag_2, tag_3, with a column that says which section the data come from.
Thanks.

I think you can directly read it as a comma separated text file. Based on what you need you can look at the tag2 ant tag3 for each line.
with open(filename, "r") as fs:
for line in fs:
cell_list = line.split(",")
# This point you will have all elements on the line as a list
# you can check for the size and implement your logic

Assuming that the (presumably manually pasted) block of information is unlikely to end up in the very bottom-right corner of the excel sheet, you could simply iterate over rows and columns (set maximum values for each to prevent long searching times) until you find a familiar value (such as "Section A") and go from there.
Unless I misunderstood you, the rest of the format should consistent between the months so you can simply assume that "month_1" is always one cell up and two to the right of that initial spot.
I have not personally worked with excel sheets in python, so I cannot state whether the following is possible in python, but it definitely works in ExcelVBA:
You could just as well use the Range.find() method to find the value "Section A" and continue with the same process as above, perhaps writing any results to a txt file and calling your python script from there if neccessary.
I hope this helps a little.

Related

Extract text from a column by delimiters Python

So far I've only used powerquery to clean and automate files, and i want to step up my game and move to python, but I'm having some issues and have no one to ask so I'm coming to you for help, I'm completely new to python and learning based off youtube videos and the python for data analysis book so please bear with me for a moment.
To learn, I've been working on a project using a sample csv file, the file cover several dates and has multiple columns with different data, what I want to do is split the file into different csv based on the date on the column "DateFull" which has the dates with a yyyy-mm-dd 00:00:00 format and name the new csv files with the date.
Looking at youtube videos I came up with this piece of code
import pandas as pd
df = pd.read_csv("sample_file.csv")
split_dates = df['DateFull'].unique()
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date) + ".csv"
df1.to_csv(split_file_name,index=False)
But when i run it it errors out because it tries to bring the whole name and is not acceptable. I've been looking into the split method to separate the DateFull column at the whitespace, but I don't know how to incorporate that into the code.
It's obvious that I don't have any idea of how the structure or logic of the code should be but my plan was to use the df['DateFull'].str.split() command to create two new columns, one with just the date, and one with the 00:00:00 part, then remove the last one and the original DateFull to have the trimmed date column replace it and use that one to split the csv.
I know I'm probably overcomplicating it and there's an easier way to do it, maybe just removing the time part from the original column. If that's possible it would be amazing to know how to do it. but I'd also like how to do it with my approach, since I would be practicing more methods even though the resulting code will be redundant
Any help would be greatly appreciated.
Thank you so much

You can find the documentation for the split() function here.
To do this with split():
str(date).split(" ")[0]
This splits on the whitespace and return the first (0 indexed) value in the resulting list. With this change your for loop would look like this:
for date in split_dates:
df1 = df[df['DateFull'] == date]
split_file_name ="Samplefile_" + str(date).split(" ")[0] + ".csv"
df1.to_csv(split_file_name,index=False)

How can I print only rows that have a precise column number greater than x ? Sheets Api with Python

So what I want to do is print only rows that have for example the price (or any other row "title" cell greater or equal to let's say 50.
I haven't been able to find the answer elsewhere and couldn't do it myself with the API documentation.
I'm using Google Sheets API v4 and my goal is based on a sheets that contain information on mobile subscription, allow user to select what they want for price, GB, etc.
Here is what my sheets look like:
Also, here is an unofficial documentation which I found great even though it didn't contain the answer I need, maybe someone here would succeed?
I tried running the following code but it didn't work:
val_list = col5
d = wks.findall(>50) if cell.value >50 :
print (val_list)
I hope you will be able to help me. I'm new to Python.

I think you had the right idea, but it looks like findall is for strings or regex, not an arbitrary boolean condition. Also, some of the syntax is a bit off, but that's to be expected when you are just starting out.
Here is how I would approach this with just what I could find in your attached document. I doubt this is the fastest or cleanest way to do this, but I think it's at least conceptually clear:
#list of all values in 4th/price column
prices=wks.col_values(4)
#Remove nonnumeric characters from prices
prices=[p.replace('*','') for p in prices[1:]]
#Get indices of rows with price >=50
##i+2 to account for one indexing and removing header row
indices=[i+2 for i,p in enumerate(prices) if float(p)>=50]
#Print these rows
for i in indices:
row=wks.row_values(i)
print(row)
Going forward with this project, you may want to put these row values into a dataframe rather than just printing them so you can do further analysis on this subset of the data.

Swapping dataframe column data without changing the index for the table

While compiling a pandas table to plot certain activity on a tool I have encountered a rare error in the data that creates an extra 2 columns for certain entries. This means that one of my computed column data goes into the table 2 cells further on that the other and kills the plot.
I was hoping to find a way to pull the contents of a single cell in a row and swap it into the other cell beside it, which contains irrelevant information in the error case, but which is used for the plot of all the other pd data.
I've tried a couple of different ways to swap the data around but keep hitting errors.
My attempts to fix it include:
for rows in df['server']:
if '%USERID' in line:
df['server'] = df[7] # both versions of this and below
df['server'].replace(df['server'],df[7])
else:
pass
if '%USERID' in df['server']: # Attempt to fix missing server name
df['server'] = df[7];
else:
pass
if '%USERID' in df['server']:
return row['7'], row['server']
else:
pass
I'd like the data from column '7' to be replicated in 'server', only in the case of the error - where the data in the cell contains a string starting with '%USERID'

Turns out I was over-thinking this one. I took a step back, worked the code a bit and solved it.
Rather than trying to smash a one-size fits all bit of code for the all data I built separate lists for the general data and 2 exception I found, by writing a nested loop and created 3 data frames. These were easy enough to then manipulate individually, and finally concatenate together. All working fine now.

Rewriting CSV file with particular rows omitted - Python 3

G'day,
I posted this question, and had some excellent responses from #abarnert. I'm trying to remove particular rows from a CSV file. I've learned that CSV files won't allow particular rows to be deleted, so I'm going to rewrite the CSV whilst omitting the particular rows, then rename the new file as the old.
As per the above question in the link, I have tools being taken and returned from a toolbox. The CSV file I'm trying to rewrite is an ongoing 'log' of the tools currently checked out from the toolbox. Therefore, when a tool is returned, I need that tool to be removed from the log CSV file.
Here's what I have so far:
absent_past = frozenset(tuple(row) for row in csv.reader(open('Absent_Past.csv', 'r')))
absent_current = frozenset(tuple(row) for row in csv.reader(open('Absent_Current.csv', 'r')))
tools_returned = [",".join(row) for row in absent_past - absent_current]
with open('Log.csv') as f:
check = csv.reader(f)
for row in check:
if row[1] not in tools_returned:
csv.writer(open('Log_Current.csv', 'a+')).writerow(row)
os.remove('Log.csv')
os.rename('Log_Current.csv', 'Log.csv')
As you can (hopefully) see from above, it will open the Log.csv file, and if a tool has been returned (ie. the tool is listed in a row in tools_returned), it will not rewrite that entry into the new file. When all the non-returned tools have been written to the new file, the old file is deleted, with the new file being renamed as Log.csv from Log_Current.csv.
It's worth mentioning that the tools which have been taken are appended to Log_Current.csv before it is renamed. This part of the code works nicely :)
I've been instructed to avoid using CSV for this system, which I agree with. I would like to explore CSV operation under Python as much as I can at this point however, as I know it will come in handy in the future. I will be looking to use the contextlib and shelve functions in the future.
Thanks!
EDIT: In the code above, I have if row[1]...which I'm hoping means that it will only check the value of the first column in the row? Basically, the row will consist of something like Hammer, James, Taken, 09:15:25, but I only want to search the Log.csv file for Hammer, as the tools_returned consists of only rows of tool names, ie. Hammer, Drill, Saw etc. Is the row[1] approach correct for this?
At the moment, the Log_Current.csv file is writing the Log.csv files regardless of whether the tool has been replaced or not. As such, I'm thinking that the if row[1] etc part of the code isn't working.

I figured I'd answer my own question, as I've now figured it out. The code posted above is correct, except for one minor error. When referring to the column number in a row, the first column is column 0, not column 1. As I was searching column '1' for the tool name, it was never going to work, as column '1 is actually the second column, which is the name of the user.
Changing that line to if row[0] etc rewrites a new file with the current list of tools that are checked out, and omits any tools that have been replaced, as desired!

How to delete/remove row from the google spreadsheet using gspread lib. in python?

I want to delete a record from a google spreadsheet using the gspread library.
Also, how to can I get the number of rows/records in google spreadsheet? gspread provides .row_count(), which returns the total number of rows, including those that are blank, but I only want to count rows which have data.

Since gspread version 0.5.0 (December 2016) you can remove a row with delete_row().
For example, to delete a row with index 42, you do:
worksheet.delete_row(42)

Can you specify exactly how you want to delete the rows/records? Are they in the middle of a sheet? Bottom? Top?
I had a situation where I wanted to wipe all data except the headers once I had processed it. To do this I just resized the worksheet twice.
#first row is data header to keep
worksheet.resize(rows=1)
worksheet.resize(rows=30)
This is a simple brute force solution for wiping a whole sheet without deleting the worksheet.
Count Rows with data
One way would be to download the data in a json object using get_all_records() then check the length of that object. That method returns all rows above the last non blank row. It will return rows that are blank if a row after it is not blank, but not trailing blanks.

worksheet.delete_row(42) is deprecated (December 2021). Now you can achieve the same results using
worksheet.delete_rows(42)
The new function has the added functionality of being able to delete several rows at the same time through
worksheet.delete_rows(42, 3)
where it will delete the next three rows, starting from row 42.
Beware that it starts counting rows from 1 (so not zero based numbering).

Reading the source code it seems there is no such method to directly remove rows - there are only methods there to add them or .resize() method to resize the worksheet.
When it comes to getting the rows number, there's a .row_count() method that should do the job for you.

adding to #AsAP_Sherb answere:
If you want to count how many rows there are, don't use get_all_records() - instead use worksheet.col_values(1), and count the length of that.
(instead of getting the entire table, you get only one column)
I think that would be more time efficient (and will definantly be memory efficient)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - read excel data when format changes every time - python

Related

Extract text from a column by delimiters Python

How can I print only rows that have a precise column number greater than x ? Sheets Api with Python

Swapping dataframe column data without changing the index for the table

Rewriting CSV file with particular rows omitted - Python 3

How to delete/remove row from the google spreadsheet using gspread lib. in python?

Categories

Resources