Identify groups and grouped rows in Excel file

Identify groups and grouped rows in Excel file - python

I need to identify different groups in Excel files and rows inside these groups (to be more accurate I need to get the value of the first cell of the main row under which over rows are grouped).
Below is an example of the files structure (I've minimized the groups but when I receive these files they are expanded):
I know how to create new groups using openpyxl or xlwt, I'm familiar with both openpyxl and xlrd but I'm enable to find anything in the API to solve this requirement.
So, is it possible using Python and if so, which part of openpyxl or xlrd API should I use ?

You should be able to do this using the worksheet's row_dimensions. This returns an object accessible like a dict where the keys are the row numbers of the sheet. outline_level will have a non-zero value for each depth of grouping, or 0 if the row is not part of a group.
So, if you had a sheet where rows 2 and 3 were a group, and rows 5 and 6 were another group, iterating through row_dimensions would look like this:
>>> for row in range(ws.min_row, ws.max_row + 1):
... print(f"row {row} is in group {ws.row_dimensions[row].outline_level}")
...
row 1 is in group 0
row 2 is in group 1
row 3 is in group 1
row 4 is in group 0
row 5 is in group 1
row 6 is in group 1
I should point out that there's some weirdness with accessing the information. My original solution was this:
>>> for row_num, row_data in ws.row_dimensions.items():
... print(f"row {row_num} is group {row_data.outline_level}")
...
row 2 is group 1
row 3 is group 1
row 4 is group 0
row 5 is group 1
row 6 is group 1
Notice that row 1 is missing. It wasn't part of row_dimensions until I manually accessed it as row_dimensions[1] and then it appeared. I don't know how to explain that, but the first approach is probably better as it specifically iterates from the first to last row.
The same process applies to column groups through column_dimensions except that it must be keyed using column letter(s), e.g. ws.column_dimensions["A"].current_level.

Related

How to add rows to a specific location in a pandas DataFrame?

enter image description here
enter image description here
I am trying to add rows where there is a gap between month_count. For example, row 0 has month_count = 0 and row 1 has month_count = 7. How can I add extra 6 rows with month counts being 1,2,3,4,5,6? Also, same situation from row 3 to row 4. I would like to add 2 extra rows with month_count 10 and 11. What is the best way to go about this?

One way to do this would be to iterate over all of the rows and re-build the DataFrame with the missing rows inserted. Pandas does not support the direct insertion of rows at an index, however you can hack together a solution using pd.concat():
def pandas_insert(df, idx, row_contents):
top = df.iloc[:idx]
bot = df.iloc[idx:]
inserted = pd.concat([top, row_contents, bot], ignore_index=True)
return inserted
Here row_contents should be a DataFrame with one (or more) rows. We use ignore_index=True to update the index of the new DataFrame to be labeled 0,1, …, n-2, n-1

Adding new rows in pandas dataframe at specific index

I have read all the answers related to my question available in stackoverflow but my question is little different from available answers. I have very large dataframe and some portion of that dataframe is following-
Input Dataframe is like
A B C D
0 foot 17/1: OGChan_2020011717711829281829281 , 7days ...
1 arm this will processed after ;;;
2 leg go_2020011625692400374400374 16/1: Id Imerys_2020011618188744093744093
3 head xyziemen_2020011510691787006787006 en_2020011510749462801462801 ;;;
: : : :
In this dataframe, firstly I am extracting ID's from column B based upon some regular expression. Some rows of Column B may contain that ID's, some may not and some rows of column B may blank. Following is the code-
df = pd.read_excel("Book1.xlsx", "Sheet1")
dict= {}
for i in df.index:
j = str(df['B'][i])
if(re.findall('_\d{25}', j)):
a = re.findall('_\d{25}', j)
print(a)
dict[i] = a
Regular Expression starts with _(undersore) and 25 digits. Example in above df are _2020011618188744093744093, _2020011510749462801462801 etc..
Now I want to insert these ID's in Column D of a particular row. For Example If two ID's are find at 0th row than first ID should insert in 0th row of column D and second Id should insert on 1st row of column D and all the content of dataframe should shifted down. What I want will clear from following output.I want my output as following based upon above input.
A B .. D
0 foot 17/1: OGChan_2020011717711829281829281 ,7days _2020011717711829281829281
1 arm this will processed after
2 leg go_2020011625692400374400374 16/1: _2020011625692400374400374
Id Imerys_2020011618188744093744093
3 _2020011618188744093744093
4 head xyziemen_2020011510691787006787006 _2020011510691787006787006
en_2020011510749462801462801
5 _2020011510749462801462801
: : : :
In above output 1 ID is found at 0th row.So column D of 0th row contains that ID. No ID is found at first index. So column D of 1st index is empty. At second index there are two ID's. Hence first ID is placed on 2nd row of column D and second ID is placed on 3rd row of column D and it shifted the previous content of third row to 4th row. I want above output as my final output.
Hope I am clear. Thanks in advance

Iterate through CSV rows with Pandas, Perform Selenium Action

I have a CSV file that was created using Pandas. Below is the output from the following code:
test = pd.read_csv('order.csv', header=0)
print(test.head())
3 16258878505032
0 3 16258876670024
1 3 16258876899400
2 3 16258876997704
The only data I need to be processed is the information in the 2nd column and the information on the 3rd column. This is purchase order data where the 2nd column represents a "quantity" and the 3rd column represents the "sku".
I need to take row 1, col 2 and inject it into an input field using selenium. I need row 1, col 3 and perform an action of selecting a sku on a webpage. Add the item to a cart and loop back through process row 2, row 3 etc.
I know how to write the selenium code to perform the web based actions, but not sure how to write the pandas/python code to iterate through the CSV file one row at a time and how to call those values out. My logic would be the following.
read order.csv
get quantity value and sku value for row (one row at the time)
visit website, inject quantity value
remain on website, select sku
add to cart
repeat loop until no more rows to process
Thanks for your help.

First use parameter names in read_csv for avoid convert first row of data to columns names:
test = pd.read_csv('order.csv', names=['quantity','sku'])
print (test)
quantity sku
0 3 16258878505032
1 3 16258876670024
2 3 16258876899400
3 3 16258876997704
Because working with selenium and web is possible use DataFrame.iterrows or another loops solutions:
def func(x):
q = x['quantity']
sku = x['sku']
print (q, sku)
#add selenium code
df.apply(func, axis=1)
Or:
for i, row in test.iterrows():
q = row['quantity']
sku = row['sku']
print (q, sku)
#add selenium code

Finding rows with highest means in dataframe

I am trying to find the rows, in a very large dataframe, with the highest mean.
Reason: I scan something with laser trackers and used a "higher" point as reference to where the scan starts. I am trying to find the object placed, through out my data.
I have calculated the mean of each row with:
base = df.mean(axis=1)
base.columns = ['index','Mean']
Here is an example of the mean for each row:
0 4.407498
1 4.463597
2 4.611886
3 4.710751
4 4.742491
5 4.580945
This seems to work fine, except that it adds an index column, and gives out columns with an index of type float64.
I then tried this to locate the rows with highest mean:
moy = base.loc[base.reset_index().groupby(['index'])['Mean'].idxmax()]
This gives out tis :
index Mean
0 0 4.407498
1 1 4.463597
2 2 4.611886
3 3 4.710751
4 4 4.742491
5 5 4.580945
But it only re-index (I have now 3 columns instead of two) and does nothing else. It still shows all rows.

Here is one way without using groupby
moy=base.sort_values('Mean').tail(1)

It looks as though your data is a string or single column with a space in between your two numbers. Suggest splitting the column into two and/or using something similar to below to set the index to your specific column of interest.
import pandas as pd
df = pd.read_csv('testdata.txt', names=["Index", "Mean"], delimiter="\s+")
df = df.set_index("Index")
print(df)

Python merge 'n' cells in excel based on condition

I have an excel where there are values in row 1 from column 1 to column 15. Each cell value in the end has a number.
I would like to create another row which merges cells based on the ending number and puts that corresponding text in the merged cell. But the row values still needs to maintain the order.
For example A1=ABC3, B1=ABC5, C1=ABC4 and so on. Now I would like to create in row 2 a merge of first 3 cells for and place ABC3. I need to create 5 merged cells next in the same row 2 to place ABC5. After that 4 Merged cells in the same row and place ABC4 and so on. Any thoughts how to implement this ?

This can be accomplished with the openpyxl module. If you're not familiar with it yet, then doing some of the tutorials would be a good start.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Identify groups and grouped rows in Excel file - python

Related

How to add rows to a specific location in a pandas DataFrame?

Adding new rows in pandas dataframe at specific index

Iterate through CSV rows with Pandas, Perform Selenium Action

Finding rows with highest means in dataframe

Python merge 'n' cells in excel based on condition

Categories

Resources