End of merged cells in Excel with Python - python

I am using xlrd package to parse Excel spreadsheets.
I would like to get the end index of a merged cell.
A B C
+---+---+----+
1 | 2 | 2 | 2 |
+ +---+----+
2 | | 7 | 8 |
+ +---+----+
3 | | 0 | 3 |
+ +---+----+
4 | | 4 | 20 |
+---+---+----+
5 | | 2 | 0 |
+---+---+----+
given the row index and the column index, I would like to know the end index of the merged cell (if merged)
in this example for (row,col)=(0,0) ; end = 3

You can use merged_cells attribute of the Sheet object: https://secure.simplistix.co.uk/svn/xlrd/trunk/xlrd/doc/xlrd.html?p=4966#sheet.Sheet.merged_cells-attribute
It returns the list of address ranges of cells which have been merged.
If you want to get end index only for the vertically merged cells:
def is_merged(row, column):
for cell_range in sheet.merged_cells:
row_low, row_high, column_low, column_high = cell_range
if row in xrange(row_low, row_high) and column in xrange(column_low, column_high):
return (True, row_high-1)
return False

Related

How do I "push down" the current columns to form the first row, and create new columns to replace that one?

I have a DataFrame that essentially has the first row that I want as the column row and I'd like to know how to set new columns and set that row as the first row.
For example:
| 4 | 3 | dog |
| --- | --- | --- |
| 1 | 2 | cat |
I want to change that DataFrame to be:
| number_1 | number_2 | animal |
| -------- | -------- | ------ |
| 4 | 3 | dog |
| 1 | 2 | cat |
What would be the best way to do this?
Lets create a new dataframe with old column row as the first row followed by remaining rows
pd.DataFrame([df.columns, *df.values], columns=['num_1', 'num_2', 'animal'])
num_1 num_2 animal
0 4 3 dog
1 1 2 cat

Comparing two Dataframes and creating a third one where certain contions are met

I am trying to compare two different dataframe that have the same column names and indexes(not numerical) and I need to obtain a third df with the biggest value for the row with the same column name.
Example
df1=
| | col_1 | col2 | col-3 |
| rft_12312 | 4 | 7 | 4 |
| rft_321321 | 3 | 4 | 1 |
df2=
| | col_1 | col2 | col-3 |
| rft_12312 | 7 | 3 | 4 |
| rft_321321 | 3 | 7 | 6 |
Required result
| | col_1 | col2 | col-3 |
| rft_12312 | 7 (because df2.value in this \[row :column\] \>df1.value) | 7 | 4 |
| rft_321321 | 3(when they are equal doesn't matter from which column is the value) | 7 | 6 |
I've already tried pd.update with filter_func defined as:
def filtration_function(val1,val2):
if val1 >= val2:
return val1
else:
return val2
but is not working. I need the check for each column with same name.
also pd.compare but does not allow me to pick the right values.
Thank you in advance :)
I think one possibility would be to use "combine". This method generates an element-wise comparsion between the two dataframes and returns the maximum value of each element.
Example:
import pandas as pd
def filtration_function(val1, val2):
return max(val1, val2)
result = df1.combine(df2, filtration_function)
I think method "where" can work to:
import pandas as pd
result = df1.where(df1 >= df2, df2)

How to apply a function on each group of data in a pandas group by

Suppose the data frame below:
|id |day | order |
|---|--- |-------|
| a | 2 | 6 |
| a | 4 | 0 |
| a | 7 | 4 |
| a | 8 | 8 |
| b | 11 | 10 |
| b | 15 | 15 |
I want to apply a function to day and order column of each group by rows on id column.
The function is:
def mean_of_differences(my_list):
return sum([ my_list[i] - my_list[i-1] for i in range(1, len(my_list))]) / len(my_list)
This function calculates mean of differences of each element and the next one. For example, for id=a, day would be 2+3+1 divided by 4. I know how to use lambda, but didn't find a way to implement this in a pandas group by. Also, each column should be ordered to get my desired output, so apparently it is not possible to sort by one column before group by
The output should be like this:
|id |day| order |
|---|---|-------|
| a |1.5| 2 |
| b | 2 | 2.5 |
Any one know how to do so in a group by?
First, sort your data by day then group by id and finally compute your diff/mean.
df = df.sort_values('day') \
.groupby('id') \
.agg({'day': lambda x: x.diff().fillna(0).mean()}) \
.reset_index()
Output:
>>> df
id day
0 a 1.5
1 b 2.0

Splitting a csv into multiple csv's depending on what is in column 1 using python

so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.

Writing to excel file based on column value & string naming

The DF looks something like this and extends for thousands of rows (i.e every combination of 'Type' & 'Name' possible)
| total | big | med | small| Type | Name |
|:-----:|:-----:|:-----:|:----:|:--------:|:--------:|
| 5 | 4 | 0 | 1 | Pig | John |
| 6 | 0 | 3 | 3 | Horse | Mike |
| 5 | 2 | 3 | 0 | Cow | Rick |
| 5 | 2 | 3 | 0 | Horse | Rick |
| 5 | 2 | 3 | 0 | Cow | John |
| 5 | 2 | 3 | 0 | Pig | Mike |
I would like to write code that writes files to excel based on the 'Type' column value. In the example above there are 3 different "Types" so I'd like one file for Pig, one for Horse, one for Cow respectively.
I have been able to do this using two columns but for some reason have not been able to do it do it with just one. See code below.
for idx, df in data.groupby(['Type', 'Name']):
table_1 = function_1(df)
table_2 = function_2(df)
with pd.ExcelWriter(f"{'STRING1'+ '_' + ('_'.join(idx)) + '_' + 'STRING2'}.xlsx") as writer:
table_1.to_excel(writer, sheet_name='Table 1', index=False)
table_2.to_excel(writer, sheet_name='Table 2', index=False)
Current result is:
STRING1_Pig_John_STRING2.xlsx (all the rows that have Pig and John)
What I would like is:
STRING1_Pig_STRING2.xlsx (all the rows that have Pig)
Do you have anything against boolean indexing ? If not :
vals = df['Type'].unique().tolist()
with pd.ExcelWriter("blah.xlsx") as writer:
for val in vals:
ix = df[df['Type']==val].index
df.loc[ix].to_excel(writer, sheet_name=str(val), index=False)
EDIT :
If you want to stick to groupby, that would be :
with pd.ExcelWriter("blah.xlsx") as writer:
for idx, df in data.groupby(['Type']):
val = list(set(df.Type))[0]
df.to_excel(writer, sheet_name=str(val), index=False)

Categories

Resources