Two different excel file to match their rows having same name

Two different excel file to match their rows having same name - python

Using python pandas,
I am trying to write a condition in pandas which will match two columns from two different excel file having the same column name and different numerical values in them. For each column there are 2000 rows to match.
The condition:
if final value = ( if File1(column1value) - File2(column1value) = 0 then update the value with 1;
if File1(column1value) - File2(column1value) is less than or equ al to 0.2 then keep File1Column1Value;
if (File1Column1) - File2(column1value) greater than 0.2 the. update the value with 0.
https://i.stack.imgur.com/Nx3WA.jpg

df1 = pd.read_excel('file_name1') # get input from excel files
df2 = pd.read_excel('file_name2')
p1 = df1['p1'].values
p11 = df2['p11'].values
new_col = [] # we will store desired values here
for i in range(len(p1)):
if p1[i] - p11[i] == 0:
new_col.append(1)
elif abs(p1[i] - p11[i]) > 0.2:
new_col.append(0)
else:
new_col.append(p1[i])
df1['new_column'] = new_col # we add new column with our values
You can also remove old column df.drop('column', axis = 1)

Related

Retaining rows that have percent overlapping ranges in Pandas

I have a dataframe with the columns:
[id, range_start, range_stop, score]
If two rows have a range overlap by x percentage I retain the row with the higher score. However, I am confused how to pull out rows with no overlap to other ranges. I am using a nested loop and recursion to condense overlapping ranges into a new dataframe. However, this structure causes all rows to be retained when I am looking for the non overlapping rows.
## This is my function to recursively select the highest scoring overlapping regions
def overlap_retention(df_overlap, threshold, df_nonoverlap=None):
if df_nonoverlap != None:
df_nonoverlap = pd.DataFrame()
df_overlap = pd.DataFrame()
for index, row in x.iterrows():
rs = row['range_start']
re = row['range_end']
## Silly nested loop to compare ranges between all rows
for index2, row2 in x.drop(index).iterrows():
rs2 = row2['range_start']
re2 = row2['range_end']
readRegion=[*range(rs,re,1)]
refRegion=[*range(rs2,re2,1)]
regionUnion = set(readRegion).intersection(set(refRegion))
overlap_length = len(regionUnion)
overlap_min = min(rs, rs2)
overlap_max = max(re, re2)
overlap_full_range = overlap_max-overlap_min
overlap_percentage = (overlap_length/overlap_full_range)*100
## Check if they overlap by x_percentage and retain the higher score
if overlap_percentage>x_percentage:
evalue = row['score']
evalue_2 = row2['score']
if evalue_2 > evalue:
df_overlap = df_overlap.append(row2)
else:
df_overlap = df_overlap.append(row)
#----------------------------------------------------------
## How to find non-overlapping rows without pulling everything?
else:
df_nonoverlap = df_nonoverlap.append(row)
# ---------------------------------------------
### Recursion here to condense overlapped list further
if len(df_overlap)>1:
overlap_retention(df_overlap, threshold, df_nonoverlap)
else:
return(df_nonoverlap)
An example input is below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'range_start':[1,12,11,1,20, 10],
'range_end':[4,15,15,6,23,16],
'score':[3,1,8,2,5,1]}
input = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])
The desired output can change based on the overlap threshold. In the example above id1 and id4 may both be retained or simply id1 depending on the overlap threshold:
data = {'id':['id1', 'id3', 'id5'],
'range_start':[1,11,20],
'range_end':[4,15,23],
'score':[3,8,5]}
output = pd.DataFrame(data, columns=['id', 'range_start', 'range_end', 'score'])

You can make a cartesian join between all the ranges, then find length and % of the overlap for each pair, and filter it based on the x_overlap threshold.
After that, for each range we can find the overlapping range with the highest score (which could be the range itself, with the overlap of 100%):
# set min overlap parameter
x_overlap = 0.5
# cartesian join all ranges
z = df.assign(k=1).merge(
df.assign(k=1), on='k', suffixes=['_1', '_2'])
# find lengths of overlaps
z['len_overlap'] = (
z[['range_end_1', 'range_end_2']].min(axis=1) -
z[['range_start_1', 'range_start_2']].max(axis=1)).clip(0)
# we're only interested in cases where ranges overlap, so the total
# range is the range between min(start1, start2) and max(end1, end2)
z['len_total'] = (
z[['range_end_1', 'range_end_2']].max(axis=1) -
z[['range_start_1', 'range_start_2']].min(axis=1)).clip(0)
# find % overlap and filter out pairs above threshold
# these include 'pairs' where a range is paired to itself
z['pct_overlap'] = z['len_overlap'] / z['len_total']
z = z[z['pct_overlap'] > x_overlap]
# for each range find an overlapping range with the highest score
# (could be the range itself)
z = z.sort_values('score_2').groupby('id_1')['id_2'].last()
# filter the inputs
df_out = df[df['id'].isin(z)]
df_out
Output:
id range_start range_end score
0 id1 1 4 3
2 id3 11 15 8
4 id5 20 23 5
P.S. Please note that it is not very clear what should happen with id4 in your example. Since you don't have it in your output, I assumed (hopefully correctly) that you're not interested in zero-length ranges in the output
P.P.S. There is a new syntax for cartesian join in pandas 1.2.0+ with how=cross parameter in the merge method. I've used in my answer a version with a dummy variable k=1, which is more verbose, but compatible with older versions

I think you need a very clear definition of overlap. If you have [2;7], [6;10] and [7;8], which one overlaps with which one ?
Avoid using input as a variable name, it shadows the function input() (to get input from the user)
If you want to select clear overlaps (only the start or the end differs), and you only have at most ONE overlap, here you go:
sorted_df = df.sort_values(by=["range_start"])
starts_earlier = sorted_df[sorted_df.range_end.shift(-1) == sorted_df.range_end]
sorted_df = df.sort_values(by=["range_end"])
ends_earlier = sorted_df[sorted_df.range_start.shift(-1) == sorted_df.range_start]
Then you can do a df.drop(starts_earlier.index) and df.drop(ends_earlier.index) to remove the shorter ones/
df.shift() : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shift.html
This code won't work for multiple overlapping segments. If you are interested in that, let me know.

Time Series Dataframe Groupby 3d Array - observation/row count - For LSTM

I have a time series with a structure like below, and identifier column and two value columns (floats)
dataframe called just df:
Date Id Value1 Value2
2014-10-01 A 1.1 1.2
2014-10-01 B 1.3 1.4
2014-10-02 A 1.5 1.6
2014-10-02 B 1.7 1.8
2014-10-03 A 3.2 4.8
2014-10-03 B 8.2 10.1
2014-10-04 A 6.1 7.2
2014-10-04 B 4.3 4.1
What I am trying to do is turn it into a an array that is grouped by the identifier column with a rolling 3 observation period so I would end up with is this:
[[[1.1 1.2]
[1.5 1.6] '----> ID A 10/1 to 10/3'
[3.2 4.8]]
[[1.3 1.4]
[1.7 1.8] '----> ID B 10/1 to 10/3'
[8.2 10.1]]
[[1.5 1.6]
[3.2 4.8] '----> ID A 10/2 to 10/4'
[6.1 7.2]]
[[1.7 1.8]
[8.2 10.1] '----> ID B 10/2 to 10/4'
[4.3 4.1]]]
Of course ignore the parts in quotes above in the array but you hopefully get the idea.
I have a larger dataset that has more identifiers and may need to change the observation count, so can't hard the row count. So far the direction I am leaning towards is taking the unique values of the ID column and iterating and grabbing 3 values at a time, by creating a temp df and iterating over that.
Seems there is probably a better and faster way to do this.
"pseudo code"
unique_ids = df.ID.unique().tolist()
for id in unique_ids:
temp_df = df.loc[df['Id']==id]]
Though the part am I stuck on there is the best way to iterate over the temp_df as well.
The end output would be used in an LSTM model; however most other solutions are written to not need to handle the groupby aspect as with column 'Id'.

Here is what I ended up doing for the solution, not the pretty easiest but then again my question wasn't winning any beauty contests to begin with
id_list = array_steps_df['Id'].unique().tolist()
# change number of steps as needed
step = 3
column_list = ['Value1', 'Value2']
master_list = []
for id in id_list:
master_dict = {}
for column in column_list:
array_steps_id_df = array_steps_df.loc[array_steps_df['Id'] == id]
array_steps_id_df = array_steps_id_df[[column]].values
master_dict[column] = []
for obs in range(len(array_steps_id_df)-step+1):
start_obs = obs + step
master_dict[column].append(array_steps_id_df[obs:start_obs,])
master_list.append(master_dict)
for idx, dic in enumerate(master_list):
# init arrays here
if idx == 0:
value1_array_init = master_list[0]['Value1']
value2_array_init = master_list[1]['Value2']
else:
value1_array_init += master_list[idx]['Value1']
value2_array_init += master_list[idx]['Value2']
value1_array = np.array(value1_array_init)
value2_array = np.array(value2_array_init)
all_array = np.hstack((value1_array, value2_array)).reshape((len(array_steps_df) - (step + 1),
len(column_list),
step)).transpose(0, 2, 1)
Fixed, my mistake added a transpose at the end and redid order of features and steps in reshape.
Credit to this site for some extra help
https://www.mikulskibartosz.name/how-to-turn-pandas-data-frame-into-time-series-input-for-rnn/

I ended up redoing this a bit to make it more dynamic for the columns and keep the time series in order, also added a target array as well to keep the predictions in order. For anyone that needs this here is the function:
def data_to_array_steps(array_steps_df, time_steps, columns_to_array, id_column):
"""
https: //www.mikulskibartosz.name/ how - to - turn - pandas - data - frame - into - time - series - input - for -rnn /
:param array_steps_df: the dataframe from the csv
:param time_steps: how many time steps
:param columns_to_array: what columns to convert to the array
:param id_column: what is to be used for the identifier
:return: data grouped in a # observations by identifier and date
"""
id_list = array_steps_df[id_column].unique().tolist()
date_list = array_steps_df['date'].unique().tolist()
master_list = []
target_list = []
missing_counter = 0
total_counter = 0
# grab date size = time steps at a time and iterate through all of them
for date in range(len(date_list) - time_steps + 1):
date_range_test = date_list[date:time_steps+date]
date_range_df = array_steps_df.loc[(array_steps_df['date'] <= date_range_test[-1]) &
(array_steps_df['date'] >= date_range_test[0])
]
# for each id do it separately so time series data doesn't get mixed up
for identifier in id_list:
# get id in here and then skip if not the required time steps/observations for the id
date_range_id = date_range_df.loc[date_range_df[id_column] == identifier]
master_dict = {}
# if there aren't enough observations for the data range
if len(date_range_id) != time_steps:
# dont fully need the counter except in unusual circumstances when debugging it causes no harm for now
missing_counter += 1
else:
# add target each loop through for the last date in the date range for the id or ticker
target = array_steps_df['target'].\
loc[(array_steps_df['date'] == date_range_test[-1])
& (array_steps_df[id_column] == identifier)
].iloc[0]
target_list.append(target)
total_counter += 1
# loop through each column in dataframe
for column in columns_to_array:
date_range_id_value = date_range_id[[column]].values
master_dict[column] = []
master_dict[column].append(date_range_id_value)
master_list.append(master_dict)
# redo columns to arrays, after they have been ordered and grouped by Id above
array_list = []
# for each column go through the values in the array create an array for the column then append to list
for column in columns_to_array:
for idx, dic in enumerate(master_list):
# init arrays here if the first value
if idx == 0:
value_array_init = master_list[0][column]
else:
value_array_init += master_list[idx][column]
array_list.append(np.array(value_array_init))
# for each value in the array list, horizontally stack each value
all_array = np.hstack(array_list).reshape((total_counter,
len(columns_to_array),
time_steps
)
).transpose(0, 2, 1)
target_array_all = np.array(target_list
).reshape(len(target_list),
1)
# should probably make this an if condition later after a few more tests
print('check of length of arrays', len(all_array), len(target_array_all))
return all_array, target_array_all

Python: How to iterate over rows and calculate value based on previous row

I have sales data till Jul-2020 and want to predict the next 3 months using a recovery rate.
This is the dataframe:
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
This is how it looks:
Now, I want to add a "Predicted" column resulting into this dataframe:
The first value 300 at row 3, is basically (200 * 1.5/1). This will be our base value going ahead, so next value i.e. 500 is basically (300 * 2.5/1.5) and so on.
How do I iterate over row every row, starting from row 3 onwards? I tried using shift() but couldn't iterate over the rows.

You could do it like this:
import pandas as pd
test = pd.DataFrame({'Country':['USA','USA','USA','USA','USA'],
'Month':[6,7,8,9,10],
'Sales':[100,200,0,0,0],
'Recovery':[0,1,1.5,2.5,3]
})
test['Prediction'] = test['Sales']
for i in range(1, len(test)):
#prevent division by zero
if test.loc[i-1, 'Recovery'] != 0:
test.loc[i, 'Prediction'] = test.loc[i-1, 'Prediction'] * test.loc[i, 'Recovery'] / test.loc[i-1, 'Recovery']

The sequence you have is straight up just Recovery * base level (Sales = 200)
You can compute that sequence like this:
valid_sales = test.Sales > 0
prediction = (test.Recovery * test.Sales[valid_sales].iloc[-1]).rename("Predicted")
And then combine by index, insert column or concat:
pd.concat([test, prediction], axis=1)

How to save output in .csv after every loop without overwriting in Pandas?

I want to save my output in .csv. When I am running my while loop and saving the output, My output is only saving for the last iteration.
Its not saving my all iteration value.
Also, I want to skip the zero value rows while printing my output.
This is my code:
import pandas as pd `#pandas library
sample = pd.DataFrame(pd.read_csv ("Sample.csv")) #importing .csv as pandas DataFrame
i = 0
while (i <= 23):
print('Value for', i) `#i vale`
sample2 = (sample[sample['Hour'] == i])`#Data for every hour`
sample3 = (sample2[(sample2['GHI']) == (sample2['GHI'].max(0))]) `#Max value from sample3 DataFrame`
sample3 = sample3.loc[sample3.ne(0).all(axis=1)]`ignoring all rows having zero values`
print(sample3) `print sample3`
sample3.to_csv('Output.csv')`trying to save for output after every iteration`
i = i + 1

An other way of doing what you want to do is to get rid of your loop, like this :
sample_with_max_ghi = sample.assign(max_ghi=sample.groupby('Hour')['GHI'].transform('max'))
sample_filtered = sample_with_max_ghi[sample_with_max_ghi['GHI'] == sample_with_max_ghi['max_ghi']]
output_sample = sample_filtered.loc[sample_filtered.ne(0).all(axis=1)].drop('max_ghi', axis=1)
output_sample.to_csv('Output.csv')
Some explanations :
1.
sample_with_max_ghi = sample.assign(max_ghi=sample.groupby('Hour')['GHI'].transform('max'))
This line add a new column to your dataframe containing the max of GHI column for your group of Hour
2.
sample_filtered = sample_with_max_ghi[sample_with_max_ghi['GHI'] == sample_with_max_ghi['max_ghi']]
This line filters only rows where the GHI value is actually the max of its Hour group
3.
output_sample = sample_filtered.loc[sample_filtered.ne(0).all(axis=1)].drop('max_ghi', axis=1)
And apply the last filter to get rid of the 0 values rows

while the loop is running adding the value at every loop to rename the csv file will make it to look unique and solve your problem.. eg:
import pandas as pd `#pandas library
sample = pd.DataFrame(pd.read_csv ("Sample.csv")) #importing .csv as pandas DataFrame
i = 0
while (i <= 23):
print('Value for', i) `#i vale`
sample2 = (sample[sample['Hour'] == i])`#Data for every hour`
sample3 = (sample2[(sample2['GHI']) == (sample2['GHI'].max(0))]) `#Max value from sample3 DataFrame`
sample3 = sample3.loc[sample3.ne(0).all(axis=1)]`ignoring all rows having zero values`
print(sample3) `print sample3`
sample3.to_csv(str(i)+'Output.csv')`trying to save for output after every iteration`
i = i + 1

Pandas Parse DataFrame Field and Maintain ID Field

I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Two different excel file to match their rows having same name - python

Related

Retaining rows that have percent overlapping ranges in Pandas

Time Series Dataframe Groupby 3d Array - observation/row count - For LSTM

Python: How to iterate over rows and calculate value based on previous row

How to save output in .csv after every loop without overwriting in Pandas?

Pandas Parse DataFrame Field and Maintain ID Field

Categories

Resources