Pandas: tricky interval selection based on varying ranges

Pandas: tricky interval selection based on varying ranges - python

I'm relatively new to coding and have a tricky (to me) logic determination sequence that I assume can be simplified from what I currently have. I have yet to be able to find something similar that I can comprehend at my current level of understanding and thus adapt accordingly.
I have a data frame containing a list of ~200 wells. Each well has a different depth of a perforated casing that exposes it to groundwater at various depths (open interval). The length/depth of the open interval ranges for each well. Based on the open interval, I need to determine which layers within a groundwater model should be associated to each individual well (location within the model is given by row/column values) and then append that information to an array for export in text format that will be fed back into the model (I've got that part). Furthermore, the number of layers could potentially increase, so ideally the code could adapt to any number of layers. I would always have some form of the example data set below. If the layers increase, the data set with those new values would be provided by the model. The well open interval will not change, and thus which layers each well's open interval exist within would change.
Example dataset:
import pandas as pd
df = pd.DataFrame({'well_ID': ['GR800', 'HA009', 'HA219', 'HA323','HA463'],
'Top_open_int':[4450.0, 4530.0, 4390.0, 3900.0, 4140.0], #top of open interval
'Bot_open_int':[4110.0, 3800.0, 4250.0, 3750.0, 3650.0], #bottom of open interval
'Top_1':[4500.0, 4550.0, 4100.0, 4200.0, 4150.0], #top of layer 1
'Bot_1':[4300.0, 4250.0, 3900.0, 4050.0, 3900.0], #bottom of layer 1
'Bot_2':[4100.0, 3900.0, 3750.0, 3850.0, 3750.0], #bottom of layer 2
'Bot_3':[3820.0, 3650.0, 3520.0, 3650.0, 3570.0], #bottom of layer 3
'Bot_4':[3360.0, 3480.0, 3300.0, 3380.0, 3350.0]}) #bottom of layer 4
What I'm currently doing is something like below, where I'm writing up every possible boundary condition combination that could exist. If the number of layers increase, I have to add all the additional possible combinations to the script.
Current script approach:
# initiate empty array
layers = []
# loop through all combinations and appending appropriate text if the interval matches
for well, row in df.iterrows():
if row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_3']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
layers.append((row['well_ID'], '3', row['Row'], row['Column']))
layers.append((row['well_ID'], '4', row['Row'], row['Column']))
elif row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_2']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
layers.append((row['well_ID'], '3', row['Row'], row['Column']))
elif row['Top_open_int'] >= row['Bot_1'] and row['Bot_open_int'] < row['Bot_1']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
layers.append((row['well_ID'], '2', row['Row'], row['Column']))
elif row['Top_open_int'] > row['Bot_1'] and row['Bot_open_int'] >= row['Bot_1']:
layers.append((row['well_ID'], '1', row['Row'], row['Column']))
# script continues for all possible combinations that the open interval could
# potentially fall within. There doesn't seem to be a point in writing it all out here
If you run the above code you'll see what the expected outcome would be. It would be an array like this, but for all the wells in the data set:
[('GR800', '1', 20, 100),
('GR800', '2', 20, 100),
('HA009', '1', 45, 10),
('HA009', '2', 45, 10),
('HA009', '3', 45, 10),
('HA219', '1', 105, 65),
('HA463', '1', 250, 15),
('HA463', '2', 250, 15),
('HA463', '3', 250, 15)]
Is there a way to simplify this approach and make it more robust so that it can adapt to changes in the number of layers?

I don't know if it helps but these 3 lines will do the work of that for loop:
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_3'])]
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_2'])]
df[(df['Top_open_int'] >= df['Bot_1']) & (df['Bot_open_int'] >= df['Bot_1'])]

Related

Python csv file convert string to float

I have a csv file full of data, which is all type string. The file is called Identifiedλ.csv.
Here is some of the data from the csv file:
['Ref', 'Ion', 'ULevel', 'UConfig.', 'ULSJ', 'LLevel', 'LConfig.', 'LLSJ']
['12.132', 'Ne X', '4', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.132', 'Ne X', '3', '2p1', '2P3/2', '1', '1s1', '1S0']
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
What I would like to do is the read the file and search for a number in the column 'Ref', for example 12.846. And if the number I search matches a number in the file, print the whole row of that number .
eg. something like:
csv_g = csv.reader(open('Identifiedλ.csv', 'r'), delimiter=",")
for row in csv_g:
if 12.846 == (row[0]):
print (row)
And it would return (hopefully)
['12.846', 'Fe XX', '58', '1s2.2s2.2p2.3d1', '4P5/2', '1', '1s2.2s2.2p3', '4S3/2']
However this returns nothing and I think it's because the 'Ref' column is type string and the number I search is type float. I'm trying to find a way to convert the string to float but am seemingly stuck.
I've tried:
df = pd.read_csv('Identifiedλ.csv', dtype = {'Ref': np.float64,})
and
array = b = np.asarray(array,
dtype = np.float64, order = 'C')
but am confused on how to incorporate this with the rest of the search.
Any tips would be most helpful! Thank you!

Python has a function to convert strings to floats. For example, the following evaluates to True:
float('3.14')==3.14
I would try this conversion while comparing the value in the first column.

Iterate over items in one column while referencing a tag in another column

Suppose I'm managing many stock brokerage account, each account have different types of stock in it. I'm trying to write some code to perform a stress test.
What I'm trying to do is, I have 2 dataframes:
Account information (dataframe):
account = {'account':['1', '1', '1', '2', '2'], 'Stock type':['A', 'A', 'B', 'B', 'C'], 'share value' = '100', '150', '200', '175', '85']}
stress test scenario(dataframe):
test = {'stock type':['A', 'B', 'C', 'D'], 'stress shock':['0.8', '0.7', '0.75', 0.6']}
Given these 2 dataframes, I want to calculate for each account, what's the share value after the stress shock.
i.e. for account #1, after shock value = 100*0.8 + 150*0.8 + 200*0.7 = 340
I tried some basic for loop, but my jupyter notebook will soon crush (out of memory) after the run.
shocked = []
for i in range(len(account)):
for j in range(len(test)):
if account.loc[i,'Stock type'] == test.loc[j,'stock type']:
shocked.append(account.loc[i,'share value']*test.loc[j, 'stock type']

We can first do a merge to get the data of the two dataframes together. Then we calculate the after shock value and finally get the sum of each account:
merge = account.merge(test, on='Stock type')
merge['after_stress_shock'] = pd.to_numeric(merge['share value']) * pd.to_numeric(merge['stress shock'])
merge.groupby('account')['after_stress_shock'].sum()
account
1 340.00
2 186.25
Name: after_stress_shock, dtype: float64
Note I used pandas.to_numeric since your values are in string type.

Create a Series to map "stock type" to "stress shock".
Then use pandas.groupby.apply with a lambda function for desired result:
stress_map = test.set_index('stock type')['stress shock']
account.groupby('account').apply(lambda x: (x['Stock type'].map(stress_map) * x['share value']).sum())
[output]
account
1 340.00
2 186.25
dtype: float64

Add Page Break before adding a Split with Flowable

I have an application that is using reportlab to build a document of tables. What I want to happen is when a flowable (in this case, always a Table) needs to split across pages, it should first add a page break. Thus, a Table should be allowed to split, but any table that is split should always start on a new page. There are multiple Tables in the same document, and if two can fit on the same page without splitting, there should not be a page break.
The closest I have gotten to this is to set allowSplitting to False when initializing the Document. However the issue is when a table exceeds the amount of space it has to fit, it will just fail. If instead of failing it will then wrap, this is what I am looking for.
For instance, this will fail with an error about not having enough space:
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter, inch
from reportlab.platypus import SimpleDocTemplate, Table
from reportlab.lib.styles import getSampleStyleSheet
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter, allowSplitting=False)
# container for the 'Flowable' objects
elements = []
data2 = []
data = [['00', '01', '02', '03', '04'],
['10', '11', '12', '13', '14'],
['20', '21', '22', '23', '24'],
['30', '31', '32', '33', '34']]
for i in range(100):
data2.append(['AA', 'BB', 'CC', 'DD', 'EE'])
t1 = Table(data)
t2 = Table(data2)
elements.append(t1)
elements.append(t2)
doc.build(elements)
The first table (t1) will fit fine, however t2 does not. If the allowSplitting is left off, it will fit everything in the doc, however t1 and t2 are on the same page. Because t2 is longer than one page, I would like it to add a page break before it starts, and then to split on the following pages where needed.

One option is to make use of the document height and table height to calculate the correct placement of PageBreak() elements. Document height can be obtained from the SimpleDocTemplate object and the table height can be calculated with the wrap() method.
The example below inserts a PageBreak() if the available height is less than table height. It then recalculates the available height for the next table.
from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Table, PageBreak
doc = SimpleDocTemplate("simple_table_grid.pdf", pagesize=letter)
# Create multiple tables of various lengths.
tables = []
for rows in [10, 10, 30, 50, 30, 10]:
data = [[0, 1, 2, 3, 4] for _ in range(rows)]
tables.append(Table(data, style=[('BOX', (0, 0), (-1, -1), 2, (0, 0, 0))]))
# Insert PageBreak() elements at appropriate positions.
elements = []
available_height = doc.height
for table in tables:
table_height = table.wrap(0, available_height)[1]
if available_height < table_height:
elements.extend([PageBreak(), table])
if table_height < doc.height:
available_height = doc.height - table_height
else:
available_height = table_height % doc.height
else:
elements.append(table)
available_height = available_height - table_height
doc.build(elements)

Using Python's Higher Order Functions on a CSV

I have a csv containing ~45,000 rows, which equates to seven days' worth of data. It has been sorted by datetime, with the oldest record first.
This is a sample row once the csv has been passed into the csv module's DictReader:
{'end': '423', 'g': '2', 'endid': '17131', 'slat': '40.7', 'endname': 'Horchata', 'cid': '1', 'startname': 'Sriracha', 'startid': '521', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-01 00:00:00+00', 'elat': '40.7', 'dur': '801', 'meppy': '', 'birth_year': '1978'}
...and another:
{'end': '418', 'g': '1', 'endid': '17108', 'slat': '40.7', 'endname': 'Guacamole', 'cid': '1', 'startname': 'Cerveza', 'startid': '519', 'slon': '-73.9', 'usertype': 'Sub', 'stoptime': '2015-02-01 00:14:00+00', 'elong': '-73.9', 'starttime': '2015-02-02 00:00:00+00', 'elat': '40.7', 'dur': '980', 'meppy': '', 'birth_year': '1983'}
I recently wrote the code below. It runs through the csv (after it's been passed to DictReader). The code yields the first row of each new day, i.e. whenever the day changes, based on starttime:
dayList = []
def first_ride(reader):
for row in reader:
starttime = dateutil.parser.parse(row['starttime'])
if starttime.day not in dayList:
day_holder.append(starttime.day)
yield row
else:
pass
My goal now is to produce a single list containing the value associated with birth_year from each of the seven records, i.e.:
[1992, 1967, 1988, 1977, 1989, 1953, 1949]
The catch is that I want to understand how to do it using Python's HOFs to the maximum extent possible (i.e. map/ reduce, and likely filter), without the generator (currently used in my code), and without global variables. To eliminate the global variable, my guess is that each starttime's day will have to be compared to the one before, but not using the list, as I currently have it set up. As a final FYI, I run Python 2.7.
I majorly appreciate any expertise donated.

You can just reduce the dayList, into a list of birth_years:
reduce(lambda r, d: r + [d['birth_year']], dayList, [])
Or you can use a comprehension (preferred):
[d['birth_year'] for d in dayList]

Tab delineated python 3 .txt file reading

I am having a bit of trouble getting started on an assignment. We are issued a tab delineated .txt file with 6 columns of data and around 50 lines of this data. I need help starting a list to store this data in for later recall. Eventually I will need to be able to list all the contents of any particular column and sort it, count it, etc. Any help would be appreciated.
Edit; I really haven't done much besides research on this kinda stuff, I know ill be looking into csv, and i have done single column .txt files before but im not sure how to tackle this situation. How will I give names to the separate columns? how will I tell the program when one row ends and the next begins?

The dataframe structure in Pandas basically does exactly what you want. It's highly analogous to the data frame in R if you're familiar with that. It has built in options for subsetting, sorting, and otherwise manipulating tabular data.
It reads directly from csv and even automatically reads in column names. You'd call:
read_csv(yourfilename,
sep='\t', # makes it tab delimited
header=1) # makes the first row the header row.
Works in Python 3.

Let's say you have a csv like the following.
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
You can read them into a dictionary like so:
>>> import csv
>>> reader = csv.DictReader(open('test.csv','r'), fieldnames= ['col1', 'col2', 'col3', 'col4', 'col5', 'col6'], dialect='excel-tab')
>>> for row in reader:
... print row
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
{'col6': '6', 'col4': '4', 'col5': '5', 'col2': '2', 'col3': '3', 'col1': '1'}
But Pandas library might be better suited for this. http://pandas.pydata.org/pandas-docs/stable/io.html#csv-text-files

Sounds like a job better suited to a database. You should just use something like PostgreSQLs COPY FROM operation to import the CSV data into a table then use python + SQL for all your sorting, searching and matching needs.
If you feel a real database is overkill there's still options like SQLlite and BerkleyDB which both have python modules.
EDIT: BerkelyDB is deprecated but anydbm is similiar in concept.

I think using a db for 50 lines and 6 colums is overkill, so here's my idea:
from __future__ import print_function
import os
from operator import itemgetter
def get_records_from_file(path_to_file):
"""
Read a tab-deliminated file and return a
list of dictionaries representing the data.
"""
records = []
with open(path_to_file, 'r') as f:
# Use the first line to get names for columns
fields = [e.lower() for e in f.readline().split('\t')]
# Iterate over the rest of the lines and store records
for line in f:
record = {}
for i, field in enumerate(line.split('\t')):
record[fields[i]] = field
records.append(record)
return records
if __name__ == '__main__':
path = os.path.join(os.getcwd(), 'so.txt')
records = get_records_from_file(path)
print('Number of records: {0}'.format(len(records)))
s = sorted(records, key=itemgetter('id'))
print('Sorted: {0}'.format(s))
For storing records for later use, look into Python's pickle library--that'll allow you to preserve them as Python objects.
Also, note I don't have Python 3 installed on the computer I'm using right now, but I'm pretty sure this'll work on Python 2 or 3.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: tricky interval selection based on varying ranges - python

Related

Python csv file convert string to float

Iterate over items in one column while referencing a tag in another column

Add Page Break before adding a Split with Flowable

Using Python's Higher Order Functions on a CSV

Tab delineated python 3 .txt file reading

Categories

Resources