So I have a rather large file that is broken down like this:
Claim
CPT Code
TOTAL_ALLOWED
CPT_CODE
NEW_PRICE
ALLOWED_DIFFERENCE
6675647
90887
120
90887
153
difference
The thing is, for my data set, the existing already paid data is 47K lines long, yet the CPT codes we are paying are 20 codes only. How would use Pandas/Numpy to have python look at the CPT code, find its match, and compare the TOTAL_ALLOWED with the NEW_PRICE to determine what is ultimately owed.
I think I have it with this, but I'm having an issue with having Python iterate through my list:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE'])*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15))```
but so far, its giving me an error that the rows don't match.
Any help is appreciated!
There is a small formatting error. Try this:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE']*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15)))
I did what Clegane mentioned:
final = df1.merge(df3, how='left' , left_on = 'CLAIM_ID' , right_on= 'QUANTITY')
df2 = df1.drop_duplicates(keep = 'first')
Then I dropped the duplicates. I first did this on only 20 lines of excel, then after I made sure it worked, I let it loose on my 945000 line .xlsx. Everything worked, and everything lined up. It was daunting...
I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.
I want to make a loop to create a plot for each corresponding column in 2 different csv files such that column 1 in csv A and column 1 in csv B are plotted together with the same timestamp (pulled from another csv). I do not think I will have trouble when I modify my code to create the loop, but I have to get matplotlib to work for the first column before trying to construct a loop.
I have already tried checking to make sure that the correct data is being passed into the function and that is in the correct order. For example, I printed the zipped array as a list (t_array, b_array) and checked my csv files to verify that the data was in the correct order. I have also tried modifying the axes, ticks, and zoom to no avail. I have tried checking the helper functions which I lifted from my other projects and they all work as expected.
def double_plot():
before = read_file(r_before)
after = read_file(r_after)
time = read_file(timestamp)
if len(before) == len(after):
b_array = np.asarray(before[1])
a_array = np.asarray(after[1])
t_array = np.asarray(time[1])
plt.plot(t_array, b_array)
plt.plot(t_array, a_array)
plt.show()
else:
print(len(before))
print(len(after))
print("dimension failure")
read_file() is a helper function that reads csv files and saves the columns to dictionaries with the first column key indexed by key
"1" and so on down the columns. I know I should probably change it to index with 0 first, but this is a problem for later...
Images showing what I want the code to do and what it is doing
What I would like
What my code is actually doing
Thank you for your time. This is my first time posting so I apologize if something I did was incorrect. I did attempt to find the answer before posting.
Edits: data sample; read_file()
screenshot of excel
def read_file(read_file):
data = {}
with open(read_file, 'r') as f:
reader = csv.reader(f)
for row in reader:
col_num = 0
for col in row:
col_num += 1
if col_num in data:
data[col_num].append(col)
else:
ls = col
ls = [ls]
data[col_num] = ls
return data
edit again: ^ its much better to use pandas but I am leaving this here because its funny after seeing it done with dataframes
The arrays I was using with the plot function contained strings rather than floats.
These links explain the problem along with multiple ways to fix it:
Matplotlib y axis values are not ordered
In Python, how do I convert all of the items in a list to floats?
I want to shade every other column excluding the 1st row/header with grey. I read through the documentation for XLSX Writer and was unable to find any example for this, I also searched through the tag here and couldn't find anything.
why not set it up as a conditional format?
http://xlsxwriter.readthedocs.org/example_conditional_format.html
you should just declare a condition like "if cells row number %2 == 0"
I wanted to post the details on how I did this, and how I was able to do it dynamically. It's kinda hacky, but I'm new to Python and I just needed this to work for right now.
xlsW = pd.ExcelWriter(finalReportFileName)
rptMatchingDoe.to_excel(xlsW,'Room Counts Not Matching',index=False)
workbook = xlsW.book
rptMatchingSheet = xlsW.sheets['Room Counts Not Matching']
formatShadeRows = workbook.add_format({'bg_color': '#a9c9ff',
'font_color': 'black'})
rptMatchingSheet.conditional_format('A1:'+xlsAlpha[rptMatchingDoeColCount]+matchingCount,{'type': 'formula',
'criteria': '=MOD(ROW(),2) = 0',
'format': formatShadeRows})
xlsW.save()
xlsAlpha is a list that contains the max amount of columns my report could possible have. My first three columns are always consistent so I just set rptMatchingDoeColCount equal to 2 and then when I loop through the list to build my query I increment the count. The matchingCount variable is just a fetchone() result from a count(*) query on the view I'm pulling from in the database.
Eventually I think I will write a function to replace the hardcoded list assigned to xlsAlpha, so that it can be a virtually unlimited amount of columns.
If anyone has any suggestions on how I could improve this feel free to share.
I'm having trouble with processing some csv data files for a project. Someone suggested using python/csv reader to help break down the files, which I've had some success with, but not in a way I can use.
This code is a little different from what I was trying before. I am essentially attempting to create an array. In the raw data format, the first 7 rows contain no data, and then each column contains 50 experiments, each with 4000 rows, for 200000 some rows total. What I want to do is take each column, and make it an individual csv file, with each experiment in its own column. So it would be an array of 50 columns and 4000 rows for each data type. The code here does break down the correct values, I think the logic is okay, but it is breaking down the opposite of how I want it. I want the separators without quotes (the commas and spaces) and I want the element values in quotes. Right now it is doing just the opposite for both, element values with no quotes, and the separators in quotes. I've spent several hours trying to figure out how to do this to no avail,
import csv
ifile = open('00_follow_maverick.csv')
epistemicfile = open('00_follower_maverick_EP.csv', 'w')
reader = csv.reader(ifile)
colnum = 0
rownum = 0
y = 0
z = 8
for column in reader:
rownum = 4000 * y + z
for element in column:
writer = csv.writer(epistemicfile)
if y <= 50:
y = y + 1
writer.writerow([element])
writer.writerow(',')
rownum = x * y + z
if y > 50:
y = 0
z = z + 1
writer.writerow(' ')
rownum = x * y + z
if z >= 4008:
break
What is going on: I am taking each row in the raw data file in iterations of 4000, so that I can separate them with commas for the 50 experiments. When y, the experiment indicator here, reaches 50, it resets back to experiment 0, and adds 1 to z, which tells it which row to look at, by the formula of 4000 * y + z. When it completes the rows for all 50 experiments, it is finished. The problem here is that I don't know how to get python to write the actual values in quotes, and my separators outside of quotes.
Any help will be most appreciated. Apologies if this seems a stupid question, I have no programming experience, this is my first attempt ever. Thank you.
Sorry, I'll try to make this more clear. The original csv file has several columns, each of which are different sets of data.
A miniature example of the raw file looks like:
column1 column2 column3
exp1data1time1 exp1data2time1 exp1data3time1
exp1data1time2 exp1data2time2 exp1data3time2
exp2data1time1 exp2data2time1 exp2data3time1
exp2data1time2 exp2data2time2 exp2data3time2
exp3data1time1 exp3data2time1 exp3data3time1
exp3data1time2 exp3data2time2 exp3data3time2
So, the actual version has 4000 rows instead of 2 for each new experiment. There are 40 columns in the actual version, but basically, the data type in the raw file matches the column number. I want to separate each data type or column into an individual csv file.
This would look like:
csv file1
exp1data1time1 exp2data1time1 exp3data1time1
exp1data1time2 exp2data1time2 exp3data1time2
csv file2
exp1data2time1 exp2data2time1 exp3data2time1
exp1data2time2 exp2data2time2 exp3data2time2
csv file3
exp1data3time1 exp2data3time1 exp3data3time1
exp1data3time2 exp2data3time2 exp3data3time2
So, I'd move the raw data in the file to a new column, and each data type to its own file. Right now I'm only going to do one file, until I can move the separate experiments to separate columns in the new file. So, in the code, the above would make the 4000 into 2. I hope this makes more sense, but if not, I will try again.
If I had a cat for each time I saw a bio or psych or chem database in this state:
"each column contains 50 experiments,
each with 4000 rows, for 200000 some
rows total. What I want to do is take
each column, and make it an individual
csv file, with each experiment in its
own column. So it would be an array of
50 columns and 4000 rows for each data
type"
I'd have way too farking many cats.
I didn't even look at your code because the re-mangling you are proposing is just another problem that will have to be solved. I don't fault you, you claim to be a novice and all your peers make the same sort of error. Beginning programmers who have yet to understand how to use arrays often wind up with variable declarations like:
integer response01, response02, response03, response04, ...
and then very, very redundant code when they try to see if every response is - say - 1. I think this is such a seductive error in bio-informatics because it actually models the paper notations they come from rather well. Unfortunately, the sheet-of-paper model isn't the best way to model data.
You should read and understand why database normalization was developed, codified and has come to dominate how people think about structured data. One Wikipedia article may not be sufficient. Using the example I excerpted let me try to explain how I think of it. Your data consists of observations; put the other way the primary datum is a singular observation. That observation has a context though: it is one of a set of 4000 observations, where each set belongs to one of 50 experiments. If you had to attach a context to each observation you'd wind up with an addressing scheme that looks like:
<experiment_number, observation_number, value>
In database jargon, that's a tuple, and it is capable of representing, with no ambiguity and perfect symmetry the entirety of your data. I'm not certain that I've understood the exact structure of your data, so perhaps it is something more like:
<experiment_number, protocol_number, observation_number, value>
where the protocol may be some form of variable treatment type - let's say pH. But note that I didn't call the protocol a pH and I don't record it as such in the database. What I would then need is an ancillary table showing the relevant parameters of the protocol, e.g.:
<protocol_number, acidity, temperature, pressure>
Now we've just built a "relation" that those database people like to talk about; we've also begun normalizing the data. If you need to know the pH for a given protocol, there is one and only one place to find it, in the proper row of the protocol table. Note that I've divorced the data that fit so nicely together on a data-sheet and from the observation table I can't see the pH for a particular dataum. But that's okay, because I can just look it up in my protocol table if needed. This is a "relational join" and if I needed to, I could coalesce all the various parameters from all the various tables and reconstitute the original datasheet in its original, unstructured glory.
I hope this answer is of some use to you. I'm certain that I don't even know what field of study your data is from, but these principles apply across domains from drug trials to purchase requisition processing. Please understand that I'm trying to inform, per your request, and there is zero condescension intended. I welcome further questions on the matter.
Normalization of the dataset
Thanks for giving the example. You have the context I described already, perhaps I can make it more clear.
column1 column2 column3
exp1data1time1 exp1data2time1 exp1data3time1
exp1data1time2 exp1data2time2 exp1data3time2
The columns are an artifice made by the last guy; that is, they carry no relevant information. When parsed into a normal form, your data looks just like my first proposed tuple:
<experiment_number, time, response_number, response>
where I suspect time may actually mean "subject_id" or "trial_number". It may very well look incongruous to you to conjoin all the different response values into the same dataset; indeed based on your desired output, I suspect that it does. At first blush, the objection "but the subject's response to a question about epistemic properties of chairs has no connection to their meta-epistemic beliefs regarding color", but this would be mistaken. The data are related because they have a common experimental subject, and self-correlation is an important concept in sociological analytics.
For example, you may find that respondent A gives the same responses as respondent B, except all of A's responses are biased one higher because of how the subject understood the criteria. This would make a very real difference in the absolute values of the data, but I hope you can see that the question "do A and B actually have different epistemic models?" is salient and valid. One method of data modeling allows this question to be answered easily, your desired method does not.
Working parsing code to follow shortly.
The normalizing code
#!/usr/bin/python
"""parses a csv file containing a particular data layout and normalizes
The raw data set is a csv file of the form::
column1 column2 column3
exp01data01time01 exp01data02time01 exp01data03time01
exp01data01time02 exp01data02time02 exp01data03time02
where there are 40 such columns and the literal column title
is added as context to the output row
it is assumed that the columns are comma separated but
the lexical form of the subcolumns is unspecified.
Output will consist of a single CSV output stream
on stdout of the form::
exp01, time01, data01, column1
for varying actual values of each field.
"""
import csv
import sys
def split_subfields(s):
"""returns a list of subfields of s
this function is expected to be re-written to match the actual,
unspecified lexical structure of s."""
return [s[0:5], s[5:11], s[11:17]]
def normalise_data(reader, writer):
"""returns a list of the column headings from the reader"""
# obtain the headings for use in normalization
names = reader.next()
# get the data rows, split them out by column, add the column name
for row in reader:
for column, datum in enumerate(row):
fields = split_subfields(datum)
fields.append(names[column])
writer.writerow(fields)
def main():
if len(sys.argv) != 2:
print >> sys.stderr, ('usage: %s input.csv' % sys.argv[0])
sys.exit(1)
in_file = sys.argv[1]
reader = csv.reader(open(in_file))
writer = csv.writer(sys.stdout)
normalise_data(reader, writer)
if __name__ == '__main__': main()
Such that the command python epistem.py raw_data.csv > cooked_data.csv yields excerpted output looking like:
exp01,data01,time01,column1
...
exp01,data40,time01,column40
exp01,data01,time02,column1
exp01,data01,time03,column1
...
exp02,data40,time15,column40