Python/Pandas to update prices already paid for certain codes - python

So I have a rather large file that is broken down like this:
Claim
CPT Code
TOTAL_ALLOWED
CPT_CODE
NEW_PRICE
ALLOWED_DIFFERENCE
6675647
90887
120
90887
153
difference
The thing is, for my data set, the existing already paid data is 47K lines long, yet the CPT codes we are paying are 20 codes only. How would use Pandas/Numpy to have python look at the CPT code, find its match, and compare the TOTAL_ALLOWED with the NEW_PRICE to determine what is ultimately owed.
I think I have it with this, but I'm having an issue with having Python iterate through my list:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE'])*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15))```
but so far, its giving me an error that the rows don't match.
Any help is appreciated!

There is a small formatting error. Try this:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE']*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15)))

I did what Clegane mentioned:
final = df1.merge(df3, how='left' , left_on = 'CLAIM_ID' , right_on= 'QUANTITY')
df2 = df1.drop_duplicates(keep = 'first')
Then I dropped the duplicates. I first did this on only 20 lines of excel, then after I made sure it worked, I let it loose on my 945000 line .xlsx. Everything worked, and everything lined up. It was daunting...

Related

Python sort table on multiple columns

I am busy with making a system that can sort some things from a excel document, i have added a part of the document here: shorturl.at/DKNP7
It has the following inputs: Day, time, sort, number, gourmet/fondue, sort_exclusive
I want to have this sorted as follows, it must contain the sum of each of the different types.
I have some code but i doubt it is efficient, the start of the code i have included below.
df = pd.read_excel('Example_excel.xlsm', sheet_name="INVOER")
gourmet = df[['Day', 'Time', 'Sort', 'number', 'Gourmet/Fondue', 'sort exclusive']]
gourmet1 = gourmet.dropna(subset=['Sort'], inplace=False) #if 'Sort' is not filled in it is dropped.
gourmet1.to_excel('test.xlsx', index=False, sheet_name='gourmet')
Maybe it is needed to split it in 2 parts, where 1 part is 'exclusief' with 'sort exclusive' and another part for 'populair' and 'deluxe' from the 'sort'column.
Looking forward to your reply!
One of the things I have tried is to split it;
gourmet_pop_del = gourmet1.groupby(['Day','Sort', 'Gourmet/Fondue' ])['number'].sum()
gourmet_pop_del = gourmet_pop_del.reset_index()
gourmet_pop_del.sort_values(by=['Day', 'Sort','Gourmet/Fondue'], inplace=True)

Pandas Value Error (columns passed) No matter how many columns passed

I'm making a pandas df based on a list and a 3D list of lists. If you want to see the whole code, you can look here: https://github.com/Bigglesworth95/Fish-Food-Calculator/blob/main/foodDataScraper.py
but I will do my best to summarize below. (If you want to peruse the code and offer any recommendations, I will happily accept tho. I'm a noob and I know I'm not good at this :)
The lists I am using here are quite long. IDK if that makes much of a difference but I thought I would note it since I won't be posting the full contents of the lists below for this reason.
2)The function to make the df is as follows:
def make_df ():
counter = 0
nameLength = len(names)
print('nameLength =', nameLength)
for product in newTupledList:
templist = []
if counter <= nameLength:
templist.append(names[counter])
product.insert(0, templist)
counter += 1
df1 = pd.DataFrame (newTupledList, columns=['Name','Crude Protein', 'Crude Fat', 'Crude Fiber', 'Moisture'...])
return df1
newTupledList is a list that looks like this: [[['Crude Protein', '48%'], ['Crude Fat', '5.5%'], ['Crude Fiber', '0.5%'], ['Moisture', '6%'], ['Phosphorus', '0.1%']...]...]
Note that the first layer is all the products, the second is the individual product, and the third is all the nutritional values of all products, populated with data for the individual products and then a 0 for everything not relevant.
Len of names is 24. IDK if its relevant.
Now, the interesting issue here is that, no matter how many columns I pass in the dF I get a value error. If I do nothing, then I get a value error saying that I only passed 52 columns and needed 60. If I add 8 more columns it will say that I passed 60 columns but needed 61. If I add one to that it will say I passed 61 columns but needed 60. And so on.
Has anyone ever seen anything like that happen before? What are some approaches I could take to debugging such a weird bug? Thanks.

Python - Looping through specific column in CSV file

I currently have my csv file displaying on python:
df = pd.read_csv("Desktop\Assignment\World Cup 2018.csv")
df.head()
Here I can see that my data has been opened and the unneeded columns have been removed. Now I want to use some variables named CounterVal1 (and so on) to count the amount of times a formation appears in a row.
for i in enumerate(df['home_formation']):
if i == '4-2-3-1':
counterVal1 += 1
elif i == '4-1-4-1':
counterVal2 += 1
performance = [counterVal1,counterVal2,counterVal3,counterVal4,counterVal5,counterVal6]
plt.bar(y_pos, performance, align='center', alpha=0.8)
However I have printed off one of these values and it appears that the data is not being searched through as show above.
My question: How do I take in the data from the CSV file, take just the column related to the formation and loop through it?
You haven't really provided much here. So this is just guess work.
Based on your comment I'm guessing this is what you want :
df['home-formation'].value_counts().plot('bar')

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

Shade Every Other Row

I want to shade every other column excluding the 1st row/header with grey. I read through the documentation for XLSX Writer and was unable to find any example for this, I also searched through the tag here and couldn't find anything.
why not set it up as a conditional format?
http://xlsxwriter.readthedocs.org/example_conditional_format.html
you should just declare a condition like "if cells row number %2 == 0"
I wanted to post the details on how I did this, and how I was able to do it dynamically. It's kinda hacky, but I'm new to Python and I just needed this to work for right now.
xlsW = pd.ExcelWriter(finalReportFileName)
rptMatchingDoe.to_excel(xlsW,'Room Counts Not Matching',index=False)
workbook = xlsW.book
rptMatchingSheet = xlsW.sheets['Room Counts Not Matching']
formatShadeRows = workbook.add_format({'bg_color': '#a9c9ff',
'font_color': 'black'})
rptMatchingSheet.conditional_format('A1:'+xlsAlpha[rptMatchingDoeColCount]+matchingCount,{'type': 'formula',
'criteria': '=MOD(ROW(),2) = 0',
'format': formatShadeRows})
xlsW.save()
xlsAlpha is a list that contains the max amount of columns my report could possible have. My first three columns are always consistent so I just set rptMatchingDoeColCount equal to 2 and then when I loop through the list to build my query I increment the count. The matchingCount variable is just a fetchone() result from a count(*) query on the view I'm pulling from in the database.
Eventually I think I will write a function to replace the hardcoded list assigned to xlsAlpha, so that it can be a virtually unlimited amount of columns.
If anyone has any suggestions on how I could improve this feel free to share.

Categories

Resources