Pandas Value Error (columns passed) No matter how many columns passed - python

I'm making a pandas df based on a list and a 3D list of lists. If you want to see the whole code, you can look here: https://github.com/Bigglesworth95/Fish-Food-Calculator/blob/main/foodDataScraper.py
but I will do my best to summarize below. (If you want to peruse the code and offer any recommendations, I will happily accept tho. I'm a noob and I know I'm not good at this :)
The lists I am using here are quite long. IDK if that makes much of a difference but I thought I would note it since I won't be posting the full contents of the lists below for this reason.
2)The function to make the df is as follows:
def make_df ():
counter = 0
nameLength = len(names)
print('nameLength =', nameLength)
for product in newTupledList:
templist = []
if counter <= nameLength:
templist.append(names[counter])
product.insert(0, templist)
counter += 1
df1 = pd.DataFrame (newTupledList, columns=['Name','Crude Protein', 'Crude Fat', 'Crude Fiber', 'Moisture'...])
return df1
newTupledList is a list that looks like this: [[['Crude Protein', '48%'], ['Crude Fat', '5.5%'], ['Crude Fiber', '0.5%'], ['Moisture', '6%'], ['Phosphorus', '0.1%']...]...]
Note that the first layer is all the products, the second is the individual product, and the third is all the nutritional values of all products, populated with data for the individual products and then a 0 for everything not relevant.
Len of names is 24. IDK if its relevant.
Now, the interesting issue here is that, no matter how many columns I pass in the dF I get a value error. If I do nothing, then I get a value error saying that I only passed 52 columns and needed 60. If I add 8 more columns it will say that I passed 60 columns but needed 61. If I add one to that it will say I passed 61 columns but needed 60. And so on.
Has anyone ever seen anything like that happen before? What are some approaches I could take to debugging such a weird bug? Thanks.

Related

New dataframe in Pandas based on specific values(a lot of them) from existing df

Good evening! I'm using pandas on Jupyter Notebook. I have a huge dataframe representing full history of posts of 26 channels in a messenger. It has a column "dialog_id" which represents in which dialog the message was sent(so, there can be only 26 unique values in the column, but there are more then 700k rows, and the df is sorted itself by time, not id, so it is kinda chaotic). I have to split this dataframe into 2 different(one will contain full history of 13 channels, and the other will contain history for the rest 13 channels). I know ids by which I have to split, they are random as well. For example, one is -1001232032465 and the other is -1001153765346.
The question is, how do I do it most elegantly and adequate?
I know I can do it somehow with df.loc[], but I don't want to put like 13 rows of df.loc[]. I've tried to use logical operators for this, like:
df1.loc[(df["dialog_id"] == '-1001708255880') & (df["dialog_id"] == '-1001645788710' )], but it doesn't work. I suppose I'm using them wrong. I expect a solution with any method creating a new df, with the use of logical operators. In verbal expression, I think it should sound like "put the row in a new df if the dialog_id is x, or dialog_id is y, or dialog_id is z, etc". Please help me!
The easiest way seems to be just setting up a query.
df = pd.DataFrame(dict(col_id=[1,2,3,4,], other=[5,6,7,8,]))
channel_groupA = [1,2]
channel_groupB = [3,4]
df_groupA = df.query(f'col_id == {channel_groupA}')
df_groupB = df.query(f'col_id == {channel_groupB}')

Python/Pandas to update prices already paid for certain codes

So I have a rather large file that is broken down like this:
Claim
CPT Code
TOTAL_ALLOWED
CPT_CODE
NEW_PRICE
ALLOWED_DIFFERENCE
6675647
90887
120
90887
153
difference
The thing is, for my data set, the existing already paid data is 47K lines long, yet the CPT codes we are paying are 20 codes only. How would use Pandas/Numpy to have python look at the CPT code, find its match, and compare the TOTAL_ALLOWED with the NEW_PRICE to determine what is ultimately owed.
I think I have it with this, but I'm having an issue with having Python iterate through my list:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE'])*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15))```
but so far, its giving me an error that the rows don't match.
Any help is appreciated!
There is a small formatting error. Try this:
df['price_difference'] = np.where(df['LINE_TOTAL_ALLOWED'] == ((df['NEW_PRICE']*15)), 0, df['LINE_TOTAL_ALLOWED'] - ((df['NEW_PRICE']*15)))
I did what Clegane mentioned:
final = df1.merge(df3, how='left' , left_on = 'CLAIM_ID' , right_on= 'QUANTITY')
df2 = df1.drop_duplicates(keep = 'first')
Then I dropped the duplicates. I first did this on only 20 lines of excel, then after I made sure it worked, I let it loose on my 945000 line .xlsx. Everything worked, and everything lined up. It was daunting...

Pandas change range into int

In my df I have a salary_range column, which contains ranges like 100 000 - 150 000. I'd like to modify this column so it would take the first value as an int. So in this example I'd like to change "100 000 - 150 000(string) to 100000(int). Unfortunatelly salary_range is full of NaN, and I don't really know how to use if/where statements in pandas.
I tried doing something like this: df['salary_range'] = np.where(df['salary_range']!='NaN',) but I don't know what should I write as the second argument of np.where. Obviously I can't just use str(salary_range), so I don't know how to do it.
You first need to take the subset where the value is not NaN. This can be done using the following code.
pd.isna(df['salary_range'])
The above function will return a series containing True/False values. Now you can select the non-NaN rows using the following code.
df[pd.isna(df['salary_range'])]
Next you will need to parse the entries of this subset, which can be done in many ways, one of which can be the following.
df[pd.isna(df['salary_range'])]['salary_range'] = df[pd.isna(df['salary_range'])]['salary_range'].str.split(' ')[0].replace(' ', '').astype(int)
This will only change the rows, where the column is not null. Since you did not include the code, I can't help much without knowing more about the context. Hope this helps.

How to populate arrays with values read in from csv via pandas?

I have create a DataFrame using pandas by reading a csv file. What I want to do is iterate down the rows (for the values in column 1) into a certain array, and do the same for the values in column 2 for a different array. This seems like it would normally be a fairly easy thing to do, so I think I am missing something, however I can't find much online that doesn't get too complicated and doesn't seem to do what I want. Stack questions like this one appear to be asking the same thing, but the answers are long and complicated. Is there no way to do this in a few lines of code? Here is what I have set up:
import pandas as pd
#available possible players
playerNames = []
df = pd.read_csv('Fantasy Week 1.csv')
What I anticipate I should be able to do would be something like:
for row in df.columns[1]:
playerNames.append(row)
This however does not return the desired result.
Essentially, if df =
[1,2,3
4,5,6
7,8,9], I would want my array to be [1,4,7]
Do:
for row in df[df.columns[1]]:
playerNames.append(row)
Or even better:
print(df[df.columns[1]].tolist())
In this case you want the 1st column's values so do:
for row in df[df.columns[0]]:
playerNames.append(row)
Or even better:
print(df[df.columns[0]].tolist())

How to modify cells in column conditionally in pandas?

I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.
Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.

Categories

Resources