I have a csv dataset which for whatever reason has an extra asterisk (*) at the end of some names. I am trying to remove them, but I'm having trouble. I just want to replace the name in the case where it ends with a *, otherwise keep it as-is.
I have tried a couple variations of the following, but with little success.
import pandas as pd
people = pd.read_csv("people.csv")
people.loc[people["name"].str[-1] == "*"]] = people["name"].str[:-1]
Here I am getting the following error:
ValueError: Must have equal len keys and value when setting with an iterable
I understand why this is wrong, but I'm not sure how else to reference the values I want to change.
I could instead do something like:
starred = people.loc[people["name"].str[-1] == "*"]
starred["name"] = starred["name"].str[:-1]
I get a warning here, but this kind of works. The problem is that it only contains the previously starred people, not all of them.
I'm kind of new to this, so apologies if this is simple. I feel like it shouldn't be too hard, there should be some function to do this, but I don't know what it is.
Your syntax for pd.DataFrame.loc needs to include a column label:
df = pd.DataFrame({'name': ['John*', 'Rose', 'Summer', 'Mark*']})
df.loc[df['name'].str[-1] == '*', 'name'] = df['name'].str[:-1]
print(df)
name
0 John
1 Rose
2 Summer
3 Mark
If you only specify the first part of the indexer, you will be filtering by row label only and return a dataframe. You cannot assign a series to a dataframe.
Related
Coming from Excel background, I find indexes so confusing in code.
Typically I'll make something an index that I feel should be one, then lose the functionality I would have had when it was a column.
I've a df with 4 digit years from 2015 to 2113 as the index. If i call a for loop on the index they are class type int (shouldn't matter for my purposes).
I then want to take a cut that's just 2020, so I do
df[df.index==2020] and it returns a blank df where there is data to return
If i do df.loc[2020] it says it can't do label indexing on ints
I just want to slice the data by years (so I can say just give me 2020 onward for example)
What am I doing wrong? Feel like I'm missing something fundamental?
I created a mock df to reproduce the problem for the question but that works fine.
If I do a for loop on the index of both the problem df and the example one they both return class int for each row
If I do example_df.index though it returns
Int64Index(2019,2020,2021, dtype='int64', name='Yr')
If I do the same on the problem df, it returns
Index(['2019','2020','2021'],dtype='object')
The above look like strings to me, but the loop says they are int?
Original problem index comes from Excel with set_index, so i can't produce an example here.
Any ideas?
On the problem df, indeed the index's data type is string.
Index(['2019','2020','2021'],dtype='object')
When you write
df[df.index==2020]
A blank result is expected because you search for int 2020 not string '2020'.
Then, in code
df.loc[2020]
Is a wrong code for searching some data with some condition. loc is used to slice column and row, not to search a row by some condition like what you wanted to do.
So the code
df[df.index==2020]
is the most right answer, but first you need to change the datatype of your index column first.
df.index= [int(i) for i in df.index]
I'm making a pandas df based on a list and a 3D list of lists. If you want to see the whole code, you can look here: https://github.com/Bigglesworth95/Fish-Food-Calculator/blob/main/foodDataScraper.py
but I will do my best to summarize below. (If you want to peruse the code and offer any recommendations, I will happily accept tho. I'm a noob and I know I'm not good at this :)
The lists I am using here are quite long. IDK if that makes much of a difference but I thought I would note it since I won't be posting the full contents of the lists below for this reason.
2)The function to make the df is as follows:
def make_df ():
counter = 0
nameLength = len(names)
print('nameLength =', nameLength)
for product in newTupledList:
templist = []
if counter <= nameLength:
templist.append(names[counter])
product.insert(0, templist)
counter += 1
df1 = pd.DataFrame (newTupledList, columns=['Name','Crude Protein', 'Crude Fat', 'Crude Fiber', 'Moisture'...])
return df1
newTupledList is a list that looks like this: [[['Crude Protein', '48%'], ['Crude Fat', '5.5%'], ['Crude Fiber', '0.5%'], ['Moisture', '6%'], ['Phosphorus', '0.1%']...]...]
Note that the first layer is all the products, the second is the individual product, and the third is all the nutritional values of all products, populated with data for the individual products and then a 0 for everything not relevant.
Len of names is 24. IDK if its relevant.
Now, the interesting issue here is that, no matter how many columns I pass in the dF I get a value error. If I do nothing, then I get a value error saying that I only passed 52 columns and needed 60. If I add 8 more columns it will say that I passed 60 columns but needed 61. If I add one to that it will say I passed 61 columns but needed 60. And so on.
Has anyone ever seen anything like that happen before? What are some approaches I could take to debugging such a weird bug? Thanks.
I have a pandas dataframe in which some rows didn't pull in correctly so that the values were pushed over into the next column over. Therefore I have a column that is mostly null, but has a few instances where there is a value that should go in the previous column. Below is an example of what it looks like.
enter image description here
I need to replace the 12345 and 45678 in the Approver column with JJones in the NeedtoDelete column.
I am not sure if a for loop, or a regular expression is the right way to go. I also came across the replace function, but I'm not sure how I would set that up in this scenario. Below is the code I have tried thus far (Q1Q2 is the df name):
for Q1Q2['Approver'] in Q1Q2:
Replacement = Q1Q2.loc[Q1Q2['Need to Delete'].notnull()]
Q1Q2.loc[Replacement] = Q1Q2['Approver']
Q1Q2.loc[Q1Q2['Need to Delete'].notnull(), ['Approver'] == Q1Q2['Need to Delete']]
If you could help me fix either attempts above, or point me in the right direction, it would be greatly appreciated. Thanks in advance!
You can use boolean indexing:
r=Q1Q2['Need to Delete'].notnull()
Q1Q2.loc[r,'Approver']=Q1Q2.loc[r,'Need to Delete']
I have a dataframe df in a PySpark setting. I want to change a column, say it is called A, whose datatype is "string". I want to change its values according to their lengths. In particular, if in a row we have only a character, we want to concatenate 0 to the end. Otherwise, we take the default value. The name of the "modified" column must still be A. This is for a Jupyter Notebook using PySpark3.
This is what I have tried so far:
df = df.withColumn("A", when(size(df.col("A")) == 1, concat(df.col("A"), lit("0"))).otherwise(df.col("A")))
I also tried the same code deleting the "df.col"'s.
When I run this code, the software complains saying that the syntax is invalid, but I don't see the error.
df.withColumn("temp", when(length(df.A) == 1, concat(df.A, lit("0"))).\
otherwise(df.A)).drop("A").withColumnRenamed('temp', 'A')
What I understood after reading your question was, you are getting one extra column A.
So you want that old column A replaced by new Column A. So I created a temp column with your required logic then dropped column A then renamed temp column to A.
Listen here child...
To choose a column from a DF in pyspark, you must not use the "col" function, since it is a Scala/Java API. In Pyspark, the correct way is just to choose the name from the DF: df.colName.
To get the length of your string, use the "length" function. Size function is for iterables.
And for the grand solution... (drums drums drums)
df.withColumn("A", when(length(df.A) == 1, concat(df.A, lit("0"))).otherwise(df.A))
Por favor!
I want to make the columns of Salary_Data_split variables, depending of Sal_name (type : list) where:
Sal_name = ['Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']
and Salary_Data_split must be as follow, it contains: Salary + existing rows on Sal_name. Like :
Salary_Data_split = data[["Salary",'Success_S_1', 'Failure_S_1', 'Success_S_2', 'Failure_S_2','Success_S_4', 'Failure_S_4','Success_S_7', 'Failure_S_7','Success_S_8', 'Failure_S_8']]
I have tried this code but it doesnt work
Salary_Data_split = data[["Salary", Sal_name]]
Please always include example data in your posts. It's also important to always include error messages in your posts. That way, your question is alot more clear. I am guessing data is your dataframe with columns Sal_name and Salary, which you want to combine in Sal_data_split?
data['sal_Data_Split'] = [data['Salary'], data['Sal_name']]
This will put the columns Salary and Sal_name in a list, resulting in a nested list if data['Sal_name'] is a list itself. The way you assigned Salary_Data_split = data[["Salary", Sal_name]] in your original post it just indexes 2 columns of the dataframe at once. You also forgot the quotation marks around Sal_name if that is what you meant.