I have the following code:
print(df.drop(df[df['Quantity'] == 0].index).rename(columns={'Weight': 'Weight (oz.)'}))
I understand what query is trying to do, but I'm lost at why you need to add the " .index " portion?
What is .index doing in this particular code?
For context here is what the dataframe looks like:
I looked at the python documentation for dataframe index:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html
but unfortunately it was too vague for me to make sense of it.
The DataFrame.index is the index of each record in your dataframe. It is unique to each row even if two rows have the same data in each column. DataFrame.drop takes the index : single label or list-like and drops those rows that match the index.
So from the code above,
df[df['Quantity'] == 0] gets the rows that has Quantity == 0,
df[df['Quantity'] == 0].index gets the indexes of all rows that has the predicate,
df.drop(df[df['Quantity'] == 0].index) this drops all the indices that returned True for that predicate.
Hope this helps!
I checked df.drop()'s documentation. It says that it drops by index. This code first finds the items that has the quantity 0, but because drop() works with indexes , it sends the items back to the dataframe and receive their indexes. That's index.
https://pandas.pydata.org/pandas-docs/stable//reference/api/pandas.DataFrame.drop.html
Related
Why is that df.loc[df[‘xx’] ==‘yy’,’oo’].values turns out to be a blank list/array (array([],dtype=float64))instead of returning a 0 in a dataframe? when I call out df, that specific spot shows 0, which is changed by myself from a nan.
I want the loc spot appears to be 0 which it’s supposed to
The df.loc[df[‘xx’] == ‘yy’, ’oo’].values result, contains filtered data that match by this condition, if you want count of matched rows you must use count method, same as below:
df.loc[df[‘xx’] == ‘yy’, ’oo’].count()
I would like to know how to vectorialize this logic:
create a new column (df['state']) that have value
'startTrade' if 10SMA>30SMA>100SMA but in preceding row this condition was not true
AND the previous row was not state='startTrade'.
Subsequest rows need to be state 'inTrade' or something like that.
'exitTrade' if 10SMA<30SMA and in previous row state was = 'inTrade'
I am coding that with python for-loop and is runninng, but I think that it would be very interesting knowing how to refers to the previous calculation results with lambda or any other way to vectorialize and using the philosophy of dataframe, and avoid python loop.
Use the index attribute of the Dataframe :
df = pd.DataFrame(...)
for i in df.index:
if df['10SMA'][i] > df['30SMA'][i] > df['100SMA'][i] and df['state'][i-1] != 'startTrade':
df['state'][i] = 'startTrade'
elif df['10SMA'][i] < df['30SMA'][i]:
df['state'][i] = 'exitTrade'
else:
df['state'][i] = 'inTrade'
It seems that the right answer is doing task in two times: first using shift, getting the previous row value on the current row. Then is possible to calulate every row in parallel mode, because every rows "knows" the previous row value. Thank you https://stackoverflow.com/users/523612/karl-knechtel that understood the right answer even before I understood the question!!
I have a dataframe filled with twitter data. The columns are:
row_id : Int
content : String
mentions : [String]
value : Int
So for every tweet I have it's row id in the dataframe, the content of the tweet, the mentions used in it (for example: '#foo') as an array of strings and a value that I calculated based on the content of the tweet.
An example of a row would be:
row_id : 12
content : 'Game of Thrones was awful'
mentions : ['#hbo', '#tv', '#dissapointment', '#whatever']
value: -0.71
So what I need is a way to do the following 3 things:
find all rows that contain the mention '#foo' in the mentions-field
find all rows that ONLY contain the mention '#foo' in the mentions-field
above two but checking for an array of strings instead of checking for only one handle
If anyone could help met with this, or even just point me in the right direction that'd be great.
Let's call your DataFrame df.
For the first task you use:
result = df[(Dataframe(df['mentions'].tolist()) == '#foo').any(1)]
Here, the Dataframe(df['mentions']) creates a new DataFrame where each column is a mention and each row a tweet.
Then == '#foo' generates a boolean dataframe containing True where the mentions are '#foo'.
Finally .any(1) returns a boolean index which elements are True if any element in the row is True.
I think with this help you can manage to solve the rest for yourself.
I have a dataframe df in a PySpark setting. I want to change a column, say it is called A, whose datatype is "string". I want to change its values according to their lengths. In particular, if in a row we have only a character, we want to concatenate 0 to the end. Otherwise, we take the default value. The name of the "modified" column must still be A. This is for a Jupyter Notebook using PySpark3.
This is what I have tried so far:
df = df.withColumn("A", when(size(df.col("A")) == 1, concat(df.col("A"), lit("0"))).otherwise(df.col("A")))
I also tried the same code deleting the "df.col"'s.
When I run this code, the software complains saying that the syntax is invalid, but I don't see the error.
df.withColumn("temp", when(length(df.A) == 1, concat(df.A, lit("0"))).\
otherwise(df.A)).drop("A").withColumnRenamed('temp', 'A')
What I understood after reading your question was, you are getting one extra column A.
So you want that old column A replaced by new Column A. So I created a temp column with your required logic then dropped column A then renamed temp column to A.
Listen here child...
To choose a column from a DF in pyspark, you must not use the "col" function, since it is a Scala/Java API. In Pyspark, the correct way is just to choose the name from the DF: df.colName.
To get the length of your string, use the "length" function. Size function is for iterables.
And for the grand solution... (drums drums drums)
df.withColumn("A", when(length(df.A) == 1, concat(df.A, lit("0"))).otherwise(df.A))
Por favor!
Basically I am trying the take the previous row for the combination of ['dealer','State','city']. If I have multiple values in this combination I will get the Shifted value of this combination.
df['ShiftBY_D_S_C']= df.groupby(['dealer','State','city'])['dealer'].shift(1)
I am taking this ShiftBY_D_S_C column again and trying to take the count for the ['ShiftBY_D_S_C','State','city'] combination.
df['NewColumn'] = (df.groupby(['ShiftBY_D_S_C','State','city'])['ShiftBY_D_S_C'].transform("count"))+1
Below table shows what I am trying to do and it works well also. But when all the rows in ShiftBY_D_S_C column is nulls, this not working, as it have all null values. Any suggestions?
I am trying to see the NewColumn values like below when all the values in ShiftBY_D_S_C are NaN.
You could simply handle the special case that you describe with an if/else case:
if df['ShiftBY_D_S_C'].isna().all():
df['NewColumn'] = 1
else:
df['NewColumn'] = df.groupby(...)