Here is the the part of the code I am having issues with:
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
b
The output I get is b = 14 which is the last value where the if statement holds in the column of the dataframe. I am trying to get ALL the values of the column in which the if statement holds to be held in "b" rather than just the last value alone.
What you want to do is make a list instead and append b to it.
my_vals = []
for x in range(len(df['Days'])):
if df['Days'][x]>0 and df['Days'][x]<=30:
b = df['Days'][x]
my_vals.append(b)
my_vals
In your code, you are changing b in every iterration and so it only stores the most recent value. In the future when you are trying to store multiple values, do so in a different Data Type.
You can also use the filtering functionality of pandas and use
values = df.loc[(df['Days'] >= 0) & (df['Days'] <= 30)]
If you want the values as a Series instead of a DataFrame use
values_series = values['Days']
If you want the values as a list instead of a Series use
values_list = list(values_series)
Related
I'm looking up a certain row in my Pandas DataFrame by using the index - the row information is stored in variable p. As you can see p gives me a normal Pandas DataFrame. Now I want to save just the integer in in_reply_to_status_id as variable y but, in my code below, it gives me an object. Does anyone know if and how it would be possible to just store the integer (1243885949697888263 in this case) as y?
y is a series, you can try as follow to pick second (1243885949697888263) value
print(y.array[0])
Did you try this?
y = df.at[i, 'in_reply_to_status_id']
That way you don't have to create p.
If you want to do it with p:
y = p.iloc[0].at['in_reply_to_status_id']
Or:
y = p.iat[0,1]
Or:
y = p.at[i, 'in_reply_to_status_id']
I'm trying to agg() a df at the same time I make a subsetting from one of the columns:
indi = pd.DataFrame({"PONDERA":[1,2,3,4], "ESTADO": [1,1,2,2]})
empleo = indi.agg(ocupados = (indi.PONDERA[indi["ESTADO"]==1], sum) )
but I'm getting 'Series' objects are mutable, thus they cannot be hashed
I want to sum the values of "PONDERA" only when "ESTADO" == 1.
Expected output:
ocupados
0 3
I'm trying to imitate R function summarise(), so I want to do it in one step and agg some other columns too.
In R would be something like:
empleo <- indi %>%
summarise(poblacion = sum(PONDERA),
ocupados = sum(PONDERA[ESTADO == 1]))
Is this even the correct approach?
Thank you all in advance.
Generally agg takes as an argument function, not Series itself. In your case though it's more beneficial to separate filtering and summation.
One of the options would be the following:
empleo = indi.query("ESTADO == 1")[["PONDERA"]].sum()
(Use single square brackets to output single number, instead of pd.Series)
Another option would be to use loc and filter the dataframe to when estado = 1, and sum the values of the column pondera:
indi.loc[indi.ESTADO==1, ['PONDERA']].sum()
Thanks to #Henry's input.
A bit fancy, but the output is exactly the format you want, and the syntax is similar to what you tried:
Use DataFrameGroupBy.agg() instead of DataFrame.agg():
empleo = (indi.loc[indi['ESTADO']==1]
.groupby('ESTADO')
.agg(ocupados=('PONDERA', sum))
.reset_index(drop=True)
)
Result:
print(empleo) gives:
ocupados
0 3
Here are two different ways you can get the scalar value 3.
option1 = indi.loc[indi['ESTADO'].eq(1),'PONDERA'].sum()
option2 = indi['PONDERA'].where(indi['ESTADO'].eq(1)).sum()
However, your expected output shows this value in a dataframe. To do this, you can create a new dataframe with the desired column name "ocupados".
outputdf = pd.DataFrame({'ocupados':[option1]})
Based on your comment you provided, is this what you are looking for?
(indi.agg(poblacion = ("PONDERA", 'sum'),
ocupados = ('PONDERA',lambda x: x.where(indi['ESTADO'].eq(1)).sum())))
I have some code which reads a json file and applies a lambda that removes values.
Code -
import pandas as pd
data = pd.read_json('filename.json',dtype='int64')
data = data[data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))]
The last statements filter outs records from dataframe where anything other than numbers having length 2 is in ColumnA (please correct if I'm wrong).
Objective - Before applying the lambda, I want to print the record from dataframe, so that i can know what kind of values are getting removed.
P.S. I am new to python and working on some predesigned code
What you are doing here is filtering a DataFrame based on the values. Doing this involves two steps:
Creating a boolean array of True/False for those that you want/don't want
Indexing the original DataFrame that boolean array to filter the select only the values that you want.
In your code, you are doing both steps at once (and that's perfectly fine). But if you want to look at the values, it might be helpful to get the boolean array and look at that one.
data = data[data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))]
# can become
# step 1
my_values = data['ColumnA'].apply(lambda x: x == None or (x.isnumeric() and len(x) <= 2))
# step2
data = data[my_values]
# you can use my_values to inspect what is happening
print(my_values) # this will be a series of just True/False
print(data[my_values]) # this will print the kept values
print(data[~my_values]) # this will print the removed values
I want to check if one date is between two other dates (everything in the same row). If this is true I want that a new colum is filled with a sales value of the same table. If not the row shall be dropped.
The code shall iterate over the entire dataframe.
This is my code:
for row in final:
x = 0
if pd.to_datetime(final['start_date'].iloc[x]) < pd.to_datetime(final['purchase_date'].iloc[x]) < pd.to_datetime(final['end_date'].iloc[x]):
final['new_col'].iloc[x] = final['sales'].iloc[x]
else:
final.drop(final.iloc[x])
x = x + 1
print(final['new_col'])
Instead of the values of final[sales] I just get 0 back.
Does anyone know where the mistake is or any other efficient way to tackle this?
The DataFrame looks like this:
I will do something like this:
First, creating the new column:
import numpy as np
final['new_col'] = np.where(pd.to_datetime(final['start_date'])<(pd.to_datetime(final['purchase_date']), final['sales'], np.NaN)
Then, you just drop the Na's:
final.dropna(inplace=True)
I'm trying to code following logic in pandas, for first three rows of every group i want to create a variable which should have value 1(1st row), 2 (2nd row), 3(3rd row). I'm doing it like below, In the below code I'm not creating a new variable because i don't know how to do that, so I'm replacing the variable that's already present in the data set. Though my code doesn't throw error, it's giving me very strange results.
def func (i):
data.loc[data.groupby('ID').nth(i).index,'date'] = i
func(1)
Any suggestions?
Thanks in Advance.
If you don't have duplicated index, you can create a row id for each group, filter out id which is larger than 3 and then assign it back to the data frame:
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
This gives the first three rows for each ID 1,2,3, rows beyond 3 will have NaN values.
data = pd.DataFrame({"ID":[1,1,1,1,2,2,3,3,3]})
data['date'] = (data.groupby('ID').cumcount() + 1)[lambda x: x <= 3]
data