Derive a new pandas column based on lengh of string in other columns
I want to count the number of columns which have a value in each row and create a new column with that number. Assume if I have 3 columns and two columns have some value then new column for that row will have the value 2.
df = pd.DataFrame({'ID':['1','2','3'], 'J1': ['a','ab',''],'J2':['22','','33']})
print df
The output should be like:
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
One way could be to check which cells are not equal (DataFrame.ne) to an empty string, and take the sum along the rows:
df['Count_of_cols_have_values '] = df.set_index('ID').ne('').sum(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
Or you can also replace with NaNs and count, which returns the amount of non_NA values:
df['Count_of_cols_have_values '] = df.set_index('ID').replace('',np.nan).count(1).values
ID J1 J2 Count_of_cols_have_values
0 1 a 22 2
1 2 ab 1
2 3 33 1
Related
I have a dataframe with the structure:
ID
Split
Data
1
GT:RC:BC:CN
1:4:5:3
2
GT:RC:CN
1:7:0
3
GT:BC
4:2
I would like to create n new columns and populate with the data in the Data column, where n is the total number of unique fields split by a colon in the Split column (in this case, this would be 4 new columns: GT, RC, BC, CN). The new columns should be populated with the corresponding data in the Data column, so for ID 3, only column GT and BC should be populated. I have tried using string splitting, but that doesn't take into account the correct column to move the data to.
The output should look like this:
ID
Split
Data
GT
RC
BC
CN
1
GT:RC:BC:CN
1:4:5:3
1
4
5
3
2
GT:RC:CN
1:7:0
1
7
0
3
GT:BC
4:2
4
2
You can use:
out = df.join(pd.concat([pd.Series(d.split(':'), index=s.split(':'))
for s,d in zip(df['Split'], df['Data'])], axis=1).T)
output:
ID Split Data GT RC BC CN
0 1 GT:RC:BC:CN 1:4:5:3 1 4 5 3
1 2 GT:RC:CN 1:7:0 1 7 NaN 0
2 3 GT:BC 4:2 4 NaN 2 NaN
I am having trouble with Pandas.
I try to compare each value of a row to another one.
In the attached link you will be able to see a slice of my dataframe.
For each date I have the daily variation of some stocks.
I want to compare each stock variation to the variation of the columns labelled 'CAC 40'.
If the value is greater I want to turn it into a Boolean 1 or 0 if lower.
This should return a dataframe filled only with 1 or 0 so I can then summarize by columns.
I have tried the apply method but this doesn't work.
It returns a Pandas.Serie ( attached below )
def compare_to_cac(row):
for i in row:
if row[i] >= row['CAC 40']:
return 1
else:
return 0
data2 = data.apply(compare_to_cac, axis=1)
Please can someone help me out ?
I worked with this data (column names are not important here, only the CAC 40 one is):
A B CAC 40
0 0 2 9
1 1 3 9
2 2 4 1
3 3 5 2
4 4 7 2
With just a for loop :
for column in df.columns:
if column == "CAC 40":
continue
condition = [df[column] > df["CAC 40"]]
value = [1]
df[column] = np.select(condition, value, default=0)
Which gives me as a result :
A B CAC 40
0 0 0 9
1 0 0 9
2 1 1 1
3 1 1 2
4 1 1 2
I have a sorted dataframe with an ID, and a value column, which looks like:
ID value
A 10
A 10
A 10
B 15
B 15
C 10
C 10
...
How can i create a new dataframe, that it counts the "new" distinct values in terms of the number of different IDS, so that it basically goes over my dataframe and looks like:
Number of ID Number of distinct values
1 1
2 2
3 2
In that case above we have 3 different IDs, but ID A and C have the same value.
So the first row in the new dataframe:
Numer of ID = 1, because we have 1 different ID so far
Number of distinct values= 1 , because we have one distinct value so far
Second row:
Number of ID=2, because we are going to row 4 in the old dataframe( we only are interessted in new IDS)
Number of disntinct values=2, because the value changed to 15 and didn't occur so far
I think you need processing new DataFrame by DataFrame.drop_duplicates with factorize and cumsum:
Replace duplicated values to NaN, forward filling them and then call pd.factorize:
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
I change data for better testing:
print (df)
ID value
0 A 10
1 A 10
2 A 10
3 B 15
4 B 15
5 C 10
6 C 15
df1 = df.drop_duplicates(['ID','value']).copy()
df1['Number of ID'] = range(1, len(df1)+1)
df1['Number of distinct values'] = pd.factorize(df1['value'].mask(df1['value'].duplicated()).ffill())[0] + 1
print (df1)
ID value Number of ID Number of distinct values
0 A 10 1 1
3 B 15 2 2
5 C 10 3 2
6 C 15 4 2
Working wrong if multiple values value per ID:
df = pd.DataFrame({'Number of ID': range(1, len(df1)+1),
'Number of distinct values': np.cumsum(pd.factorize(df1['value'])[0])+1})
print (df)
Number of ID Number of distinct values
0 1 1
1 2 2
2 3 2
3 4 3
I have a dataframe like this
id a1 a2 a3 b1 b2
1 1 0 0 0 1
2 0 0 0 1 0
3 1 1 0 0 1
4 1 0 1 1 1
5 0 1 1 0 0
Now, I have to transpose columns beginning with prefix 'a' into rows and get counts for corresponding columns with prefix 'b'. The counts are basically the number of times that 'a' and 'b' co-occurred in an id. Co-occurrence is only if both the values are '1' for that id.
b1 b2
a1 1 3
a2 0 1
a3 1 1
In the above example, a1,b2 pair co-occurred in 3 ids (in id 1,3 and 4) and hence the value is 3.
How to do this in Pandas?
Matrix multiplication (with python 3.5+)
df[['a1', 'a2','a3']].T # df[['b1','b2']]
Update: more generally
df.filter(like='a').T # df.filter(like='b')
Or
df.iloc[:,:3].T # df.iloc[:,3:]
I have two data frames df1 - which holds a 'grouped inventory' of items grouped by numerical values A, B and C. For each item there is a sum column which should reflect the total price of all the items I have of that particular type. Initially I have set the sum column to zero.
df2 is a list of items I have with A, B, C and the price of the item.
df1 (Initial Inventory):
A B C SUM
1 1 1 0
1 1 2 0
1 2 2 0
2 2 2 0
df2 (List of items):
A B C PRICE
2 2 2 30
1 1 2 100
1 1 2 110
1 1 2 105
So my code should convert df1 into:
df1 (expected output):
A B C SUM
1 1 1 0
1 1 2 315
1 2 2 0
2 2 2 30
Explanation: My list of items (df2) contains one item coded as 2,2,2 which has a value of 30 and contains three items coded as 1,1,2 which has values of 100 + 110 + 105 = 315. So I update the inventory table df1, to reflect that I have a total value of 30 for items coded 2,2,2 and total value of 315 for items coded 1,1,2. I have 0 in value for items coded 1,1,1 and 1,2,2 - since they aren't found in my items list.
What would be the most efficient way to do this?
I would rather not use loops since df1 is 720 rows and df2 is 10,000 rows.
You can try to merge on columns "A", "B", and "C" with how="left". (df2_sum below is a subset of df1, so we choose left here.)
df2_sum = df2.groupby(["A", "B", "C"])["PRICE"].sum().reset_index()
df1.merge(df2_sum, on=["A","B","C"], how="left").fillna(0)
A B C SUM PRICE
0 1 1 1 0 0.0
1 1 1 2 0 315.0
2 1 2 2 0 0.0
3 2 2 2 0 30.0
You can then add PRICE to your SUM column.