Comparing multiple columns of data using pandas data frame - python

I have a pandas data frame df1
Time sat1 sat2 sat3 sat4 val1 val2 val3 val4
10 2 4 2 4 0.1 -1.0 1 2.0
20 3 1 1 3 1.6 0 2.1 -0.7
30 12 8 8 16 0.5 1.1 0.6 2.0
40 2 1 2 12 1.0 1.2 0.4 3.7
I want to compare sat1,sat2 with sat3 and sat4 at all time instant.
If there is match between these two columns ,I want to get number of matched
elements and subtract matched elements values columns.
Expected Output:
match_count Reslt_1 Reslt_2
2 val1-val3 val2-val4
2 val1-val4 val2-val3
1 Nan val2-val3
1 val1-val3 Nan ( w.r.t match found in sat1 or sat2)
These data are sample data and number of columns may increase . Data in sat1,sat2 are toggling in sat3 & sat4 and that is why subtraction will happen accordingly.
How can I obtain above expected output using pandas. I obtained above dataframe
using pandas concat function.

Related

Replacing values on cell by multiplier by other row in Pandas

I have the following dataframe:
I want to verify if the value of a cell is 0 for any date. If it is, I want to replace the value of the cell by multiplying the value on the previous cell by the proper multiplier.
For example, Day 14 = 0, I want to multiply Day 7 by Mul 14 and store the new value in Day 14. And so on with the whole dataframe.
I have tried this code but it is not working:
if df['day 30'] == 0.00:
df['day 30'] = df['day 14']*df['Mul 30']
And this is my expected output:
Thanks!
Here is a solution with small example:
import pandas as pd
import numpy as np
df=pd.DataFrame([[0.8,0.9,0.7,2,6],[0.6,0,0,2,3],[0.2,0,0,4,2]],columns=["Day 7","Day 14","Day 30","Mul 14","Mul 30"])
print(df)
df["Day 14"]=np.where(df["Day 14"]==0,df["Day 7"]*df["Mul 14"],df["Day 14"])
df["Day 30"]=np.where(df["Day 30"]==0,df["Day 14"]*df["Mul 30"],df["Day 30"])
print(df)
If you want ypu can iterate over [7,14,10,90] instead of writing individual lines.
Result of above code:
Day 7 Day 14 Day 30 Mul 14 Mul 30
0 0.8 0.9 0.7 2 6
1 0.6 0.0 0.0 2 3
2 0.2 0.0 0.0 4 2
Day 7 Day 14 Day 30 Mul 14 Mul 30
0 0.8 0.9 0.7 2 6
1 0.6 1.2 3.6 2 3
2 0.2 0.8 1.6 4 2

Convert continuous numerical data to discrete numerical data in Pandas

I have a pandas dataframe df with a column having continuous numerical data.
A
0 1.5
1 15.0
2 12.8
3 23.2
4 9.6
I want to replace the continuous variables with numerical value based on the following rules:
0-10=10
10-20=50
20-100=80
The final dataframe obtained should be like this:
A
0 10
1 50
2 50
3 80
4 10
I had tried to use pandas.cut(df, bins=[0,10,20,100], labels=[10,50,80]) but it returns a Categorical column. I need the output column to be numerical.
Adding to_numeric to your code
pd.to_numeric(pd.cut(df['A'], bins=[0,10,20,100], labels=[10,50,80]))
Out[54]:
0 10
1 50
2 50
3 80
4 10
Name: A, dtype: int64

How can I compute my data frame by slicing the index

I have the data as
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
If A is my index and B is the value corresponding to A. I have to group the first three i.e [0.1,0.3,0.5] and calculate the average in B i.e [3,4,6]. similarly average of 2nd 3 data [8,2,10] corresponding to [0.7,0.9,1.1] and again of [2,3,4] corresponding to [1.3,1.5,1.7] and then prepare the table for this three average values. Final Data frame should be like
A=[1,2,3]
B=[average 1, average 2, average 3]
If need aggregate mean by each 3 values use helper array by length of DataFrame with integer division by 3 with GroupBy.mean:
A=[0.1,0.3,0.5,0.7,0.9,1.1,1.3,1.5,1.7]
B=[3,4,6,8,2,10,2,3,4]
df = pd.DataFrame({'col':B}, index=A)
print (df)
col
0.1 3
0.3 4
0.5 6
0.7 8
0.9 2
1.1 10
1.3 2
1.5 3
1.7 4
df = df.groupby(np.arange(len(df)) // 3).mean()
df.index +=1
print (df)
col
1 4.333333
2 6.666667
3 3.000000

pandas unique values how to iterate as a starting point

Good Morning, (bad beginner)
I have the following pandas dataframe:
My goal is to take the firs time a new ID appears and let the VALUE COLUMN be 1000* DELTA of that row. for all consecutive rows of that ID, the VALUE is the VALUE of the row above * the DELTA of the current row.
I tried by getting all unique ID values:
a=stocks2.ID.unique()
a.tolist()
It works, unfortunately I do not really know how to iterate in the way I described. Any kind of help or tip would be greatly appreciated!
A way to do it would be as follows. Example dataframe:
df = pd.DataFrame({'ID':[1,1,5,3,3], 'delta':[0.3,0.5,0.2,2,4]}).assign(value=[2,5,4,2,3])
print(df)
ID delta value
0 1 0.3 2
1 1 0.5 5
2 5 0.2 4
3 3 2.0 2
4 3 4.0 3
Fill value from the row above as:
df['value'] = df.shift(1).delta * df.shift(1).value
Groupby to get the indices where the first ID appears:
w = df.groupby('ID', as_index=False).nth(0).index.values
And compute the values for value using the indices in w:
df.loc[w,'value'] = df.loc[w,'delta'] * 1000
Which gives for this example:
ID delta value
0 1 0.3 300.0
1 1 0.5 0.6
2 5 0.2 200.0
3 3 2.0 2000.0
4 3 4.0 4.0

Python - Pivot and create histograms from Pandas column, with missing values

Having the following Data Frame:
name value count total_count
0 A 0 1 20
1 A 1 2 20
2 A 2 2 20
3 A 3 2 20
4 A 4 3 20
5 A 5 3 20
6 A 6 2 20
7 A 7 2 20
8 A 8 2 20
9 A 9 1 20
----------------------------------
10 B 0 10 75
11 B 5 30 75
12 B 6 20 75
13 B 8 10 75
14 B 9 5 75
I would like to pivot the data, grouping each row by the name value, then create columns based on the value & count columns aggregated into bins.
Explanation: I have 10 possible values, range 0-9, not all the values are present in each group. In the above example group B is missing values 1,2,3,4,7. I would like to create an histogram with 5 bins, ignore missing values and calculate the percentage of count for each bin. So the result will look like so:
name 0-1 2-3 4-5 6-7 8-9
0 A 0.150000 0.2 0.3 0.2 0.150000
1 B 0.133333 0.0 0.4 0.4 0.066667
For example for bin 0-1 of group A the calculation is the sum of count for the values 0,1 (1+2) divided by the total_count of group A
name 0-1
0 A (1+2)/20 = 0.15
I was looking into hist method and this StackOverflow question, but still struggling with figuring out what is the right approach.
Use pd.cut to bin your feature, then use a df.groupby().count() and the .unstack() method to get the dataframe you are looking for. During the group by you can use any aggregation function (.sum(), .count(), etc) to get the results you are looking for. The code below works if you are looking for an example.
import pandas as pd
import numpy as np
df = pd.DataFrame(
data ={'name': ['Group A','Group B']*5,
'number': np.arange(0,10),
'value': np.arange(30,40)})
df['number_bin'] = pd.cut(df['number'], bins=np.arange(0,10))
# Option 1: Sums
df.groupby(['number_bin','name'])['value'].sum().unstack(0)
# Options 2: Counts
df.groupby(['number_bin','name'])['value'].count().unstack(0)
The null values in the original data will not affect the result.
To get the exact result you could try this.
bins=range(10)
res = df.groupby('name')['count'].sum()
intervals = pd.cut(df.value, bins=bins, include_lowest=True)
df1 = (df.groupby([intervals,"name"])['count'].sum()/res).unstack(0)
df1.columns = df1.columns.astype(str) # convert the cols to string
df1.columns = ['a','b','c','d','e','f','g','h','i'] # rename the cols
cols = ['a',"b","d","f","h"]
df1 = df1.add(df1.iloc[:,1:].shift(-1, axis=1), fill_value=0)[cols]
print(df1)
You can manually rename the cols later.
# Output:
a b d f h
name
A 0.150000 0.2 0.3 0.200000 0.15
B 0.133333 NaN 0.4 0.266667 0.20
You can replace the NaN values using df1.fillna("0.0")

Categories

Resources