I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209
I have a data frame where I am trying to get the row of min value by subtracting the abs difference of two columns to make a third column where I am trying to get the first or second min value of the data frame of col[3] I get an error. Is there a better method to get the row of min value from a column[3].
df2 = df[[2,3]]
df2[4] = np.absolute(df[2] - df[3])
#lowest = df.iloc[df[6].min()]
2 3 4
0 -111 -104 7
1 -130 110 240
2 -105 -112 7
3 -118 -100 18
4 -147 123 270
5 225 -278 503
6 102 -122 224
2 3 4
desired result = 2 -105 -112 7
Get difference to Series, add Series.abs and then compare by minimal value in boolean indexing:
s = (df[2] - df[3]).abs()
df = df[s == s.min()]
If want new column for diffence:
df['diff'] = (df[2] - df[3]).abs()
df = df[df['diff'] == df['diff'].min()]
Another idea is get index by minimal value by Series.idxmin and then select by DataFrame.loc, for one row DataFrame are necessary [[]]:
s = (df[2] - df[3]).abs()
df = df.loc[[s.idxmin()]]
EDIT:
For more dynamic code with convert to integers if possible use:
def int_if_possible(x):
try:
return x.astype(int)
except Exception:
return x
df = df.apply(int_if_possible)
I have a DataFrame with segments,timestamps and different columns
Segment Timestamp Value1 Value2 Value2_mean
0 2018-11... 180 156 135
0 170 140 135
0 135
1
1
...
I want to aggregate/group this DataFrame with 'Segment' and get the first Timestamp for a segment as soon as this intervall condition is met and then the time intervall in seconds for this segment. Because there are more values for a function, aggregate does not work I think.
value2_mean-std(value2) <= value1 <= value2_mean+std(value2)
It should look like this:
Segment Intervall[s]
0 10
1 19
2 6
3 ...
I tried something like this:
grouped = dataSeg.groupby(['Segment'])
def grouping(df)
a = np.array(df['Value_1'])
b = np.array(df['Value2'])
c = np.array(df['Value2_mean'])
d = np.array(df['Timestamp'])
for x in a:
categories = np.logical_and(
(c-np.std(b)<= x),
(c+np.std(b)>= x))
if np.any(categories):
return d[categories]-d[0]
grouped.apply(grouping)
This does not work the way I want it to. Any suggestions would be appreciated!
Something like this? I didn't test it thoroughly.
def calc(grp):
if grp.Value1.sub(grp.Value2_mean).abs().lt(grp.Value2.std()).any():
return grp["Timestamp"].iloc[-1] - grp["Timestamp"].iloc[0]
return np.nan
df.groupby("Segment").apply(calc)
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)
Apology if the problemis trivial but as a python newby I wasn't able to find the right solution.
I have two dataframes and I need to add a column to the first dataframe that is true if a certain value of the first dataframe is between two values of the second dataframe otherwise false.
for example:
first_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2':[10,22,15,15,7,130,2]})
second_df = pd.DataFrame({'code1':[1,1,2,2,3,1,1],'code2_start':[5,20,11,11,5,110,220],'code2_end':[15,25,20,20,10,120,230]})
first_df
code1 code2
0 1 10
1 1 22
2 2 15
3 2 15
4 3 7
5 1 130
6 1 2
second_df
code1 code2_end code2_start
0 1 15 5
1 1 25 20
2 2 20 11
3 2 20 11
4 3 10 5
5 1 120 110
6 1 230 220
For each row in the first dataframe I should check if the value reported in the code2 columne is between one of the possible range identified by the row of the second dataframe second_df for example:
in row 1 of first_df code1=1 and code2=22
checking second_df I have 4 rows with code1=1, rows 0,1,5 and 6, the value code2=22 is in the interval identified by code2_start=20 and code2_end=25 so the function should return True.
Considering an example where the function should return False,
in row 5 of first_df code1=1 and code2=130
but there is no interval containing 130 where code1=1
I have tried to use this function
def check(first_df,second_df):
for i in range(len(first_df):
return ((second_df.code2_start <= first_df.code2[i]) & (second_df.code2_end <= first_df.code2[i]) & (second_df.code1 == first_df.code1[i])).any()
and to vectorize it
first_df['output'] = np.vectorize(check)(first_df, second_df)
but obviously with no success.
I would be happy for any input you could provide.
thx.
A.
As a practical example:
first_df.code1[0] = 1
therefore I need to search on second_df all the istances where
second_df.code1 == first_df.code1[0]
0 True
1 True
2 False
3 False
4 False
5 True
6 True
for the instances 0,1,5,6 where the status is True I need to check if the value
first_df.code2[0]
10
is between one of the range identified by
second_df[second_df.code1 == first_df.code1[0]][['code2_start','code2_end']]
code2_start code2_end
0 5 15
1 20 25
5 110 120
6 220 230
since the value of first_df.code2[0] is 10 it is between 5 and 15 so the range identified by row 0 therefore my function should return True. In case of first_df.code1[6] the value vould still be 1 therefore the range table would be still the same above but first_df.code2[6] is 2 in this case and there is no interval containing 2 therefore the resut should be False.
first_df['output'] = (second_df.code2_start <= first_df.code2) & (second_df.code2_end <= first_df.code2)
This works because when you do something like: second_df.code2_start <= first_df.code2
You get a boolean Series. If you then perform a logical AND on two of these boolean series, you get a Series which has value True where both Series were True and False otherwise.
Here's an example:
>>> import pandas as pd
>>> a = pd.DataFrame([{1:2,2:4,3:6},{1:3,2:6,3:9},{1:4,2:8,3:10}])
>>> a['output'] = (a[2] <= a[3]) & (a[2] >= a[1])
>>> a
1 2 3 output
0 2 4 6 True
1 3 6 9 True
2 4 8 10 True
EDIT:
So based on your updated question and my new interpretation of your problem, I would do something like this:
import pandas as pd
# Define some data to work with
df_1 = pd.DataFrame([{'c1':1,'c2':5},{'c1':1,'c2':10},{'c1':1,'c2':20},{'c1':2,'c2':8}])
df_2 = pd.DataFrame([{'c1':1,'start':3,'end':6},{'c1':1,'start':7,'end':15},{'c1':2,'start':5,'end':15}])
# Function checks if c2 value is within any range matching c1 value
def checkRange(x, code_range):
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
return check.any()
# Apply the checkRange function to each row of the DataFrame
df_1['output'] = df_1.apply(lambda x: checkRange(x, df_2), axis=1)
What I do here is define a function called checkRange which takes as input x, a single row of df_1 and code_range, the entire df_2 DataFrame. It first finds the rows of code_range which have the same c1 value as the given row, x.c1. Then the non matching rows are discarded. This is done in the first 2 lines:
idx = code_range.c1 == x.c1
code_range = code_range.loc[idx]
Next, we get a boolean Series which tells us if x.c2 falls within any of the ranges given in the reduced code_range DataFrame:
check = (code_range.start <= x.c2) & (code_range.end >= x.c2)
Finally, since we only care that the x.c2 falls within one of the ranges, we return the value of check.any(). When we call any() on a boolean Series, it will return True if any of the values in the Series are True.
To call the checkRange function on each row of df_1, we can use apply(). I define a lambda expression in order to send the checkRange function the row as well as df_2. axis=1 means that the function will be called on each row (instead of each column) for the DataFrame.