construct dataframe from a series of keys and a key:value dataframe - python

I have a pandas series of keys and would like to create a dataframe by selecting values from other dataframes.
eg.
data_df = pandas.DataFrame({'key' : ['a','b','c','d','e','f'],
'value1': [1.1,2,3,4,5,6],
'value2': [7.1,8,9,10,11,12]
})
keys = pandas.Series(['a','b','a','c','e','f','a','b','c'])
data_df
# key value1 value2
#0 a 1.1 7.1
#1 b 2.0 8.0
#2 c 3.0 9.0
#3 d 4.0 10.0
#4 e 5.0 11.0
#5 f 6.0 12.0
I would like to get the result like this
result
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0
one way I have successfully done this is by using
def append_to_series(key):
new_series=data_df[data_df['key']==key].iloc[0]
return new_series
pd.DataFrame(key_df.apply(append_to_series))
However, this function is very slow and not clean. Is there a way to do this more efficiently?

convert the series intodataframe with column name key
use pd.merge() to merge value1,value2
keys = pd.DataFrame(['a','b','a','c','e','f','a','b','c'],columns=['key'])
res = pd.merge(keys,data_df,on=['key'],how='left')
print(res)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0

Create index by key column and then use DataFrame.reindex or DataFrame.loc:
Notice: Necessary unique values of original key column.
df = data_df.set_index('key').reindex(keys.rename('key')).reset_index()
Or:
df = data_df.set_index('key').loc[keys].reset_index()
print (df)
key value1 value2
0 a 1.1 7.1
1 b 2.0 8.0
2 a 1.1 7.1
3 c 3.0 9.0
4 e 5.0 11.0
5 f 6.0 12.0
6 a 1.1 7.1
7 b 2.0 8.0
8 c 3.0 9.0

Related

How to delete rows based on change in variable in pandas dataframe

I've got a dataset with an insanely high sampling rate, and would like to remove excess data where the columnar value changes less than a predefined value down through the dataset. However, some intermediary points need to be kept in order to not loose all data.
e.g.
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
5 3.7 4.2
6 3.8 4.6
7 4.4 5.4
8 5.1 6.0
9 6.0 7.0
10 7.0 10.0
Now I want to delete all the rows where the change in V from one row to another is less than dV, AND the change in t is below dt, but still keep datapoints such that there is data at roughly every interval dV or dt.
Lets say for dV = 1 and dt = 1, the wanted output would be:
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 4.4 5.4
9 6.0 7.0
10 7.0 10.0
Meaning row 5, 6 and 8 was deleted since it was within the changevalue, but row 7 remains since it has a changevalue above dt and dV in both directions.
The easy solution is iterating over the rows in the dataframe, but a faster (and more proper) solution is wanted.
EDIT:
The question was edited to reflect the point that intermediary points must be kept in order to not delete too much.
Use DataFrame.diff with boolean indexing:
dV = 1
dt = 1
df = df[~(df['t'].diff().lt(dt) & df['V'].diff().lt(dV))]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
Or:
dV = 1
dt = 1
df1 = df.diff()
df = df[df1['t'].fillna(dt).ge(dt) | df1['V'].fillna(dV).ge(dV)]
print (df)
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0
you might want to use shift() method:
diff_df = df - df.shift()
and then filter rows with loc:
diff_df = diff_df.loc[diff_df['V'] > 1.0 & diff_df['t'] > 1.0]
You can use loc for boolean indexing and do the comparison between the values between rows within each column using shift():
# Thresholds
dv = 1
dt = 1
# Filter out
print(df.loc[~((df.V.sub(df.V.shift()) < 1) & (df.t.sub(df.t.shift()) < 1))])
t V
0 1.0 1.0
1 2.0 1.2
2 3.0 2.0
3 3.3 3.0
4 3.4 4.0
7 5.0 6.0
8 5.1 8.0
9 6.0 9.0
10 7.0 10.0

How to add the values of one smaller DataFrame to part of another mixed type DataFrame, but only to rows after some arbitrary row index?

I have two .csv files, one contains what could be described as a header and a body. The header contains data like the total number of rows, datetime, what application generated the data, and what line the body starts on. The second file contains a single row.
>>> import pandas as pd
>>> df = pd.read_csv("data.csv", names=list('abcdef'))
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>> df2 = pd.read_csv("extra_data.csv")
>>> df2
a b c
0 6.0 5.0 4.0
>>> row = df2.loc[0]
>>>
I am having trouble modifying the 'a', 'b' and 'c' columns and then saving the DataFrame to a new .csv file.
I have tried adding the row by way of slicing and the addition operator but this did not work:
>>> df[5:,'a':'c'] += row
TypeError: '(slice(5, None, None), slice('a', 'c', None))' is an invalid key
>>>
I also tried the answer I found here, but this gave a similar error:
>>> df[5:,row.index] += row
TypeError: '(slice(5, None, None), Index(['a', 'b', 'c'], dtype='object'))' is an invalid key
>>>
I suspect the problem is coming from object dtypes so I tried converting a subframe to the float type:
>>> sub_section = df.loc[5:,['a','b','c']].astype(float)
>>> sub_section
a b c
5 0.0 1.0 2.0
6 0.0 1.0 2.0
7 0.0 1.0 2.0
8 0.0 1.0 2.0
9 0.0 1.0 2.0
10 0.0 1.0 2.0
11 0.0 1.0 2.0
>>> sub_section += row
>>> sub_section
a b c
5 6.0 6.0 6.0
6 6.0 6.0 6.0
7 6.0 6.0 6.0
8 6.0 6.0 6.0
9 6.0 6.0 6.0
10 6.0 6.0 6.0
11 6.0 6.0 6.0
>>> df
a b c d e f
0 data start row 5 NaN NaN NaN NaN
1 row count 7 NaN NaN NaN NaN
2 made by foo.exe NaN NaN NaN NaN
3 date 01-01-2000 NaN NaN NaN NaN
4 a b c d e f
5 0.0 1.0 2.0 3.0 4.0 5.0
6 0.0 1.0 2.0 3.0 4.0 5.0
7 0.0 1.0 2.0 3.0 4.0 5.0
8 0.0 1.0 2.0 3.0 4.0 5.0
9 0.0 1.0 2.0 3.0 4.0 5.0
10 0.0 1.0 2.0 3.0 4.0 5.0
11 0.0 1.0 2.0 3.0 4.0 5.0
>>>
Obviously, in this case df.loc[] is returning a copy, and then modifying the copy does nothing to the df.
How do I modify parts of a DataFrame (dtype=object) and then save the changes?

How to add the sum of some column at the end of dataframe

I have a pandas dataframe with 11 columns. I want to add the sum of all values of columns 9 and column 10 to the end of table. So far I tried 2 methods:
Assigning the data to the cell with dataframe.iloc[rownumber, 8]. This results in an out of bound error.
Creating a vector with some blank: ' ' by using the following code:
total = ['', '', '', '', '', '', '', '', dataframe['Column 9'].sum(), dataframe['Column 10'].sum(), '']
dataframe = dataframe.append(total)
The result was not nice as it added the total vector as a vertical vector at the end rather than a horizontal one. What can I do to solve the issue?
You need use pandas.DataFrame.append with ignore_index=True
so use:
dataframe=dataframe.append(dataframe[['Column 9','Column 10']].sum(),ignore_index=True).fillna('')
Example:
import pandas as pd
import numpy as np
df=pd.DataFrame()
df['col1']=[1,2,3,4]
df['col2']=[2,3,4,5]
df['col3']=[5,6,7,8]
df['col4']=[5,6,7,8]
Using Append:
df=df.append(df[['col2','col3']].sum(),ignore_index=True)
print(df)
col1 col2 col3 col4
0 1.0 2.0 5.0 5.0
1 2.0 3.0 6.0 6.0
2 3.0 4.0 7.0 7.0
3 4.0 5.0 8.0 8.0
4 NaN 14.0 26.0 NaN
Whitout NaN values:
df=df.append(df[['col2','col3']].sum(),ignore_index=True).fillna('')
print(df)
col1 col2 col3 col4
0 1 2.0 5.0 5
1 2 3.0 6.0 6
2 3 4.0 7.0 7
3 4 5.0 8.0 8
4 14.0 26.0
Create new DataFrame with sums. This example DataFrame has columns 'a' and 'b'. df1 is the DataFrame what need to be summed up and df3 is one line DataFrame only with sums:
data = [[df1.a.sum(),df1.b.sum()]]
df3 = pd.DataFrame(data,columns=['a','b'])
Then append it to end:
df1.append(df3)
simply try this:(replace test with your dataframe name)
row wise sum(which you have asked for):
test['Total'] = test[['col9','col10']].sum(axis=1)
print(test)
column wise sum:
test.loc['Total'] = test[['col9','col10']].sum()
test.fillna('',inplace=True)
print(test)
IICU , this is what you need (change numbers 8 & 9 to suit your needs)
df['total']=df.iloc[ : ,[8,9]].sum(axis=1) #horizontal sum
df['total1']=df.iloc[ : ,[8,9]].sum().sum() #Vertical sum
df.loc['total2']=df.iloc[ : ,[8,9]].sum() # vertical sum in rows for only columns 8 & 9
Example
a=np.arange(0, 11, 1)
b=np.random.randint(10, size=(5,11))
df=pd.DataFrame(columns=a, data=b)
0 1 2 3 4 5 6 7 8 9 10
0 0 5 1 3 4 8 6 6 8 1 0
1 9 9 8 9 9 2 3 8 9 3 6
2 5 7 9 0 8 7 8 8 7 1 8
3 0 7 2 8 8 3 3 0 4 8 2
4 9 9 2 5 2 2 5 0 3 4 1
**output**
0 1 2 3 4 5 6 7 8 9 10 total total1
0 0.0 5.0 1.0 3.0 4.0 8.0 6.0 6.0 8.0 1.0 0.0 9.0 48.0
1 9.0 9.0 8.0 9.0 9.0 2.0 3.0 8.0 9.0 3.0 6.0 12.0 48.0
2 5.0 7.0 9.0 0.0 8.0 7.0 8.0 8.0 7.0 1.0 8.0 8.0 48.0
3 0.0 7.0 2.0 8.0 8.0 3.0 3.0 0.0 4.0 8.0 2.0 12.0 48.0
4 9.0 9.0 2.0 5.0 2.0 2.0 5.0 0.0 3.0 4.0 1.0 7.0 48.0
total2 NaN NaN NaN NaN NaN NaN NaN NaN 31.0 17.0 NaN NaN NaN

Negating column values and adding particular values in only some columns in a Pandas Dataframe

Taking a Pandas dataframe df I would like to be able to both take away the value in the particular column for all rows/entries and also add another value. This value to be added is a fixed additive for each of the columns.
I believe I could reproduce df, say dfcopy=df, set all cell values in dfcopy to the particular numbers and then subtract df from dfcopy but am hoping for a simpler way.
I am thinking that I need to somehow modify
df.iloc[:, [0,3,4]]
So for example of how this should look:
A B C D E
0 1.0 3.0 1.0 2.0 7.0
1 2.0 1.0 8.0 5.0 3.0
2 1.0 1.0 1.0 1.0 6.0
Then negating only those values in columns (0,3,4) and then adding 10 (for example) we would have:
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Thanks.
You can first multiply by -1 with mul and then add 10 with add for those columns we select with iloc:
df.iloc[:, [0,3,4]] = df.iloc[:, [0,3,4]].mul(-1).add(10)
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
Or as anky_91 suggested in the comments:
df.iloc[:, [0,3,4]] = 10-df.iloc[:,[0,3,4]]
A B C D E
0 9.0 3.0 1.0 8.0 3.0
1 8.0 1.0 8.0 5.0 7.0
2 9.0 1.0 1.0 9.0 4.0
pandas is very intuitive in letting you perform these operations,
negate:
df.iloc[:, [0,2,7,10,11] = -df.iloc[:, [0,2,7,10,11]
add a constant c:
df.iloc[:, [0,2,7,10,11] = df.iloc[:, [0,2,7,10,11]+c
or change to constant value c:
df.iloc[:, [0,2,7,10,11] = c
and any other arithmetics you can think of

Get max row of pandas dataframe without a tiebraker policy

I have a pandas dataframe similar to the following and I want to group it by "group" and for each group (a, b, c & d) get the max values based on the "value1" column.
In case of a tie, I want to get all rows that tied at the top.
So, for this example input:
group value1 value2
0 a 1.1 7.1
1 a 2.0 8.0
2 a 3.0 9.0
3 b 4.0 10.0
4 b 4.0 11.0
5 b 6.0 12.0
6 c 7.0 43.0
7 c 8.0 12.0
8 d 9.0 34.0
9 d 1.0 5.0
10 d 2.0 6.0
11 d 9.0 2.0
12 d 4.0 3.0
I would like to get this:
group value1 value2
2 a 3.0 9.0
5 b 6.0 12.0
7 c 8.0 12.0
8 d 9.0 34.0
11 d 9.0 2.0 # no tiebreaker policy
This is what I have so far but head(1) does not deliver. What has to go in there instead?
temp_df.sort_values('temp', ascending=False).groupby('Node').head(1)
use apply with a lambda that compares the value against the max value, this will return a boolean mask that you can use to mask the original df:
In [125]:
df[df.groupby('group')['value1'].apply(lambda x: x== x.max())]
Out[125]:
group value1 value2
2 a 3.0 9.0
5 b 6.0 12.0
7 c 8.0 12.0
8 d 9.0 34.0
11 d 9.0 2.0
Here is the mask:
In [126]:
df.groupby('group')['value1'].apply(lambda x: x== x.max())
Out[126]:
0 False
1 False
2 True
3 False
4 False
5 True
6 False
7 True
8 True
9 False
10 False
11 True
12 False
Name: value1, dtype: bool

Categories

Resources