Original data:
Data Set
Day 100 200 300 400
1 2 4 6 8
2 3 5 7 9
3 4 6 8 10
Desired output:
Day Lookup Val Value
1 100 2
2 400 9
3 200 6
Basically trying to figure out how to create the second dataframe that queries the first based on the lookup value provided. Thanks in advance.
Below is one of the possible solutions. For a very large data frames it may be slow, but should work fine with average size data. I am assuming that the lookup info is loaded into some data frame. Please let me know if you have questions.
days = [1,2,3]
_100 = [2,3,4]
_200 = [4,5,6]
_300 = [6,7,8]
_400 = [8,9,10]
# this is your actual data
df = pd.DataFrame({"Day":days, "100": _100, "200": _200, "300": _300, "400": _400})
df.set_index("Day", inplace=True)
days = [1,2,3]
lookup = [100,400,200]
#this is your look up table
dfLookup = pd.DataFrame({"Day":days, "Lookup Val": lookup})
def get_value(row):
row_id = row["Day"]
lookup_id = str(row["Lookup Val"])
dfRow = df.loc[row_id].copy()
dfRow = pd.DataFrame(dfRow)
dfRow.reset_index(inplace=True)
dfRow.columns = ["lookup", "values"]
dictLookUp = dict(zip(dfRow["lookup"], dfRow["values"]))
value = dictLookUp[lookup_id]
dfRow = None
return value
dfLookup["Value"] = dfLookup.apply(lambda row: get_value(row), axis=1)
Related
For a data frame with 2 columns "series1" and "series2", I want to find the first index when a value (from series1) is greater than some threshold (in series2). I'm not sure how else to design the algorithm than iterating through series1 and find for each iteration min index.
Having 50M rows, it takes days to compute. Is there perhaps a different approach that I'm not seeing?
def pandas_version(df: pd.DataFrame, signal_series: pd.Series, threshold_max, threshold_min):
results = []
series1 = df[df.columns[0]]
series2 = df[df.columns[1]]
# select only values of interest
selection = signal_series[signal_series >= threshold_max]
series1 = series1.loc[selection.index]
series2 = series2.loc[selection.index[0]:]
for idx_dt, val in zip(series1.index, series1):
window_series2 = series2.loc[idx_dt:]
results.append(window_series2[window_series2 > val + threshold_min].index.min())
return pd.Series(results, index=selection.index, copy=False)
s1 = [i for i in range(10)]
s2 = [2*i for i in range(10)]
signal=s1 # or sth else
df = pd.DataFrame({"series1": s1, "series2": s2})
signal_series = pd.Series(signal)
pandas_version(df, signal_series, 3, 2)
Output:
3 3
4 4
5 5
6 6
7 7
8 8
9 9
I have scraped a webpage table, and the table items are in a sequential 1D list, with repeated headers. I want to reconstitute the table into a DataFrame.
I have an algorithm to do this, but I'd like to know if there is a more pythonic/efficient way to achieve this? NB. I don't necessarily know how many columns there are in my table. Here's an example:
input = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
output = {}
it = iter(input)
val = next(it)
while val:
if val in output:
output[val].append(next(it))
else:
output[val] = [next(it)]
val = next(it,None)
df = pd.DataFrame(output)
print(df)
with the result:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
If your data is always "well behaved", then something like this should suffice:
import pandas as pd
data = ['A',1,'B',5,'C',9,
'A',2,'B',6,'C',10,
'A',3,'B',7,'C',11,
'A',4,'B',8,'C',12]
result = {}
for k,v in zip(data[::2], data[1::2]):
result.setdefault(k, []).append(v)
df = pd.DataFrame(output)
You can also use numpy reshape:
import numpy as np
cols = sorted(set(l[::2]))
df = pd.DataFrame(np.reshape(l, (int(len(l)/len(cols)/2), len(cols)*2)).T[1::2].T, columns=cols)
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
Explaination:
# get columns
cols = sorted(set(l[::2]))
# reshape list into list of lists
shape = (int(len(l)/len(cols)/2), len(cols)*2)
np.reshape(l, shape)
# get only the values of the data
.T[1::2].T
# this transposes the data and slices every second step
I am working with a database that looks like the below. For each fruit (just apple and pears below, for conciseness), we have:
1. yearly sales,
2. current sales,
3. monthly sales and
4.the standard deviation of sales.
Their ordering may vary, but it's always 4 values per fruit.
dataset = {'apple_yearly_avg': [57],
'apple_sales': [100],
'apple_monthly_avg':[80],
'apple_st_dev': [12],
'pears_monthly_avg': [33],
'pears_yearly_avg': [35],
'pears_sales': [40],
'pears_st_dev':[8]}
df = pd.DataFrame(dataset).T#tranpose
df = df.reset_index()#clear index
df.columns = (['Description', 'Value'])#name 2 columns
I would like to perform two sets of operations.
For the first set of operations, we isolate a fruit price, say 'pears', and subtract each average sales from current sales.
df_pear = df[df.loc[:, 'Description'].str.contains('pear')]
df_pear['temp'] = df_pear['Value'].where(df_pear.Description.str.contains('sales')).bfill()
df_pear ['some_op'] = df_pear['Value'] - df_pear['temp']
The above works, by creating a temporary column holding pear_sales of 40, backfill it and then use it to subtract values.
Question 1: is there a cleaner way to perform this operation without a temporary array? Also I do get the common warning saying I should use '.loc[row_indexer, col_indexer], even though the output still works.
For the second sets of operations, I need to add '5' rows equal to 'new_purchases' to the bottom of the dataframe, and then fill df_pear['some_op'] with sales * (1 + std_dev *some_multiplier).
df_pear['temp2'] = df_pear['Value'].where(df_pear['Description'].str.contains('st_dev')).bfill()
new_purchases = 5
for i in range(new_purchases):
df_pear = df_pear.append(df_pear.iloc[-1])#appends 5 copies of the last row
counter = 1
for i in range(len(df_pear)-1, len(df_pear)-new_purchases, -1):#backward loop from the bottom
df_pear.some_op.iloc[i] = df_pear['temp'].iloc[0] * (1 + df_pear['temp2'].iloc[i] * counter)
counter += 1
This 'backwards' loop achieves it, but again, I'm worried about readability since there's another temporary column created, and then the indexing is rather ugly?
Thank you.
I think, there is a cleaner way to perform your both tasks, for each
fruit in one go:
Add 2 columns, Fruit and Descr, the result of splitting of Description at the first "_":
df[['Fruit', 'Descr']] = df['Description'].str.split('_', n=1, expand=True)
To see the result you may print df now.
Define the following function to "reformat" the current group:
def reformat(grp):
wrk = grp.set_index('Descr')
sal = wrk.at['sales', 'Value']
dev = wrk.at['st_dev', 'Value']
avg = wrk.at['yearly_avg', 'Value']
# Subtract (yearly) average
wrk['some_op'] = wrk.Value - avg
# New rows
wrk2 = pd.DataFrame([wrk.loc['st_dev']] * 5).assign(
some_op=[ sal * (1 + dev * i) for i in range(5, 0, -1) ])
return pd.concat([wrk, wrk2]) # Old and new rows
Apply this function to each group, grouped by Fruit, drop Fruit
column and save the result back in df:
df = df.groupby('Fruit').apply(reformat)\
.reset_index(drop=True).drop(columns='Fruit')
Now, when you print(df), the result is:
Description Value some_op
0 apple_yearly_avg 57 0
1 apple_sales 100 43
2 apple_monthly_avg 80 23
3 apple_st_dev 12 -45
4 apple_st_dev 12 6100
5 apple_st_dev 12 4900
6 apple_st_dev 12 3700
7 apple_st_dev 12 2500
8 apple_st_dev 12 1300
9 pears_monthly_avg 33 -2
10 pears_sales 40 5
11 pears_yearly_avg 35 0
12 pears_st_dev 8 -27
13 pears_st_dev 8 1640
14 pears_st_dev 8 1320
15 pears_st_dev 8 1000
16 pears_st_dev 8 680
17 pears_st_dev 8 360
Edit
I'm in doubt whether Description should also be replicated to new
rows from "st_dev" row. If you want some other content there, set it
in reformat function, after wrk2 is created.
My DataFrame collected from the dataset1.xlsx looks like this:
TimePoint Object 0 Object 1 Object 2 Object 3 Object 4 Object 0 Object 1 Object 2 Object 3 Object 4
0 10 4642.99 2000.71 4869.52 4023.69 3008.99 11188.15 2181.62 12493.47 10275.15 8787.99
1 20 4640.09 2005.17 4851.07 4039.73 3007.16 11129.38 2172.37 12438.31 10218.92 8723.45
Problem:
The Data contains header columns with duplicate names need to aggregate them to find the occurrence and then initialize the IDA and IAA values for each Objects.
Based on these new values need to calculate the Fc and EAPP values. So, the final excel output should looks like this:
TimePoint Objects IDA IAA Fc (using IDA- (a * IAA)) EAPP (using Fc/ (Fc + (G *Fc)))
10 Object 0 4642.99 11188.15 3300.412 0.463177397
10 Object 1 2000.71 2181.62 -527.78758 1
10 Object 2 4869.52 12493.47 4869.52 1
10 Object 3 4023.69 10275.15 4023.69 1
10 Object 4 3008.99 8787.99 3008.99 1
20 Object 0 4640.09 11129.38 4640.09 1
20 Object 1 2005.17 2172.37 2005.17 1
20 Object 2 4851.07 12438.31 4851.07 1
20 Object 3 4039.73 10218.92 4039.73 1
20 Object 4 3007.16 8723.45 3007.16 1
I tried to solve this problem using the following python script:
def main():
all_data = pd.DataFrame()
a = 0.12
G = 1.159
for f in glob.glob("data/dataset1.xlsx"):
df = pd.read_excel(f, 'Sheet1') # , header=[1]
all_data = all_data.append(df, ignore_index=True, sort=False)
all_data.columns = all_data.columns.str.split('.').str[0]
print(all_data)
object_df = all_data.groupby(all_data.columns, axis=1)
print(object_df)
for k in object_df.groups.keys():
if k != 'TimePoint':
for row_index, row in object_df.get_group(k).iterrows():
print(row)
# This logic is not working to group by Object and then apply the Following formula
# TODO: Calculation for the new added columns Assumption every time there will be two occurrence of any
# Object i.e. Object 0...4 in this example but Object count can varies sometime only one Object can
# appear
# IDA is the first occurrence value of the Object
all_data['IDA'] = row[0] # This is NOT correct
# IAA is the second occurrence value of the Object
all_data['IAA'] = row[1]
all_data['Fc'] = all_data.IDA.fillna(0) - (a * all_data.IAA.fillna(0))
all_data['EAPP'] = all_data.Fc.fillna(0) / (all_data.Fc.fillna(0) + (G * all_data.Fc.fillna(0)))
# now save the data frame
writer = pd.ExcelWriter('data/dataset1.xlsx')
all_data.to_excel(writer, 'Sheet2', index=True)
writer.save()
if __name__ == '__main__':
main()
Please let me know the part how to assign the IDA and IAA values for each Objects using groupby in pandas referring my code above.
I think melt might help you a lot
import pandas as pd
df = pd.read_clipboard()
# This part of breaking the df into 2 might be different based on how your reading the dataframe into memory
df1 = df[df.columns[:6]]
df2 = df[['TimePoint'] + df.columns.tolist()[6:]]
tdf1 = df1.melt(['TimePoint']).assign(key=range(10))
tdf2 = df2.melt(['TimePoint']).assign(key=range(10)).drop(['TimePoint', 'variable'], axis=1)
df = tdf1.merge(tdf2, on='key', how='left').drop(['key'], axis=1).rename(columns={'value_x': 'IDA', 'value_y': 'IAA'})
a = 0.12
G = 1.159
df['Fc'] = df['IDA'] - a * df['IAA']
df['EAPP'] = df['Fc'].div(df['Fc']+(G*df['Fc']))
TimePoint variable IDA IAA Fc EAPP
0 10 Object_0 4642.99 11188.15 3300.4120 0.463177
1 20 Object_0 4640.09 11129.38 3304.5644 0.463177
2 10 Object_1 2000.71 2181.62 1738.9156 0.463177
3 20 Object_1 2005.17 2172.37 1744.4856 0.463177
4 10 Object_2 4869.52 12493.47 3370.3036 0.463177
5 20 Object_2 4851.07 12438.31 3358.4728 0.463177
6 10 Object_3 4023.69 10275.15 2790.6720 0.463177
7 20 Object_3 4039.73 10218.92 2813.4596 0.463177
8 10 Object_4 3008.99 8787.99 1954.4312 0.463177
9 20 Object_4 3007.16 8723.45 1960.3460 0.463177
I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4