I'm wondering if there is a way to add columns based on the common field in pandas.
This is my original dataset
load mapping freq 99th energy
61 175.0k 5CN0-5CN1 1.20GHz 0.937662 18952.056063
19 175.0k 5CN0-5CN1 2.10GHz 0.391280 19051.052048
I want to add the following columns 99th-1.20GHz energy-1.20GHz 99th-2.10GHz and energy-2.10GHz based on the presumption that load and mapping are the same.
This is the desired output
load mapping 99th-1.20Ghz 99th-2.10GHz energy-1.20GHz energy-2.10GHz
175.0K 5CN0-5CN1 0.937662 0.39128 18952.05606 19051.05205
I suggest you use MultiIndex columns for this, e.g. via pd.pivot_table. You can flatten columns as a separate step, although your data will lose structure.
res = pd.pivot_table(df, index=['load', 'mapping'], columns='freq',
values=['99th', 'energy'], aggfunc='first')
Result:
99th energy
freq 1.20GHz 2.10GHz 1.20GHz 2.10GHz
load mapping
175.0k 5CN0-5CN1 0.937662 0.39128 18952.056063 19051.052048
Related
I'm new to the library and am trying to figure out how to add columns to a pivot table with the mean and standard deviation of the row data for the last three months of transaction data.
Here's the code that sets up the pivot table:
previousThreeMonths = [prev_month_for_analysis, prev_month2_for_analysis, prev_month3_for_analysis]
dfPreviousThreeMonths = df[df['Month'].isin(previousThreeMonths)]
ptHistoricalConsumption = dfPreviousThreeMonths.pivot_table(dfPreviousThreeMonths,
index=['Customer Part #'],
columns=['Month'],
aggfunc={'Qty Shp':np.sum}
)
ptHistoricalConsumption['Mean'] = ptHistoricalConsumption.mean(numeric_only=True, axis=1)
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.std(numeric_only=True, axis=1)
ptHistoricalConsumption
The resulting pivot table looks like this:
The problem is that the standard deviation column is including the Mean in its calculations, whereas I just want it to use the raw data for the previous three months. For example, the Std Dev of part number 2225 should be 11.269, not 9.2.
I'm sure there's a better way to do this and I'm just missing something.
One way would be to remove the Mean column temporarily before call .std():
ptHistoricalConsumption['Std Dev'] = ptHistoricalConsumption.drop('Mean', axis=1).std(numeric_only=True, axis=1)
That wouldn't remove it from the permanently, it would just remove it from the copy fed to .std().
I have large spark dataframe 'df', (more than billion rows) made of
meta_info | date | comment
I also have a variable 'lst', where it stores all ids I'm interested in.
What would be the way to only retain rows where its id is included in lst?
df.where("meta_info".isin(lst)).show()
this is what I tried but it said 'string' doesn't have isin
First option is to rely on a join if your list is 'big':
data = [[value] for value in lst]
safelist = spark.createDataFrame(data=data, schema=["meta_info"])
filtered = df.join(safelist, on='meta_info')
filtered.show()
Other option is to filter your dataset, but still, it can be relevant to broadcast your list in order to optimize data transmitted to your executors:
import pyspark.sql.functions as F
blst = sc.broadcast(lst)
df.filter(F.col("meta_info").isin(blst.value)).show()
I would recommend to compare the two options on your dataset.
i have the following problem and had an idea to solve it, but it didn't worked:
I have the data on DAX Call and Put Options for every trading day in a month. After transforming and some calculations I have the following DataFrame:
DaxOpt. The goal is now to get rid of every row (either Call or Put Option) which does not have the respective pair. With pair I mean a Call and Put Option with the same 'EXERCISE_PRICE' and 'TAU', where 'TAU' = the time to maturity in years. The red boxes in the picture are examples for a pair. So either having a DataFrame with only the pairs or having two DataFrames with Call and Put Options where the rows are the respective pairs.
My idea was creating two new DataFrames one which contains only the Call Options and the other the Put Options, sort them after 'TAU' and 'EXERCISE_PRICE' and working my way through with pandas isin function, in order to get rid of the Call or Put Options which do not have the respective pair.
DaxOptCall = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'C']
DaxOptPut = DaxOpt[DaxOpt.CALL_PUT_FLAG == 'P']
The problem is that the DaxOptCall and DaxOptPut have different dimensions, so isin function is not applicable. I am trying to find the most efficient way, since the data I am using now is just a fraction from the real data.
Would appreciate any help or idea.
See if this works for you:
Once you separated your df into two dfs by CALL/PUT options, convert the column(s) that are unique to your pairs into index columns:
# Assuming your unique columns are TAU and EXERCISE_PRICE
df_call = df_call.set_index(["EXERCISE_PRICE", "TAU"])
df_put = df_put.set_index(["EXERCISE_PRICE", "TAU"])
Next, take the intersection of the indexes, which will return a pandas MultiIndex object
mtx = df_call.index.intersection(df_put.index)
Then use the mtx object to extract the common elements from the dfs
df_call.loc[mtx]
df_put.loc[mtx]
You can merge these if you want them to be in the same df and reset the index to the original column.
When trying to merge two dataframes using pandas I receive this message: "ValueError: array is too big." I estimate the merged table will have about 5 billion rows, which is probably too much for my computer with 8GB of RAM (is this limited just by my RAM or is it built into the pandas system?).
I know that once I have the merged table I will calculate a new column and then filter the rows, looking for the maximum values within groups. Therefore the final output table will be only 2.5 million rows.
How can I break this problem up so that I can execute this merge method on smaller parts and build up the output table, without hitting my RAM limitations?
The method below works correctly for this small data, but fails on the larger, real data:
import pandas as pd
import numpy as np
# Create input tables
t1 = {'scenario':[0,0,1,1],
'letter':['a','b']*2,
'number1':[10,50,20,30]}
t2 = {'letter':['a','a','b','b'],
'number2':[2,5,4,7]}
table1 = pd.DataFrame(t1)
table2 = pd.DataFrame(t2)
# Merge the two, create the new column. This causes "...array is too big."
table3 = pd.merge(table1,table2,on='letter')
table3['calc'] = table3['number1']*table3['number2']
# Filter, bringing back the rows where 'calc' is maximum per scenario+letter
table3 = table3.loc[table3.groupby(['scenario','letter'])['calc'].idxmax()]
This is a follow up to two previous questions:
Does iterrows have performance issues?
What is a good way to avoid using iterrows in this example?
I answer my own Q below.
You can break up the first table using groupby (for instance, on 'scenario'). It could make sense to first make a new variable which gives you groups of exactly the size you want. Then iterate through these groups doing the following on each: execute a new merge, filter and then append the smaller data into your final output table.
As explained in "Does iterrows have performance issues?", iterating is slow. Therefore try to use large groups to keep it using the most efficient methods possible. Pandas is relatively quick when it comes to merging.
Following on from after you create the input tables
table3 = pd.DataFrame()
grouped = table1.groupby('scenario')
for _, group in grouped:
temp = pd.merge(group,table2, on='letter')
temp['calc']=temp['number1']*temp['number2']
table3 = table3.append(temp.loc[temp.groupby('letter')['calc'].idxmax()])
del temp
Hi I have a result set from psycopg2 like so
(
(timestamp1, val11, val12, val13, val14),
(timestamp2, val21, val22, val23, val24),
(timestamp3, val31, val32, val33, val34),
(timestamp4, val41, val42, val43, val44),
)
I have to return the difference between the values of the row (exception for the timestamp column).
Each row would subtract the previous row values.
The first row would be
timestamp, 'NaN', 'NaN' ....
This has to then be returned as a generic object
Ie something like an array of the following objects
Group(timestamp=timestamp, rows=[val11, val12, val13, val14]
I was going to use Pandas to do the diff.
Something like below works ok on the values
df = DataFrame().from_records(data=results, columns=headers)
diffs = df.set_index('time', drop=False).diff()
But diff also performs on the timestamp column and I can't get it to ignore a column while
leaving the original timestamp column in place.
Also I wasn't sure it was going to be efficient to get the data into my return format
as Pandas advises against row access
What would a fast way to get the result set differences in my required output format ?
Why did you set drop=False? That puts the timestamps in the index (where they will not be touched by diff) but also leaves a copy of the timestamps as a proper column, to be process by diff.
I think this will do what you want:
diffs = df.set_index('time').diff().reset_index()
Since you mention psycopg2, take a look at the docs for pandas 0.14, released just a few days ago, which features improved SQL functionality, including new support for postgresql. You can read and write directly between the database and pandas DataFrames.