I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.
Related
Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)
I'm really new to coding. I have 2 columns in Excel - one for ingredients and other the ratio.
Like this:
ingredients [methanol/ipa,ethanol/methanol,ethylacetate]
spec[90/10,70/30,100]
qty[5,6,10]
So this data is entered continuously. I want to get the total amount of ingredients, by eg from first column methanol will be 5x90 and ipa will be 10x5.
I tried to split them based on / and use a for loop to iterate
import pandas as pd
solv={'EA':0,'M':0,'AL':0,'IPA':0}
data_xls1=pd.read_excel(r'C:\Users\IT123\Desktop\Solvent stock.xlsx',Sheet_name='PLANT',index_col=None)
sz=range(len(data_xls1.index))
a=data_xls1.Solvent.str.split('/',0).tolist()
b=data_xls1.Spec.str.split('/',0).tolist()
print(a)
for i in sz:
print(b[i][0:1])
print(b[i][1:2])
I want to split the ingredients and spec column multiply with qty and store in a solve dictionary
Error right now is float object is not subscript able
You have already found the key part, namely using the str.split function.
I would suggest that you bring the data to a a long format like this:
| | Transaction | ingredients | spec | qty |
|---:|--------------:|:--------------|-------:|------:|
| 0 | 0 | methanol | 90 | 4.5 |
| 1 | 0 | ipa | 10 | 0.5 |
| 2 | 1 | ethanol | 70 | 4.2 |
| 3 | 1 | methanol | 30 | 1.8 |
| 4 | 2 | ethylacetate | 100 | 10 |
The following code produces that result:
import pandas as pd
d = {"ingredients":["methanol/ipa","ethanol/methanol","ethylacetate"],
"spec":["90/10","70/30","100"],
"qty":[5,6,10]
}
df = pd.DataFrame(d)
df.index = df.index.rename("Transaction") # Add sensible name to the index
#Each line represents a transcation with one or more ingridients
#Following lines split the lines by the delimter. Stack Functinos moves them to long format.
ingredients = df.ingredients.str.split("/", expand = True).stack()
spec = df.spec.str.split("/", expand = True).stack()
Each of them will look like this:
| TrID, |spec |
|:-------|----:|
| (0, 0) | 90 |
| (0, 1) | 10 |
| (1, 0) | 70 |
| (1, 1) | 30 |
| (2, 0) | 100 |
Now we just need to put everything together:
df_new = pd.concat([ingredients, spec], axis = "columns")
df_new.columns = ["ingredients", "spec"]
#Switch from string to float
df_new.spec = df_new.spec.astype("float")
#Multiply by the quantity,
#Pandas automatically uses Transaction (Index of both frames) to filter accordingly
df_new["qty"] = df_new.spec * df.qty / 100
#As long as you are not comfortable to work with multiindex, just run this line:
df_new = df_new.reset_index(level = 0, drop = False).reset_index(drop = True)
The good thing about this format is that you can have a multiple-way splits for your ingredients, str.split will work without a problem, and summing up is straightforward.
I should have posted this first bur this is what my input excel sheet looks like
I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+
I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.
I have some data as follows:
+--------+------+
| Reason | Keys |
+--------+------+
| x | a |
| y | a |
| z | a |
| y | b |
| z | b |
| x | c |
| w | d |
| x | d |
| w | d |
+--------+------+
I want to get the Reason corresponding to the first occurrence of each Key. Like here, I should get Reasons x,y,x,w for Keys a,b,c,d respectively. After that, I want to compute the percentage of each Reason, as in a metric for how many times each Reason occurs. Thus x = 2/4 = 50%. And w,y = 25% each.
For the percentage, I think I can use something like value_counts(normalize=True) * 100, based on the previous step. What is a good way to proceed?
You are right about the second step and the first step could be achieved by
summary = df.groupby("Keys").first()
You can using drop_duplicates
df.drop_duplicates(['Reason'])
Out[207]:
Reason Keys
0 x a
1 y a
2 z a
6 w d
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())