I'm really new to coding. I have 2 columns in Excel - one for ingredients and other the ratio.
Like this:
ingredients [methanol/ipa,ethanol/methanol,ethylacetate]
spec[90/10,70/30,100]
qty[5,6,10]
So this data is entered continuously. I want to get the total amount of ingredients, by eg from first column methanol will be 5x90 and ipa will be 10x5.
I tried to split them based on / and use a for loop to iterate
import pandas as pd
solv={'EA':0,'M':0,'AL':0,'IPA':0}
data_xls1=pd.read_excel(r'C:\Users\IT123\Desktop\Solvent stock.xlsx',Sheet_name='PLANT',index_col=None)
sz=range(len(data_xls1.index))
a=data_xls1.Solvent.str.split('/',0).tolist()
b=data_xls1.Spec.str.split('/',0).tolist()
print(a)
for i in sz:
print(b[i][0:1])
print(b[i][1:2])
I want to split the ingredients and spec column multiply with qty and store in a solve dictionary
Error right now is float object is not subscript able
You have already found the key part, namely using the str.split function.
I would suggest that you bring the data to a a long format like this:
| | Transaction | ingredients | spec | qty |
|---:|--------------:|:--------------|-------:|------:|
| 0 | 0 | methanol | 90 | 4.5 |
| 1 | 0 | ipa | 10 | 0.5 |
| 2 | 1 | ethanol | 70 | 4.2 |
| 3 | 1 | methanol | 30 | 1.8 |
| 4 | 2 | ethylacetate | 100 | 10 |
The following code produces that result:
import pandas as pd
d = {"ingredients":["methanol/ipa","ethanol/methanol","ethylacetate"],
"spec":["90/10","70/30","100"],
"qty":[5,6,10]
}
df = pd.DataFrame(d)
df.index = df.index.rename("Transaction") # Add sensible name to the index
#Each line represents a transcation with one or more ingridients
#Following lines split the lines by the delimter. Stack Functinos moves them to long format.
ingredients = df.ingredients.str.split("/", expand = True).stack()
spec = df.spec.str.split("/", expand = True).stack()
Each of them will look like this:
| TrID, |spec |
|:-------|----:|
| (0, 0) | 90 |
| (0, 1) | 10 |
| (1, 0) | 70 |
| (1, 1) | 30 |
| (2, 0) | 100 |
Now we just need to put everything together:
df_new = pd.concat([ingredients, spec], axis = "columns")
df_new.columns = ["ingredients", "spec"]
#Switch from string to float
df_new.spec = df_new.spec.astype("float")
#Multiply by the quantity,
#Pandas automatically uses Transaction (Index of both frames) to filter accordingly
df_new["qty"] = df_new.spec * df.qty / 100
#As long as you are not comfortable to work with multiindex, just run this line:
df_new = df_new.reset_index(level = 0, drop = False).reset_index(drop = True)
The good thing about this format is that you can have a multiple-way splits for your ingredients, str.split will work without a problem, and summing up is straightforward.
I should have posted this first bur this is what my input excel sheet looks like
Related
I want to remove rows in a dataframe which have partial overlaps in their start and end character indices.
Details:
I have two sentences and I have extracted some entities from them and organized them in a dataframe.
sentences :
| id | sentence |
| --- | --- |
| 1 | Today is a very sunny day and sun is shining |
| 2 | I bought the red balloon and playing with it |
My dataframe with the extracted entities looks like this:
| id | data | start_char_index | end_char index | token_position |
| ---| -------------- | ---------------- | -------------- | -------------- |
| 1 | very sunny day | 11 | 26 | [4,5,6] |
| 1 | shining | 37 | 45 | [10] |
| 1 | sunny | 16 | 21 | [5] |
| 2 | the red balloon | 9 | 25 | [3,4,5] |
| 2 | playing | 29 | 37 | [7] |
| 2 | red | 13 | 16 | [4] |
P.S. In this token position is the index of the specific token in text (starting from 1)
Now, for id 1. we see that 'very sunny day' and 'sunny' are partial overlaps (their start and end character indices and token position both overlap)
Same for id 2, where 'the red balloon' and 'red' have red which is an overlap and I want to remove the rows 'sunny' and 'red' which are smaller of the overlaps in the two different ids.
I thought about grouping them on ids and then removing those records by storing the start and end character indices (or token position) in a dictionary, but if I have alot of data rows and lot of ids, then it would be very slow.
Also I read about IntervalTree but I could not get to use it for partial overlaps very efficiently.
So could you please suggest some solution for this?
The final output dataframe should look like this:
| id | data | start_char_index | end_char index | token_position |
| ---| -------------- | ---------------- | -------------- | -------------- |
| 1 | very sunny day | 11 | 26 | [4,5,6] |
| 1 | shining | 37 | 45 | [10] |
| 2 | the red balloon | 9 | 25 | [3,4,5] |
| 2 | playing | 29 | 37 | [7] |
Thanks for the help in advance :)
Apart from Mortz's answer, I also tried pandas IntervalArray and overlap which was working faster for me. Putting it here for anyone else who might find it useful (Credits : https://stackoverflow.com/a/69336914/15941713 ):
from intervaltree import Interval, IntervalTree
def drop_subspan_duplicates(df):
idx1 = pd.arrays.IntervalArray.from_arrays(
df['start'],
df['end'],
closed='both')
df['wrd_id'] = df.apply(lambda x : df.index[idx1.overlaps(pd.Interval(x['start'], x['end'], closed='both'))][0],axis=1)
df= df.drop_duplicates(['wrd_id'],keep='first')
df.drop(['wrd_id'],axis=1,inplace=True)
return df
output = data.groupby('id').apply(drop_subspan_duplicates)
One can also refer to this answer for tackling the issue if one wishes to avoid dataframe operations
I am not sure a DataFrame is the best structure to solve this problem - but here is one approach
df = pd.DataFrame({'id': [1, 1, 1], 'start': [11, 20, 16], 'end': [18, 35, 17]})
# First we construct a range of numbers from the start and end index
df.loc[:, 'range'] = df.apply(lambda x: list(range(x['start'], x['end'])), axis=1)
# Next, we "cumulate" these ranges and measure the number of unique elements in the cumulative range at each row
df['range_size'] = df['range'].cumsum().apply(lambda x: len(set(x)))
# Finally we check if every row adds anything to the cumulative range - if a new row adds nothing, then we can drop that row
df['range_size_shifted'] = df['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
df['drop'] = df.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)
print(df)
# id start end drop
#0 1 11 18 False
#1 1 20 35 False
#2 1 16 17 True
If you want to do this for each group separately -
for key, group in df.groupby('id'):
group.loc[:, 'range'] = group.apply(lambda x: list(range(x['start'], x['end'])), axis=1)
group['range_size'] = group['range'].cumsum().apply(lambda x: len(set(x)))
group['range_size_shifted'] = group['range'].cumsum().apply(lambda x: len(set(x))).shift(1)
group['drop'] = group.apply(lambda x: False if pd.isna(x['range_size_shifted']) else not int(x['range_size'] - x['range_size_shifted']), axis=1)
print(group)
so I currently have a large csv containing data for a number of events.
Column one contains a number of dates as well as some id's for each event for example.
Basically I want to write something within Python that whenever there is an id number (AL.....) it creates a new csv with the id number as the title with all the data in it before the next id number so i end up with a csv for each event.
For info the whole csv contains 8 columns but the division into individual csvs is only predicated on column one
Use Python to split a CSV file with multiple headers
I notice this questions is quite similar but in my case I I have AL and then a different string of numbers after it each time and also I want to call the new csvs by the id numbers.
You can achieve this using pandas, so let's first generate some data:
import pandas as pd
import numpy as np
def date_string():
return str(np.random.randint(1, 32)) + "/" + str(np.random.randint(1, 13)) + "/1997"
l = [date_string() for x in range(20)]
l[0] = "AL123"
l[10] = "AL321"
df = pd.DataFrame(l, columns=['idx'])
# -->
| | idx |
|---:|:-----------|
| 0 | AL123 |
| 1 | 24/3/1997 |
| 2 | 8/6/1997 |
| 3 | 6/9/1997 |
| 4 | 31/12/1997 |
| 5 | 11/6/1997 |
| 6 | 2/3/1997 |
| 7 | 31/8/1997 |
| 8 | 21/5/1997 |
| 9 | 30/1/1997 |
| 10 | AL321 |
| 11 | 8/4/1997 |
| 12 | 21/7/1997 |
| 13 | 9/10/1997 |
| 14 | 31/12/1997 |
| 15 | 15/2/1997 |
| 16 | 21/2/1997 |
| 17 | 3/3/1997 |
| 18 | 16/12/1997 |
| 19 | 16/2/1997 |
So, interesting positions are 0 and 10 as there are the AL* strings...
Now to filter the AL* you can use:
idx = df.index[df['idx'].str.startswith('AL')] # get's you all index where AL is
dfs = np.split(df, idx) # splits the data
for out in dfs[1:]:
name = out.iloc[0, 0]
out.to_csv(name + ".csv", index=False, header=False) # saves the data
This gives you two csv files named AL123.csv and AL321.csv with the first line being the AL* string.
This question already has an answer here:
Pandas - Finding percent contributed by each group
(1 answer)
Closed 2 years ago.
I have a df and I want to create some new cols with it. How would I use the apply function to both pass in the row, and the entire df with it? I need the entire df to do some filtering, and the data is subject to the values in each row.
Or maybe I don't need to use apply, but that's the first thing that came to my mind. Thank you and all help is appreciated!
Ex of df:
+----+--------+--------+
| ID | Family | Amount |
+----+--------+--------+
| 1 | A | 2 |
| 2 | A | 10 |
| 3 | B | 4 |
| 4 | B | 7 |
+----+--------+--------+
Result:
+----+--------+--------+-----------+------------+
| ID | Family | Amount | Total_Fam | Id_Percent |
+----+--------+--------+-----------+------------+
| 1 | A | 2 | 12 | .166 |
| 2 | A | 10 | 12 | .833 |
| 3 | B | 4 | 11 | .363 |
| 4 | B | 7 | 11 | .636 |
+----+--------+--------+-----------+------------+
First, group by Family and then transform amount and then you can directly divide Amount by the new column.
df['Total_Fam'] = df.groupby('Family')['Amount'].transform(np.sum)
df['Id_Percent'] = df['Amount']/df['Total_Fam']
df
Using apply on a column passes each row individualy. If you use apply on the entire dataset, it sees the entire dataset, hence, you can use all columns. As you can see in the example below, df['new_2] which is made using a function which I apply to the dataset, I do not need to pass the df to it.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
df['new'] = df['species'].apply(lambda x: x[:2])
def sumIsMore(dataframe):
x = dataframe['sepal_length']
y = dataframe['sepal_width']
return x+y >= 8.5
df['new_2'] = df.apply(sumIsMore, axis=1)
I have a multiindexed dataframe where the index levels have multiple categories, something like this:
|Var1|Var2|Var3|
|Level1|Level2|Level3|----|----|----|
| A | A | A | | | |
| A | A | B | | | |
| A | B | A | | | |
| A | B | B | | | |
| B | A | A | | | |
| B | A | B | | | |
| B | B | A | | | |
| B | B | B | | | |
In summary, and specifically in my case, Level 1 has 2 levels, Level 2 has 24, Level 3 has 6, and there are also Levels 4 (674) and Level 5 (9) (with some minor variation depending on specific higher-level values - Level1 == 1 actually has 24 Level2s, but Level1 == 2 has 23).
I need to generate all possible combinations of 3 at Level 5, then calculate their means for Vars 1-3.
I am trying something like this:
# Resulting df to be populated
df_result = pd.DataFrame([])
# Retrieving values at Level1
lev1s = df.index.get_level_values("Level1").unique()
# Looping through each Level1 value
for lev1 in lev1s:
# Filtering df based on Level1 value
df_lev1 = df.query('Level1 == ' + str(lev1))
# Repeating...
lev2s = df_lev1.index.get_level_values("Level2").unique()
for lev2 in lev2s:
df_lev2 = df_lev1.query('Level2 == ' + str(lev2))
# ... until Level3
lev3s = df_lev2.index.get_level_values("Level3").unique()
# Creating all combinations
combs = itertools.combinations(lev3s, 3)
# Looping through each combination
for comb in combs:
# Filtering values in combination
df_comb = df_wl.query('Level3 in ' + str(comb))
# Calculating means using groupby (groupby might not be necessary,
# but I don't believe it has much of an impact
df_means = df_comb.reset_index().groupby(['Level1', 'Level2']).mean()
# Extending resulting dataframe
df_result = df_result.append(df_means)
The thing is, after a little while, this process gets really slow. Since I have around 2 * 24 * 6 * 674 levels and 84 combinations (of 9 elements, 3 by 3), I am expecting more than 16 million df_meanss to be calculated.
Is there any more efficient way to do this?
Thank you.
I have a list of parts with each part consisting of part_number, width and length. I want to end up displaying this list as a grid using the various widths as column labels and the various lengths as row labels.
width1 width2 width3 width4
len1 no1 no2
len2 no3 no4 no5
len3 no6 no7 no8 no9
Note that parts are not available in all lengths and widths and that some cells in the grid will be empty.
I began by wrangling all this data into lists. One for column labels, one for row labels and one for data thinking I could use panda to create a DataFrame.
columns = []
rows = []
li = []
for part in part_list:
if part.width not in columns:
columns.append(part.width)
if part.length not in rows:
rows.append(part.length)
li.append([part.width, part.length, part.part_number])
data_dict = {
'part_number': weld_stud.part_number,
'diameter_pitch': weld_stud.thread,
'length': weld_stud.fractional_length
}
grid_data.append(data_dict)
Then, using panda I did:
numpy_array = np.array(li)
df = pd.DataFrame(
data=numpy_array[1:,1:], # values
index=numpy_array[1:,0], # 1st column as index
columns=numpy_array[0,1:] # 1st row as the column names
)
This is obviously not outputting what I need, but I'm unclear where to go from here.
I reedited this answer many times, so I hope this is last time
Answer is consisted of two things:
1) Data preparation: I dont know your dataset, but I will guess
2) Data display
1) a ) Data preparation
I know about one easy solution which may solve all your problems with data preparations. But one condition must be met, and that is that your widths and lengths are integers. If this is true, then solution is easy:
lets say max width is 10 and max length is 10 too.
You create grid of 11x11 in numpy (because from 0 to 10 is 11 cells )
# chararray because you can initialize
# grid with being empty (empty string fields)
grid=np.chararray(shape=[11,11])
for part in part_list:
grid[part.width,part.length] = str(part.part_number)
#If you have 2 parts with same parameter you can sum up string so its flexible
So basicly you can use array indexes as actual parameters width and lentgh
1) b ) Data preparation
your problem somehow sticked in my mind until i found general solution to your problem (Even though I still dont know your dataset). Based on above example, if your parameters are in any form (int, float, string) you can do this
# Lets say you have 3 types of widths and lengths and you index them
# Key is your parameter, value is index position in grid
widths = {'7.2mm':0,'9.6mm':1,'11.4mm':2}
lengths = {'2.2mm':0,'4.8mm':1,'16.8mm':2}
header = [h for h in widths] # useless in this example
side = [s for s in lengths ] # but can serve for Data display part
grid=np.chararray(shape=[3,3])
for part in part_list:
index_width = widths[part.width]
index_lentgh = lengths[part.length]
grid[index_width ,index_lentgh] = str(part.part_number)
# This way, you dont need to care about missing part numbers and grid can stay unfilled
2) Data display
from prettytable import PrettyTable
def table(header,side,data):
t = PrettyTable(['']+header)
for i in range(len(data)):
t.add_row([side[i]]+list(data[i]))
print(t)
header = ['col1','col2','col3']
side = ['row1','row2','row3']
data = np.zeros(shape=[3,3])
>> table(header,side,data)
+------+------+------+------+
| | col1 | col2 | col3 |
+------+------+------+------+
| row1 | 0.0 | 0.0 | 0.0 |
| row2 | 0.0 | 0.0 | 0.0 |
| row3 | 0.0 | 0.0 | 0.0 |
+------+------+------+------+
You can feed it with anything(like classes)
data2 = [['','',PrettyTable],['World','',5],['','Hello','']]
table(header,side,data)
>>> table(header,side,data2)
+------+-------+-------+-----------------------------------+
| | col1 | col2 | col3 |
+------+-------+-------+-----------------------------------+
| row1 | | | <class 'prettytable.PrettyTable'> |
| row2 | World | | 5 |
| row3 | | Hello | |
+------+-------+-------+-----------------------------------+
EDIT: Based on sample data I was able to do what was required:
import numpy as np
widths = {'#10-24':0, '#10-32':1, '1/4-20':2, '5/16-18':3, '3/8-16':4, '1/2-13':5, '5/8-11':6, '3/4-10':7} # My data dictionary looks like
lenghts = {'5/8':0,'3/4':1,'7/8':2}
part_list = [{'part_number': 'FTC19-62', 'width': '#10-24', 'length': '5/8'},
{'part_number': 'FTC19-75', 'width': '#10-32', 'length': '3/4'},
{'part_number': 'FTC19-87', 'width': '#10-24', 'length': '7/8'}]
grid=np.chararray(shape=[len(lenghts),len(widths)]).astype('|S8')
grid[:,:] = ''
for part in part_list:
index_width = widths[part['width']]
index_lentgh = lenghts[part['length']]
grid[index_lentgh,index_width] = str(part['part_number'])
header = sorted(widths, key=lambda k: widths[k])
side = sorted(lenghts, key=lambda k: lenghts[k])
from prettytable import PrettyTable
def table(header,side,data):
t = PrettyTable(['']+header)
for i in range(len(data)):
t.add_row([side[i]]+list(data[i]))
print(t)
table(header,side,grid)
+-----+----------+----------+--------+---------+--------+--------+--------+--------+
| | #10-24 | #10-32 | 1/4-20 | 5/16-18 | 3/8-16 | 1/2-13 | 5/8-11 | 3/4-10 |
+-----+----------+----------+--------+---------+--------+--------+--------+--------+
| 5/8 | FTC19-62 | | | | | | | |
| 3/4 | | FTC19-75 | | | | | | |
| 7/8 | FTC19-87 | | | | | | | |
+-----+----------+----------+--------+---------+--------+--------+--------+--------+
>>>