My end goal: I want to upload a new table file (be it excel, pdf, txt, csv) to a master excel spreadsheet. Then, pick out several columns from that master datasheet and then graph and group them by their state: are these samples falling under categories X or Y? And then plot them looking at each sample.
Datawise, I have samples that fall under X category or Y category. They are either X or Y, and have names associated with the samples as well as sample counts (e.g. sample #abc falls under category X and has 30 counts with different values).
I am using pandas to open the data and manipulate the tables. Here is the section of my code that is giving me issues. All other issues I have found workarounds. This one I cannot. I have tried doing fillna(df_choice.column) and I have found that it doesnt replace the NaN. Tried doing reset index or set index, doesnt really help. A sample output would have X (if I entered "X" first) or Y (if I entered "y" first) as NAN, yet the other three columns have no issues!
df_choice is a previous dataframe where I selected two columns from a master datasheet. In this case, it'd be columns X and Y.
df_new= {'X':[], 'Y':[], 'Property of X (units)':[], 'Property of Y (units)':[]} #setup dict
df_new= pd.DataFrame.from_dict(df_new) #dict to df
for df_choice.column in df_choice.columns: #list columns in previous dataset
print(df_choice.column)
state = input('Is this sample considered X or Y? Input X or Y, or quit to exit the loop') #ask if column in previous dataset falls under X or Y
if state == 'quit':
break
elif state == 'x':
df_new['Property of X (units)'] = df_choice[df_choice.column] #takes data from old dataframe into new
df_new['X'] = 'df_choice.column' #fills column X with column name from df_choice
elif state == 'y':
df_new['Y'] = 'df_choice.column'
df_new['Property of Y (units)'] = df_choice[df_choice.column]
else:
print('Not a valid response')
df_new #prints new df
What I see (only showing 4 rows, but imagine every row as NaN):
+-----+------------+----------------+----------------+
| X | Y | Property of X | Property of Y |
+-----+------------+----------------+----------------+
| NaN | Sample123 | 4 | 3 |
| NaN | Sample123 | 5 | 4 |
| NaN | Sample123 | 3 | 6 |
| NaN | Sample123 | 4 | 1 |
+-----+------------+----------------+----------------+
What I should get:
+-----------+------------+----------------+----------------+
| X | Y | Property of X | Property of Y |
+-----------+------------+----------------+----------------+
| SampleABC | Sample123 | 4 | 3 |
| SampleABC | Sample123 | 5 | 4 |
| SampleABC | Sample123 | 3 | 6 |
| SampleABC | Sample123 | 4 | 1 |
+-----------+------------+----------------+----------------+
Eventually, I assume I'd want to df_new.melt() to eventually graph them, grouping bars or boxplots by X or Y,
+-----------+-------+-----------+
| Sample | Type | Property |
+-----------+-------+-----------+
| SampleABC | X | 4 |
| SampleABC | X | 5 |
| SampleABC | X | 3 |
| SampleABC | X | 4 |
| Sample123 | Y | 3 |
| Sample123 | Y | 4 |
| Sample123 | Y | 6 |
| Sample123 | Y | 1 |
+-----------+-------+-----------+
I am a month or so into self-taught coding, so I apologize if my code is inefficient or not very clever, I come across issues and look up how other people do them and see what works for me. No formal training and I am a material scientist by training. I don't know a whole lot and I figured the best way to learn is get some fundamentals down, and then make something genuinely useful to me.
You should pass the name of your column as a vector, rather than a single string when inserting into an empty dataframe.
Think about it this way: you're creating a column in an empty dataframe by passing a single string to it. But how can Pandas know what length the column should have?
The variable in your for-loop also has a bit confusing name: the dot in "df_choice.column" looks as if you're accessing a dataframe.
Putting it together:
for colname in df_choice.columns:
#...#
elif state == 'x':
#takes data from old dataframe into new
df_new['Property of X (units)'] = df_choice[colname]
#fills column X with column name from df_choice
df_new['X'] = np.repeat(colname, df_choice.shape[0])
elif state == 'y':
df_new['Y'] = np.repeat(colname, df_choice.shape[0])
df_new['Property of Y (units)'] = df_choice[colname]
Notice that I replaced the line for your "Y" variable as well, just in case it comes up before "X".
To use np.repeat import the library
import numpy as np
Related
I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0
I'm really new to coding. I have 2 columns in Excel - one for ingredients and other the ratio.
Like this:
ingredients [methanol/ipa,ethanol/methanol,ethylacetate]
spec[90/10,70/30,100]
qty[5,6,10]
So this data is entered continuously. I want to get the total amount of ingredients, by eg from first column methanol will be 5x90 and ipa will be 10x5.
I tried to split them based on / and use a for loop to iterate
import pandas as pd
solv={'EA':0,'M':0,'AL':0,'IPA':0}
data_xls1=pd.read_excel(r'C:\Users\IT123\Desktop\Solvent stock.xlsx',Sheet_name='PLANT',index_col=None)
sz=range(len(data_xls1.index))
a=data_xls1.Solvent.str.split('/',0).tolist()
b=data_xls1.Spec.str.split('/',0).tolist()
print(a)
for i in sz:
print(b[i][0:1])
print(b[i][1:2])
I want to split the ingredients and spec column multiply with qty and store in a solve dictionary
Error right now is float object is not subscript able
You have already found the key part, namely using the str.split function.
I would suggest that you bring the data to a a long format like this:
| | Transaction | ingredients | spec | qty |
|---:|--------------:|:--------------|-------:|------:|
| 0 | 0 | methanol | 90 | 4.5 |
| 1 | 0 | ipa | 10 | 0.5 |
| 2 | 1 | ethanol | 70 | 4.2 |
| 3 | 1 | methanol | 30 | 1.8 |
| 4 | 2 | ethylacetate | 100 | 10 |
The following code produces that result:
import pandas as pd
d = {"ingredients":["methanol/ipa","ethanol/methanol","ethylacetate"],
"spec":["90/10","70/30","100"],
"qty":[5,6,10]
}
df = pd.DataFrame(d)
df.index = df.index.rename("Transaction") # Add sensible name to the index
#Each line represents a transcation with one or more ingridients
#Following lines split the lines by the delimter. Stack Functinos moves them to long format.
ingredients = df.ingredients.str.split("/", expand = True).stack()
spec = df.spec.str.split("/", expand = True).stack()
Each of them will look like this:
| TrID, |spec |
|:-------|----:|
| (0, 0) | 90 |
| (0, 1) | 10 |
| (1, 0) | 70 |
| (1, 1) | 30 |
| (2, 0) | 100 |
Now we just need to put everything together:
df_new = pd.concat([ingredients, spec], axis = "columns")
df_new.columns = ["ingredients", "spec"]
#Switch from string to float
df_new.spec = df_new.spec.astype("float")
#Multiply by the quantity,
#Pandas automatically uses Transaction (Index of both frames) to filter accordingly
df_new["qty"] = df_new.spec * df.qty / 100
#As long as you are not comfortable to work with multiindex, just run this line:
df_new = df_new.reset_index(level = 0, drop = False).reset_index(drop = True)
The good thing about this format is that you can have a multiple-way splits for your ingredients, str.split will work without a problem, and summing up is straightforward.
I should have posted this first bur this is what my input excel sheet looks like
I have some data as follows:
+--------+------+
| Reason | Keys |
+--------+------+
| x | a |
| y | a |
| z | a |
| y | b |
| z | b |
| x | c |
| w | d |
| x | d |
| w | d |
+--------+------+
I want to get the Reason corresponding to the first occurrence of each Key. Like here, I should get Reasons x,y,x,w for Keys a,b,c,d respectively. After that, I want to compute the percentage of each Reason, as in a metric for how many times each Reason occurs. Thus x = 2/4 = 50%. And w,y = 25% each.
For the percentage, I think I can use something like value_counts(normalize=True) * 100, based on the previous step. What is a good way to proceed?
You are right about the second step and the first step could be achieved by
summary = df.groupby("Keys").first()
You can using drop_duplicates
df.drop_duplicates(['Reason'])
Out[207]:
Reason Keys
0 x a
1 y a
2 z a
6 w d
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.
So I have a dataframe with some values. This is my dataframe:
|in|x|y|z|
+--+-+-+-+
| 1|a|a|b|
| 2|a|b|b|
| 3|a|b|c|
| 4|b|b|c|
I would like to get number of unique values of each row, and number of values that are not equal to value in column x. The result should look like this:
|in | x | y | z | count of not x |unique|
+---+---+---+---+---+---+
| 1 | a | a | b | 1 | 2 |
| 2 | a | b | b | 2 | 2 |
| 3 | a | b | c | 2 | 3 |
| 4 | b | b |nan| 0 | 1 |
I could come up with some dirty decisions here. But there must be some elegant way of doing that. My mind is turning around dropduplicates(that does not work on series); turning into array and .unique(); df.iterrows() that I want to evade; and .apply on each row.
Here are solutions using apply.
df['count of not x'] = df.apply(lambda x: (x[['y','z']] != x['x']).sum(), axis=1)
df['unique'] = df.apply(lambda x: x[['x','y','z']].nunique(), axis=1)
A non-apply solution for getting count of not x:
df['count of not x'] = (~df[['y','z']].isin(df['x'])).sum(1)
Can't think of anything great for unique. This uses apply, but may be faster, depending on the shape of the data.
df['unique'] = df[['x','y','z']].T.apply(lambda x: x.nunique())