I have similar question like this one : question
Referring the code from above post.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
my_list = [1,2,3,4,5,7,8,9,11,23,56,78,3,3,5,7,9,12]
new_list = pd.Series(my_list)
df1 = pd.DataFrame({'Range1':new_list.value_counts().index, 'Range2':new_list.value_counts().values})
df1.sort_values(by=["Range1"],inplace=True)
df2 = df1.groupby(pd.cut(df1["Range1"], [0,1,2,3,4,5,6,7,8,9,10,11,df1['Range1'].max()])).sum()
objects = df2['Range2'].index
y_pos = np.arange(len(df2['Range2'].index))
but want the following sequence on x-axis:
Expected output:
(00,01] (01,02] (02,03] (03,04]......
Any help in getting the expected output?
It isn't straightforward, but it is doable. You will need to format the left and right intervals separately.
l = df2.index.categories.left.map("{:02d}".format)
r = df2.index.categories.right.map("{:02d}".format)
plt.bar(range(len(df2)), df2['Range2'].values, tick_label='('+l+', '+r+']')
plt.xticks(fontsize=6)
plt.show()
Where,
print('('+l+', '+r+']')
Index(['(00, 01]', '(01, 02]', '(02, 03]', '(03, 04]', '(04, 05]', '(05, 06]',
'(06, 07]', '(07, 08]', '(08, 09]', '(09, 10]', '(10, 11]', '(11, 78]'],
dtype='object')
You may have to change the brackets depending on whether your intervals are closed on the left, on the right, or neither.
Related
Here is an example dataset found from google search close to my datasets in my environment
I'm trying to get output like this
import pandas as pd
import numpy as np
data = {'Product':['Box','Bottles','Pen','Markers','Bottles','Pen','Markers','Bottles','Box','Markers','Markers','Pen'],
'State':['Alaska','California','Texas','North Carolina','California','Texas','Alaska','Texas','North Carolina','Alaska','California','Texas'],
'Sales':[14,24,31,12,13,7,9,31,18,16,18,14]}
df=pd.DataFrame(data, columns=['Product','State','Sales'])
df1=df.sort_values('State')
#df1['Total']=df1.groupby('State').count()
df1['line']=df1.groupby('State').cumcount()+1
print(df1.to_string(index=False))
Commented out line throws this error
ValueError: Columns must be same length as key
Tried with size() it gives NaN for all rows
Hope someone points me to right direction
Thanks in advance
I think this should work for 'Total':
df1['Total']=df1.groupby('State')['Product'].transform(lambda x: x.count())
Try this:
df = pd.DataFrame(data).sort_values("State")
grp = df.groupby("State")
df["Total"] = grp["State"].transform("size")
df["line"] = grp.cumcount() + 1
I managed to write my first function. however I do not understand it :-)
I approached my real problem with a simplified on. See the following code:
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
T1_T_in = [398,397,395]
T1_p_in = [29,29,29]
T1_mPkt_in = [2.2,3,3.5]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck[i],temp[i]))
Q.append(H[i]*menge[i])
return Q
t1Q=Power(T1_p_in,T1_T_in,T1_mPkt_in)
t3Q = Power(T3_p_in,T3_T_in,T3_mPkt_in)
print(t1Q)
print(t3Q)
It works. The real problem now is different in that way that I read the data from an excel file. I got an error message and (according my learnings from this good homepage :-)) I added ".tolist()" in the function and it works. I do not understand why I need to change it to a list? Can anybody explain it to me? Thank you for your help.
import pandas as pd
import matplotlib as plt
import numpy as np
from pyXSteam.XSteam import XSteam
steamTable = XSteam(XSteam.UNIT_SYSTEM_MKS)
pfad="XXX.xlsx"
df = pd.read_excel(pfad)
T1T_in = df.iloc[2:746,1]
T1p_in = df.iloc[2:746,2]
T1mPkt_in = df.iloc[2:746,3]
def Power(druck,temp,menge):
H = []
Q = []
for i in range(len(druck)):
H.append(steamTable.h_pt(druck.tolist()[i],temp.tolist()[i]))
Q.append(H[i]*menge.tolist()[i])
return Q
t1Q=Power(T1p_in,T1T_in,T1mPkt_in)
t1Q[0:10]
The reason your first example works is because you are passing the T1_mPkt_in variable into the menge parameter as a list:
T1_mPkt_in = [2.2,3,3.5]
Your second example is not working because you pass the T1_mPkt_in variable into the menge parameter as a series and not a list:
T1mPkt_in = df.iloc[2:746,3]
If you print out the type of T1_mPkt_in, you will get:
<class 'pandas.core.series.Series'>
In pandas, to convert a series back into a list, you can call .tolist() to store the data in a list so that you can properly index it.
This is my code:
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a=pd.crosstab(df .HH, df .MM,margins=True)
print (a)
I would like to view results in a table format. Table borders or at least the same number of digits would solve the problem.
I want to see the table on console without any graphical window.
If you want a nicely looking crosstab you can use seaborn.heatmap
An example
>>> import numpy as np; np.random.seed(0)
>>> import seaborn as sns; sns.set()
>>> uniform_data = np.random.rand(10, 12)
>>> ax = sns.heatmap(uniform_data)
result would look like this:
You can find many examples that show how to apply this, e.g.:
https://www.science-emergence.com/Codes/How-to-plot-a-confusion-matrix-with-matplotlib-and-seaborn/
https://gist.github.com/shaypal5/94c53d765083101efc0240d776a23823
Update
In order to simply display the crosstab in a formatted way you can skip the print and display like this
import pandas as pd
cols= ['DD','MM','YYYY','HH'] #names
DD,MM,YYYY,HH=[1,2,None,4,5,5],[1,1,1,2,2,3],[2014,2014,2014,2014,2014,2014],[20,20,20,18,18,18] #data
df = pd.DataFrame(list(zip(DD,MM,YYYY,HH)), columns =cols )
print (df)
a = pd.crosstab(df .HH, df .MM,margins=True)
display(a)
which will yield the same result as:
pd.crosstab(df .HH, df .MM,margins=True)
I have a pandas dataframe which has a column 'INTENSITY' and a numpy array of same length containing the error for each intensity. I would like to generate columns with randomly generated numbers in the error range.
So far I use two nested for loops to create the new columns but I feel like this is inefficient:
theor_err = [ sqrt(abs(x)) for x in theor_df[str(INTENSITY)] ]
theor_err = np.asarray(theor_err)
for nr_sample in range(2):
sample = np.zeros(len(theor_df[str(INTENSITY)]))
for i, error in enumerate(theor_err):
sample[i] = theor_df[str(INTENSITY)][i] + random.uniform(-error, error)
theor_df['gen_{}'.format(nr_sample)] = Series(sample, index=theor_df.index)
theor_df.head()
Is there a more efficient way of approaching a problem like this?
Numpy can handle arrays for you. So, you can do it like this:
import pandas as pd
import numpy as np
a=pd.DataFrame([10,20,15,30],columns=['INTENSITY'])
a['theor_err']=np.sqrt(np.abs(a.INTENSITY))
a['sample']=np.random.uniform(-a['theor_err'],a['theor_err'])
Suppose you want to generate 6 samples. You can try to code below. You can tune the number of samples you want by setting the value k.
df = pd.DataFrame([[1],[2],[3],[4],[-5]], columns=["intensity"])
k = 6
sample_names = ["sample" + str(i+1) for i in range(k)]
df["err"] = np.sqrt(np.abs((df["intensity"])))
df[sample_names] = pd.DataFrame(
df["err"].map(lambda x: np.random.uniform(-x, x, k)).values.tolist())
df.loc[:,sample_names] = df.loc[:,sample_names].add(df.intensity, axis=0)
I want to execute SVM light and SVM rank,
so I need to process my data into the format of SVM light.
But I had a big problem....
My Python codes are below:
import pandas as pd
import numpy as np
from sklearn.datasets import dump_svmlight_file
self.df = pd.DataFrame()
self.df['patent_id'] = patent_id_list
self.df['Target'] = class_list
self.df['backward_citation'] = backward_citation_list
self.df['uspc_originality'] = uspc_originality_list
self.df['science_linkage'] = science_linkage_list
self.df['sim_bc_structure'] = sim_bc_structure_list
self.df['claim_num'] = claim_num_list
self.qid = dataset_list
X = self.df[np.setdiff1d(self.df.columns, ['patent_id','Target'])]
y = self.df.Target
dump_svmlight_file(X,y,'test.dat',zero_based=False, query_id=self.qid,multilabel=False)
The output file "test.dat" is look like this:
But the real data is look like this:
I got a wrong index....
Take first instance for example, the value of column 1 is 7, and the values of column 2~4 are zeros, the value of column 5 is 2....
So my expected result is look like this:
1 qid:1 1:7 5:2
but the column index of output file are totally wrong....
and unfortunately... I cannot figure out where is the problem occur....
I cannot fix this problem for a long time....
Thank you for help!!
I change the data structure, I use np.array to produce array-like input.
Finally, I succeed!
If you're interested in loading into a numpy array, try:
X = clicks_train[:,0:2]
y = clicks_train[:,2]
where 2 is the index of the target column