consider the below dataframe -df
one two three four five six seven eight
0 0.1 1.1 2.2 3.3 3.6 4.1 0.0 0.0
1 0.1 2.1 2.3 3.2 3.7 4.3 0.0 0.0
2 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.1 1.2 2.5 3.7 4.4 0.0 0.0 0.0
4 1.7 2.1 0.0 0.0 0.0 0.0 0.0 0.0
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0
7 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0
9 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0
i want to select all rows where any columns value is '3.2' but at the same time the selected rows should not have values '0.1' or '1.2'
I can able to get the first part with the below query
df[df.values == 3.2]
but cannot combine this with the second part of the query (the joint != condition)
i also get the following error
DeprecationWarning: elementwise != comparison failed; this will raise an error in the future.
on the larger data set (but not on the smaller replica) when trying the below
df[df.values != [0.1,1.2]]
//edit:
#pensen, here is the output, rows 1, 15, 27, 35 have values '0.1' though as per the condition they should have been filtered.
contains = df.eq(3.2).any(axis=1)
not_contains = ~df.isin([0.1,1.2]).any(axis=1)
print(df[contains & not_contains])
0 1 2 3 4 5 6 7
1 0.1 2.1 3.2 0.0 0.0 0.0 0.0 0.0
15 0.1 1.1 2.2 3.2 3.3 3.6 3.7 0.0
27 0.1 2.1 2.3 3.2 3.6 3.7 4.3 0.0
31 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
35 0.1 1.7 2.1 3.2 3.6 3.7 4.3 0.0
here is the original dataset from 0:36 rows to replicate the above output
0 1 2 3 4 5 6 7
0 4.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.1 2.1 3.2 0.0 0.0 0.0 0.0 0.0
2 0.1 2.4 2.5 0.0 0.0 0.0 0.0 0.0
3 2.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.1 2.1 4.1 0.0 0.0 0.0 0.0 0.0
7 4.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 2.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
11 1.1 4.1 0.0 0.0 0.0 0.0 0.0 0.0
12 0.1 2.2 3.3 3.6 0.0 0.0 0.0 0.0
13 0.1 1.8 3.3 0.0 0.0 0.0 0.0 0.0
14 0.1 1.2 1.3 2.5 3.7 4.2 0.0 0.0
15 0.1 1.1 2.2 3.2 3.3 3.6 3.7 0.0
16 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
17 1.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
18 1.3 2.5 0.0 0.0 0.0 0.0 0.0 0.0
19 0.1 1.2 2.5 3.7 4.4 0.0 0.0 0.0
20 1.2 4.4 0.0 0.0 0.0 0.0 0.0 0.0
21 4.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0
22 1.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 0.1 2.2 2.4 2.5 3.7 0.0 0.0 0.0
24 0.1 2.4 4.3 0.0 0.0 0.0 0.0 0.0
25 1.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0
26 0.1 1.1 4.1 0.0 0.0 0.0 0.0 0.0
27 0.1 2.1 2.3 3.2 3.6 3.7 4.3 0.0
28 1.4 2.2 3.6 4.1 0.0 0.0 0.0 0.0
29 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0
30 1.2 4.4 0.0 0.0 0.0 0.0 0.0 0.0
31 3.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0
32 3.6 4.1 0.0 0.0 0.0 0.0 0.0 0.0
33 2.1 2.4 0.0 0.0 0.0 0.0 0.0 0.0
34 0.1 1.8 0.0 0.0 0.0 0.0 0.0 0.0
35 0.1 1.7 2.1 3.2 3.6 3.7 4.3 0.0
here is the link to the actual dataset
You can do the following in short:
df.eq(3.2).any(axis=1) & ~df.isin([0.1, 1.2]).any(axis=1)
Or here more explicitly:
contains = df.eq(3.2).any(axis=1)
not_contains = ~df.isin([0.1,1.2]).any(axis=1)
print(df[contains & not_contains])
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0
For performance, specially since you mentioned large dataset and if you are looking to exclude just two numbers, here's one approach with array data -
a = df.values
df_out = df.iloc[(a == 3.2).any(1) & (((a!=0.1) & (a!=1.2)).all(1))]
Sample run -
In [43]: a = df.values
In [44]: df.iloc[(a == 3.2).any(1) & (((a!=0.1) & (a!=1.2)).all(1))]
Out[44]:
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0 0 0 0
6 2.1 2.3 3.2 4.3 0 0 0 0
You could just combine the conditions.
>>> df[(df == 3.2).any(1) & ~df.isin([0.1, 1.2]).any(1)]
one two three four five six seven eight
5 2.1 3.2 0.0 0.0 0.0 0.0 0.0 0.0
6 2.1 2.3 3.2 4.3 0.0 0.0 0.0 0.0
Related
Here I have a dataset:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14876 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14877 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
The Y-axis represents seconds, and the X-axis represents binned size ranges. This is data for cloud particle concentration. So for every second, there are 42 pieces of data that represent the number of particles that exist within a certain size range.
Each column represents a certain size range as I already said and those ranges for example, are:
(micrometers)
0 = 0.0 to 20.0
1 = 20.0 to 40.0
2 = 40.0 to 60.0
3 = 60.0 to 80.0
4 = 80.0 to 100.0
5 = 100.0 to 125.0
6 = 125.0 to 150.0
7 = 150.0 to 200.0
8 = 200.0 to 250.0
9 = 250.0 to 300.0
10 = 300.0 to 350.0
11 = 350.0 to 400.0
12 = 400.0 to 475.0
ect...
The reason I included so many is I want to show how the bins are spaced. The width of the bins increases, and the increase in width does not follow any sort of formula.
What I am wanting to do is replace the index for each column on the X-axis with these binned size ranges, and create a filled contour plot very similar to this:
I am using a pandas dataframe to store the dataset and I am currently using pyplot to attempt some plotting using pcolormesh.
Edit: Here is my attempt at starting with this.
#reading dataset extracted using h5py into pandas dataframe
df = pd.DataFrame(ds_arr)
df = df.replace(-999.0, 0)
#creating list for bin midpoints
strcols = [10.0, 30.0, 50.0, 70.0, 90.0, 112.5, 137.5, 175.0, 225.0, 275.0, 325.0, 375.0, 437.5, 512.5, 587.5, 662.5, 750.0, 850.0, 950.0, 1100.0, 1300.0, 1500.0, 1700.0, 2000.0, 2400.0, 2800.0, 3200.0, 3600.0, 4000.0, 4400.0, 4800.0, 5500.0, 6500.0, 7500.0, 8500.0, 9500.0, 11000.0, 13000.0, 15000.0, 17000.0, 19000.0, 22500.0]
#add new column to end of df
newdf = df
newdf['midpoints'] = strcols
#set the index to be the new column, and delete the name.
newdf.set_index('midpoints', drop=True, inplace=True)
newdf.index.name=None
#getting bins on the X-axis
newdf = newdf.T
print(newdf)
#check data type of indicies
print('\ncolumns are:',type(newdf.columns),)
print('rows are:',type(newdf.index))
#creating figure
fig, ax = plt.subplots(figsize=(13, 5))
fig.tight_layout(pad = 6)
#setting colormap, .copy so you can modify it inplace without an error
cmap = cm.gnuplot2.copy()
cmap.set_bad(color='black')
#plotting data using pcolormesh
plot = ax.pcolormesh(newdf, norm = mpl.colors.LogNorm(), cmap = cmap)
plt.title('Number Concentration', pad=12.8)
plt.xlabel("Bins", rotation=0, labelpad=17.5)
plt.ylabel("Time(Seconds)", labelpad=8.5)
cb = plt.colorbar(plot, shrink=1, aspect=25, location='right')
cb.ax.set_title('#/m4', pad=10, fontsize=10.5)
plt.show()
The resulting dataframe, where I use the midpoints of my desired bins as the header labels, looks like this:
10.0 30.0 50.0 70.0 90.0 112.5 137.5 175.0 225.0 275.0 325.0 375.0 437.5 ... 4400.0 4800.0 5500.0 6500.0 7500.0 8500.0 9500.0 11000.0 13000.0 15000.0 17000.0 19000.0 22500.0
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
14876 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14877 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14878 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14879 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
14880 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
And this is the plot I generated:
plot
The problem here is that the tick marks on the X-axis are not matching with the column headers in my resulting dataset.
Here is the output when I check what type of data my indices are:
columns are: <class 'pandas.core.indexes.numeric.Float64Index'>
rows are: <class 'pandas.core.indexes.base.Index'>
To be clear, my goal is to be able to give each column a bin width, where the range of size data on the X-axis starts at 0 and ends at the endpoint of my last bin. I was to be able to hardcode the bin widths for each column individually. I would also be able to display the bins logarithmically scaled, or anything similar.
How should I configure my dataframe to be able to output a plot similar to the example plot, with unevenly spaced yet logarithmically scaled binned data?
It seems like the while loop should terminate once the start int == 1, but it keeps going. It also seems it's not actually printing the values....just 0
Given a positive integer n, the following rules will always create a
sequence that ends with 1, called the hailstone sequence:
If n is even, divide it by 2
If n is odd, multiply it by 3 and add 1(i.e. 3n +1)
Continue until n is 1
Write a program that reads an
integer as input and prints the hailstone sequence starting with the
integer entered. Format the output so that ten integers, each
separated by a tab character (\t), are printed per line.
The output format can be achieved as follows: print(n, end='\t')
Ex: If the input is:
25
the output is:
25 76 38 19 58 29 88 44 22 11
34 17 52 26 13 40 20 10 5 16
8 4 2 1
My code:
''' Type your code here. '''
start = int()
while True:
print(start, end='\t')
if start % 2 == 0:
start = start/2
print(start, end='\t')
elif start % 2 == 1:
start = (start *3)+1
print(start, end='\t')
if start == 1:
print(start, end='\t')
break
print(start, end='\t')
Program errors displayed here
Program generated too much output.
Output restricted to 50000 characters.
Check program for any unterminated loops generating output.
Program output displayed here
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.
Your loop isn't terminating because you used 0 as an input and as 0 % 2 == 0 is true and 0/2=0 you become stuck in an infinite loop. you could fix this by raising an exception if start is <=0 like this:
start = int(input())
if start <=0:
raise Exception('Start must be strictly positive')
while True:
print(start, end='\t')
if not start % 2:
start //= 2
elif start % 2:
start = 3*start+1
if start == 1:
break
I have dataframe which looks like below:
df:
Review_Text Noun Thumbups
Would be nice to be able to import files from ... [My, Tracks, app, phone, Google, Drive, import... 1.0
No Offline Maps! It used to have offline maps ... [Offline, Maps, menu, option, video, exchange,... 18.0
Great application. Designed with very well tho... [application, application] 16.0
Great App. Nice and simple but accurate. Wish ... [Great, App, Nice, Exported] 0.0
Save For Offline - This does not work. The rou... [Save, Offline, route, filesystem] 12.0
Since latest update app will not run. Subscrip... [update, app, Subscription, March, application] 9.0
Great app. Love it! And all the things it does... [Great, app, Thank, work] 1.0
I have paid for subscription but keeps telling... [subscription, trial, period] 0.0
Error: The route cannot be save for no locatio... [Error, route, i, GPS] 0.0
When try to restore my tracks it says "unable ... [try, file, locally-1] 0.0
Was a good app but since the update it only re... [app, update, metre] 2.0
based on 'Noun' Column values, I want to create other columns. For example, all values of noun column from first row become columns and those columns contain value of 'Thumbups' column value. If the column name already present in dataframe then it adds 'Thumbups' value into the existing value of the column.
I was trying to implement by using pivot_table :
pd.pivot_table(latest_review,columns='Noun',values='Thumbups')
But got following error:
TypeError: unhashable type: 'list'
Could anyone help me in fixing the issue?
Use Series.str.join with Series.str.get_dummies for dummies and then multiple by column Thumbups by DataFrame.mul:
df1 = df['Noun'].str.join('|').str.get_dummies().mul(df['Thumbups'], axis=0)
print (df1)
App Drive Error Exported GPS Google Great Maps March My Nice \
0 0.0 10.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 10.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Offline Save Subscription Thank Tracks app application exchange \
0 0.0 0.0 0.0 0.0 10.0 10.0 0.0 0.0
1 180.0 0.0 0.0 0.0 0.0 0.0 0.0 180.0
2 0.0 0.0 0.0 0.0 0.0 0.0 160.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 120.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 90.0 0.0 0.0 90.0 90.0 0.0
6 0.0 0.0 0.0 10.0 0.0 10.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
file filesystem i import locally-1 menu metre option period \
0 0.0 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 180.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN NaN
phone route subscription trial try update video work
0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 180.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 120.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0 0.0 90.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0
7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 NaN NaN NaN NaN NaN NaN NaN NaN
rows = []
#_unpacking Noun column row list values and storing it in rows list
_ = df.apply(lambda row: [rows.append([row['Review_Text'],row['Thumbups'], nn])
for nn in row.Noun], axis=1)
#_creates new dataframe with unpacked values
df_new = pd.DataFrame(rows, columns=df.columns)
#_now doing pivot operation on df_new
pivot_df = df_new.pivot(index='Review_Text', columns='Noun')
I have dataframe not sequences. if I use len(df.columns), my data has 3586 columns. How to re-order the data sequences?
ID V1 V10 V100 V1000 V1001 V1002 ... V990 V991 V992 V993 V994
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 0.0
I used this df = df.reindex(sorted(df.columns), axis=1) (based on this question Re-ordering columns in pandas dataframe based on column name) but still not working.
thank you
First get all columns without pattern V + number by filtering with str.contains, then sorting all another values by Index.difference, add together and pass to DataFrame.reindex - get first all non numeric non matched columns in first positions and then sorted V + number columns:
L1 = df.columns[~df.columns.str.contains('^V\d+$')].tolist()
L2 = sorted(df.columns.difference(L1), key=lambda x: float(x[1:]))
df = df.reindex(L1 + L2, axis=1)
print (df)
ID V1 V10 V100 V990 V991 V992 V993 V994 V1000 V1001 V1002
A 1 9.0 2.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
B 1 1.2 0.1 3.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
C 2 8.6 8.0 2.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
D 3 0.0 2.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
E 4 7.8 6.6 3.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I have the following numpy matrix:
0 1 2 3 4 5 6 7 8 9
0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 5.0 0.0 9.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 7.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
7 5.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0
8 2.0 0.0 0.0 0.0 3.0 0.0 6.0 0.0 8.0 0.0
9 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I want to calculate the non-zero values average of every row and column separately. So my result should be something like this:
average_rows = [1.0,7.0,2.0,5.0,0.0,4.0,0.0,5.5,4.75,1.0,0.0]
average_cols = [3.5,1.0,4.33333,0.0,4.33333,0.0,4.0,6.0,6.5,0.0]
I can't figure out how to iterate over them, and I keep getting TypeError: unhashable type
Also, I'm not sure if iterating is the best solution, I also tried something like R[:,i] to grab each column and sum it using sum(R[:,i]), but keep getting the same error.
It is better to use 2d np.array instead of matrix.
import numpy as np
data = np.array([[1, 2, 0], [0, 0, 1], [0, 2, 4]], dtype='float')
data[data == 0] = np.nan
# replace all zeroes with `nan`'s to skip them
# [[ 1. 2. nan]
# [ nan nan 1.]
# [ nan 2. 4.]]
np.nanmean(data, axis=0)
# array([ 1. , 2. , 2.5])
np.nanmean(data, axis=1)
# array([ 1.5, 1. , 3. ])