De-mean the data and convert to numpy array - python

I am trying to implement basic Matrix Factorization movie Recommender system on Movielens 1M dataset. But I am stuck here. what I want to do is I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.
Code Snippet:
import pandas as pd
import numpy as np
ratings_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/ratings.dat', 'r').readlines()]
#users_list = [i.strip().split("::") for i in open('/users/nickbecker/Downloads/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/movies.dat', 'r').readlines()]
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()
R = R_df.to_numpy()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
error:
Traceback (most recent call last):
File "S:\TIP\Code\MF_orig.py", line 17, in <module>
user_ratings_mean = np.mean(R, axis = 1)
File "<__array_function__ internals>", line 6, in mean
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\fromnumeric.py", line 3257, in mean
out=out, **kwargs)
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py", line 151, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: can only concatenate str (not "int") to str
Edit:
Value of R is:
[['5' 0 0 ... 0 0 0]
['5' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
['4' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
ratings_df:
UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
... ... ... ... ...
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569
movies_df:
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
3878 3948 Meet the Parents (2000) Comedy
3879 3949 Requiem for a Dream (2000) Drama
3880 3950 Tigerland (2000) Drama
3881 3951 Two Family House (2000) Drama
3882 3952 Contender, The (2000) Drama|Thriller
[3883 rows x 3 columns]
Dataset link:
http://files.grouplens.org/datasets/movielens/ml-1m.zip

It is working on object and even giving the dtype argument to pandas dataframe constructor isn't converting that to integer.
You have to convert it to int explicitly:
ratings_list = [[int(j) for j in i.strip().split("::") if j] for i in open('ratings.txt', 'r').readlines()]
And then proceed. I tried and this works.

Related

create a new column in pandas dataframe using if condition from another dataframe

I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.
First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for

Matplot Numpy ValueError: setting an array element with a sequence

I have the following code to try to create a 3d Plot for a dataset YearlyKeywordsFrequency. I cannot figure out why is this error coming
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits import mplot3d
myData = pd.read_csv('counted-JOSAIC.csv', delimiter=',', skiprows=0,usecols=range(0,5))
print(myData)
item_list = list(myData.columns) #Names of Columns
item_list = item_list[1:]
print(item_list)
myData = np.array(myData) #Convert to numpy
keywords = np.asarray(myData[:,0]) #Get the Keywords
print(keywords)
data = np.asarray(myData[:,1:]) #remove Keywords from data
print(data.shape)
print(data)
##################################################################################
###x=keyword
###y=year
###z=freq
y=range(len(keywords))
x=range(len(item_list))
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(x, y, data, 50, cmap='binary')
ax.set_yticklabels(keywords)
ax.set_xticklabels(item_list)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
plt.show()
This code gives the following Results with error
Kewords freq-2015 ... freq-2017 freq-2018
0 energy 526 ... 89 97
1 power 246 ... 170 125
2 wireless 194 ... 121 144
3 transmission 157 ... 77 106
4 optimal 153 ... 100 110
5 interference 136 ... 100 78
6 spectrum 132 ... 126 29
7 allocation 125 ... 143 101
8 harvesting 123 ... 5 11
9 node 114 ... 25 63
10 capacity 106 ... 92 67
11 cellular 102 ... 72 39
12 relay 98 ... 20 35
13 access 97 ... 138 98
14 control 94 ... 50 87
15 link 91 ... 62 105
16 radio 91 ... 78 55
17 localization 89 ... 11 3
18 receiver 84 ... 20 38
19 sensor 82 ... 4 21
20 optical 80 ... 6 50
21 simulation 79 ... 90 94
22 probability 79 ... 51 44
23 the 78 ... 59 64
24 mimo 78 ... 192 49
25 signal 76 ... 38 38
26 sensing 76 ... 33 0
27 throughput 73 ... 65 39
28 packet 73 ... 8 38
29 heterogeneous 71 ... 36 42
... ... ... ... ...
8348 rated 0 ... 0 1
8349 150 0 ... 0 1
8350 highdefinition 0 ... 0 1
8351 facilitated 0 ... 0 1
8352 750 0 ... 0 1
8353 240 0 ... 0 1
8354 supplied 0 ... 0 1
8355 robotic 0 ... 0 1
8356 confinement 0 ... 0 1
8357 jam 0 ... 0 1
8358 8x6 0 ... 0 1
8359 megahertz 0 ... 0 1
8360 rotations 0 ... 0 1
8361 sudden 0 ... 0 1
8362 fades 0 ... 0 1
8363 marine 0 ... 0 1
8364 habitat 0 ... 0 1
8365 probes 0 ... 0 1
8366 uowcs 0 ... 0 1
8367 uowc 0 ... 0 1
8368 manchestercoded 0 ... 0 1
8369 avalanche 0 ... 0 1
8370 apd 0 ... 0 1
8371 pin 0 ... 0 1
8372 shallow 0 ... 0 1
8373 harbor 0 ... 0 1
8374 waters 0 ... 0 1
8375 focal 0 ... 0 1
8376 lcd 0 ... 0 1
8377 display 0 ... 0 1
[8378 rows x 5 columns]
[' freq-2015', ' freq-2016', ' freq-2017', ' freq-2018']
['energy' 'power' 'wireless' ... 'focal' 'lcd' 'display']
(8378, 4)
[[526 747 89 97]
[246 457 170 125]
[194 248 121 144]
...
[0 0 0 1]
[0 0 0 1]
[0 0 0 1]]
Traceback (most recent call last):
File "<ipython-input-5-7d351bf710cc>", line 1, in <module>
runfile('C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal/PlotingCharts.py', wdir='C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal')
File "e:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "e:\ProgramData\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Haseeb/Desktop/Report 5/PYTHON word removal/PlotingCharts.py", line 111, in <module>
ax.contour3D(x, y, data, 50, cmap='binary')
File "e:\ProgramData\Anaconda3\lib\site-packages\mpl_toolkits\mplot3d\axes3d.py", line 2076, in contour
self.auto_scale_xyz(X, Y, Z, had_data)
File "e:\ProgramData\Anaconda3\lib\site-packages\mpl_toolkits\mplot3d\axes3d.py", line 494, in auto_scale_xyz
self.xy_dataLim.update_from_data_xy(np.array([x, y]).T, not had_data)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\transforms.py", line 913, in update_from_data_xy
path = Path(xy)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\path.py", line 127, in __init__
vertices = _to_unmasked_float_array(vertices)
File "e:\ProgramData\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py", line 1365, in _to_unmasked_float_array
return np.asarray(x, float)
File "e:\ProgramData\Anaconda3\lib\site-packages\numpy\core\numeric.py", line 492, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence
I am trying to make a chart something like this but i cant understand why is it giving an error when a 2d array is required and my array shape is (8374,4) so what is the problem.
You must change your x and y types to arrays:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from mpl_toolkits import mplot3d
myData = pd.read_csv('counted-JOSAIC.csv', delimiter=',', skiprows=0,usecols=range(0,5))
item_list = list(myData.columns) #Names of Columns
item_list = item_list[1:]
print(item_list)
myData = np.array(myData) #Convert to numpy
keywords = np.asarray(myData[:,0]) #Get the Keywords
print(keywords)
data = np.asarray(myData[:,1:]) #remove Keywords from data
print(data.shape)
print(data)
##################################################################################
###x=keyword
###y=year
###z=freq
y=np.arange(len(keywords))
x=np.arange(len(item_list))
X, Y = np.meshgrid(x, y)
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.contour3D(X, Y, data, 50, cmap='binary')
ax.set_yticklabels(keywords)
ax.set_xticklabels(item_list)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z');
plt.show()
This gives:
Obviously you must change colors and x and y accordingly to get your desired output.
x, y, and data are not compatible.
type(data) gives <class 'numpy.ndarray'>
type(x) gives <class 'range'>
type(y) gives <class 'range'>
On the other hand, in this example, all of X, Y, and Z are of <class 'numpy.ndarray'> type. I think that you should try converting x and y to arrays.

How to get rows of a most recent day in the ascending order of time way when reading csv file?

I want to get rows of a most recent day which is in ascending order of time way.
I get dataframe as follows:
label uId adId operTime siteId slotId contentId netType
0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
Cause I get about 100 million rows in this csv file, it is impossible to load all this into my PC memory.
So I want to get rows of a most recent day in ascending order of time way when reading this csv files.
For examples, if the most recent day is on 2019-04-04, it will output as follows:
#this not a real data, just for examples.
label uId adId operTime siteId slotId contentId netType
0 0 u147336431 3887 2019-04-04 00:08:42.315 1 54 2427 2
1 0 u146933269 1462 2019-04-04 01:06:16.417 30 36 1343 6
2 0 u139536523 2084 2019-04-04 02:08:58.079 15 23 1536 7
3 0 u106663472 1460 2019-04-04 03:21:13.050 32 45 1352 2
4 0 u121642861 2295 2019-04-04 04:36:08.653 3 33 3267 4
Could anyone help me?
Thanks in advence.
I'm assuming you can't read the entire file into memory, and the file is in a random order. You can read the file in chunks and iterate through the chunks.
# read 50,000 lines of the file at a time
reader = pd.read_csv(
'csv_file.csv',
parse_dates=True,
chunksize=5e5,
header=0
)
recent_day=pd.datetime(2019,4,4)
next_day=recent_day + pd.Timedelta(days=1)
df_list=[]
for chunk in reader:
#check if any rows match the date range
date_rows = chunk.loc[
(chunk['operTime'] >= recent_day]) &\
(chunk['operTime'] < next_day)
]
#append dataframe of matching rows to the list
if date_rows.empty:
pass
else:
df_list.append(date_rows)
final_df = pd.concat(df_list)
final_df = final_df.sort_values('operTime')
Seconding what anky_91 said, sort_values() will be helpful here.
import pandas as pd
df = pd.read_csv('file.csv')
# >>> df
# label uId adId operTime siteId slotId contentId netType
# 0 0 u147333631 3887 2019-03-30 15:01:55.617 10 30 2137 1
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 2 0 u139816523 2084 2019-03-27 08:10:41.769 10 30 2336 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 4 0 u106642861 2295 2019-03-27 22:58:03.679 3 32 2567 4
sub_df = df[(df['operTime']>'2019-03-31') & (df['operTime']<'2019-04-01')]
# >>> sub_df
# label uId adId operTime siteId slotId contentId netType
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
final_df = sub_df.sort_values(by=['operTime'])
# >>> final_df
# label uId adId operTime siteId slotId contentId netType
# 3 0 u106546472 1460 2019-03-31 08:51:41.085 3 32 1371 4
# 1 0 u146930169 1462 2019-03-31 09:51:15.275 3 32 1373 1
I think you could also use a datetimeindex here; that might be necessary if the file is sufficiently large.
Like #anky_91 mentionned, you can use the sort_values function. Here is a short example of how it works:
df = pd.DataFrame( {'Symbol':['A','A','A'] ,
'Date':['02/20/2015','01/15/2016','08/21/2015']})
df.sort_values(by='Date')
Out :
Date Symbol
2 08/21/2015 A
0 02/20/2015 A
1 01/15/2016 A

Why I get different size on pandas dataframe after append or concat?

My code looks like this:
import pandas as pd
candle_data = pd.DataFrame()
for fileName in files:
csv_data = pd.read_csv(fileName, header=None)
candle_data = pd.concat([candle_data, csv_data])
#candle_data = candle_data.append(csv_data)
print(candle_data)
print(candle_data.tail(3))
the result is:
0 1 2 3 4 5 6
0 2000.05.30 17:27 0.93020 0.93020 0.93020 0.93020 0
1 2000.05.30 17:35 0.93040 0.93050 0.93040 0.93050 0
2 2000.05.30 17:38 0.93040 0.93040 0.93030 0.93030 0
...
29781 2016.04.29 16:55 1.14512 1.14524 1.14503 1.14515 0
29782 2016.04.29 16:56 1.14515 1.14517 1.14491 1.14495 0
29783 2016.04.29 16:57 1.14494 1.14505 1.14482 1.14482 0
29784 2016.04.29 16:58 1.14477 1.14511 1.14457 1.14457 0
[5171932 rows x 7 columns]
0 1 2 3 4 5 6
29782 2016.04.29 16:56 1.14515 1.14517 1.14491 1.14495 0
29783 2016.04.29 16:57 1.14494 1.14505 1.14482 1.14482 0
29784 2016.04.29 16:58 1.14477 1.14511 1.14457 1.14457 0
Why did I get 5171932x7 as the dimension while printing the whole dataframe, but 29784 as the last row index?
What is the correct way to merge all rows of two dataframes?
I think there are duplicates in index:
You can add parameter ignore_index=True to concat if don't have a meaningful index:
pd.concat([candle_data, csv_data], ignore_index=True)
Docs

Binning a data set using Pandas

Given a csv file of...
neg,,,,,,,
SAMPLE 1,,SAMPLE 2,,SAMPLE 3,,SAMPLE 4,
50.0261,2.17E+02,50.0224,3.31E+02,50.0007,5.38E+02,50.0199,2.39E+02
50.1057,2.65E+02,50.0435,3.92E+02,50.0657,5.52E+02,50.0465,3.37E+02
50.1514,2.90E+02,50.0781,3.88E+02,50.1115,5.75E+02,50.0584,2.58E+02
50.166,3.85E+02,50.1245,4.25E+02,50.1258,5.11E+02,50.0765,4.47E+02
50.1831,2.55E+02,50.1748,3.71E+02,50.1411,6.21E+02,50.1246,1.43E+02
50.2023,3.45E+02,50.2161,2.59E+02,50.1671,5.56E+02,50.1866,3.77E+02
50.223,4.02E+02,50.2381,4.33E+02,50.1968,6.31E+02,50.2276,3.41E+02
50.2631,1.89E+02,50.2826,4.63E+02,50.211,3.92E+02,50.2717,4.71E+02
50.2922,2.72E+02,50.3593,4.52E+02,50.2279,5.92E+02,50.376,3.09E+02
50.319,2.46E+02,50.4019,4.15E+02,50.2929,5.60E+02,50.3979,2.56E+02
50.3523,3.57E+02,50.423,3.31E+02,50.3659,4.84E+02,50.4237,3.28E+02
50.3968,4.67E+02,50.4402,1.76E+02,50.437,1.89E+02,50.4504,2.71E+02
50.4431,1.88E+02,50.479,4.85E+02,50.5137,6.63E+02,50.5078,2.54E+02
50.481,3.63E+02,50.5448,3.51E+02,50.5401,5.11E+02,50.5436,2.69E+02
50.506,3.73E+02,50.5872,4.03E+02,50.5593,6.56E+02,50.555,3.06E+02
50.5379,3.00E+02,50.6076,2.96E+02,50.6034,5.02E+02,50.6059,2.83E+02
50.5905,2.38E+02,50.6341,2.67E+02,50.6579,6.37E+02,50.6484,1.99E+02
50.6564,1.30E+02,50.662,3.53E+02,50.6888,7.37E+02,50.7945,4.84E+02
50.7428,2.38E+02,50.6952,4.21E+02,50.7132,6.71E+02,50.8044,4.41E+02
50.8052,3.67E+02,50.7397,1.99E+02,50.7421,6.29E+02,50.8213,1.69E+02
50.8459,2.80E+02,50.7685,3.73E+02,50.7872,5.30E+02,50.8401,3.88E+02
50.9021,3.56E+02,50.7757,4.54E+02,50.8251,4.13E+02,50.8472,3.61E+02
50.9425,3.89E+02,50.8027,7.20E+02,50.8418,5.73E+02,50.8893,1.18E+02
51.0117,2.29E+02,50.8206,2.93E+02,50.8775,4.34E+02,50.9285,2.64E+02
51.0244,5.19E+02,50.8364,4.80E+02,50.9101,4.25E+02,50.9591,1.64E+02
51.0319,3.62E+02,50.8619,2.90E+02,50.9222,5.11E+02,51.0034,2.70E+02
51.0439,4.24E+02,50.9098,3.22E+02,50.9675,4.33E+02,51.0577,2.88E+02
51.0961,3.59E+02,50.969,3.87E+02,51.0123,6.03E+02,51.0712,3.18E+02
51.1429,2.49E+02,51.0009,2.42E+02,51.0266,7.30E+02,51.1015,1.84E+02
51.1597,2.71E+02,51.0262,1.32E+02,51.0554,3.69E+02,51.1291,3.71E+02
51.177,2.84E+02,51.0778,1.58E+02,51.1113,4.50E+02,51.1378,3.54E+02
51.1924,2.00E+02,51.1313,4.07E+02,51.1464,3.86E+02,51.1871,1.55E+02
51.2055,2.25E+02,51.1844,2.08E+02,51.1826,7.06E+02,51.2511,2.05E+02
51.2302,3.81E+02,51.2197,5.49E+02,51.2284,7.00E+02,51.3036,2.60E+02
51.264,2.16E+02,51.2306,3.76E+02,51.271,3.83E+02,51.3432,1.99E+02
51.2919,2.29E+02,51.2468,2.87E+02,51.308,3.89E+02,51.3775,2.45E+02
51.3338,3.67E+02,51.2739,5.56E+02,51.3394,5.17E+02,51.3977,3.86E+02
51.3743,2.57E+02,51.3228,3.18E+02,51.3619,6.03E+02,51.4151,3.37E+02
51.3906,3.78E+02,51.3685,2.33E+02,51.3844,4.44E+02,51.4254,2.72E+02
51.4112,3.29E+02,51.3912,5.03E+02,51.4179,5.68E+02,51.4426,3.17E+02
51.4423,1.86E+02,51.4165,2.68E+02,51.4584,5.10E+02,51.4834,3.87E+02
51.537,3.48E+02,51.4645,3.76E+02,51.5179,5.75E+02,51.544,4.37E+02
51.637,4.51E+02,51.5078,2.76E+02,51.569,4.73E+02,51.5554,4.52E+02
51.665,2.27E+02,51.5388,2.51E+02,51.5894,4.57E+02,51.5958,1.96E+02
51.6925,5.60E+02,51.5486,2.79E+02,51.614,4.88E+02,51.6329,5.40E+02
51.7409,4.19E+02,51.5584,2.53E+02,51.6458,5.72E+02,51.6477,3.23E+02
51.7851,4.29E+02,51.5961,2.72E+02,51.7076,4.36E+02,51.6577,2.70E+02
51.8176,3.11E+02,51.6608,2.04E+02,51.776,5.59E+02,51.6699,3.89E+02
51.8764,3.94E+02,51.7093,5.14E+02,51.8157,6.66E+02,51.6788,2.83E+02
51.9135,3.26E+02,51.7396,1.88E+02,51.8514,4.26E+02,51.7201,3.91E+02
51.9592,2.66E+02,51.7931,2.72E+02,51.8791,5.61E+02,51.7546,3.41E+02
51.9954,2.97E+02,51.8428,5.96E+02,51.9129,5.14E+02,51.7646,2.27E+02
52.0751,2.24E+02,51.8923,3.94E+02,51.959,5.18E+02,51.7801,1.43E+02
52.1456,3.26E+02,51.9177,2.82E+02,52.0116,4.21E+02,51.8022,2.27E+02
52.1846,3.42E+02,51.9265,3.21E+02,52.0848,5.10E+02,51.83,2.66E+02
52.2284,2.66E+02,51.9413,3.56E+02,52.1412,6.20E+02,51.8698,1.74E+02
52.2666,5.32E+02,51.9616,2.19E+02,52.1722,5.72E+02,51.9084,2.89E+02
52.2936,4.24E+02,51.9845,1.53E+02,52.1821,5.18E+02,51.937,1.69E+02
52.3256,3.69E+02,52.0051,3.53E+02,52.2473,5.51E+02,51.9641,3.31E+02
52.3566,2.50E+02,52.0299,2.87E+02,52.3103,4.12E+02,52.0292,2.63E+02
52.4192,3.08E+02,52.0603,3.15E+02,52.35,8.76E+02,52.0633,3.94E+02
52.4757,2.99E+02,52.0988,3.45E+02,52.3807,6.95E+02,52.0797,2.88E+02
52.498,2.37E+02,52.1176,3.63E+02,52.4234,4.89E+02,52.1073,2.97E+02
52.57,2.58E+02,52.1698,3.11E+02,52.4451,4.54E+02,52.1546,3.41E+02
52.6178,4.29E+02,52.2352,3.96E+02,52.4627,5.38E+02,52.2219,3.68E+02
How can one split the samples using overlapping bins of 0.25 m/z - where the first column of each tuple (Sample n,,) contains a m/z value and the second containing the weight?
To load the file into a Pandas DataFrame I currently do:
import csv, pandas as pd
def load_raw_data():
raw_data = []
with open("negsmaller.csv", "rb") as rawfile:
reader = csv.reader(rawfile, delimiter=",")
next(reader)
for row in reader:
raw_data.append(row)
raw_data = pd.DataFrame(raw_data)
return raw_data.T
if __name__ == '__main__':
raw_data = load_raw_data()
print raw_data
Which returns
0 1 2 3 4 5 6 \
0 SAMPLE 1 50.0261 50.1057 50.1514 50.166 50.1831 50.2023
1 2.17E+02 2.65E+02 2.90E+02 3.85E+02 2.55E+02 3.45E+02
2 SAMPLE 2 50.0224 50.0435 50.0781 50.1245 50.1748 50.2161
3 3.31E+02 3.92E+02 3.88E+02 4.25E+02 3.71E+02 2.59E+02
4 SAMPLE 3 50.0007 50.0657 50.1115 50.1258 50.1411 50.1671
5 5.38E+02 5.52E+02 5.75E+02 5.11E+02 6.21E+02 5.56E+02
6 SAMPLE 4 50.0199 50.0465 50.0584 50.0765 50.1246 50.1866
7 2.39E+02 3.37E+02 2.58E+02 4.47E+02 1.43E+02 3.77E+02
7 8 9 ... 56 57 58 \
0 50.223 50.2631 50.2922 ... 52.2284 52.2666 52.2936
1 4.02E+02 1.89E+02 2.72E+02 ... 2.66E+02 5.32E+02 4.24E+02
2 50.2381 50.2826 50.3593 ... 51.9413 51.9616 51.9845
3 4.33E+02 4.63E+02 4.52E+02 ... 3.56E+02 2.19E+02 1.53E+02
4 50.1968 50.211 50.2279 ... 52.1412 52.1722 52.1821
5 6.31E+02 3.92E+02 5.92E+02 ... 6.20E+02 5.72E+02 5.18E+02
6 50.2276 50.2717 50.376 ... 51.8698 51.9084 51.937
7 3.41E+02 4.71E+02 3.09E+02 ... 1.74E+02 2.89E+02 1.69E+02
59 60 61 62 63 64 65
0 52.3256 52.3566 52.4192 52.4757 52.498 52.57 52.6178
1 3.69E+02 2.50E+02 3.08E+02 2.99E+02 2.37E+02 2.58E+02 4.29E+02
2 52.0051 52.0299 52.0603 52.0988 52.1176 52.1698 52.2352
3 3.53E+02 2.87E+02 3.15E+02 3.45E+02 3.63E+02 3.11E+02 3.96E+02
4 52.2473 52.3103 52.35 52.3807 52.4234 52.4451 52.4627
5 5.51E+02 4.12E+02 8.76E+02 6.95E+02 4.89E+02 4.54E+02 5.38E+02
6 51.9641 52.0292 52.0633 52.0797 52.1073 52.1546 52.2219
7 3.31E+02 2.63E+02 3.94E+02 2.88E+02 2.97E+02 3.41E+02 3.68E+02
[8 rows x 66 columns]
Process finished with exit code 0
My desired output: To take the overlapping 0.25 bins and then take the average of the column next to it and have it as one. So,
0.01 3
0.10 4
0.24 2
would become .25 3

Categories

Resources