Add a numeric column to an SFrame in Glab

Add a numeric column to an SFrame in Glab - python

In Graphlab,
I have a csv file which contains some_ids like the following:
some_ids
10
92
85
352
...
65664
I imported the csv file into Glab as an sframe:
my_csv = gl.SFrame.read_csv('my_csv_file.csv')
I need to add another column in the Sframe which contain the row number and I call it 'item_id'. The output will look like the following:
item_id,some_ids
1 10
2 92
3 85
4 352
...
13373 65664
I do no prefer to create another csv whereas prefer to do this inside Glab. We can also use numpy() if needed. How can this be done please? Thanks

There's an inbuilt command for just that:
graphlab.SFrame.add_row_number(column_name, start)
You can find out more about it in the documentation here

Related

Formatting issues using Regex and Pandas

I don't exactly know how to describe the issue I'm having, so I'll just show it.
I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?

Why is python only displaying the top and bottom portions of your table?
Python classes can define two "magic" methods:
__repr__(), which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and
__str__(), which is supposed to produce a readable "string" of the object, and which falls back to __repr__().
When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC)) is run, that last str(NHSQC) bit tells python to call NHSCQ.__str__(), which falls back to NHSCQ.__repr__(), which you can read about here.
The developers of the pandas library implemented DataFrame.__repr__() in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall on the first 5 and last 5 rows of the DataFrame.
What should you do instead?
Using str(NHSQC) is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str methods instead?
For instance, you appear to be lining up Column_2 and Column_3 of rows from DataFrame NHSQC where the value of Column_1 matches the first regex in order with Column_3 of rows from DataFrame HNCA where the value of Column_1 matches the second regex, right?
df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]
Those lines will select the requisite rows and columns from the two DataFrames using Series.str.match on Column_1.
long1 = df1.melt(id_vars=["Column_1"]).drop("variable", axis="columns")
long2 = df2.rename(columns={"Column_3": "value"})
The first line uses DataFrame.melt to turn the three columns of df1 into a "longer" version with columns Column_1 as an identifier, variable as either the strings "Column_2" or "Column_3", and value, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2 doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3 to value.
extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())
This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.

KDD Cup 2009: Parsing and converting a dataset with a ".data" extension to .csv?

I'm trying to parse a ".data" dataset via python / pandas into 230 separate variables into a df and then a .csv export.
The data seems to be tabular, but also has a few other nuances. Here's the instructions on the format from KDD:
Data Format Instructions from KDD Cup 2009
The datasets use a format similar as that of the text export format from relational databases:
One header lines with the variables names
One line per instance
Separator tabulation between the values
There are missing values (consecutive tabulations)
The large matrix results from appending the various chunks downloaded in their order number. The header line is present only in the first chunk.
The target values (.labels files) have one example per line in the same order as the corresponding data files. Note that churn, appetency, and up-selling are three separate binary classification problems. The target values are +1 or -1. We refer to examples having +1 (resp. -1) target values as positive (resp. negative) examples.
The Matlab matrices are numeric. When loaded, the data matrix is called X. The categorical variables are mapped to integers. Missing values are replaced by NaN for the original numeric variables while they are mapped to 0 for categorical variables.
Here's a snippet of how the code looks when opening in a text editor:
Var1 Var2 Var3 Var4 Var5 Var6 Var7 Var8 Var9 Var10 Var11 Var12 Var13 Var14 Var15 Var16 Var17 Var18 Var19 Var20 Var21 Var22 Var23 Var24 Var25 Var26 Var27 Var28 Var29 Var30 Var31 Var32 Var33 Var34 Var35 Var36 Var37 Var38 Var39 Var40 Var41 Var42 Var43 Var44 Var45 Var46 Var47 Var48 Var49 Var50 Var51 Var52 Var53 Var54 Var55 Var56 Var57 Var58 Var59 Var60 Var61 Var62 Var63 Var64 Var65 Var66 Var67 Var68 Var69 Var70 Var71 Var72 Var73 Var74 Var75 Var76 Var77 Var78 Var79 Var80 Var81 Var82 Var83 Var84 Var85 Var86 Var87 Var88 Var89 Var90 Var91 Var92 Var93 Var94 Var95 Var96 Var97 Var98 Var99 Var100 Var101 Var102 Var103 Var104 Var105 Var106 Var107 Var108 Var109 Var110 Var111 Var112 Var113 Var114 Var115 Var116 Var117 Var118 Var119 Var120 Var121 Var122 Var123 Var124 Var125 Var126 Var127 Var128 Var129 Var130 Var131 Var132 Var133 Var134 Var135 Var136 Var137 Var138 Var139 Var140 Var141 Var142 Var143 Var144 Var145 Var146 Var147 Var148 Var149 Var150 Var151 Var152 Var153 Var154 Var155 Var156 Var157 Var158 Var159 Var160 Var161 Var162 Var163 Var164 Var165 Var166 Var167 Var168 Var169 Var170 Var171 Var172 Var173 Var174 Var175 Var176 Var177 Var178 Var179 Var180 Var181 Var182 Var183 Var184 Var185 Var186 Var187 Var188 Var189 Var190 Var191 Var192 Var193 Var194 Var195 Var196 Var197 Var198 Var199 Var200 Var201 Var202 Var203 Var204 Var205 Var206 Var207 Var208 Var209 Var210 Var211 Var212 Var213 Var214 Var215 Var216 Var217 Var218 Var219 Var220 Var221 Var222 Var223 Var224 Var225 Var226 Var227 Var228 Var229 Var230
1225 7 100 156 195 0 72 166.56 0 4259232 0 2.565264 9 106 7 959480 0 70399.2 15 10 32 40 383386.4 620 54 20646 0 756720 1123876 1915 0 9 0 8335680 16 1689774 0 0 xddq9ayfAo RO12 taul 1K8T PShj iJzviRg 17VONbZnAuZ90atz MF5EBmj WVvO 9_Y1 vm5R VpdQ haYg 7M47J5GA0pTYIFxg5uy kIsH uKAI L84s H4p93_uThXwSG XREFJCi 7WwzJJY OgPm cJvF FzaX ch2oGfM Al6ZaUT P6pu4Vl LM8l689qOp
I've found this StackOverflow post to be helpful with how to utilize pandas to convert the filetype, however the text parsing logic is completely different.
Any support on how to navigate this problem would be very helpful, as I'm looking to use this dataset to learn how apply predictive learning to CRM datasets.
Thank you!!!

The data provided for the competition is in .data file format. After checking the data you can see that the data is separated by tab. So we can directly use pandas to read the file.
import pandas as pd
temp = pd.read_csv('orange_small_train.data', sep='\t')
This will solve the problem.

How to add column numbers to each column in a large text file

I would like to add column numbers to 128 columns in a text file
E.g.
My file
12 13 14 15
20 21 23 14
34 56 67 89
Required output
1:12 2:13 3:14 4:15
1: 20 2:21 3: 23 4:14
1: 34 2:56 3:67 4:89
Can this be done using awk / python
I tried paste command for joining two files : one with the values other file with column numbers, manually typed. Since the file size is very large manual typing didnt work.
As of my knowledge I could find answers for adding only one column to the end of a text file.
Thanks for the suggestions

awk to the rescue!
$ awk '{for(i=1;i<=NF;i++) $i=i":"$i}1' file
should do.

add computed column to a csv file

I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'

I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!

How to write dictionary values to a csv file using Python

I have a dictionary of class objects. I want to write the member values (timepoints, fitted, measured) of the class to a csv file using Python.
My Class:
class PlotReadingCurves:
def __init__(self, timepoints, fitted, measured):
self.timepoints = timepoints
self.fitted = fitted
self.measured = measured
obj = PlotReadingCurves(mTimePoints,mFitted,mMeasured)
PlotReadingCurvesList[csoId] = obj
Eg: timpoints : 1 2 3 4 5
fitted: 6 7 8 9 10
measured: 11 12 13 14
Expected results:
timepoints fitted measured fitted measured
1 6 11 .. ..
2 7 12
3 8 13
4 9 14
5 10 15

Try my mini wrapper library pyexcel. Although it is not as powerful as pandas, it is sufficient to write a dict to an excel file in a few lines of code:
>>> import pyexcel as pe
>>> your_dict = { "timepoints": [1,2,3], "fitted":[6,7,8]} # more columns omitted
>>> sheet = pe.Sheet(pe.utils.dict_to_array(your_dict))
>>> sheet.save_as("your_file_name.csv") # done
With pyexcel, you can easily write your data into other excel formats: xls, xlsx and even ods. The documentation can be found here

Try to use pandas, here is pandas's feature about your problem.
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
It's very convenient and powerful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Add a numeric column to an SFrame in Glab - python

There's an inbuilt command for just that: graphlab.SFrame.add_row_number(column_name, start) You can find out more about it in the documentation here

Related

Formatting issues using Regex and Pandas

KDD Cup 2009: Parsing and converting a dataset with a ".data" extension to .csv?

How to add column numbers to each column in a large text file

add computed column to a csv file

How to write dictionary values to a csv file using Python

Categories

Resources