I have a dictionary of class objects. I want to write the member values (timepoints, fitted, measured) of the class to a csv file using Python.
My Class:
class PlotReadingCurves:
def __init__(self, timepoints, fitted, measured):
self.timepoints = timepoints
self.fitted = fitted
self.measured = measured
obj = PlotReadingCurves(mTimePoints,mFitted,mMeasured)
PlotReadingCurvesList[csoId] = obj
Eg: timpoints : 1 2 3 4 5
fitted: 6 7 8 9 10
measured: 11 12 13 14
Expected results:
timepoints fitted measured fitted measured
1 6 11 .. ..
2 7 12
3 8 13
4 9 14
5 10 15
Try my mini wrapper library pyexcel. Although it is not as powerful as pandas, it is sufficient to write a dict to an excel file in a few lines of code:
>>> import pyexcel as pe
>>> your_dict = { "timepoints": [1,2,3], "fitted":[6,7,8]} # more columns omitted
>>> sheet = pe.Sheet(pe.utils.dict_to_array(your_dict))
>>> sheet.save_as("your_file_name.csv") # done
With pyexcel, you can easily write your data into other excel formats: xls, xlsx and even ods. The documentation can be found here
Try to use pandas, here is pandas's feature about your problem.
Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
It's very convenient and powerful.
Related
I am loading a txt file containig complex number. The data are formatted in this way
How can I create a two separate arrays, one for the real part and one for the imaginary part?
I tried to create a panda dataframe using e-01 as a separator but in this way I loose this info
df = pd.read_fwf(r'c:\test\complex.txt', header=None)
df[['real','im']] = df[0].str.extract(r'\(([-.\de]+)([+-]\d\.[\de\-j]+)')
print(df)
0 real im
0 (9.486832980505137680e-01-3.162277660168379412... 9.486832980505137680e-01 -3.162277660168379412e-01j
1 (9.486832980505137680e-01+9.486832980505137680... 9.486832980505137680e-01 +9.486832980505137680e-01j
2 (-9.486832980505137680e-01+9.48683298050513768... -9.486832980505137680e-01 +9.486832980505137680e-01j
3 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
4 (-3.162277660168379412e-01+9.48683298050513768... -3.162277660168379412e-01 +9.486832980505137680e-01j
5 (9.486832980505137680e-01-3.162277660168379412... 9.486832980505137680e-01 -3.162277660168379412e-01j
6 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
7 (9.486832980505137680e-01-9.486832980505137680... 9.486832980505137680e-01 -9.486832980505137680e-01j
8 (9.486832980505137680e-01-9.486832980505137680... 9.486832980505137680e-01 -9.486832980505137680e-01j
9 (-3.162277660168379412e-01+3.16227766016837941... -3.162277660168379412e-01 +3.162277660168379412e-01j
10 (3.162277660168379412e-01-9.486832980505137680... 3.162277660168379412e-01 -9.486832980505137680e-01j
Never knew how annoyingly involved it is to read complex numbers with Pandas, This is a slightly different solution than #Алексей's. I prefer to avoid regular expressions when not absolutely necessary.
# Read the file, pandas defaults to string type for contents
df = pd.read_csv('complex.txt', header=None, names=['string'])
# Convert string representation to complex.
# Use of `eval` is ugly but works.
df['complex'] = df['string'].map(eval)
# Alternatively...
#df['complex'] = df['string'].map(lambda c: complex(c.strip('()')))
# Separate real and imaginary parts
df['real'] = df['complex'].map(lambda c: c.real)
df['imag'] = df['complex'].map(lambda c: c.imag)
df
is...
string complex \
0 (9.486832980505137680e-01-3.162277660168379412... 0.948683-0.316228j
1 (9.486832980505137680e-01+9.486832980505137680... 0.948683+0.948683j
2 (-9.486832980505137680e-01+9.48683298050513768... -0.948683+0.000000j
3 (-3.162277660168379412e-01+3.16227766016837941... -0.316228+0.316228j
4 (-3.162277660168379412e-01+9.48683298050513768... -0.316228+0.948683j
5 (9.486832980505137680e-01-3.162277660168379412... 0.948683-0.316228j
6 (3.162277660168379412e-01+3.162277660168379412... 0.316228+0.316228j
7 (9.486832980505137680e-01-9.486832980505137680... 0.948683-0.948683j
real imag
0 0.948683 -3.162278e-01
1 0.948683 9.486833e-01
2 -0.948683 9.486833e-01
3 -0.316228 3.162278e-01
4 -0.316228 9.486833e-01
5 0.948683 -3.162278e-01
6 0.316228 3.162278e-01
7 0.948683 -9.486833e-01
df.dtypes
prints out..
string object
complex complex128
real float64
imag float64
dtype: object
I am converting some of my web-scraping code from R to Python (I can't get geckodriver to work with R, but it's working with Python). Anyways, I am trying to understand how to parse and read HTML tables with Python. Quick background, here is my code for R:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
WebElem <- readHTMLTable(doc, stringsAsFactors = FALSE)[[7]]
I would parse the HTML page to the doc object. Then I would start with doc[[1]], and move through higher numbers until I saw the data I wanted. In this case I got to doc[[7]] and saw the data I wanted. I then would read that HTML table and assign it to the WebElem object. Eventually I would turn this into a dataframe and play with it.
So what I am doing in Python is this:
html = None
doc = None
html = driver.page_source
doc = BeautifulSoup(html)
Then I started to play with doc.get_text but I don't really know how to get just the data I want to see. The data I want to see is like a 10x10 matrix. When I used R, I would just use doc[[7]] and that matrix would almost be in a perfect structure for me to convert it to a dataframe. However, I just can't seem to do that with Python. Any advice would be much appreciated.
UPDATE:
I have been able to get the data I want using Python--I followed this blog for creating a dataframe with python: Python Web-Scraping. Here is the website that we are scraping in that blog: Most Popular Dog Breeds. In that blog post, you have to work your way through the elements, create a dict, loop through each row of the table and store the data in each column, and then you are able to create a dataframe.
With R, the only code I had to write was:
doc <- htmlParse(remDr$getPageSource()[[1]],ignoreBlanks=TRUE, replaceEntities = FALSE, trim=TRUE, encoding="UTF-8")
df <- as.data.frame(readHTMLTable(doc, stringsAsFactors = FALSE)
With just that, I have a pretty nice dataframe that I only need to adjust the column names and data types--it looks like this with just that code:
NULL.V1 NULL.V2 NULL.V3 NULL.V4
1 BREED 2015 2014 2013
2 Retrievers (Labrador) 1 1 1
3 German Shepherd Dogs 2 2 2
4 Retrievers (Golden) 3 3 3
5 Bulldogs 4 4 5
6 Beagles 5 5 4
7 French Bulldogs 6 9 11
8 Yorkshire Terriers 7 6 6
9 Poodles 8 7 8
10 Rottweilers 9 10 9
Is there not something available in Python to make this a bit simpler, or is this just simpler in R because R is more built for dataframes(at least that's how it seems to me, but I could be wrong)?
Ok, after some hefty digging around I feel like I came to good solution--matching that of R. If you are looking at the HTML provided in the link above, Dog Breeds, and you have the web driver running for that link you can run the following code:
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
Then you are looking a pretty nice dataframe after only a couple lines of code:
In [145]: df
Out[145]:
[ 0 1 2 3
0 BREED 2015 2014 2013.0
1 Retrievers (Labrador) 1 1 1.0
2 German Shepherd Dogs 2 2 2.0
3 Retrievers (Golden) 3 3 3.0
4 Bulldogs 4 4 5.0
5 Beagles 5 5 4.0
I feel like this is much easier than working through the tags, creating a dict, and looping through each row of data as the blog suggests. It might not be the most correct way of doing things, I'm new to Python, but it gets the job done quickly. I hope this helps out some fellow web-scrapers.
tbl = driver.find_element_by_xpath("//html/body/main/article/section[2]/div/article/table").get_attribute('outerHTML')
df = pd.read_html(tbl)
it Worked pretty well.
First, read Selenium with Python, you will get basic idea of how Selenium work with Python.
Than, if you want to locate element in Python, there are tow ways:
Use Selenium API, you can refer Locating Elements
Use BeautifulSoup, there is nice Document you can read
BeautifulSoupDocumentation
I'm looking for an effective way to construct a Term Document Matrix in Python that can be used together with extra data.
I have some text data with a few other attributes. I would like to run some analyses on the text and I would like to be able to correlate features extracted from text (such as individual word tokens or LDA topics) with the other attributes.
My plan was load the data as a pandas data frame and then each response will represent a document. Unfortunately, I ran into an issue:
import pandas as pd
import nltk
pd.options.display.max_colwidth = 10000
txt_data = pd.read_csv("data_file.csv",sep="|")
txt = str(txt_data.comment)
len(txt)
Out[7]: 71581
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[10]: 45
txt_lines = []
f = open("txt_lines_only.txt")
for line in f:
txt_lines.append(line)
txt = str(txt_lines)
len(txt)
Out[14]: 1668813
txt = nltk.word_tokenize(txt)
txt = nltk.Text(txt)
txt.count("the")
Out[17]: 10086
Note that in both cases, text was processed in such a way that only the anything but spaces, letters and ,.?! was removed (for simplicity).
As you can see a pandas field converted into a string returns fewer matches and the length of the string is also shorter.
Is there any way to improve the above code?
Also, str(x) creates 1 big string out of the comments while [str(x) for x in txt_data.comment] creates a list object which cannot be broken into a bag of words. What is the best way to produce a nltk.Text object that will retain document indices? In other words I'm looking for a way to create a Term Document Matrix, R's equivalent of TermDocumentMatrix() from tm package.
Many thanks.
The benefit of using a pandas DataFrame would be to apply the nltk functionality to each row like so:
word_file = "/usr/share/dict/words"
words = open(word_file).read().splitlines()[10:50]
random_word_list = [[' '.join(np.random.choice(words, size=1000, replace=True))] for i in range(50)]
df = pd.DataFrame(random_word_list, columns=['text'])
df.head()
text
0 Aaru Aaronic abandonable abandonedly abaction ...
1 abampere abampere abacus aback abalone abactor...
2 abaisance abalienate abandonedly abaff abacina...
3 Ababdeh abalone abac abaiser abandonable abact...
4 abandonable abandon aba abaiser abaft Abama ab...
len(df)
50
txt = df.text.apply(word_tokenize)
txt.head()
0 [Aaru, Aaronic, abandonable, abandonedly, abac...
1 [abampere, abampere, abacus, aback, abalone, a...
2 [abaisance, abalienate, abandonedly, abaff, ab...
3 [Ababdeh, abalone, abac, abaiser, abandonable,...
4 [abandonable, abandon, aba, abaiser, abaft, Ab...
txt.apply(len)
0 1000
1 1000
2 1000
3 1000
4 1000
....
44 1000
45 1000
46 1000
47 1000
48 1000
49 1000
Name: text, dtype: int64
As a result, you get the .count() for each row entry:
txt = txt.apply(lambda x: nltk.Text(x).count('abac'))
txt.head()
0 27
1 24
2 17
3 25
4 32
You can then sum the result using:
txt.sum()
1239
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).
I have a 200 MB CSV, file and a 4 GB json file in compressed format(300 MB when in compressed form). now I need to check if a particular field in json has a value which matches with any of the values in column 0 of the csv file. How can this be achieved in a fast as I have to do this for multiple json files, csv file being same. I hope using pandas would speed up things
After reading from CSV File the following datastructure is formed:
Empty DataFrame
Columns: []
Index: [1335063, 1339033, 1344453, 1392603, 1520033, 5342858, 5361498, 5534501, 5542881, 5552665, 5618397, 5824472, 5867442, 5908134, 5908134, 6203501, 6208411, 6209921, 6211681, 6212831, 6213691, 6287061, 6293811, 6387151, 6415771, 6508691, 6649281, 6673261, 6716441, 6782181, 6821631, 7710551, 9413871, 11280941, 11285381, 11762751, 11769381, 11854271, 11964831, 11995871, 12240091, 12541201, 12553471, 12633891, 12648021, 12834201, 12899581, 13177041, 13282401, 13290581, 13292951, 13297681, 14536901, 14592891, 14665721, 14843571, 15120821, 15127231, 15531511, 15969981, 16648561, 16808911, 16809381, 17019781, 17021721, 17224241, 17234921, 17327321, 17923721, 17930901, 18577181, 18606681, 19448911, 19557541, 20272801, 20286621, 20295001, 20351761, 21052471, 21062651, 21106501, 21578741, 22279401, 22312931, 23078211, 23164911, 24937351, 24988721, 26171811, 26188561, 26224001, 26379241, 26380531, 26383571, 26386251, 26388621, 27509171, 27825771, 28282901, 28998561, ...]
Now the data t be read from gzip file will be a json string and I can convert it with read_json. But I dont get how to see if the field 'id' in json is present in the lsit shown here
This should get you started:
import numpy as np
import pandas
magic_value = 11
df = pandas.DataFrame(np.random.random_integers(0, 12, size=(10,2)))
# 0 1
# 0 1 1
# 1 5 3
# 2 12 12
# 3 12 8
# 4 11 4
# 5 11 12
# 6 9 7
# 7 7 1
# 8 0 11
# 9 2 1
magic_value in df[0].values
# True
So just read in the JSON data with pandas.read_json, get the value you want (pandas indexing docs), and go to town.