I have an excel file containing a single column (Row's number is not fixed). Using Python 3, I want to,
Import my excel file/data in python,
Read/select the data column (first column), and
Reshape this column into multiple columns having 10 rows in each column and finally
Writing output to a new excel file.
I have tried the following code;
import pandas as pd
import numpy as np
df = pd.read_excel('sample.xlsx')
first_column = pd.DataFrame(df.iloc[:,0])
arr = np.array(first_column)
newArr = arr.reshape(10, -1)
However i am facing the following error:
newArr = arr.reshape(arr, (10, -1))
TypeError: only integer scalar arrays can be converted to a scalar index
Looking for someone to help me out achieving this in Python 3.
My Excel File
1. To read a file in python you need pandas
To read the excel file in python it would be better to first save the file as csv then read it in python. You can save the excel file as csv using Save as option in excel.
>>> import pandas as pd
>>> df = pd.read_csv('fazool.csv')
Then to print the head of the dataframe/table in python
>>> df.head()
kMEblue kMEgreen kMEturquoise kMEblack kMEbrown kMEred kMEyellow data$X count moduleColors
0 -0.762233 -0.115623 0.836647 -0.418418 -0.688068 -0.078625 0.316798 VWA5A 1 turquoise
1 -0.714720 -0.145856 0.802115 -0.420983 first_column_split.csv-0.670826 -0.039813 0.424616 EIF4G2 1 turquoise
2 -0.785788 -0.259762 0.777330 -0.301520 -0.585565 0.021812 0.412960 CFL1 1 turquoise
3 -0.736677 -0.296203 0.776179 -0.266430 -0.517727 0.109923 0.526707 NSUN2 1 turquoise
4 -0.697293 0.030126 0.772833 -0.621229 -0.733419 -0.341270 0.088465 ANXA2 1 turquoise
2. Selecting the first column of the dataframe,
>>> first_column_df = pd.DataFrame(df.iloc[:,0])
>>> first_column_df.head()
kMEblue
0 -0.762233
1 -0.714720
2 -0.785788
3 -0.736677
4 -0.697293
>>> first_column_df.columns # shows the column name
Index(['kMEblue'], dtype='object')
3. For reshaping this column into multiple columns each having ten rows you would need numpy,
>>> import numpy as np
>>> n = 10 # number to be used as chunk size for the first column
>>> first_column_df_split = pd.concat([pd.Series(j, name='y' + str(i)) for i,j in enumerate(np.split( first_column_df['kMEblue'].to_numpy(), range(n, len(first_column_df['kMEblue']), n)))], axis=1)
>>> first_column_df_split.head()
y0 y1 y2 y3 y4 y5 ... y478 y479 y480 y481 y482 y483
0 -0.762233 -0.639253 -0.673571 -0.652639 -0.703227 -0.666183 ... 0.633533 0.628803 0.716792 0.783900 0.725757 0.791240
1 -0.714720 -0.680753 -0.696416 -0.686810 -0.636661 -0.613642 ... 0.678854 0.807758 0.736286 0.627988 0.853333 0.887149
2 -0.785788 -0.638530 -0.607706 -0.613452 -0.701420 -0.583315 ... 0.663671 0.649068 0.741015 0.847084 0.718821 0.786994
3 -0.736677 -0.728837 -0.665220 -0.613386 -0.596789 -0.614878 ... 0.722638 0.587891 0.658215 0.668980 0.794392 0.835687
4 -0.697293 -0.731756 -0.627547 -0.653920 -0.641218 -0.679153 ... 0.618696 0.740690 0.737382 0.679931 0.706449 0.919852
[5 rows x 484 columns]
4. For writing this file to an excel, you can use pandas dataFrame.to_csv()
>>> first_column_df_split.to_csv("first_column_split.csv")
Adopted from here.
Related
I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]
I have a csv that contains 12 cols and 4 rows of data.
As seen in the img
I would like to divide each of those values by their area of which I have created an array, and then multiply by 100 to get a % and have these values in a new column.
Image for array
So for example, A2, A3, A4, will all be divided by 52,600 and then x100.
My current df looks like this dataframe
I interpreted your request for a new column to be a new column for each Sub_* in your spreadsheet, since there were 12 values in your numpy array.
Code edit: I see you wanted to do the math to the 'Baseline' column as well. So I step through each column except the first (which is "Label" and at index 0)
import numpy as np
import pandas as pd
df = pd.read_excel("d:\stack67477476.xlsx")
area_arr = np.array([[52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538]])
for ii, col in enumerate(df.columns):
if ii == 0:
continue
df[col + "_Area"] = round(df[col] / area_arr[0][ii - 1] * 100, 2)
This is vectorized in one dimension (the 4 rows dimension) but loops through the 12 columns dimension. The output is as follows (don't quote me on this, I may have typed your inputs incorrectly):
df
Label Baseline Sub_A Sub_B Sub_C Sub_D Sub_E Sub_F Sub_G Sub_H Sub_I ... Sub_A_Area Sub_B_Area Sub_C_Area Sub_D_Area Sub_E_Area Sub_F_Area Sub_G_Area Sub_H_Area Sub_I_Area Sub_J_Area Sub_K_Area
0 0 0 15535 5128 8847 10784 5679 20481 8398 10012 5162 ... 103801.95 66580.11 212209.16 290673.85 100548.87 301857.04 449812.53 190053.15 103467.63 275527.43 380177.42
1 1 159506 149454 157456 155680 154327 154671 146863 150761 150446 155335 ... 998623.55 2044352.12 3734228.83 4159757.41 2738509.21 2164524.69 8075040.17 2855846.62 3113549.81 9387040.39 1963949.22
2 2 129087 111918 121515 122066 119557 123813 114746 123140 122156 125480 ... 747815.05 1577707.09 2927944.35 3222560.65 2192156.52 1691171.70 6595607.93 2318830.68 2515133.29 7608679.93 1653533.19
3 3 137562 102318 114509 124641 127442 130324 123331 130392 130715 134528 ... 683669.65 1486743.70 2989709.76 3435094.34 2307436.26 1817700.81 6984038.56 2481302.20 2696492.28 8123206.75 1881890.49
4 4 35901 26488 30836 33756 34549 34000 33269 34071 34151 35149 ... 176987.84 400363.54 809690.57 931239.89 601983.00 490331.61 1824906.27 648272.59 704529.97 2146473.78 531691.65
[5 rows x 25 columns]
Note that it's unclear why your numpy array is 2D, one assumes there is something deeper to that in the rest of your code. Seems it would be clearer to avoid a set of braces:
area_arr = np.array([52.6, 14.966, 7.702, 4.169, 3.71, 5.648, 6.785, 1.867, 5.268, 4.989, 1.659, 6.538])
And simplify the divisor to just:
area_arr[ii] # not area_arr[0][ii]
or for that matter, a simple list would be ok, since numpy isn't needed here.
Apologies if we have miscommunicated on commas and decimal points, but the code still works if you change the numbers.
I have a file with some data that looks like
1 2 3 4
2 3 4 5
3 4 5 6
4 5 6 7
I can process this data and do math on it just fine:
import sys
import numpy as np
import pandas as pd
def main():
if(len(sys.argv) != 2):
print "Takes one filename as argument"
sys.exit()
file_name = sys.argv[1]
data = pd.read_csv(file_name, sep=" ", header=None)
data.columns = ["timestep", "mux", "muy", "muz"]
t = data["timestep"].count()
c = np.zeros(t)
for i in range(0,t):
for j in range(0,i+1):
c[i-j] += data["mux"][i-j] * data["mux"][i]
c[i-j] += data["muy"][i-j] * data["muy"][i]
c[i-j] += data["muz"][i-j] * data["muz"][i]
for i in range(t):
print c[i]/(t-i)
The expected result for my sample input above is
42.5
62.0
84.5
110.0
This math is finding the time correlation function for my data, which is the time-average of all permutations of the pairs of products in each column.
I would like to generalize this program to
work on n number of columns (in the i/j loop for example), and
be able to read in the column names from the file, so as to not have them hard-coded in
Which numpy or pandas methods can I use to accomplish this?
We can reduce it to one loop, as we would make use of array-slicing and use sum ufunc to operate along the rows of the dataframe and thus in the process make it generic to cover any number of columns, like so -
a = data.values
t = data["timestep"].count()
c = np.zeros(t)
for i in range(t):
c[:i+1] += (a[:i+1,1:]*a[i,1:]).sum(axis=1)
Explanation
1) a[:i+1,1:] is the slice of all rows until the i+1-th row and all columns starting from the second column, i.e mux, muy and so on.
2) Similarly, for [i,1:], that's the i-th row and all columns from second column onwards.
To keep it "pandas-way", simply replace a[ with data.iloc[.
I'm trying to create a for-loop that automatically runs through my parsed list of NASDAQ stocks, and inserts their Quandl codes to then be retrieved from Quandl's database. essentially creating a large data set of stocks to perform data analysis on. My code "appears" right, but when I print the query it only prints 'GOOG/NASDAQ_Ticker' and nothing else. Any help and/or suggestions will be most appreciated.
import quandl
import pandas as pd
import matplotlib.pyplot as plt
import numpy
def nasdaq():
nasdaq_list = pd.read_csv('C:\Users\NAME\Documents\DATASETS\NASDAQ.csv')
nasdaq_list = nasdaq_list[[0]]
print nasdaq_list
for abbv in nasdaq_list:
query = 'GOOG/NASDAQ_' + str(abbv)
print query
df = quandl.get(query, authtoken="authoken")
print df.tail()[['Close', 'Volume']]
Iterating over a pd.DataFrame as you have done iterates by column. For example,
>>> df = pd.DataFrame(np.arange(9).reshape((3,3)))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
>>> for i in df[[0]]: print(i)
0
I would just get the first column as a Series with .ix,
>>> for i in df.ix[:,0]: print(i)
0
3
6
Note that in general if you want to iterate by row over a DataFrame you're looking for iterrows().
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).