making a change from R to Python I have some difficulties to write multiple csv using pandas from a list of multiple DataFrames:
import pandas
from dplython import (DplyFrame, X, diamonds, select, sift, sample_n,
sample_frac, head, arrange, mutate, group_by, summarize,
DelayFunction)
diamonds = [diamonds, diamonds, diamonds]
path = "/user/me/"
def extractDiomands(path, diamonds):
for each in diamonds:
df = DplyFrame(each) >> select(X.carat, X.cut, X.price) >> head(5)
df = pd.DataFrame(df) # not sure if that is required
df.to_csv(os.path.join('.csv', each))
extractDiomands(path,diamonds)
That however generates an errors. Appreciate any suggestions!
Welcome to Python! First I'll load a couple libraries and download an example dataset.
import os
import pandas as pd
example_data = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv")
print(example_data.head(5))
first few rows of our example data:
admit gre gpa rank
0 0 380 3.61 3
1 1 660 3.67 3
2 1 800 4.00 1
3 1 640 3.19 4
4 0 520 2.93 4
Now here's what I think you want done:
# spawn a few datasets to loop through
df_1, df_2, df_3 = example_data.head(20), example_data.tail(20), example_data.head(10)
list_of_datasets = [df_1, df_2, df_3]
output_path = 'scratch'
# in Python you can loop through collections of items directly, its pretty cool.
# with enumerate(), you get the index and the item from the sequence, each step through
for index, dataset in enumerate(list_of_datasets):
# Filter to keep just a couple columns
keep_columns = ['gre', 'admit']
dataset = dataset[keep_columns]
# Export to CSV
filepath = os.path.join(output_path, 'dataset_'+str(index)+'.csv')
dataset.to_csv(filepath)
At the end, my folder 'scratch' has three new csv's called dataset_0.csv, dataset_1.csv, and dataset_2.csv
Related
I am very new to pandas. It might be a silly question to some of you.
I am looking to compare 2 excel files and output the changes or the new entries
old.csv
Product Price Description
1 1.25 Product 1
2 2.25 Product 2
3 3.25 Product 3
new.csv
Product Price Description
1 1.25 Product 1 # Product 2 not in list
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
TRIED
import pandas as pd
import numpy as np
import requests
url = '<SomeUrl>/<PriceList>.xls'
resp = requests.get(url)
df = pd.DataFrame(pd.read_excel(resp.content))
df.to_csv('new.csv')
old = pd.read_csv('old.csv')
new = pd.read_csv('new.csv')
changes = new.loc[new['Price'] != old['Price']]
changes_csv = changes[['Product', 'Price', 'Description']]
print(changes_csv)
EXPECTING
3 3.50 Product 3 # Price update
4 4.25 Product 4 # New entry
I get the correct results if the length matches exactly. Otherwise I get
pandas valueerror can only compare identically labeled objects
BONUS
It wound be awesome if I can produce output for discontinued products
You could create a master index of all products, create 2 old/new dataframes using all the master index, then use df.compare() to compare the two databases:
Import pandas as pd
df1 = pd.DataFrame([[1,1.25,'Product 1'],[2,2.25,'Product 2'],[3,3.25,'Product 3']], columns=['Product','Price','Description'])
df2 = pd.DataFrame([[1,1.25,'Product 1'],[3,3.5,'Product 2'],[4,4.25,'Product 3 Change']], columns=['Product','Price','Description'])
df1product = df1[['Product']]
df2product = df2[['Product']]
dfproducts = df1product.merge(df2product, on='Product', how='outer')
df1 = dfproducts.merge(df1, how='left', on='Product')
df1.set_index(df1['Product'], inplace=True)
df2 = dfproducts.merge(df2, how='left', on='Product')
df2.set_index(df2['Product'], inplace=True)
dfcompare = df1.compare(df2, align_axis=0)
I have solved the problem, even though #WCeconomics kindly took the time to type the code out, it did not help me to get the output I wanted. Likely due to me being a noob with pandas.
This is how I solved it, so as it is useful to the community.
import pandas as pd
import openpyxl # to write excel files
from openpyxl.utils.dataframe import dataframe_to_rows
old = pd.read_excel('old.xls')
new = pd.read_excel('new.xls')
# data for these is in the same format as in question, with 'Product Description' instead of 'Description'
merged = old.merge(new, on='Product', how='outer', suffixes=('_old', '_new'))
df = merged[['Product', 'Product Description_old', 'Price_old', 'Price_new']]
changes = df.loc[(df['Price_new'] > df['Price_old'])].dropna(how='any', axis=0)
wb = openpyxl.Workbook()
ws = wb.active
for r in dataframe_to_rows(changes, index=False, header=True):
ws.append(r)
wb.save('avp_changes.xls')
I'm tyring to count the duplciate rows in a csv file. An example look like the following
head tail count
134; 135; 1
134; 136; 1
134; 137; 2
134; 135; 2
134; 136; 1
and want the duplicate rows (on head and tail columns) to be count and add the count together.
result looks like the following
head tail count
134; 135; 3
134; 136; 2
134; 137; 2
Another problem is that the csv file is super big (60GB), RAM is 64G btw, if set the chunksize to some number and do the iteration like:
for df in pd.read("*.csv", sep = ";",chunksize = 100000):
do the duplicate count
the count process will only be done in that part of df and not globally.
So what we want is actually to do the count in the whole file, but the file is too big.
Thanks
hz
Use Counter from collections module:
Input data:
>>> %cat data.csv
head;tail;count
134;135;1
134;136;1
134;137;2
134;135;2
134;136;1
from collections import Counter
for df in pd.read_csv(io.StringIO(text), sep=';', chunksize=2):
c.update(df.groupby(['head', 'tail'])['count'].sum().to_dict())
Output result:
>>> c
Counter({(134, 135): 3, (134, 136): 2, (134, 137): 2})
Convert the Counter to a DataFrame:
df = pd.DataFrame.from_dict(c, orient='index', columns=['count'])
mi = pd.MultiIndex.from_tuples(df.index, names=['head', 'tail'])
df = df.set_index(mi).reset_index()
>>> df
head tail count
0 134 135 3
1 134 136 2
2 134 137 2
One possibility would be to use DuckDB to perform the distinct count and then export the result to a pandas dataframe.
Duckdb is a vectorized state-of-the-art DBMS for analytics and can run queries directly on the CSV file. It is also tightly integrated with Pandas so you can easily import/export data to dataframes.
To install DuckDB you can simply run a
pip install duckdb
The following code should work for your purposes:
import duckdb
rel = duckdb.from_csv_auto(temp_file_name)
count_query = ''' SELECT column0, COUNT(column0)
FROM my_name_for_rel
GROUP BY column0
HAVING COUNT(column0)>1'''
res = rel.query('my_name_for_rel', count_query)
data_frame = res.df()
Can someone help me to extract multiples tables from ONE pdf file. I have 5 pages, every page have a table with same header column exp:
Table exp in every page
student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4
I want to extract all this tables in one dataframe, First i did
df = tabula.read_pdf(file_path,pages='all',multiple_tables=True)
But i got a messy output so i try this lines of code that looks like this :
[student Score Rang
Alex 50 23
Julia 80 12
Mariana 94 4 ,student Score Rang
Maxim 43 34
Nourah 93 5]
so i edited my code like this
import pandas as pd
import tabula
file_path = "filePath.pdf"
# read my file
df1 = tabula.read_pdf(file_path,pages=1,multiple_tables=True)
df2 = tabula.read_pdf(file_path,pages=2,multiple_tables=True)
df3 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df4 = tabula.read_pdf(file_path,pages=3,multiple_tables=True)
df5 = tabula.read_pdf(file_path,pages=5,multiple_tables=True)
It give me a dataframe for each table but i don't how to regroup it into one single dataframe and any other solution to avoid repeating the line of code.
According to the documentation of tabula, read_pdf returns a list when passed the multiple_table=True option.
Thus, you can use pandas.concat on its output to concatenate the dataframes:
df = pd.concat(tabula.read_pdf(file_path,pages='all',multiple_tables=True))
I'm trying to create a for-loop that automatically runs through my parsed list of NASDAQ stocks, and inserts their Quandl codes to then be retrieved from Quandl's database. essentially creating a large data set of stocks to perform data analysis on. My code "appears" right, but when I print the query it only prints 'GOOG/NASDAQ_Ticker' and nothing else. Any help and/or suggestions will be most appreciated.
import quandl
import pandas as pd
import matplotlib.pyplot as plt
import numpy
def nasdaq():
nasdaq_list = pd.read_csv('C:\Users\NAME\Documents\DATASETS\NASDAQ.csv')
nasdaq_list = nasdaq_list[[0]]
print nasdaq_list
for abbv in nasdaq_list:
query = 'GOOG/NASDAQ_' + str(abbv)
print query
df = quandl.get(query, authtoken="authoken")
print df.tail()[['Close', 'Volume']]
Iterating over a pd.DataFrame as you have done iterates by column. For example,
>>> df = pd.DataFrame(np.arange(9).reshape((3,3)))
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
>>> for i in df[[0]]: print(i)
0
I would just get the first column as a Series with .ix,
>>> for i in df.ix[:,0]: print(i)
0
3
6
Note that in general if you want to iterate by row over a DataFrame you're looking for iterrows().
I have a 100M line csv file (actually many separate csv files) totaling 84GB. I need to convert it to a HDF5 file with a single float dataset. I used h5py in testing without any problems, but now I can't do the final dataset without running out of memory.
How can I write to HDF5 without having to store the whole dataset in memory? I'm expecting actual code here, because it should be quite simple.
I was just looking into pytables, but it doesn't look like the array class (which corresponds to a HDF5 dataset) can be written to iteratively. Similarly, pandas has read_csv and to_hdf methods in its io_tools, but I can't load the whole dataset at one time so that won't work. Perhaps you can help me solve the problem correctly with other tools in pytables or pandas.
Use append=True in the call to to_hdf:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
df = pd.DataFrame(np.arange(10).reshape((5,2)), columns=['A', 'B'])
print(df)
# A B
# 0 0 1
# 1 2 3
# 2 4 5
# 3 6 7
# 4 8 9
# Save to HDF5
df.to_hdf(filename, 'data', mode='w', format='table')
del df # allow df to be garbage collected
# Append more data
df2 = pd.DataFrame(np.arange(10).reshape((5,2))*10, columns=['A', 'B'])
df2.to_hdf(filename, 'data', append=True)
print(pd.read_hdf(filename, 'data'))
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
Note that you need to use format='table' in the first call to df.to_hdf to make the table appendable. Otherwise, the format is 'fixed' by default, which is faster for reading and writing, but creates a table which can not be appended to.
Thus, you can process each CSV one at a time, use append=True to build the hdf5 file. Then overwrite the DataFrame or use del df to allow the old DataFrame to be garbage collected.
Alternatively, instead of calling df.to_hdf, you could append to a HDFStore:
import numpy as np
import pandas as pd
filename = '/tmp/test.h5'
store = pd.HDFStore(filename)
for i in range(2):
df = pd.DataFrame(np.arange(10).reshape((5,2)) * 10**i, columns=['A', 'B'])
store.append('data', df)
store.close()
store = pd.HDFStore(filename)
data = store['data']
print(data)
store.close()
yields
A B
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
0 0 10
1 20 30
2 40 50
3 60 70
4 80 90
This should be possible with PyTables. You'll need to use the EArray class though.
As an example, the following is a script I wrote to import chunked training data stored as .npy files into a single .h5 file.
import numpy
import tables
import os
training_data = tables.open_file('nn_training.h5', mode='w')
a = tables.Float64Atom()
bl_filter = tables.Filters(5, 'blosc') # fast compressor at a moderate setting
training_input = training_data.create_earray(training_data.root, 'X', a,
(0, 1323), 'Training Input',
bl_filter, 4000000)
training_output = training_data.create_earray(training_data.root, 'Y', a,
(0, 27), 'Training Output',
bl_filter, 4000000)
for filename in os.listdir('input'):
print "loading {}...".format(filename)
a = numpy.load(os.path.join('input', filename))
print "writing to h5"
training_input.append(a)
for filename in os.listdir('output'):
print "loading {}...".format(filename)
training_output.append(numpy.load(os.path.join('output', filename)))
Take a look at the docs for detailed instructions, but very briefly, the create_earray function takes 1) a data root or parent node; 2) an array name; 3) a datatype atom; 4) a shape with a 0 in the dimension you want to expand; 5) a verbose descriptor; 6) a compression filter; and 7) an expected number of rows along the expandable dimension. Only the first two are required, but you'll probably use all seven in practice. The function accepts a few other optional arguments as well; again, see the docs for details.
Once the array is created, you can use its append method in the expected way.
If you have a very large single CSV file, you may want to stream the conversion to hdf, e.g.:
import numpy as np
import pandas as pd
from IPython.display import clear_output
CHUNK_SIZE = 5000000
filename = 'data.csv'
dtypes = {'latitude': float, 'longitude': float}
iter_csv = pd.read_csv(
filename, iterator=True,
dtype=dtypes, encoding='utf-8', chunksize=CHUNK_SIZE)
cnt = 0
for ix, chunk in enumerate(iter_csv):
chunk.to_hdf(
"data.hdf", 'data', format='table', append=True)
cnt += CHUNK_SIZE
clear_output(wait=True)
print(f"Processed {cnt:,.0f} coordinates..")
Tested with a 64GB CSV file and 450 Million coordinates (about 10 Minutes conversion).