I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column.
The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV.
Code:
db = pd.read_csv('SF1_original.csv')
db = db[db['marketcap']<= 100000000000]
db.to_csv('new_SF1_original.csv')
Example of old CSV (It's long don't look through whole thing, just to give you an idea):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Example New CSV (Exact same when this line should have been cut):
,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0
I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.
You can get the indexes of the rows with "marketcap" over 100 billion rows like so:
df.loc[df["marketcap"] > 100000000000]["marketcap"].index
All that's left to do is drop them from the DataFrame:
df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True)
Reading from CSV and writing to the CSV is already correctly taken care of in your code.
I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.
For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')
Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.