I'm going through over 1 million patent applications and have to fix the dates, in addition to other things that I will work on later. I'm reading the file into a Pandas data frame, then running the following function:
def date_change():
new_dates = {'m/y': []}
for i, row in apps.iterrows():
try:
d = row['date'].rsplit('/')
new_dates['m/y'].append('{}/19{}'.format(d[0], d[2]))
except Exception as e:
print('{} {}\n{}\n{}'.format(i, e, row, d))
new_dates['m/y'].append(np.nan)
apps.join(pd.DataFrame(new_dates))
apps.drop('date')
Is there a quicker way of executing this? Is Pandas even the correct library to be using with a dataset this large? I've been told PySpark is good for big data, but how much will it improve the speed?
So it seems like you are using a string to represent data instead of a date time object.
I'd suggest to do something like
df['date'] = pd.to_datetime(df['date'])
So you don't need to iterate at all, as that function operate on the whole column.
And then, you might want to check the following answer which uses dt.strftime to format your column appropriately.
If you could show input and expected output, I could add the full solution here.
Besides, 1 million rows should typically be manageable for pandas (depending on the number of columns of course)
Related
So, I work on a place and here I use A LOT of Python (Pandas) and the data keeps getting bigger and bigger, last month I was working with a few hundred thousand rows, weeks after that I was working with a few million rows and now I am working with 42 million rows. Most of my work is just take a dataframe and for each row, I need to consult in another dataframe its "equivalent" and process the data, sometimes just merge but more often i need to do a function with the equivalent data. Back in the days with a few hundred thousand rows, it was ok to just use apply and a simple filter but now it is EXTREMELY SLOW. Recently I've switched to vaex which is way faster than pandas on every aspect but apply, and after some time searching I found that apply is the last resource and should be used only if u haven't another option. So, is there another option? I really don't know
Some code to explain how I was doing this entire time:
def get_secondary(row: pd.DataFrame):
cnae = row["cnae_fiscal"]
cnpj = row["cnpj"]
# cnaes is another dataframe
secondary = cnaes[cnaes.cnpj == cnpj]
return [cnae] + list(secondary["cnae"].values)
empresas["cnae_secundarios"] = empresas.apply(get_secondary, axis=1)
This isn't the only use case, as I said.
How can I parse phone numbers from a pandas data frame, ideally using phonenumbers library?
I am trying to use a port of Google's libphonenumber library on Python,
https://pypi.org/project/phonenumbers/.
I have a data frame with 3 million phone numbers from many countries. I have a row with the phone number, and a row with the country/region code. I'm trying to use the parse function in the package. My goal is to parse each row using the corresponding country code but I can't find a way of doing it efficiently.
I tried using apply but it didn't work. I get a "(0) Missing or invalid default region." error, meaning it won't pass the country code string.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),str(df.region_code)))
The line below works, but doesn't get me what I want, as the numbers I have come from about 120+ different countries.
df['phone_number_clean'] = df.phone_number.apply(lambda x:
phonenumbers.parse(str(df.phone_number),"US"))
I tried doing this in a loop, but it is terribly slow. Took me more than an hour to parse 10,000 numbers, and I have about 300x that:
for i in range(n):
df3['phone_number_std'][i] =
phonenumbers.parse(str(df.phone_number[i]),str(df.region_code[i]))
Is there a method I'm missing that could run this faster? The apply function works acceptably well but I'm unable to pass the data frame element into it.
I'm still a beginner in Python, so perhaps this has an easy solution. But I would greatly appreciate your help.
Your initial solution using apply is actually pretty close - you don't say what doesn't work about it, but the syntax for a lambda function over multiple columns of a dataframe, rather than on the rows within a single column, is a bit different. Try this:
df['phone_number_clean'] = df.apply(lambda x:
phonenumbers.parse(str(x.phone_number),
str(x.region_code)),
axis='columns')
The differences:
You want to include multiple columns in your lambda function, so you want to apply your lambda function to the entire dataframe (i.e, df.apply) rather than to the Series (the single column) that is returned by doing df.phone_number.apply. (print the output of df.phone_number to the console - what is returned is all the information that your lambda function will be given).
The argument axis='columns' (or axis=1, which is equivalent, see the docs) actually slices the data frame by rows, so apply 'sees' one record at a time (ie, [index0, phonenumber0, countrycode0], [index1, phonenumber1, countrycode1]...) as opposed to slicing the other direction, which would give it ([phonenumber0, phonenumber1, phonenumber2...])
Your lambda function only knows about the placeholder x, which, in this case, is the Series [index0, phonenumber0, countrycode0], so you need to specify all the values relative to the x that it knows - i.e., x.phone_number, x.country_code.
Love the solution of #katelie, but here's my code. Added a try/except block to skip the format_number function from failing. It cannot handle strings that are too long.
import phonenumber as phon
def formatE164(self):
try:
return phon.format_number(phon.parse(str(self),"NL"),phon.PhoneNumberFormat.E164)
except:
pass
df['column'] = df['column'].apply(formatE164)
I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
Situation
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore.
Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Data
shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
My approach
I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
I use datetime index and automatically selected float or integer columns
I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns
Question
How can I retrieve all columns without reading any data from the HDFStore?
What would be the most efficient way of storing my data? Is HDF the right option? Table or fixed?
Does it matter in term of efficiency if the index is a datetime index? Does there exists a more efficient format in general (e.g. all columns the same, fixed dtype?)
Is there a faster way to aggregate instead of groupby (df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict))
similar questions
How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
You may simply load 0 rows of the DataFrame by specifying same start and stop attributes. And leave all internal index/column processing for pandas itself:
idx = pd.MultiIndex.from_product([('A', 'B'), range(2)], names=('Alpha', 'Int'))
df = pd.DataFrame(np.random.randn(len(idx), 3), index=idx, columns=('I', 'II', 'III'))
df
>>> I II III
>>> Alpha Int
>>> A 0 -0.472412 0.436486 0.354592
>>> 1 -0.095776 -0.598585 -0.847514
>>> B 0 0.107897 1.236039 -0.196927
>>> 1 -0.154014 0.821511 0.092220
Following works both for fixed an table formats:
with pd.HDFStore('test.h5') as store:
store.put('df', df, format='f')
meta = store.select('df', start=1, stop=1)
meta
meta.index
meta.columns
>>> I II III
>>> Alpha Int
>>>
>>> MultiIndex(levels=[[], []],
>>> codes=[[], []],
>>> names=['Alpha', 'Int'])
>>>
>>> Index(['I', 'II', 'III'], dtype='object')
As for others question:
As long as your data is mostly homogeneous (almost float columns as you mentioned) and you are able to store it in single file without need to distribute data across machines - HDF is the first thing to try.
If you need to append/delete/query data - you must use table format. If you only need to write once and read many - fixed will improve performance.
As for datetime index, i think here we may use same idea as in 1 clause. If u are able to convert all data into single type it should increase your performance.
Nothing else that proposed in comment to your question comes to mind.
For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')
Columns without reading any data:
store.get_storer('df').ncols # substitute 'df' with your key
# you can also access nrows and other useful fields
From the docs (fixed format, table format): (important points in bold)
[fixed] These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores.
[table] Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported.
You may try to use epochms (or epochns) (milliseconds or nanoseconds since epoch) in place of datetimes. This way, you are just dealing with integer indices.
You may have a look at this answer if what you need is grouping by on large data.
An advice: if you have 4 questions to ask, it may be better to ask 4 separate questions on SO. This way, you'll get a higher number of (higher quality) answers, since each one is easier to tackle. And each will deal with a specific topic, making it easier to search for people that are looking for specific answers.
A continuation on a previous post. Previously, I had help creating a new column in a dataframe using Pandas, and each value would represent a factorized or unique value based on another column's value. I used this on a test case and it successfully works, but I am having trouble with a much larger log and htm file to do the same process for. I have 12 log files (for each month) and after combining them, I get a 17Gb file to work with. I want to factorize each and every username on it. I have been looking into using Dask, however, I can't replicate the functionality of sort and factorize to do what I want for the Dask dataframe. Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
import pandas as pd
import numpy as np
#import dask.dataframe as pf
df = pd.read_csv('example2.csv', header=0, dtype='unicode')
df_count = df['fruit'].value_counts()
df.sort_values(['fruit'], ascending=True, inplace=True)
sorting the column fruit
df.reset_index(drop=True, inplace=True)
f, u = pd.factorize(df.fruit.values)
n = np.core.defchararray.add('Fruit', f.astype(str))
df = df.assign(NewCol=n)
#print(df)
df.to_csv('output.csv')
Would it be better to try to use Dask, continue with Pandas or try with a MySQL database to manipulate a 17GB file?
The answer to this question depends on a great many things and is probably too general to get a good answer on Stack Overflow.
However, there are a few particular questions you bring up that are easier to answer
How do I factorize a column?
The easy way here is to categorize a column:
df = df.categorize(columns=['fruit'])
How do I sort unique values within a column
You can always set the column as the index, which will cause a sort. However beware that sorting in a distributed setting can be quite expensive.
However if you want to sort a column with a small number of options then you might find the unique values, sort those in-memory, and then join those back onto the dataframe. Something like the following might work:
unique_fruit = df.fruit.drop_duplicates().compute() # this is now a pandas series
unique_fruit = unique_fruit.sort_values()
numbers = pd.Series(unique_fruit.index, index=unique_fruit.values, name='fruit')
df = df.merge(numbers.to_frame(), left_on='fruit', right_index=True)
I have a fairly large csv file (700mb) which is assembled as follows:
qCode Date Value
A_EVENTS 11/17/2014 202901
A_EVENTS 11/4/2014 801
A_EVENTS 11/3/2014 2.02E+14
A_EVENTS 10/17/2014 203901
etc.
I am parsing this file to get specific values, and then using DF.loc to populate a pre-existing DataFrame, i.e. the code:
for line in fileParse:
code=line[0]
for point in fields:
if(point==code[code.find('_')+1:len(code)]):
date=line[1]
year,quarter=quarter_map(date)
value=float(line[2])
pos=line[0].find('_')
ticker=line[0][0:pos]
i=ticker+str(int(float(year)))+str(int(float(quarter)))
df.loc[i,point]=value
else:
pass
the question I have is .loc the most efficient way to add values to a existing DataFrame? As this operation seems to take over 10 hours...
fyi fields are the col that are in the DF (values i'm interested in) and the index (i) is a string...
thanks
No, you should never build a dataframe row-by-row. Each time you do this the entire dataframe has to be copied (it's not extended inplace) so you are using n + (n - 1) + (n - 2) + ... + 1, O(n^2), memory (which has to be garbage collected)... which is terrible, hence it's taking hours!
You want to use read_csv, and you have a few options:
read in the entire file in one go (this should be fine with 700mb even with just a few gig of ram).
pd.read_csv('your_file.csv')
read in the csv in chunks and then glue them together (in memory)... tbh I don't think this will actually use less memory than the above, but is often useful if you are doing some munging at this step.
pd.concat(pd.read_csv('foo.csv', chunksize=100000)) # not sure what optimum value is for chunksize
read the csv in chunks and save them into pytables (rather than in memory), if you have more data than memory (and you've already bought more memory) then use pytables/hdf5!
store = pd.HDFStore('store.h5')
for df in pd.read_csv('foo.csv', chunksize=100000):
store.append('df', df)
If I understand correctly, I think it would be much faster to:
Import the whole csv into a dataframe using pandas.read_csv.
Select the rows of interest from the dataframe.
Append the rows to your other dataframe using df.append(other_df).
If you provide more information about what criteria you are using in step 2 I can provide code there as well.
A couple of options that come to mind
1) Parse the file as you are currently doing, but build a dict intend of appending to your dataframe. After you're done with that convert that dict to a Dataframe and then use concat() to combine it with the existing Dataframe
2) Bring that csv into pandas using read_csv() and then filter/parse on what you want then do a concat() with the existing dataframe