Convert columns to string in Pandas - python

I have the following DataFrame from a SQL query:
(Pdb) pp total_rows
ColumnID RespondentCount
0 -1 2
1 3030096843 1
2 3030096845 1
and I pivot it like this:
total_data = total_rows.pivot_table(cols=['ColumnID'])
which produces
(Pdb) pp total_data
ColumnID -1 3030096843 3030096845
RespondentCount 2 1 1
[1 rows x 3 columns]
When I convert this dataframe into a dictionary (using total_data.to_dict('records')[0]), I get
{3030096843: 1, 3030096845: 1, -1: 2}
but I want to make sure the 303 columns are cast as strings instead of integers so that I get this:
{'3030096843': 1, '3030096845': 1, -1: 2}

One way to convert to string is to use astype:
total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)
However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):
In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])
In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'
In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'
Note: you can pass in a buffer/file to save this to, along with some other options...

If you need to convert ALL columns to strings, you can simply use:
df = df.astype(str)
This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):
df[["D", "E"]] = df[["D", "E"]].astype(int)

pandas >= 1.0: It's time to stop using astype(str)!
Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:
# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')
From pandas 1.0 onwards, consider using "string" type instead.
# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype
Here's why, as quoted by the docs:
You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.
object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text
while excluding non-text but still object-dtype columns.
When reading code, the contents of an object dtype array is less clear than 'string'.
See also the section on Behavioral Differences between "string" and object.
Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.
Why should I stop using it?
Accidentally mixing dtypes
The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.
# pandas <= 0.25
pd.Series(['a', 'b', 1.23]) # whoops, this should have been "1.23"
0 a
1 b
2 1.23
dtype: object
pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23] # oops, pandas was storing this as float all the time.
# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")
0 a
1 b
2 1.23
dtype: string
pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23'] # it's a string and we just averted some potentially nasty bugs.
Challenging to differentiate strings and other python objects
Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.
Consider,
# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
A B
0 a {}
1 b [1, 2, 3]
2 c 123
Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.
# pandas <= 0.25
df.dtypes
A object
B object
dtype: object
df.select_dtypes(object)
A B
0 a {}
1 b [1, 2, 3]
2 c 123
From pandas 1.0, this becomes a lot simpler:
# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes
A string
B object
dtype: object
df.select_dtypes("string")
A
0 a
1 b
2 c
Readability
This is self-explanatory ;-)
OK, so should I stop using it right now?
...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!

Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:
In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
...: 'A': [20, 30.0, np.nan],
...: 'B': ["a45a", "a3", "b1"],
...: 'C': [10, 5, np.nan]})
...:
In [79]: df.dtypes ## Current datatype
Out[79]:
A float64
B object
C float64
dtype: object
## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str)
In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]:
A object
B object
C object
dtype: object

There are four ways to convert columns to string
1. astype(str)
df['column_name'] = df['column_name'].astype(str)
2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)
3. map(str)
df['column_name'] = df['column_name'].map(str)
4. apply(str)
df['column_name'] = df['column_name'].apply(str)
Lets see the performance of each type
#importing libraries
import numpy as np
import pandas as pd
import time
#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')
#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')
#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')
#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')
Output
time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds
map(str) and apply(str) are takes less time compare with remaining two techniques

I usually use this one:
pd['Column'].map(str)

pandas version: 1.3.5
Updated answer
df['colname'] = df['colname'].astype(str) => this should work by default. But if you create str variable like str = "myString" before using astype(str), this won't work. In this case, you might want to use the below line.
df['colname'] = df['colname'].astype('str')
===========
(Note: incorrect old explanation)
df['colname'] = df['colname'].astype('str') => converts dataframe column into a string type
df['colname'] = df['colname'].astype(str) => gives an error

1. .map(repr) is very fast
If you want to convert values to strings in a column, consider .map(repr). For multiple columns, consider .applymap(str).
df['col_as_str'] = df['col'].map(repr)
# multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(str)
# or
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda col: col.map(repr))
In fact, a timeit test shows that map(repr) is 3 times faster than astype(str) (and is faster than any other method mentioned on this page). Even for multiple columns, this runtime difference still holds. The following is the runtime plot of various methods mentioned here.
astype(str) has very little overhead but for larger frames/columns, map/applymap outperforms it.
2. Don't convert to strings in the first place
There's very little reason to convert a numeric column into strings given pandas string methods are not optimized and often get outperformed by vanilla Python string methods. If not numeric, there are dedicated methods for those dtypes. For example, datetime columns should be converted to strings using pd.Series.dt.strftime().
One way numeric->string seems to be used is in a machine learning context where a numeric column needs to be treated as categorical. In that case, instead of converting to strings, consider other dedicated methods such as pd.get_dummies or sklearn.preprocessing.LabelEncoder or sklearn.preprocessing.OneHotEncoder to process your data instead.
3. Use rename to convert column names to specific types
The specific question in the OP is about converting column names to strings, which can be done by rename method:
df = total_rows.pivot_table(columns=['ColumnID'])
df.rename(columns=str).to_dict('records')
# [{'-1': 2, '3030096843': 1, '3030096845': 1}]
The code used to produce the above plots:
import numpy as np
from perfplot import plot
plot(
setup=lambda n: pd.Series(np.random.default_rng().integers(0, 100, size=n)),
kernels=[lambda s: s.astype(str), lambda s: s.astype("string"), lambda s: s.apply(str), lambda s: s.map(str), lambda s: s.map(repr)],
labels= ['col.astype(str)', 'col.astype("string")', 'col.apply(str)', 'col.map(str)', 'col.map(repr)'],
n_range=[2**k for k in range(4, 22)],
xlabel='Number of rows',
title='Converting a single column into string dtype',
equality_check=lambda x,y: np.all(x.eq(y)));
plot(
setup=lambda n: pd.DataFrame(np.random.default_rng().integers(0, 100, size=(n, 100))),
kernels=[lambda df: df.astype(str), lambda df: df.astype("string"), lambda df: df.applymap(str), lambda df: df.apply(lambda col: col.map(repr))],
labels= ['df.astype(str)', 'df.astype("string")', 'df.applymap(str)', 'df.apply(lambda col: col.map(repr))'],
n_range=[2**k for k in range(4, 18)],
xlabel='Number of rows in dataframe',
title='Converting every column of a 100-column dataframe to string dtype',
equality_check=lambda x,y: np.all(x.eq(y)));

Using .apply() with a lambda conversion function also works in this case:
total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))
For entire dataframes you can use .applymap().
(but in any case probably .astype() is faster)

currently i do it like this
df_pg['store_id'] = df_pg['store_id'].astype('string')

Related

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

how to express dataframe operations using symbols?

suppose i have an sympy expression, it seems to me i can only substitute symbols with numbers. the question is can i substitute it with something else like a pandas series? For example,
from sympy import Symbol, Function
a_sym = Symbol('a')
b_sym = Symbol('b')
sum_func_sym = Function('sum_func')
expression = sum_func_sym(a_sym+b_sym)
is there a way for me to substitute a_sym and b_sym with pandas series and replace the sum_func_sym with series sum and then calculate the result?
import pandas as pd
df = pd.DataFrame({'a': [1,2], 'b': [3,4]})
a = df.a
b = df.b
def sum_func(series):
return series.sum()
When i do the substitution and replacement i get an error:
expression.subs(a_sym, a).subs(b_sym, b).replace(sum_func_sym, sum_func)
AttributeError: 'Add' object has no attribute 'sum'
Building upon this answer, I came up with the following implementation that seems to work for at least fairly simple use cases:
df = pd.DataFrame({'a': range(5), 'b': range(5)})
my_vars = symbols('a b') # have to have same names as DataFrame columns
expr = parse_expr('a+Sqrt(b)+1')
# Create callable version of the expression
callable_obj = lambdify(my_vars, expr)
# Call the object, passing in the DataFrame columns as parameters
# Write the result in a new column of the dataframe
df['result'] = callable_obj(**{
str(a): df[str(a)] # pass column as variable with the same name
for a in expr.atoms() # for all atomic expressions...
if isinstance(a, Symbol) # that are Symbols (not constants)
})
The output is (as expected):
0 1.000000
1 3.000000
2 4.414214
3 5.732051
4 7.000000
dtype: float64
I assume that you have a dataframe with many columns and you want to add two of them. However, the names of columns to be added are variables, unknown beforeahead. Here is the solution for this case. f-strings work for Python 3.6+, for other versions, please modify appropriately.
def sum(a, b):
global df
df[f'sum_of_{a}_and_{b}'] = df[a] + df[b]
# for more general function than sum
# df['f'sum_of_{a}_and_{b}']] = df.apply(lambda x: f(x[a],x[b]), axis=1)
# where f is the function instead of the sum

How to pick the numeric columns in pd.Dataframe() [duplicate]

Let's say df is a pandas DataFrame.
I would like to find all columns of numeric type.
Something like:
isNumeric = is_numeric(df)
You could use select_dtypes method of DataFrame. It includes two parameters include and exclude. So isNumeric would look like:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)
Simple one-line answer to create a new dataframe with only numeric columns:
df.select_dtypes(include=np.number)
If you want the names of numeric columns:
df.select_dtypes(include=np.number).columns.tolist()
Complete code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': range(7, 10),
'B': np.random.rand(3),
'C': ['foo','bar','baz'],
'D': ['who','what','when']})
df
# A B C D
# 0 7 0.704021 foo who
# 1 8 0.264025 bar what
# 2 9 0.230671 baz when
df_numerics_only = df.select_dtypes(include=np.number)
df_numerics_only
# A B
# 0 7 0.704021
# 1 8 0.264025
# 2 9 0.230671
colnames_numerics_only = df.select_dtypes(include=np.number).columns.tolist()
colnames_numerics_only
# ['A', 'B']
You can use the undocumented function _get_numeric_data() to filter only numeric columns:
df._get_numeric_data()
Example:
In [32]: data
Out[32]:
A B
0 1 s
1 2 s
2 3 s
3 4 s
In [33]: data._get_numeric_data()
Out[33]:
A
0 1
1 2
2 3
3 4
Note that this is a "private method" (i.e., an implementation detail) and is subject to change or total removal in the future. Use with caution.
df.select_dtypes(exclude = ['object'])
Update:
df.select_dtypes(include= np.number)
or with new version of panda
df.select_dtypes('number')
Simple one-liner:
df.select_dtypes('number').columns
Following codes will return list of names of the numeric columns of a data set.
cnames=list(marketing_train.select_dtypes(exclude=['object']).columns)
here marketing_train is my data set and select_dtypes() is function to select data types using exclude and include arguments and columns is used to fetch the column name of data set
output of above code will be following:
['custAge',
'campaign',
'pdays',
'previous',
'emp.var.rate',
'cons.price.idx',
'cons.conf.idx',
'euribor3m',
'nr.employed',
'pmonths',
'pastEmail']
This is another simple code for finding numeric column in pandas data frame,
numeric_clmns = df.dtypes[df.dtypes != "object"].index
We can include and exclude data types as per the requirement as below:
train.select_dtypes(include=None, exclude=None)
train.select_dtypes(include='number') #will include all the numeric types
Referred from Jupyter Notebook.
To select all numeric types, use np.number or 'number'
To select strings you must use the object dtype but note that
this will return all object dtype columns
See the NumPy dtype hierarchy <http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html>__
To select datetimes, use np.datetime64, 'datetime' or
'datetime64'
To select timedeltas, use np.timedelta64, 'timedelta' or
'timedelta64'
To select Pandas categorical dtypes, use 'category'
To select Pandas datetimetz dtypes, use 'datetimetz' (new in
0.20.0) or ``'datetime64[ns, tz]'
Although this is old subject,
but i think the following formula is easier than all other comments
df[df.describe().columns]
As the function describe() only works for numeric columns, the column of the output will only be numeric.
Please see the below code:
if(dataset.select_dtypes(include=[np.number]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.number]).describe())
if(dataset.select_dtypes(include=[np.object]).shape[1] > 0):
display(dataset.select_dtypes(include=[np.object]).describe())
This way you can check whether the value are numeric such as float and int or the srting values. the second if statement is used for checking the string values which is referred by the object.
Adapting this answer, you could do
df.ix[:,df.applymap(np.isreal).all(axis=0)]
Here, np.applymap(np.isreal) shows whether every cell in the data frame is numeric, and .axis(all=0) checks if all values in a column are True and returns a series of Booleans that can be used to index the desired columns.
A lot of the posted answers are inefficient. These answers either return/select a subset of the original dataframe (a needless copy) or perform needless computational statistics in the case of describe().
To just get the column names that are numeric, one can use a conditional list comprehension with the pd.api.types.is_numeric_dtype function:
numeric_cols = [col for col in df if pd.api.types.is_numeric_dtype(df[col])]
I'm not sure when this function was introduced.
def is_type(df, baseType):
import numpy as np
import pandas as pd
test = [issubclass(np.dtype(d).type, baseType) for d in df.dtypes]
return pd.DataFrame(data = test, index = df.columns, columns = ["test"])
def is_float(df):
import numpy as np
return is_type(df, np.float)
def is_number(df):
import numpy as np
return is_type(df, np.number)
def is_integer(df):
import numpy as np
return is_type(df, np.integer)

How to row-wise concatenate several columns containing strings?

I have a specific series of datasets which come in the following general form:
import pandas as pd
import random
df = pd.DataFrame({'n': random.sample(xrange(1000), 3), 't0':['a', 'b', 'c'], 't1':['d','e','f'], 't2':['g','h','i'], 't3':['i','j', 'k']})
The number of tn columns (t0, t1, t2 ... tn) varies depending on the dataset, but is always <30.
My aim is to merge the content of the tn columns for each row so that I achieve this result (note that for readability I need to keep the whitespace between elements):
df['result'] = df.t0 +' '+df.t1+' '+df.t2+' '+ df.t3
So far so good. This code may be simple but it becomes clumsy and inflexible as soon as I receive another dataset, where the number of tn columns goes up. This is where my question comes in:
Is there any other syntax to merge the content across multiple columns? Something agnostic to the number columns, akin to:
df['result'] = ' '.join(df.ix[:,1:])
Basically, I want to achieve the same as the OP in the link below, but with whitespace between the strings:
Concatenate row-wise across specific columns of dataframe
The key to operate in columns (Series) of strings en mass is the Series.str accessor.
I can think of two .str methods to do what you want.
str.cat()
The first is str.cat. You have to start from a series, but you can pass a list of series (unfortunately you can't pass a dataframe) to concatenate with an optional separator. Using your example:
column_names = df.columns[1:] # skipping the first, numeric, column
series_list = [df[c] for c in column_names[1:]]
# concatenate:
df['result'] = series_list[0].str.cat(series_list[1:], sep=' ')
Or, in one line:
df['result'] = df[df.columns[1]].str.cat([df[c] for c in df.columns[2:]], sep=' ')
str.join()
The second is the .str.join() method, which works like the standard Python method string.join(), but for which you need to have a column (Series) of iterables, for example, a column of tuples, which we can get by applying tuples row-wise to a sub-dataframe of the columns you're interested in:
tuple_series = df[column_names].apply(tuple, axis=1)
df['result'] = tuple_series.str.join(' ')
Or, in one line:
df['result'] = df[df.columns[1:]].apply(tuple, axis=1).str.join(' ')
BTW, don't try the above with list instead of tuple. As of pandas-0.20.1, if the function passed into the Dataframe.apply() method returns a list and the returned list has the same number entries as the columns of the original (sub)dataframe, Dataframe.apply() returns a Dataframe instead of a Series.
Other than using apply to concatenate the strings, you can also use agg to do so.
df[df.columns[1:]].agg(' '.join, axis=1)
Out[118]:
0 a d g i
1 b e h j
2 c f i k
dtype: object
Here is a slightly alternative solution:
In [57]: df['result'] = df.filter(regex=r'^t').apply(lambda x: x.add(' ')).sum(axis=1).str.strip()
In [58]: df
Out[58]:
n t0 t1 t2 t3 result
0 92 a d g i a d g i
1 916 b e h j b e h j
2 363 c f i k c f i k

Creating an empty Pandas DataFrame, and then filling it

I'm starting from the pandas DataFrame documentation here: Introduction to data structures
I'd like to iteratively fill the DataFrame with values in a time series kind of calculation. I'd like to initialize the DataFrame with columns A, B, and timestamp rows, all 0 or all NaN.
I'd then add initial values and go over this data calculating the new row from the row before, say row[A][t] = row[A][t-1]+1 or so.
I'm currently using the code as below, but I feel it's kind of ugly and there must be a way to do this with a DataFrame directly or just a better way in general.
Note: I'm using Python 2.7.
import datetime as dt
import pandas as pd
import scipy as s
if __name__ == '__main__':
base = dt.datetime.today().date()
dates = [ base - dt.timedelta(days=x) for x in range(0,10) ]
dates.sort()
valdict = {}
symbols = ['A','B', 'C']
for symb in symbols:
valdict[symb] = pd.Series( s.zeros( len(dates)), dates )
for thedate in dates:
if thedate > dates[0]:
for symb in valdict:
valdict[symb][thedate] = 1+valdict[symb][thedate - dt.timedelta(days=1)]
print valdict
NEVER grow a DataFrame row-wise!
TLDR; (just read the bold text)
Most answers here will tell you how to create an empty DataFrame and fill it out, but no one will tell you that it is a bad thing to do.
Here is my advice: Accumulate data in a list, not a DataFrame.
Use a list to collect your data, then initialise a DataFrame when you are ready. Either a list-of-lists or list-of-dicts format will work, pd.DataFrame accepts both.
data = []
for row in some_function_that_yields_data():
data.append(row)
df = pd.DataFrame(data)
pd.DataFrame converts the list of rows (where each row is a scalar value) into a DataFrame. If your function yields DataFrames instead, call pd.concat.
Pros of this approach:
It is always cheaper to append to a list and create a DataFrame in one go than it is to create an empty DataFrame (or one of NaNs) and append to it over and over again.
Lists also take up less memory and are a much lighter data structure to work with, append, and remove (if needed).
dtypes are automatically inferred (rather than assigning object to all of them).
A RangeIndex is automatically created for your data, instead of you having to take care to assign the correct index to the row you are appending at each iteration.
If you aren't convinced yet, this is also mentioned in the documentation:
Iteratively appending rows to a DataFrame can be more computationally
intensive than a single concatenate. A better solution is to append
those rows to a list and then concatenate the list with the original
DataFrame all at once.
*** Update for pandas >= 1.4: append is now DEPRECATED! ***
As of pandas 1.4, append has now been deprecated! Use pd.concat instead. See the release notes
These options are horrible
append or concat inside a loop
Here is the biggest mistake I've seen from beginners:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True) # yuck
# or similarly,
# df = pd.concat([df, pd.Series({'A': i, 'B': b, 'C': c})], ignore_index=True)
Memory is re-allocated for every append or concat operation you have. Couple this with a loop and you have a quadratic complexity operation.
The other mistake associated with df.append is that users tend to forget append is not an in-place function, so the result must be assigned back. You also have to worry about the dtypes:
df = pd.DataFrame(columns=['A', 'B', 'C'])
df = df.append({'A': 1, 'B': 12.3, 'C': 'xyz'}, ignore_index=True)
df.dtypes
A object # yuck!
B float64
C object
dtype: object
Dealing with object columns is never a good thing, because pandas cannot vectorize operations on those columns. You will need to do this to fix it:
df.infer_objects().dtypes
A int64
B float64
C object
dtype: object
loc inside a loop
I have also seen loc used to append to a DataFrame that was created empty:
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
As before, you have not pre-allocated the amount of memory you need each time, so the memory is re-grown each time you create a new row. It's just as bad as append, and even more ugly.
Empty DataFrame of NaNs
And then, there's creating a DataFrame of NaNs, and all the caveats associated therewith.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
df
A B C
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
It creates a DataFrame of object columns, like the others.
df.dtypes
A object # you DON'T want this
B object
C object
dtype: object
Appending still has all the issues as the methods above.
for i, (a, b, c) in enumerate(some_function_that_yields_data()):
df.iloc[i] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
Here's a couple of suggestions:
Use date_range for the index:
import datetime
import pandas as pd
import numpy as np
todays_date = datetime.datetime.now().date()
index = pd.date_range(todays_date-datetime.timedelta(10), periods=10, freq='D')
columns = ['A','B', 'C']
Note: we could create an empty DataFrame (with NaNs) simply by writing:
df_ = pd.DataFrame(index=index, columns=columns)
df_ = df_.fillna(0) # With 0s rather than NaNs
To do these type of calculations for the data, use a NumPy array:
data = np.array([np.arange(10)]*3).T
Hence we can create the DataFrame:
In [10]: df = pd.DataFrame(data, index=index, columns=columns)
In [11]: df
Out[11]:
A B C
2012-11-29 0 0 0
2012-11-30 1 1 1
2012-12-01 2 2 2
2012-12-02 3 3 3
2012-12-03 4 4 4
2012-12-04 5 5 5
2012-12-05 6 6 6
2012-12-06 7 7 7
2012-12-07 8 8 8
2012-12-08 9 9 9
If you simply want to create an empty data frame and fill it with some incoming data frames later, try this:
newDF = pd.DataFrame() #creates a new dataframe that's empty
newDF = newDF.append(oldDF, ignore_index = True) # ignoring index is optional
# try printing some data from newDF
print newDF.head() #again optional
In this example I am using this pandas doc to create a new data frame and then using append to write to the newDF with data from oldDF.
If I have to keep appending new data into this newDF from more than
one oldDFs, I just use a for loop to iterate over
pandas.DataFrame.append()
Note: append() is deprecated since version 1.4.0. Use concat()
Initialize empty frame with column names
import pandas as pd
col_names = ['A', 'B', 'C']
my_df = pd.DataFrame(columns = col_names)
my_df
Add a new record to a frame
my_df.loc[len(my_df)] = [2, 4, 5]
You also might want to pass a dictionary:
my_dic = {'A':2, 'B':4, 'C':5}
my_df.loc[len(my_df)] = my_dic
Append another frame to your existing frame
col_names = ['A', 'B', 'C']
my_df2 = pd.DataFrame(columns = col_names)
my_df = my_df.append(my_df2)
Performance considerations
If you are adding rows inside a loop consider performance issues. For around the first 1000 records "my_df.loc" performance is better, but it gradually becomes slower by increasing the number of records in the loop.
If you plan to do thins inside a big loop (say 10M‌ records or so), you are better off using a mixture of these two;
fill a dataframe with iloc until the size gets around 1000, then append it to the original dataframe, and empty the temp dataframe.
This would boost your performance by around 10 times.
Simply:
import numpy as np
import pandas as pd
df=pd.DataFrame(np.zeros([rows,columns])
Then fill it.
Assume a dataframe with 19 rows
index=range(0,19)
index
columns=['A']
test = pd.DataFrame(index=index, columns=columns)
Keeping Column A as a constant
test['A']=10
Keeping column b as a variable given by a loop
for x in range(0,19):
test.loc[[x], 'b'] = pd.Series([x], index = [x])
You can replace the first x in pd.Series([x], index = [x]) with any value
This is my way to make a dynamic dataframe from several lists with a loop
x = [1,2,3,4,5,6,7,8]
y = [22,12,34,22,65,24,12,11]
z = ['as','ss','wa', 'ss','er','fd','ga','mf']
names = ['Bob', 'Liz', 'chop']
a loop
def dataF(x,y,z,names):
res = []
for t in zip(x,y,z):
res.append(t)
return pd.DataFrame(res,columns=names)
Result
dataF(x,y,z,names)
# import pandas library
import pandas as pd
# create a dataframe
my_df = pd.DataFrame({"A": ["shirt"], "B": [1200]})
# show the dataframe
print(my_df)

Categories

Resources