Create Dataframe based on one column Pandas - python

Apologies if this is a repeat question, I searched SO for awhile and, as simple as a question that it is, I couldn't find a similar one. I am looking to simply create one data frame (5x3 in my case) based off of one column in my Pandas dataframe. I've tried both pd.DataFrame and pd.concat and neither have seemed to work. Example below:
df = pd.DataFrame([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15])
#using pd.DataFrame
table_data = {'Column1': df.iloc[0:5,0],
'Column2': df.iloc[5:10,0],
'Column3': df.iloc[10:15,0]}
pd.DataFrame(table_data)
#different method using pd.DataFrame
pd.DataFrame([df.iloc[0:5,0],
df.iloc[5:10,0],
df.iloc[10:15,0]],
columns = ['Column1', 'Column2', 'Column3'])
#using pd.concat
pd.concat([df.iloc[0:5,0], df.iloc[5:10,0], df.iloc[10:15,0]],
axis=1, keys=['Column1', 'Column2', 'Column3'])
Note that my actual starting data frame has more than just 1 column. The issues seem to be happening when I use indexing as opposed to simply hard coding the numbers that should be in each column. This seems like such a simple thing to do yet I can't seem to find anywhere how to solve it. Any help appreciated.

Like this:
In [591]: import numpy as np
In [585]: d = pd.DataFrame()
In [553]: df_split = np.array_split(df, 5) ## Split df into equal parts of 5 rows each
In [586]: for i in df_split:
...: d = pd.concat([d,i.reset_index(drop=True)], axis=1)
...:
In [588]: d.columns = ['Col1', 'Col2', 'Col3']
In [589]: d
Out[589]:
Col1 Col2 Col3
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15

Related

pandas combine nested dataframes into one single dataframe

I have a dataframe, where in one column (we'll call it info) all the cells/rows contain another dataframe inside. I want to loop through all the rows in this column and literally stack the nested dataframes on top of each other, because they all have the same columns
How would I go about this?
You could try as follows:
import pandas as pd
length=5
# some dfs
nested_dfs = [pd.DataFrame({'a': [*range(length)],
'b': [*range(length)]}) for x in range(length)]
print(nested_dfs[0])
a b
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
# df with nested_dfs in info
df = pd.DataFrame({'info_col': nested_dfs})
# code to be implemented
lst_dfs = df['info_col'].values.tolist()
df_final = pd.concat(lst_dfs,axis=0, ignore_index=True)
df_final.tail()
a b
20 0 0
21 1 1
22 2 2
23 3 3
24 4 4
This method should be a bit faster than the solution offered by nandoquintana, which also works.
Incidentally, it is ill advised to name a df column info. This is because df.info is actually a function. E.g., normally df['col_name'].values.tolist() can also be written as df.col_name.values.tolist(). However, if you try this with df.info.values.tolist(), you will run into an error:
AttributeError: 'function' object has no attribute 'values'
You also run the risk of overwriting the function if you start assigning values to your column on top of doing something which you probably don't want to do. E.g.:
print(type(df.info))
<class 'method'>
df.info=1
# column is unaffected, you just create an int variable
print(type(df.info))
<class 'int'>
# but:
df['info']=1
# your column now has all 1's
print(type(df['info']))
<class 'pandas.core.series.Series'>
This is the solution that I came up with, although it's not the fastest which is why I am still leaving the question unanswered
df1 = pd.DataFrame()
for frame in df['Info'].tolist():
df1 = pd.concat([df1, frame], axis=0).reset_index(drop=True)
Our dataframe has three columns (col1, col2 and info).
In info, each row has a nested df as value.
import pandas as pd
nested_d1 = {'coln1': [11, 12], 'coln2': [13, 14]}
nested_df1 = pd.DataFrame(data=nested_d1)
nested_d2 = {'coln1': [15, 16], 'coln2': [17, 18]}
nested_df2 = pd.DataFrame(data=nested_d2)
d = {'col1': [1, 2], 'col2': [3, 4], 'info': [nested_df1, nested_df2]}
df = pd.DataFrame(data=d)
We could combine all nested dfs rows appending them to a list (as nested dfs schema is constant) and concatenating them later.
nested_dfs = []
for index, row in df.iterrows():
nested_dfs.append(row['info'])
result = pd.concat(nested_dfs, sort=False).reset_index(drop=True)
print(result)
This would be the result:
coln1 coln2
0 11 13
1 12 14
2 15 17
3 16 18

Python : how to 'concatenate' 2 pd.DataFrame columns ? two columns into one

I have trouble with some pandas dataframes.
Its very simple, I have 4 columns, and I want to reshape them in 2...
For 'practical' reasons, I don't want to use 'header names', but I need to use 'index' (for the columns header names).
I have :
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
I want as a result :
df_res = pd.DataFrame({'NewName1': [1,2,3,4,5,6],'NewName2': [7,8,9,10,11,12]})
(in fact NewName1 doesn't matter, it can stay a or whatever the name...)
I tried with for loops, append, concat, but couldn't figured it out...
Any suggestions ?
Thanks for your help !
Bina
You can extract the desired columns and create a new pandas.DataFrame like so:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
first_col = np.concatenate((df.a.to_numpy(), df.b.to_numpy()))
second_col = np.concatenate((df.c.to_numpy(), df.d.to_numpy()))
df2 = pd.DataFrame({"NewName1": first_col, "NewName2": second_col})
>>> df2
NewName1 NewName2
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
This is probably not the most elegant solution, but I would isolate the two dataframes and then concatenate them. I needed to rename the column axis so that the four columns could be aligned correctly.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
af = df[['a', 'c']]
bf = df[['b', 'd']]
frames = (
af.rename({'a': 'NewName1', 'c': 'NewName2'}, axis=1),
bf.rename({'b': 'NewName1', 'd': 'NewName2'}, axis=1)
)
out = pd.concat(frames)
[EDIT] Replying to the comment.
I'm not that familiar with indexing but this might be one solution. You could avoid column names by using .iloc. Replace the af, and bf frames above with these lines.
af = df.iloc[:, ::2]
bf = df.iloc[:, 1::2]

How to append comments as a row in pandas dataframe [duplicate]

How do I create an empty DataFrame, then add rows, one by one?
I created an empty DataFrame:
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
Then I can add a new row at the end and fill a single field with:
df = df._set_value(index=len(df), col='qty1', value=10.0)
It works for only one field at a time. What is a better way to add new row to df?
You can use df.loc[i], where the row with index i will be what you specify it to be in the dataframe.
>>> import pandas as pd
>>> from numpy.random import randint
>>> df = pd.DataFrame(columns=['lib', 'qty1', 'qty2'])
>>> for i in range(5):
>>> df.loc[i] = ['name' + str(i)] + list(randint(10, size=2))
>>> df
lib qty1 qty2
0 name0 3 3
1 name1 2 4
2 name2 2 8
3 name3 2 1
4 name4 9 6
In case you can get all data for the data frame upfront, there is a much faster approach than appending to a data frame:
Create a list of dictionaries in which each dictionary corresponds to an input data row.
Create a data frame from this list.
I had a similar task for which appending to a data frame row by row took 30 min, and creating a data frame from a list of dictionaries completed within seconds.
rows_list = []
for row in input_rows:
dict1 = {}
# get input row in dictionary format
# key = col_name
dict1.update(blah..)
rows_list.append(dict1)
df = pd.DataFrame(rows_list)
In the case of adding a lot of rows to dataframe, I am interested in performance. So I tried the four most popular methods and checked their speed.
Performance
Using .append (NPE's answer)
Using .loc (fred's answer)
Using .loc with preallocating (FooBar's answer)
Using dict and create DataFrame in the end (ShikharDua's answer)
Runtime results (in seconds):
Approach
1000 rows
5000 rows
10 000 rows
.append
0.69
3.39
6.78
.loc without prealloc
0.74
3.90
8.35
.loc with prealloc
0.24
2.58
8.70
dict
0.012
0.046
0.084
So I use addition through the dictionary for myself.
Code:
import pandas as pd
import numpy as np
import time
del df1, df2, df3, df4
numOfRows = 1000
# append
startTime = time.perf_counter()
df1 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows-4):
df1 = df1.append( dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']), ignore_index=True)
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df1.shape)
# .loc w/o prealloc
startTime = time.perf_counter()
df2 = pd.DataFrame(np.random.randint(100, size=(5,5)), columns=['A', 'B', 'C', 'D', 'E'])
for i in range( 1,numOfRows):
df2.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df2.shape)
# .loc with prealloc
df3 = pd.DataFrame(index=np.arange(0, numOfRows), columns=['A', 'B', 'C', 'D', 'E'] )
startTime = time.perf_counter()
for i in range( 1,numOfRows):
df3.loc[i] = np.random.randint(100, size=(1,5))[0]
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df3.shape)
# dict
startTime = time.perf_counter()
row_list = []
for i in range (0,5):
row_list.append(dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E']))
for i in range( 1,numOfRows-4):
dict1 = dict( (a,np.random.randint(100)) for a in ['A','B','C','D','E'])
row_list.append(dict1)
df4 = pd.DataFrame(row_list, columns=['A','B','C','D','E'])
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df4.shape)
P.S.: I believe my realization isn't perfect, and maybe there is some optimization that could be done.
You could use pandas.concat(). For details and examples, see Merge, join, and concatenate.
For example:
def append_row(df, row):
return pd.concat([
df,
pd.DataFrame([row], columns=row.index)]
).reset_index(drop=True)
df = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
new_row = pd.Series({'lib':'A', 'qty1':1, 'qty2': 2})
df = append_row(df, new_row)
NEVER grow a DataFrame!
Yes, people have already explained that you should NEVER grow a DataFrame, and that you should append your data to a list and convert it to a DataFrame once at the end. But do you understand why?
Here are the most important reasons, taken from my post here.
It is always cheaper/faster to append to a list and create a DataFrame in one go.
Lists take up less memory and are a much lighter data structure to work with, append, and remove.
dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
This is The Right Way™ to accumulate your data
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])
These options are horrible
append or concat inside a loop
append and concat aren't inherently bad in isolation. The
problem starts when you iteratively call them inside a loop - this
results in quadratic memory usage.
# Creates empty DataFrame and appends
df = pd.DataFrame(columns=['A', 'B', 'C'])
for a, b, c in some_function_that_yields_data():
df = df.append({'A': i, 'B': b, 'C': c}, ignore_index=True)
# This is equally bad:
# df = pd.concat(
# [df, pd.Series({'A': i, 'B': b, 'C': c})],
# ignore_index=True)
Empty DataFrame of NaNs
Never create a DataFrame of NaNs as the columns are initialized with
object (slow, un-vectorizable dtype).
# Creates DataFrame of NaNs and overwrites values.
df = pd.DataFrame(columns=['A', 'B', 'C'], index=range(5))
for a, b, c in some_function_that_yields_data():
df.loc[len(df)] = [a, b, c]
The Proof is in the Pudding
Timing these methods is the fastest way to see just how much they differ in terms of their memory and utility.
Benchmarking code for reference.
It's posts like this that remind me why I'm a part of this community. People understand the importance of teaching folks getting the right answer with the right code, not the right answer with wrong code. Now you might argue that it is not an issue to use loc or append if you're only adding a single row to your DataFrame. However, people often look to this question to add more than just one row - often the requirement is to iteratively add a row inside a loop using data that comes from a function (see related question). In that case it is important to understand that iteratively growing a DataFrame is not a good idea.
If you know the number of entries ex ante, you should preallocate the space by also providing the index (taking the data example from a different answer):
import pandas as pd
import numpy as np
# we know we're gonna have 5 rows of data
numberOfRows = 5
# create dataframe
df = pd.DataFrame(index=np.arange(0, numberOfRows), columns=('lib', 'qty1', 'qty2') )
# now fill it up row by row
for x in np.arange(0, numberOfRows):
#loc or iloc both work here since the index is natural numbers
df.loc[x] = [np.random.randint(-1,1) for n in range(3)]
In[23]: df
Out[23]:
lib qty1 qty2
0 -1 -1 -1
1 0 0 0
2 -1 0 -1
3 0 -1 0
4 -1 0 0
Speed comparison
In[30]: %timeit tryThis() # function wrapper for this answer
In[31]: %timeit tryOther() # function wrapper without index (see, for example, #fred)
1000 loops, best of 3: 1.23 ms per loop
100 loops, best of 3: 2.31 ms per loop
And - as from the comments - with a size of 6000, the speed difference becomes even larger:
Increasing the size of the array (12) and the number of rows (500) makes
the speed difference more striking: 313ms vs 2.29s
mycolumns = ['A', 'B']
df = pd.DataFrame(columns=mycolumns)
rows = [[1,2],[3,4],[5,6]]
for row in rows:
df.loc[len(df)] = row
You can append a single row as a dictionary using the ignore_index option.
>>> f = pandas.DataFrame(data = {'Animal':['cow','horse'], 'Color':['blue', 'red']})
>>> f
Animal Color
0 cow blue
1 horse red
>>> f.append({'Animal':'mouse', 'Color':'black'}, ignore_index=True)
Animal Color
0 cow blue
1 horse red
2 mouse black
For efficient appending, see How to add an extra row to a pandas dataframe and Setting With Enlargement.
Add rows through loc/ix on non existing key index data. For example:
In [1]: se = pd.Series([1,2,3])
In [2]: se
Out[2]:
0 1
1 2
2 3
dtype: int64
In [3]: se[5] = 5.
In [4]: se
Out[4]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
Or:
In [1]: dfi = pd.DataFrame(np.arange(6).reshape(3,2),
.....: columns=['A','B'])
.....:
In [2]: dfi
Out[2]:
A B
0 0 1
1 2 3
2 4 5
In [3]: dfi.loc[:,'C'] = dfi.loc[:,'A']
In [4]: dfi
Out[4]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [5]: dfi.loc[3] = 5
In [6]: dfi
Out[6]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
For the sake of a Pythonic way:
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
res = res.append([{'qty1':10.0}], ignore_index=True)
print(res.head())
lib qty1 qty2
0 NaN 10.0 NaN
You can also build up a list of lists and convert it to a dataframe -
import pandas as pd
columns = ['i','double','square']
rows = []
for i in range(6):
row = [i, i*2, i*i]
rows.append(row)
df = pd.DataFrame(rows, columns=columns)
giving
i double square
0 0 0 0
1 1 2 1
2 2 4 4
3 3 6 9
4 4 8 16
5 5 10 25
If you always want to add a new row at the end, use this:
df.loc[len(df)] = ['name5', 9, 0]
I figured out a simple and nice way:
>>> df
A B C
one 1 2 3
>>> df.loc["two"] = [4,5,6]
>>> df
A B C
one 1 2 3
two 4 5 6
Note the caveat with performance as noted in the comments.
This is not an answer to the OP question, but a toy example to illustrate ShikharDua's answer which I found very useful.
While this fragment is trivial, in the actual data I had 1,000s of rows, and many columns, and I wished to be able to group by different columns and then perform the statistics below for more than one target column. So having a reliable method for building the data frame one row at a time was a great convenience. Thank you ShikharDua!
import pandas as pd
BaseData = pd.DataFrame({ 'Customer' : ['Acme','Mega','Acme','Acme','Mega','Acme'],
'Territory' : ['West','East','South','West','East','South'],
'Product' : ['Econ','Luxe','Econ','Std','Std','Econ']})
BaseData
columns = ['Customer','Num Unique Products', 'List Unique Products']
rows_list=[]
for name, group in BaseData.groupby('Customer'):
RecordtoAdd={} #initialise an empty dict
RecordtoAdd.update({'Customer' : name}) #
RecordtoAdd.update({'Num Unique Products' : len(pd.unique(group['Product']))})
RecordtoAdd.update({'List Unique Products' : pd.unique(group['Product'])})
rows_list.append(RecordtoAdd)
AnalysedData = pd.DataFrame(rows_list)
print('Base Data : \n',BaseData,'\n\n Analysed Data : \n',AnalysedData)
You can use a generator object to create a Dataframe, which will be more memory efficient over the list.
num = 10
# Generator function to generate generator object
def numgen_func(num):
for i in range(num):
yield ('name_{}'.format(i), (i*i), (i*i*i))
# Generator expression to generate generator object (Only once data get populated, can not be re used)
numgen_expression = (('name_{}'.format(i), (i*i), (i*i*i)) for i in range(num) )
df = pd.DataFrame(data=numgen_func(num), columns=('lib', 'qty1', 'qty2'))
To add raw to existing DataFrame you can use append method.
df = df.append([{ 'lib': "name_20", 'qty1': 20, 'qty2': 400 }])
Instead of a list of dictionaries as in ShikharDua's answer (row-based), we can also represent our table as a dictionary of lists (column-based), where each list stores one column in row-order, given we know our columns beforehand. At the end we construct our DataFrame once.
In both cases, the dictionary keys are always the column names. Row order is stored implicitly as order in a list. For c columns and n rows, this uses one dictionary of c lists, versus one list of n dictionaries. The list-of-dictionaries method has each dictionary storing all keys redundantly and requires creating a new dictionary for every row. Here we only append to lists, which overall is the same time complexity (adding entries to list and dictionary are both amortized constant time) but may have less overhead due to being a simple operation.
# Current data
data = {"Animal":["cow", "horse"], "Color":["blue", "red"]}
# Adding a new row (be careful to ensure every column gets another value)
data["Animal"].append("mouse")
data["Color"].append("black")
# At the end, construct our DataFrame
df = pd.DataFrame(data)
# Animal Color
# 0 cow blue
# 1 horse red
# 2 mouse black
Create a new record (data frame) and add to old_data_frame.
Pass a list of values and the corresponding column names to create a new_record (data_frame):
new_record = pd.DataFrame([[0, 'abcd', 0, 1, 123]], columns=['a', 'b', 'c', 'd', 'e'])
old_data_frame = pd.concat([old_data_frame, new_record])
Here is the way to add/append a row in a Pandas DataFrame:
def add_row(df, row):
df.loc[-1] = row
df.index = df.index + 1
return df.sort_index()
add_row(df, [1,2,3])
It can be used to insert/append a row in an empty or populated Pandas DataFrame.
If you want to add a row at the end, append it as a list:
valuestoappend = [va1, val2, val3]
res = res.append(pd.Series(valuestoappend, index = ['lib', 'qty1', 'qty2']), ignore_index = True)
Another way to do it (probably not very performant):
# add a row
def add_row(df, row):
colnames = list(df.columns)
ncol = len(colnames)
assert ncol == len(row), "Length of row must be the same as width of DataFrame: %s" % row
return df.append(pd.DataFrame([row], columns=colnames))
You can also enhance the DataFrame class like this:
import pandas as pd
def add_row(self, row):
self.loc[len(self.index)] = row
pd.DataFrame.add_row = add_row
All you need is loc[df.shape[0]] or loc[len(df)]
# Assuming your df has 4 columns (str, int, str, bool)
df.loc[df.shape[0]] = ['col1Value', 100, 'col3Value', False]
or
df.loc[len(df)] = ['col1Value', 100, 'col3Value', False]
You can concatenate two DataFrames for this. I basically came across this problem to add a new row to an existing DataFrame with a character index (not numeric).
So, I input the data for a new row in a duct() and index in a list.
new_dict = {put input for new row here}
new_list = [put your index here]
new_df = pd.DataFrame(data=new_dict, index=new_list)
df = pd.concat([existing_df, new_df])
initial_data = {'lib': np.array([1,2,3,4]), 'qty1': [1,2,3,4], 'qty2': [1,2,3,4]}
df = pd.DataFrame(initial_data)
df
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
val_1 = [10]
val_2 = [14]
val_3 = [20]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
You can use a for loop to iterate through values or can add arrays of values.
val_1 = [10, 11, 12, 13]
val_2 = [14, 15, 16, 17]
val_3 = [20, 21, 22, 43]
df.append(pd.DataFrame({'lib': val_1, 'qty1': val_2, 'qty2': val_3}))
lib qty1 qty2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
0 10 14 20
1 11 15 21
2 12 16 22
3 13 17 43
Make it simple. By taking a list as input which will be appended as a row in the data-frame:
import pandas as pd
res = pd.DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
res_list = list(map(int, input().split()))
res = res.append(pd.Series(res_list, index=['lib', 'qty1', 'qty2']), ignore_index=True)
pandas.DataFrame.append
DataFrame.append(self, other, ignore_index=False, verify_integrity=False, sort=False) → 'DataFrame'
Code
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)
With ignore_index set to True:
df.append(df2, ignore_index=True)
If you have a data frame df and want to add a list new_list as a new row to df, you can simply do:
df.loc[len(df)] = new_list
If you want to add a new data frame new_df under data frame df, then you can use:
df.append(new_df)
We often see the construct df.loc[subscript] = … to assign to one DataFrame row. Mikhail_Sam posted benchmarks containing, among others, this construct as well as the method using dict and create DataFrame in the end. He found the latter to be the fastest by far.
But if we replace the df3.loc[i] = … (with preallocated DataFrame) in his code with df3.values[i] = …, the outcome changes significantly, in that that method performs similar to the one using dict. So we should more often take the use of df.values[subscript] = … into consideration. However note that .values takes a zero-based subscript, which may be different from the DataFrame.index.
Before going to add a row, we have to convert the dataframe to a dictionary. There you can see the keys as columns in the dataframe and the values of the columns are again stored in the dictionary, but there the key for every column is the index number in the dataframe.
That idea makes me to write the below code.
df2 = df.to_dict()
values = ["s_101", "hyderabad", 10, 20, 16, 13, 15, 12, 12, 13, 25, 26, 25, 27, "good", "bad"] # This is the total row that we are going to add
i = 0
for x in df.columns: # Here df.columns gives us the main dictionary key
df2[x][101] = values[i] # Here the 101 is our index number. It is also the key of the sub dictionary
i += 1
If all data in your Dataframe has the same dtype you might use a NumPy array. You can write rows directly into the predefined array and convert it to a dataframe at the end.
It seems to be even faster than converting a list of dicts.
import pandas as pd
import numpy as np
from string import ascii_uppercase
startTime = time.perf_counter()
numcols, numrows = 5, 10000
npdf = np.ones((numrows, numcols))
for row in range(numrows):
npdf[row, 0:] = np.random.randint(0, 100, (1, numcols))
df5 = pd.DataFrame(npdf, columns=list(ascii_uppercase[:numcols]))
print('Elapsed time: {:6.3f} seconds for {:d} rows'.format(time.perf_counter() - startTime, numOfRows))
print(df5.shape)
This code snippet uses a list of dictionaries to update the data frame. It adds on to ShikharDua's and Mikhail_Sam's answers.
import pandas as pd
colour = ["red", "big", "tasty"]
fruits = ["apple", "banana", "cherry"]
dict1={}
feat_list=[]
for x in colour:
for y in fruits:
# print(x, y)
dict1 = dict([('x',x),('y',y)])
# print(f'dict 1 {dict1}')
feat_list.append(dict1)
# print(f'feat_list {feat_list}')
feat_df=pd.DataFrame(feat_list)
feat_df.to_csv('feat1.csv')

append dictionary to data frame

I have a function, which returns a dictionary like this:
{'truth': 185.179993, 'day1': 197.22307753038834, 'day2': 197.26118010160317, 'day3': 197.19846975345905, 'day4': 197.1490578795196, 'day5': 197.37179265011116}
I am trying to append this dictionary to a dataframe like so:
output = pd.DataFrame()
output.append(dictionary, ignore_index=True)
print(output.head())
Unfortunately, the printing of the dataframe results in an empty dataframe. Any ideas?
You don't assign the value to the result.
output = pd.DataFrame()
output = output.append(dictionary, ignore_index=True)
print(output.head())
The previous answer (user alex, answered Aug 9 2018 at 20:09) now triggers a warning saying that appending to a dataframe will be deprecated in a future version.
A way to do it is to transform the dictionary to a dataframe and the concatenate the dataframes:
output = pd.DataFrame()
df_dictionary = pd.DataFrame([dictionary])
output = pd.concat([output, df_dictionary], ignore_index=True)
print(output.head())
I always do it this way because this syntax is less confusing for me.
I believe concat method is recommended though.
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>>df
col1 col2
0 1 3
1 2 4
d={'col1': 5, 'col2': 6}
df.loc[len(df)]=d
>>>df
col1 col2
0 1 3
1 2 4
2 5 6
Note that iloc method won't work this way.

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Categories

Resources