I am trying to append an empty row at the end of dataframe but unable to do so, even trying to understand how pandas work with append function and still not getting it.
Here's the code:
import pandas as pd
excel_names = ["ARMANI+EMPORIO+AR0143-book.xlsx"]
excels = [pd.ExcelFile(name) for name in excel_names]
frames = [x.parse(x.sheet_names[0], header=None,index_col=None).dropna(how='all') for x in excels]
for f in frames:
f.append(0, float('NaN'))
f.append(2, float('NaN'))
There are two columns and random number of row.
with "print f" in for loop i Get this:
0 1
0 Brand Name Emporio Armani
2 Model number AR0143
4 Part Number AR0143
6 Item Shape Rectangular
8 Dial Window Material Type Mineral
10 Display Type Analogue
12 Clasp Type Buckle
14 Case Material Stainless steel
16 Case Diameter 31 millimetres
18 Band Material Leather
20 Band Length Women's Standard
22 Band Colour Black
24 Dial Colour Black
26 Special Features second-hand
28 Movement Quartz
Add a new pandas.Series using pandas.DataFrame.append().
If you wish to specify the name (AKA the "index") of the new row, use:
df.append(pandas.Series(name='NameOfNewRow'))
If you don't wish to name the new row, use:
df.append(pandas.Series(), ignore_index=True)
where df is your pandas.DataFrame.
You can add it by appending a Series to the dataframe as follows. I am assuming by blank you mean you want to add a row containing only "Nan".
You can first create a Series object with Nan. Make sure you specify the columns while defining 'Series' object in the -Index parameter.
The you can append it to the DF. Hope it helps!
from numpy import nan as Nan
import pandas as pd
>>> df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
... 'B': ['B0', 'B1', 'B2', 'B3'],
... 'C': ['C0', 'C1', 'C2', 'C3'],
... 'D': ['D0', 'D1', 'D2', 'D3']},
... index=[0, 1, 2, 3])
>>> s2 = pd.Series([Nan,Nan,Nan,Nan], index=['A', 'B', 'C', 'D'])
>>> result = df1.append(s2)
>>> result
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
4 NaN NaN NaN NaN
You can add a new series, and name it at the same time. The name will be the index of the new row, and all the values will automatically be NaN.
df.append(pd.Series(name='Afterthought'))
Assuming df is your dataframe,
df_prime = pd.concat([df, pd.DataFrame([[np.nan] * df.shape[1]], columns=df.columns)], ignore_index=True)
where df_prime equals df with an additional last row of NaN's.
Note that pd.concat is slow so if you need this functionality in a loop, it's best to avoid using it.
In that case, assuming your index is incremental, you can use
df.loc[df.iloc[-1].name + 1,:] = np.nan
Append "empty" row to data frame and fill selected cells:
Generate empty data frame (no rows just columns a and b):
import pandas as pd
col_names = ["a","b"]
df = pd.DataFrame(columns = col_names)
Append empty row at the end of the data frame:
df = df.append(pd.Series(), ignore_index = True)
Now fill the empty cell at the end (len(df)-1) of the data frame in column a:
df.loc[[len(df)-1],'a'] = 123
Result:
a b
0 123 NaN
And of course one can iterate over the rows and fill cells:
col_names = ["a","b"]
df = pd.DataFrame(columns = col_names)
for x in range(0,5):
df = df.append(pd.Series(), ignore_index = True)
df.loc[[len(df)-1],'a'] = 123
Result:
a b
0 123 NaN
1 123 NaN
2 123 NaN
3 123 NaN
4 123 NaN
The code below worked for me.
df.append(pd.Series([np.nan]), ignore_index = True)
Assuming your df.index is sorted you can use:
df.loc[df.index.max() + 1] = None
It handles well different indexes and column types.
[EDIT] it works with pd.DatetimeIndex if there is a constant frequency, otherwise we must specify the new index exactly e.g:
df.loc[df.index.max() + pd.Timedelta(milliseconds=1)] = None
long example:
df = pd.DataFrame([[pd.Timestamp(12432423), 23, 'text_field']],
columns=["timestamp", "speed", "text"],
index=pd.DatetimeIndex(start='2111-11-11',freq='ms', periods=1))
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1 entries, 2111-11-11 to 2111-11-11
Freq: L
Data columns (total 3 columns):
timestamp 1 non-null datetime64[ns]
speed 1 non-null int64
text 1 non-null object
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 32.0+ bytes
df.loc[df.index.max() + 1] = None
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2 entries, 2111-11-11 00:00:00 to 2111-11-11 00:00:00.001000
Data columns (total 3 columns):
timestamp 1 non-null datetime64[ns]
speed 1 non-null float64
text 1 non-null object
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 64.0+ bytes
df.head()
timestamp speed text
2111-11-11 00:00:00.000 1970-01-01 00:00:00.012432423 23.0 text_field
2111-11-11 00:00:00.001 NaT NaN NaN
You can also use:
your_dataframe.insert(loc=0, value=np.nan, column="")
where loc is your empty row index.
#Dave Reikher's answer is the best solution.
df.loc[df.iloc[-1].name + 1,:] = np.nan
Here's a similar answer without the NumPy library
df.loc[len(df.index)] = ['' for x in df.columns.values.tolist()]
len(df.index) = number of rows. Always 1 more than index count.
By using df.loc[len(df.index)] you are selecting the next index number (row) available.
df.iloc[-1].name + 1 equals df.loc[len(df.index)]
Instead of using NumPy, you can also use a python comprehension
Create a list from the column names: df.columns.values.tolist()
Create a new list of empty strings '' based on the number of columns.
['' for x in df.columns.values.tolist()]
Related
I am trying to make a function to spot the columns with "100" in the header and replace the values in these columns with NaN depending on multiple criteria.
I also want in the function the value of the column "first_column" corresponding to the outlier.
For instance let's say I have a df where I want to replace all numbers that are above 100 or below 0 with NaN values :
I start with this dataframe:
import pandas as pd
data = {'first_column': [product_name', 'product_name2', 'product_name3'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_100':['25', '1568200', '5''],
}
df = pd.DataFrame(data)
print (df)
expected output:
IIUC, you can use filter and boolean indexing:
# get "100" columns and convert to integer
df2 = df.filter(like='100').astype(int)
# identify values <0 or >100
mask = (df2.lt(0)|df2.gt(100))
# mask them
out1 = df.mask(mask.reindex(df.columns, axis=1, fill_value=False))
# get rows with at least one match
out2 = df.loc[mask.any(1), ['first_column']+list(df.filter(like='100'))]
output 1:
first_column second_column third_100 fourth_100
0 product_name first_value 89 25
1 product_name2 second_value 9 NaN
2 product_name3 third_value NaN 5
output 2:
first_column third_100 fourth_100
1 product_name2 9 1568200
2 product_name3 589 5
Consider the following sequence of operations:
Create a data frame with two columns with the following types int64, float64
Create a new frame by converting all columns to object
Inspect the new data frame
Persist the new data frame
Expect the second column to get persisted as shown in the 3rd step: i.e. as string, not as float64
Illustrated below:
# Step 1
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
# Step 2
df2 = df.astype(object)
# Step 3
df2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 a 4 non-null object
1 b 4 non-null object
dtypes: object(2)
memory usage: 192.0+ bytes
# NOTE notice how column `b` is rendered
df2
a b
0 3 1
1 2 500.43
2 1 256.13
3 0 5
# Step 4
df2.to_csv("/tmp/df2", index=False, sep="\t")
Now let us inspect the generated output:
$ cat df2
a b
3 1.0
2 500.43
1 256.13
0 5.0
Notice how column b is persisted: the decimal places are still present for round numbers even though the datatype is object. Why does this happen? What am I missing here?
I'm using Pandas 1.1.2 with Python 3.7.9.
I think, 'object' is NumPy/pandas dtype and not one of the python data types.
If you run:
type(df2.iloc[0,1])
before step 4, you will get 'float' data type even though it's been already changed to 'object'.
You can use:
df.to_csv("df.csv",float_format='%g', index=False, sep="\t")
instead of casting in step 2.
I am not great with pandas and still learning. I looked at a few solution and thought why not do an apply on the data before we send it to csv file.
Here's what I did to get the values printed as 1 and 5 instead of 1.0 and 5.0
values in df are mix of string, float, ints
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 's', 't'], 'b': [1, 500.43, 256.13, 5, 'txt']})
df2 = df.astype(object)
def convert(x):
a = []
for i in x.to_list():
a.append(coerce(i))
return pd.Series(a)
#return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
def coerce(y):
try:
p = float(y)
q = int(y)
if p != q:
return str(p)
else:
return str(q)
except:
return str(y)
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
Output in the file will be:
a b
3 1
2 500.43
1 256.13
s 5
t txt
all values in df are numeric (integers or floats)
import pandas as pd
df = pd.DataFrame.from_dict({'a': [3, 2, 1, 0], 'b': [1, 500.43, 256.13, 5]})
df2 = df.astype(object)
def convert(x):
return pd.Series([str(int(i)) if int(i) == i else i for i in x.to_list()])
df2.apply(convert).to_csv("abc.txt", index=False, sep="\t")
The output is as follows:
a b
3 1
2 500.43
1 256.13
0 5
Here I am assuming all values in df2 are numeric. If it has a string value, then int(i) will fail.
I have a dataframe, and I set the index to a column of the dataframe. This creates a hierarchical column index. I want to flatten the columns to a single level. Similar to this question - Python Pandas - How to flatten a hierarchical index in columns, however, the columns do not overlap (i.e. 'id' is not at level 0 of the hierarchical index, and other columns are at level 1 of the index).
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
A B
id
101 3 x
102 5 y
Desired output is flattened columns, like this:
id A B
101 3 x
102 5 y
You are misinterpreting what you are seeing.
A B
id
101 3 x
102 5 y
Is not showing you a hierarchical column index. id is the name of the row index. In order to show you the name of the index, pandas is putting that space there for you.
The answer to your question depends on what you really want or need.
As the df is, you can dump it to a csv just the way you want:
print(df.to_csv(sep='\t'))
id A B
101 3 x
102 5 y
print(df.to_csv())
id,A,B
101,3,x
102,5,y
Or you can alter the df so that it displays the way you'd like
print(df.rename_axis(None))
A B
101 3 x
102 5 y
please do not do this!!!!
I'm putting it to demonstrate how to manipulate
I could also keep the index as it is but manipulate both column and row index names to print how you would like.
print(df.rename_axis(None).rename_axis('id', 1))
id A B
101 3 x
102 5 y
But this has named the columns' index id which makes no sense.
there will always be an index in your dataframes. if you don't set 'id' as index, it will be at the same level as other columns and pandas will populate an increasing integer for your index starting from 0.
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
In[52]: df
Out[52]:
id A B
0 101 3 x
1 102 5 y
the index is there so you can slice the original dataframe. such has
df.iloc[0]
Out[53]:
id 101
A 3
B x
Name: 0, dtype: object
so let says you want ID as index and ID as a column, which is very redundant, you could do:
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df['id'] = df.index
df
Out[55]:
A B id
id
101 3 x 101
102 5 y 102
with this you can slice by 'id' such has:
df.loc[101]
Out[57]:
A 3
B x
id 101
Name: 101, dtype: object
but it would the same info has :
df = pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
df.set_index('id', inplace=True)
df.loc[101]
Out[58]:
A 3
B x
Name: 101, dtype: object
Given:
>>> df2=pd.DataFrame([(101,3,'x'), (102,5,'y')], columns=['id', 'A', 'B'])
>>> df2.set_index('id', inplace=True)
>>> df2
A B
id
101 3 x
102 5 y
For printing purdy, you can produce a copy of the DataFrame with a reset the index and use .to_string:
>>> print df2.reset_index().to_string(index=False)
id A B
101 3 x
102 5 y
Then play around with the formatting options so that the output suites your needs:
>>> fmts=[lambda s: u"{:^5}".format(str(s).strip())]*3
>>> print df2.reset_index().to_string(index=False, formatters=fmts)
id A B
101 3 x
102 5 y
Consider the .dta file behind this zip file.
Here's the first row:
>>> df = pd.read_stata('cepr_org_2014.dta', convert_categoricals = False)
>>> df.iloc[0]
year 2014
month 1
minsamp 8
hhid 000936071123039
hhid2 91001
# [...]
>>> df.iloc[0]['wage4']
nan
I double check this using stata, it appears correct. So far, so good. Now I set up some columns that I want to keep and redo the exercise.
>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
convert_categoricals = False,
columns=columns+columns2)
>>> df.iloc[0]
wbho 1
age 65
female 0
wage4 1.7014118346e+38
ind_nber 101
year 2014
month 1
minsamp 8
hhid NaN
hhid2 NaN
fnlwgt 560.1073
Name: 0, dtype: object
After adding a list of columns to keep, pandas
doesn't understand the missing value anymore, wage4 is large instead of NaN.
Creates missing values for hhid, and hhid2.
Why?
Foot note: first loading the dataset and then filtering using df[columns+columns2] works.
I traced this error to a bug in pandas. I've fixed the error in https://github.com/jbuyl/pandas/tree/fix-column-dtype-mixing and opened a pull request to merge in the fix, but feel free to checkout my fork/branch.
Here are the results running your example:
>>> columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
>>> columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
>>> df = pd.read_stata('cepr_org_2014.dta',
... convert_categoricals = False,
... columns=columns+columns2)
>>> df.iloc[0]
wbho 1
age 65
female 0
wage4 nan
ind_nber NaN
year 2014
month 1
minsamp 8
hhid 000936071123039
hhid2 91001
fnlwgt 560.107
Name: 0, dtype: object
It appears to be a bug, in the source of pandas/io/stat.py, in the _do_select_columns() method, the loop:
dtyplist = []
typlist = []
fmtlist = []
lbllist = []
matched = set()
for i, col in enumerate(data.columns):
if col in column_set:
matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
messed up the order of dtypes, which no long matches the sequence as they appear in column_set.
Compare the dtypes of df2 and df3 in this example:
In [1]:
import zipfile
z = zipfile.ZipFile('/Users/q6600sl/Downloads/cepr_org_2014.zip')
df= pd.read_stata(z.open('cepr_org_2014.dta'), convert_categoricals = False)
In [2]:
columns = ['wbho', 'age', 'female', 'wage4', 'ind_nber']
columns2 = ['year', 'month', 'minsamp', 'hhid', 'hhid2', 'fnlwgt']
In [3]:
df2 = pd.read_stata(z.open('cepr_org_2014.dta'),
convert_categoricals = False,
columns=columns+columns2)
In [4]:
df2.dtypes
Out[4]:
wbho int16
age int8
female int8
wage4 object
ind_nber object
year float32
month int8
minsamp int8
hhid float64
hhid2 float64
fnlwgt float32
dtype: object
In [5]:
df3 = df[columns+columns2]
In [6]:
df3.dtypes
Out[6]:
wbho int8
age int8
female int8
wage4 float32
ind_nber float64
year int16
month int8
minsamp int8
hhid object
hhid2 object
fnlwgt float32
dtype: object
Change it to:
dtyplist = []
typlist = []
fmtlist = []
lbllist = []
#matched = set()
for i in np.hstack([np.argwhere(data.columns==col) for col in columns]).ravel():
# if col in column_set:
# matched.update([col])
dtyplist.append(self.dtyplist[i])
typlist.append(self.typlist[i])
fmtlist.append(self.fmtlist[i])
lbllist.append(self.lbllist[i])
fixed the problem.
(Don't know what matched is doing here. Appear to be never used later on.)
Each column of the Dataframe needs their values to be normalized according the value of the first element in that column.
for timestamp, prices in data.iteritems():
normalizedPrices = prices / prices[0]
print normalizedPrices # how do we update the DataFrame with this Series?
However how do we update the DataFrame once we have created the normalized column of data? I believe if we do prices = normalizedPrices we are merely acting on a copy/view of the DataFrame rather than the original DataFrame itself.
It might be simplest to normalize the entire DataFrame in one go (and avoid looping over rows/columns altogether):
>>> df = pd.DataFrame({'a': [2, 4, 5], 'b': [3, 9, 4]}, dtype=np.float) # a DataFrame
>>> df
a b
0 2 3
1 4 9
2 5 4
>>> df = df.div(df.loc[0]) # normalise DataFrame and bind back to df
>>> df
a b
0 1.0 1.000000
1 2.0 3.000000
2 2.5 1.333333
Assign to data[col]:
for col in data:
data[col] /= data[col].iloc[0]
import numpy
data[0:] = data[0:].values/data[0:1].values