Creating new pandas dataframe from certain columns of existing dataframe - python

I have read a csv file into a pandas dataframe and want to do some simple manipulations on the dataframe. I can not figure out how to create a new dataframe based on selected columns from my original dataframe. My attempt:
names = ['A','B','C','D']
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset['A','D']
I would like to create a new dataframe with the columns A and D from the original dataframe.

It is called subset - passed list of columns in []:
dataset = pandas.read_csv('file.csv', names=names)
new_dataset = dataset[['A','D']]
what is same as:
new_dataset = dataset.loc[:, ['A','D']]
If need only filtered output add parameter usecols to read_csv:
new_dataset = pandas.read_csv('file.csv', names=names, usecols=['A','D'])
EDIT:
If use only:
new_dataset = dataset[['A','D']]
and use some data manipulation, obviously get:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
If you modify values in new_dataset later you will find that the modifications do not propagate back to the original data (dataset), and that Pandas does warning.
As pointed EdChum add copy for remove warning:
new_dataset = dataset[['A','D']].copy()

You must pass a list of column names to select columns. Otherwise, it will be interpreted as MultiIndex; df['A','D'] would work if df.columns was MultiIndex.
The most obvious way is df.loc[:, ['A', 'B']] but there are other ways (note how all of them take lists):
df1 = df.filter(items=['A', 'D'])
df1 = df.reindex(columns=['A', 'D'])
df1 = df.get(['A', 'D']).copy()
N.B. items is the first positional argument, so df.filter(['A', 'D']) also works.
Note that filter() and reindex() return a copy as well, so you don't need to worry about getting SettingWithCopyWarning later.

Related

Appending dataframes with non matching rows [duplicate]

I have a initial dataframe D. I extract two data frames from it like this:
A = D[D.label == k]
B = D[D.label != k]
I want to combine A and B into one DataFrame. The order of the data is not important. However, when we sample A and B from D, they retain their indexes from D.
DEPRECATED: DataFrame.append and Series.append were deprecated in v1.4.0.
Use append:
df_merged = df1.append(df2, ignore_index=True)
And to keep their indexes, set ignore_index=False.
Use pd.concat to join multiple dataframes:
df_merged = pd.concat([df1, df2], ignore_index=True, sort=False)
Merge across rows:
df_row_merged = pd.concat([df_a, df_b], ignore_index=True)
Merge across columns:
df_col_merged = pd.concat([df_a, df_b], axis=1)
If you're working with big data and need to concatenate multiple datasets calling concat many times can get performance-intensive.
If you don't want to create a new df each time, you can instead aggregate the changes and call concat only once:
frames = [df_A, df_B] # Or perform operations on the DFs
result = pd.concat(frames)
This is pointed out in the pandas docs under concatenating objects at the bottom of the section):
Note: It is worth noting however, that concat (and therefore append)
makes a full copy of the data, and that constantly reusing this
function can create a significant performance hit. If you need to use
the operation over several datasets, use a list comprehension.
If you want to update/replace the values of first dataframe df1 with the values of second dataframe df2. you can do it by following steps —
Step 1: Set index of the first dataframe (df1)
df1.set_index('id')
Step 2: Set index of the second dataframe (df2)
df2.set_index('id')
and finally update the dataframe using the following snippet —
df1.update(df2)
To join 2 pandas dataframes by column, using their indices as the join key, you can do this:
both = a.join(b)
And if you want to join multiple DataFrames, Series, or a mixture of them, by their index, just put them in a list, e.g.,:
everything = a.join([b, c, d])
See the pandas docs for DataFrame.join().
# collect excel content into list of dataframes
data = []
for excel_file in excel_files:
data.append(pd.read_excel(excel_file, engine="openpyxl"))
# concatenate dataframes horizontally
df = pd.concat(data, axis=1)
# save combined data to excel
df.to_excel(excelAutoNamed, index=False)
You can try the above when you are appending horizontally! Hope this helps sum1
Use this code to attach two Pandas Data Frames horizontally:
df3 = pd.concat([df1, df2],axis=1, ignore_index=True, sort=False)
You must specify around what axis you intend to merge two frames.

Insert a dataframe between dataframes via the nested append method

The following code works fine to insert one dataframe underneath the other using a nesting of the append method.
for sheet_name, df in Input_Data.items():
df1 = df[126:236]
df=df1.sort_index(ascending=False)
Indexer=df.columns.tolist()
df = [(pd.concat([df[Indexer[0]],df[Indexer[num]]],axis=1)) for num in [1,2,3,4,5,6]]
df = [(df[num].astype(str).agg(','.join, axis=1)) for num in [0,1,2,3,4,5]]
df=pd.DataFrame(df)
df=df.loc[0].append(df.loc[1].append(df.loc[2].append(df.loc[3].append(df.loc[4].append(df.loc[5])))))
However I need to add additional dataframes(one row, one column) in between df.loc[i] and as a first step, I tried to insert a dataframe at the top of df.loc[0] via
df=df_1st.append(df,ignore_index=True)
Which yields the following error cannot reindex from a duplicate axis
It seems my dataframe df has duplicate indices. Not sure how to proceed. Perhaps the nested method is not best approach?

Copy rows of same column to different index in a Pandas Dataframe. Duplicate data of one month to Another

I am trying to copy/replace data into a column of a dataframe.
When the index is the same, I can easily copy it.
For Example:
sampledata['Total']=actualdata['Total']
above and below both work.
sampledata.loc[janStart:janEnd, 'Total'] = (sampledata.loc[0:755, 'Total']
But when I try to copy the data from either one data frame to another to different indexes. Or to a different index in the same dataframe, it doesn't work.
The following code doesn't works:
sampledata.loc[1417:2153, 'Total'] = sampledata.loc[0:743, 'Total']
I have also tried this:
actualdata.reset_index(drop=True, inplace=True)
#actualdata.index=sampledata.index
#sampledata.ignore_index = True
#actualdata.ignore_index = True
#actualdata.reindex_like(actualdata)
sampledata.loc[1417:2153, 'Total'] = actualdata.loc[0:743, 'Total']
The purpose of this code is to copy the use of electrical consumption from one month to another.
Any other methods that can be used are also welcome.
To be able to copy a Series, index must match.
A simple trick to get rid of the indexes of the copied Series is to extract its values actually converting it to an indexless array:
sampledata.loc[1417:2153, 'Total'] = sampledata.loc[0:736, 'Total'].values
The only requirement is that sizes shall match.

Put maximum of each column from a list of dataframe into new dataframe

I recently started working with pandas dataframes.
I have a list of dataframes called 'arr'.
Edit: All the dataframes in 'arr' have same columns but different data.
Also, I have an empty dataframe 'ndf' which I need to fill in using the above list.
How do I iterate through 'arr' to fill in the max values of a column from 'arr' into a row in 'ndf'
So, we'll have
Number of rows in ndf = Number of elements in arr
I'm looking for something like this:
columns=['time','Open','High','Low','Close']
ndf=DataFrame(columns=columns)
ndf['High']=arr[i].max(axis=0)
Based on your description, I assume a basic example of your data looks something like this:
import pandas as pd
data =[{'time':'2013-09-01','open':249,'high':254,'low':249,'close':250},
{'time':'2013-09-02','open':249,'high':256,'low':248,'close':250}]
data2 =[{'time':'2013-09-01','open':251,'high':253,'low':248,'close':250},
{'time':'2013-09-02','open':245,'high':251,'low':243,'close':247}]
df = pd.DataFrame(data)
df2 = pd.DataFrame(data2)
arr = [df, df2]
If that's the case, then you can simply iterate over the list of dataframes (via enumerate()) and the columns of each dataframe (via iteritems(), see http://pandas.pydata.org/pandas-docs/stable/basics.html#iteritems), populating each new row via a dictionary comprehension: (see Create a dictionary with list comprehension in Python):
ndf = pd.DataFrame(columns = df.columns)
for i, df in enumerate(arr):
ndf = ndf.append(pd.DataFrame(data = {colName: max(colData) for colName, colData in df.iteritems()}, index = [i]))
If some of your dataframes have any additional columns, the resulting dataframe ndf will have NaN entries in the relevant places.

How to get pandas.DataFrame columns containing specific dtype

I'm using df.columns.values to make a list of column names which I then iterate over and make charts, etc... but when I set this up I overlooked the non-numeric columns in the df. Now, I'd much rather not simply drop those columns from the df (or a copy of it). Instead, I would like to find a slick way to eliminate them from the list of column names.
Now I have:
names = df.columns.values
what I'd like to get to is something that behaves like:
names = df.columns.values(column_type=float64)
Is there any slick way to do this? I suppose I could make a copy of the df, and drop those non-numeric columns before doing columns.values, but that strikes me as clunky.
Welcome any inputs/suggestions. Thanks.
Someone will give you a better answe than this possibly, but one thing I tend to do is if all my numeric data are int64 or float64 objects, then you can create a dict of the column data types and then use the values to create your list of columns.
So for example, in a dataframe where I have columns of type float64, int64 and object firstly you can look at the data types as so:
DF.dtypes
and if they conform to the standard whereby the non-numeric columns of data are all object types (as they are in my dataframes), then you can do the following to get a list of the numeric columns:
[key for key in dict(DF.dtypes) if dict(DF.dtypes)[key] in ['float64', 'int64']]
Its just a simple list comprehension. Nothing fancy. Again, though whether this works for you will depend upon how you set up you dataframe...
dtypes is a Pandas Series.
That means it contains index & values attributes.
If you only need the column names:
headers = df.dtypes.index
it will return a list containing the column names of "df" dataframe.
There's a new feature in 0.14.1, select_dtypes to select columns by dtype, by providing a list of dtypes to include or exclude.
For example:
df = pd.DataFrame({'a': np.random.randn(1000),
'b': range(1000),
'c': ['a'] * 1000,
'd': pd.date_range('2000-1-1', periods=1000)})
df.select_dtypes(['float64','int64'])
Out[129]:
a b
0 0.153070 0
1 0.887256 1
2 -1.456037 2
3 -1.147014 3
...
To get the column names from pandas dataframe in python3-
Here I am creating a data frame from a fileName.csv file
>>> import pandas as pd
>>> df = pd.read_csv('fileName.csv')
>>> columnNames = list(df.head(0))
>>> print(columnNames)
You can also try to get the column names from panda data frame that returns columnn name as well dtype. here i'll read csv file from https://mlearn.ics.uci.edu/databases/autos/imports-85.data but you have define header that contain columns names.
import pandas as pd
url="https://mlearn.ics.uci.edu/databases/autos/imports-85.data"
df=pd.read_csv(url,header = None)
headers=["symboling","normalized-losses","make","fuel-type","aspiration","num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base","length","width","height","curb-weight","engine-type",
"num-of-cylinders","engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm"
,"city-mpg","highway-mpg","price"]
df.columns=headers
print df.columns

Categories

Resources