Create dataframe from specific column - python

I am trying to create a dataframe in Pandas from the AB column in my csv file. (AB is the 27th column).
I am using this line:
df = pd.read_csv(filename, error_bad_lines = False, usecols = [27])
... which is resulting in this error:
ValueError: Usecols do not match names.
I'm very new to Pandas, could someone point out what i'm doing wrong to me?

Here is a small demo:
CSV file (without header, i.e. there is NO column names):
1,2,3,4,5,6,7,8,9,10
11,12,13,14,15,16,17,18,19,20
We are going to read only 8-th column:
In [1]: fn = r'D:\temp\.data\1.csv'
In [2]: df = pd.read_csv(fn, header=None, usecols=[7], names=['col8'])
In [3]: df
Out[3]:
col8
0 8
1 18
PS pay attention at header=None, usecols=[7], names=['col8']
If you don't use header=None and names parameters, the first row will be used as a header:
In [6]: df = pd.read_csv(fn, usecols=[7])
In [7]: df
Out[7]:
8
0 18
In [8]: df.columns
Out[8]: Index(['8'], dtype='object')
and if we want to read only the last 10-th column:
In [9]: df = pd.read_csv(fn, usecols=[10])
... skipped ...
ValueError: Usecols do not match names.
because pandas counts columns starting from 0, so we have to do it this way:
In [12]: df = pd.read_csv(fn, usecols=[9], names=['col10'])
In [13]: df
Out[13]:
col10
0 10
1 20

usecols uses the column name in your csv file rather than the column number.
in your case it should be usecols=['AB'] rather than usecols=[28] that is the reason of your error stating usecols do not match names.

Related

Merge excel files with multiple sheets into one dataframe

I'm new to pd python and I'm trying to combine a lot of excel files from a folder (each file contains two sheets) and then add only certain columns from those sheets to the new dataframe. Each file has the same quantity of columns and sheet names, but sometimes a different number of rows.
I'll show you what I did with an example with two files. Screens of the sheets:
First sheet
Second sheet
Sheets from the second file have the same structure, but with different data in it.
Code:
import pandas as pd
import os
folder = [file for file in os.listdir('./test_folder/')]
consolidated = pd.DataFrame()
for file in folder:
first = pd.concat(pd.read_excel('./test_folder/'+file, sheet_name=['first']))
second = pd.concat(pd.read_excel('./test_folder/'+file, sheet_name=['second']))
first_new = first.drop(['Col_K', 'Col_L', 'Col_M'], axis=1) #dropping unnecessary columns
second_new = second.drop(['Col_DD', 'Col_EE', 'Col_FF','Col_GG','Col_HH', 'Col_II', 'Col_JJ', 'Col_KK', 'Col_LL', 'Col_MM', 'Col_NN', 'Col_OO', 'Col_PP', 'Col_QQ', 'Col_RR', 'Col_SS', 'Col_TT'], axis=1) #dropping unnecessary columns
frames = [consolidated, second_new, first_new]
consolidated = pd.concat(frames, axis=0)
consolidated.to_excel('all.xlsx', index=True)
So here is a result
And here's my desired result
So basically, I do not know how to ignore these empty cells and align these two data frames with each other. Most likely there's some problem with DFs indexes(first_new, second_new), but I don't know how to resolve it
pd.concat() has an ignore_index parameter, which you will need if your rows have differing indices across the individual frames. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.
Try:
pd.concat(frames, axis=1, ignore_index=True)
In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])
In [6]: df1
Out[6]:
A B
0 2 3
1 2 3
In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])
In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)
In [11]: df
Out[11]:
0 1 2 3
0 2 3 22 33
1 2 3 22 33
In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)
In [13]: df
Out[13]:
A B AAA BBB
0 2 3 22 33
1 2 3 22 33

Loading pandas table with column names and dtypes

I'm fairly new to using Pandas and I seem to be having some trouble loading a table from a textfile.
Here's an example of what the data looks like:
# Header text
# Header text
# id col1 col2 col3 col4
0 0.44:66 0 1600 45.6e-3
1 0.25:7f 0 1600 52.1e-3
2 0.31:5e 0 1600 33.7e-3
...
2500 0.42.6f 0 1400 42.1e-3
# END
# Footer text
I am reading it in as follows:
import pandas as pd
with open(filename, 'rt') as f:
df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python')
Then when I print(df.dtypes) I get the following:
# id int64
col1 object
col2 int64
col3 int64
col4 float64
dtype: object
This is fine, except for the # in the name of the first column. So I tried specifying the names:
df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python',
names=["id", "col1", "col2", "col3", "col4"])
but then I get print(df.dtypes)
id object
col1 object
col2 object
col3 object
col4 object
dtype: object
So I tried specifying both names and dtypes:
df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python',
names=["id", "col1", "col2", "col3", "col4"],
dtypes={"id":int,"col1":str,"col2":int, "col3":int,"col4":float})
but this gives an error:
ValueError: Unable to convert column id to type <class 'int'>
What's wrong? How can I load the table with the column names I want and the appropriate dtypes?
A few comments.
Firstly, I don't understand why your code works at all, given that your columns appear to be separated by whitespace (?). You'd usually require an extra sep=' ' in the call to read_table or read_csv.
Secondly, you don't need to open the file first, you can just pass the filename to the pandas function: pd.read_table(filename, ...)
But to answer your question:
If you specify the column names explicitly with names=[...] and they don't match the header of the file, pandas assumes there is no header. You therefore have to skip an additional row (skiprows=3), or else pandas will assume that line is part of the table data and thus set the data type to object (i.e. strings) for all columns.
I have found a workaround solution but I am open to better solutions if they are out there.
I loaded the table without specifying the names or dtypes and then renamed the problematic column name as:
df = pd.read_table(f, skiprows=2, skipfooter=2, engine='python')
df.rename(columns={'# id':'id'}, inplace=True)
Then I used print(df.dtypes) to get the desired output:
id int64
col1 object
col2 int64
col3 int64
col4 float64
dtype: object
Use astype
df['id'] = df['id'].astype(int)

Pandas: Read CSV file with separate Year & Month columns, merge them and set as index column

I have a csv file that contains a column 'Year' (type: int64) e.g. 1958, and a column 'Month' (type: int64) e.g. 7 for July.
I would like to convert these two columns into one (format should be 'YYYY-MM') and set it as the index column.
So far I tried this:
data_two = pd.read_csv('data/archive.csv', sep=',', parse_dates=[['Year','Month']], date_parser=lambda x: pd.to_datetime(x, format="%Y%M"), index_col="date_time")
As the format you are requesting (%Y-%M) is not a datetime representation you could simply skip parsing dates and do this:
import pandas as pd
temp=u'''\
Year,Month,Col
1958,7,2
1991,6,4'''
# Read sample dataframe
df = pd.read_csv(pd.compat.StringIO(temp), sep=',')
# Set index
df = (df.set_index(df.Year.astype(str)+"-"+df.Month.astype(str).str.zfill(2))
.drop(['Month','Year'],axis=1))
print(df)
Prints:
Col
1958-07 2
1991-06 4
The alternative is to do this:
df = pd.read_csv(pd.compat.StringIO(temp),
parse_dates=[['Year','Month']],
index_col="Year_Month")
df.index = df.index.strftime("%Y-%m")
First if need Datetimeindex need set index_col by Year_Month:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Year,Month,Col
1958,7,2
1991,6,4"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),
parse_dates=[['Year','Month']],
index_col="Year_Month")
print (df)
Col
Year_Month
1958-07-01 2
1991-06-01 4
print (df.index)
DatetimeIndex(['1958-07-01', '1991-06-01'],
dtype='datetime64[ns]',
name='Year_Month', freq=None)
EDIT:
If need string index (YYYY-MM) then first create MultiIndex with both columns and then join them by list comprehension:
import pandas as pd
from pandas.compat import StringIO
temp=u"""Year,Month,Col
1958,7,2
1991,6,4"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),
index_col=['Year','Month'])
print (df)
Col
Year Month
1958 7 2
1991 6 4
df.index = ['{}-{:02d}'.format(i,j) for i,j in df.index]
print (df)
Col
1958-07 2
1991-06 4

How to convert data of type Panda to Panda.Dataframe?

I have a object of which type is Panda and the print(object) is giving below output
print(type(recomen_total))
print(recomen_total)
Output is
<class 'pandas.core.frame.Pandas'>
Pandas(Index=12, instrument_1='XXXXXX', instrument_2='XXXX', trade_strategy='XXX', earliest_timestamp='2016-08-02T10:00:00+0530', latest_timestamp='2016-08-02T10:00:00+0530', xy_signal_count=1)
I want to convert this obejct in pd.DataFrame, how i can do it ?
i tried pd.DataFrame(object), from_dict also , they are throwing error
Interestingly, it will not convert to a dataframe directly but to a series. Once this is converted to a series use the to_frame method of series to convert it to a DataFrame
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
print(pd.Series(row).to_frame())
Hope this helps!!
EDIT
In case you want to save the column names use the _asdict() method like this:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]},
index=['a', 'b'])
for row in df.itertuples():
d = dict(row._asdict())
print(pd.Series(d).to_frame())
Output:
0
Index a
col1 1
col2 0.1
0
Index b
col1 2
col2 0.2
To create new DataFrame from itertuples namedtuple you can use list() or Series too:
import pandas as pd
# source DataFrame
df = pd.DataFrame({'a': [1,2], 'b':[3,4]})
# empty DataFrame
df_new_fromAppend = pd.DataFrame(columns=['x','y'], data=None)
for r in df.itertuples():
# create new DataFrame from itertuples() via list() ([1:] for skipping the index):
df_new_fromList = pd.DataFrame([list(r)[1:]], columns=['c','d'])
# or create new DataFrame from itertuples() via Series (drop(0) to remove index, T to transpose column to row)
df_new_fromSeries = pd.DataFrame(pd.Series(r).drop(0)).T
# or use append() to insert row into existing DataFrame ([1:] for skipping the index):
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)[1:]
print('df_new_fromList:')
print(df_new_fromList, '\n')
print('df_new_fromSeries:')
print(df_new_fromSeries, '\n')
print('df_new_fromAppend:')
print(df_new_fromAppend, '\n')
Output:
df_new_fromList:
c d
0 2 4
df_new_fromSeries:
1 2
0 2 4
df_new_fromAppend:
x y
0 1 3
1 2 4
To omit index, use param index=False (but I mostly need index for the iteration)
for r in df.itertuples(index=False):
# the [1:] needn't be used, for example:
df_new_fromAppend.loc[df_new_fromAppend.shape[0]] = list(r)
The following works for me:
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]}, index=['a', 'b'])
for row in df.itertuples():
row_as_df = pd.DataFrame.from_records([row], columns=row._fields)
print(row_as_df)
The result is:
Index col1 col2
0 a 1 0.1
Index col1 col2
0 b 2 0.2
Sadly, AFAIU, there's no simple way to keep column names, without explicitly utilizing "protected attributes" such as _fields.
With some tweaks in #Igor's answer
I concluded with this satisfactory code which preserved column names and used as less of pandas code as possible.
import pandas as pd
df = pd.DataFrame({'col1': [1, 2], 'col2': [0.1, 0.2]})
# Or initialize another dataframe above
# Get list of column names
column_names = df.columns.values.tolist()
filtered_rows = []
for row in df.itertuples(index=False):
# Some code logic to filter rows
filtered_rows.append(row)
# Convert pandas.core.frame.Pandas to pandas.core.frame.Dataframe
# Combine filtered rows into a single dataframe
concatinated_df = pd.DataFrame.from_records(filtered_rows, columns=column_names)
concatinated_df.to_csv("path_to_csv", index=False)
The result is a csv containing:
col1 col2
1 0.1
2 0.2
To convert a list of objects returned by Pandas .itertuples to a DataFrame, while preserving the column names:
# Example source DF
data = [['cheetah', 120], ['human', 44.72], ['dragonfly', 54]]
source_df = pd.DataFrame(data, columns=['animal', 'top_speed'])
animal top_speed
0 cheetah 120.00
1 human 44.72
2 dragonfly 54.00
Since Pandas does not recommended building DataFrames by adding single rows in a for loop, we will iterate and build the DataFrame at the end:
WOW_THAT_IS_FAST = 50
list_ = list()
for animal in source_df.itertuples(index=False, name='animal'):
if animal.top_speed > 50:
list_.append(animal)
Now build the DF in a single command and without manually recreating the column names.
filtered_df = pd.DataFrame(list_)
animal top_speed
0 cheetah 120.00
2 dragonfly 54.00

How to read index data as string with pandas.read_csv()?

I'm trying to read csv file as DataFrame with pandas, and I want to read index row as string. However, since the row for index doesn't have any characters, pandas handles this data as integer. How to read as string?
Here are my csv file and code:
[sample.csv]
uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30
[code]
df = pd.read_csv('sample.csv', index_col="uid" dtype=float)
print df.index.values
The result: df.index is integer, not string:
>>> [1 2 3]
But I want to get df.index as string:
>>> ['01', '02', '03']
And an additional condition: The rest of index data have to be numeric value and they're actually too many and I can't point them with specific column names.
pass dtype param to specify the dtype:
In [159]:
import pandas as pd
import io
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
df = pd.read_csv(io.StringIO(t), dtype={'uid':str})
df.set_index('uid', inplace=True)
df.index
Out[159]:
Index(['01', '02', '03'], dtype='object', name='uid')
So in your case the following should work:
df = pd.read_csv('sample.csv', dtype={'uid':str})
df.set_index('uid', inplace=True)
The one-line equivalent doesn't work, due to a still-outstanding pandas bug here where the dtype param is ignored on cols that are to be treated as the index**:
df = pd.read_csv('sample.csv', dtype={'uid':str}, index_col='uid')
You can dynamically do this if we assume the first column is the index column:
In [171]:
t="""uid,f1,f2,f3
01,0.1,1,10
02,0.2,2,20
03,0.3,3,30"""
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
df = pd.read_csv(io.StringIO(t), dtype=dtypes)
df.set_index('uid', inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 01 to 03
Data columns (total 3 columns):
f1 3 non-null float64
f2 3 non-null float64
f3 3 non-null float64
dtypes: float64(3)
memory usage: 96.0+ bytes
In [172]:
df.index
Out[172]:
Index(['01', '02', '03'], dtype='object', name='uid')
Here we read just the header row to get the column names:
cols = pd.read_csv(io.StringIO(t), nrows=1).columns.tolist()
we then generate dict of the column names with the desired dtypes:
index_col_name = cols[0]
dtypes = dict(zip(cols[1:], [float]* len(cols[1:])))
dtypes[index_col_name] = str
we get the index name, assuming it's the first entry and then create a dict from the rest of the cols and assign float as the desired dtype and add the index col specifying the type to be str, you can then pass this as the dtype param to read_csv
If the result is not a string you have to convert it to be a string.
try:
result = [str(i) for i in result]
or in this case:
print([str(i) for i in df.index.values])

Categories

Resources