Extracting specific columns from pandas.dataframe - python

I'm trying to use python to read my csv file extract specific columns to a pandas.dataframe and show that dataframe. However, I don't see the data frame, I receive Series([], dtype: object) as an output. Below is the code that I'm working with:
My document consists of:
product sub_product issue sub_issue consumer_complaint_narrative
company_public_response company state zipcode tags
consumer_consent_provided submitted_via date_sent_to_company
company_response_to_consumer timely_response consumer_disputed?
complaint_id
I want to extract :
sub_product issue sub_issue consumer_complaint_narrative
import pandas as pd
df=pd.read_csv("C:\\....\\consumer_complaints.csv")
df=df.stack(level=0)
df2 = df.filter(regex='[B-F]')
df[df2]

import pandas as pd
input_file = "C:\\....\\consumer_complaints.csv"
dataset = pd.read_csv(input_file)
df = pd.DataFrame(dataset)
cols = [1,2,3,4]
df = df[df.columns[cols]]
Here specify your column numbers which you want to select. In dataframe, column start from index = 0
cols = []
You can select column by name wise also. Just use following line
df = df[["Column Name","Column Name2"]]

A simple way to achieve this would be as follows:
df = pd.read_csv("C:\\....\\consumer_complaints.csv")
df2 = df.loc[:,'B':'F']
Hope that helps.

This worked for me, using slicing:
df=pd.read_csv
df1=df[n1:n2]
Where $n1<n2# are both columns in the range, e.g:
if you want columns 3-5, use
df1=df[3:5]
For the first column, use
df1=df[0]
Though not sure how to select a discontinuous range of columns.
We can also use i.loc. Given data in dataset2:
dataset2.iloc[:3,[1,2]]
Will spit out the top 3 rows of columns 2-3 (Remember numbering starts at 0)
Then dataset2.iloc[:3,[1,2]] spits out

Related

How to turn column headers into row in order to plot in chart?

This is what happens when I try df.T.plot and it is pulling from the wrong dataframe:
df1 = open_res[['Name','6-Jun','16-Jun','26-Jun','6-Jul','16-Jul','26-Jul','5-Aug','15-Aug','4-Sep','14-Sep','24-Sep','30-Aug','4-Oct','14-Sep','24-Oct','3-Nov','13-Nov','23-Nov','3-Dec']]
df2 = df1.loc[df1['Name'] == 'Global']
df2
The data returns show each date in the format seen above as a column head. How can I change it so that they may be plotted along the x axis?
The data as seen in the picture is cleaned up because I just want the Global row
You get that error the first column is a string and your other columns are numeric, and when you transpose, everything is converted to string. Using some example data like yours:
import pandas as pd
import numpy as np
open_res = pd.DataFrame(np.random.uniform(0,1,(2,19)),
columns=['6-Jun','16-Jun','26-Jun','6-Jul','16-Jul','26-Jul','5-Aug','15-Aug',
'4-Sep','14-Sep','24-Sep','30-Aug','4-Oct','14-Sep','24-Oct',
'3-Nov','13-Nov','23-Nov','3-Dec'])
open_res['Name'] = ['Global','x']
df1 = open_res[['Name','6-Jun','16-Jun','26-Jun','6-Jul','16-Jul','26-Jul','5-Aug','15-Aug',
'4-Sep','14-Sep','24-Sep','30-Aug','4-Oct','14-Sep','24-Oct','3-Nov','13-Nov','23-Nov','3-Dec']]
df2 = df1.loc[df1['Name'] == 'Global']
We transpose:
df2.T.dtypes
0 object
You can do:
df2.set_index('Name').T.plot()

Add new columns and new column names in python

I have a CSV file in the following format:
Date,Time,Open,High,Low,Close,Volume
09/22/2003,00:00,1024.5,1025.25,1015.75,1022.0,720382.0
09/23/2003,00:00,1022.0,1035.5,1019.25,1022.0,22441.0
10/22/2003,00:00,1035.0,1036.75,1024.25,1024.5,663229.0
I would like to add 20 new columns to this file, the value of each new column is synthetically created by simply randomizing a set of numbers.
It would be something like this:
import pandas as pd
df = pd.read_csv('dataset.csv')
print(len(df))
input()
for i in range(len(df)):
#Data that already exist
date = df.values[i][0]
time = df.values[i][1]
open_value= df.values[i][2]
high_value=df.values[i][3]
low_value=df.values[i][4]
close_value=df.values[i][5]
volume=df.values[i][6]
#This is the new data
prediction_1=randrange(3)
prediction_2=randrange(3)
prediction_3=randrange(3)
prediction_4=randrange(3)
prediction_5=randrange(3)
prediction_6=randrange(3)
prediction_7=randrange(3)
prediction_8=randrange(3)
prediction_9=randrange(3)
prediction_10=randrange(3)
prediction_11=randrange(3)
prediction_12=randrange(3)
prediction_13=randrange(3)
prediction_14=randrange(3)
prediction_15=randrange(3)
prediction_16=randrange(3)
prediction_17=randrange(3)
prediction_18=randrange(3)
prediction_19=randrange(3)
prediction_20=randrange(3)
#How to concatenate these data row by row in a matrix?
#How to add new column names and save the file?
I would like to concatenate them (old+synthetic data) and, after that, I would like to add 20 new columns named 'synthetic1', 'synthetic2', ..., 'synthetic20', to the existing column names and then save the resulting new dataset in a new text file.
I could do that easily with NumPy, but here, we have no numeric data and, therefore, I don't know how to do (or if it is possible to do) that. Is possible to do that with Pandas or another library?
Here's a way you can do:
import numpy as np
# set nrow and col, nrow should match the number of rows in existing df
n_row = 100
n_col = 20
f = pd.DataFrame(np.random.randint(100, size=(n_row, n_col)), columns=['synthetic' + str(x) for x in range(1,n_col+1)])
df = pd.concat([df, f])

Wide to long returns empty output - Python dataframe

I have a dataframe which can be generated from the code as given below
df = pd.DataFrame({'person_id' :[1,2,3],'date1':
['12/31/2007','11/25/2009','10/06/2005'],'val1':
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3':
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})
I followed the below solution to convert it from wide to long
pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id',
j='grp').sort_index(level=0)
Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows
This is how my real columns looks like
My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?
Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.
Try adding additional argument in the function which allows the strings suffix.
pd.long_to_wide(.......................,suffix='\w+')
The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.
import pandas as pd
import re
def change_names(df, regex):
# Select one of three column groups
old_cols = df.filter(regex = regex).columns
# Create list of new column names
new_cols = []
for col in old_cols:
# Get the stubname of the original column
stub = ''.join(re.split(r'\d', col))
# Get the time point
num = re.findall(r'\d+', col) # returns a list like ['1']
# Make new column name
new_col = stub + num[0]
new_cols.append(new_col)
# Create dictionary mapping old column names to new column names
dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
# Rename columns
df.rename(columns = dd, inplace = True)
return df
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')
print(tdf)
person_id hdate1 tval1 hdate2 tval2 hdate3 tval3
0 1 12/31/2007 2 12/31/2017 1 12/31/2027 7
1 2 11/25/2009 4 11/25/2019 3 11/25/2029 9
2 3 10/06/2005 6 10/06/2015 5 10/06/2025 11
This is quite late to answer this question. But putting the solution here in case someone else find it useful
tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
## You can use m13op22 solution to rename your columns with numeric part at the
## end of the column name. This is important.
tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
'h2date': 'hdate2', 't2val': 'tval2',
'h3date': 'hdate3', 't3val': 'tval3'})
## Then use the non-numeric portion, (in this example 'hdate', 'tval') as
## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.
df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
print(df)

How to obtain the content of a pandas multilevel index entry?

I set up a pandas dataframes that besides my data stores the respective units with it using a MultiIndex like this:
Name Relative_Pressure Volume_STP
Unit - ccm/g
Description p/p0
0 0.042691 29.3601
1 0.078319 30.3071
2 0.129529 31.1643
3 0.183355 31.8513
4 0.233435 32.3972
5 0.280847 32.8724
Now I can for example extract only the Volume_STP data by
Unit ccm/g
Description
0 29.3601
1 30.3071
2 31.1643
3 31.8513
4 32.3972
5 32.8724
With .values I can obtain a numpy array of the data. However how can I get the stored unit? I can't figure out what I need to do to receive the stored ccm/g string.
EDIT: Added example how data frame is generated
Let's say I have a string that looks like this:
Relative Volume # STP
Pressure
cc/g
4.26910e-02 29.3601
7.83190e-02 30.3071
1.29529e-01 31.1643
1.83355e-01 31.8513
2.33435e-01 32.3972
2.80847e-01 32.8724
3.34769e-01 33.4049
3.79123e-01 33.8401
I then use this function:
def read_result(contents, columns, units, descr):
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.drop(df.index[-1], inplace=True)
index = pd.MultiIndex.from_arrays((columns, units, descr))
df.columns = index
df.columns.names = ['Name','Unit','Description']
df = df.apply(pd.to_numeric)
return df
like this
def isotherm(contents):
columns = ['Relative_Pressure','Volume_STP']
units = ['-','ccm/g']
descr = ['p/p0','']
df = read_result(contents, columns, units, descr)
return df
to generate the DataFrame at the beginning of my question.
As df has a MultiIndex as columns, df.Volume_STP is still a pandas DataFrame. So you can still access its columns attribute, and the relevant item will be at index 0 because the dataframe contains only 1 Series.
So, you can extract the names that way:
print(df.Volume_STP.columns[0])
which should give: ('ccm/g', '')
At the end you extract the unit with .colums[0][0] and the description with .columns[0][1]
You can do something like this:
df.xs('Volume_STP', axis=1).columns.remove_unused_levels().get_level_values(0).tolist()[0]
Output:
'ccm/g'
Slice the dataframe from the 'Volume_STP' using xs, then select the columns remove the unused parts of the column headers, then get the value for the top most level of that slice which is the Units. Convert to a list as select the first value.
A generic way of accessing values on multi-index/columns is by using the index.get_level_values or columns.get_level_values functions of a data frame.
In your example, try df.columns.get_level_values(1) to access the second level of the multi-level column "Unit". If you have already selected a column, say "Volume_STP", then you have removed the top level and in this case, your units would be in the 0th level.

Changing the dtype for specific columns in a pandas dataframe

I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)

Categories

Resources