Read dataframe in pandas skipping first column to read time series data - python

Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"

If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)

df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere

The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])

Related

how to extract data from a cell with df into a new column with dict format pandas

csv with df
import pandas as pd
df = pd.read_csv('loves_1.csv')
in the column FuelPrices you'll see another df
df1 = pd.DataFrame(df['FuelPrices'][0])
df1
so, how to extract values of LastPriceChangeDateTime and CashPrice as a key:value pair in to a new column of the main df for DIESEL only(df['diesel_price_change'])?
eventually, i want to append in that column dict with LastPriceChangeDateTime: CashPrice every time it's changed
i tried to loop with bunch of parameters but seems like somthing is messed up
for index, row in df.iterrows():
dfnew = pd.DataFrame(df['FuelPrices'][index])
dfnew['price_change'] = dfnew.apply(lambda row: {row['LastPriceChangeDateTime']: row['CashPrice']}, axis=1)
df['diesel_price_change'][index] = dfnew.apply(lambda x: y['price_change'] for y in x if y['ProductName'] == 'DIESEL')
i receive "'int' object is not iterable"
Unfortunately, The only way I found is to loop through it, but I still hope that i'll find pandas solution for it.
for index, row in df.iterrows():
for row in df['FuelPrices'][index]:
if row['ProductName'] == 'DIESEL':
df['diesel_price_change'][index] = {row['LastPriceChangeDateTime']:row['CashPrice']}
can you try this:
df['test_v1']=df['FuelPrices'].apply(lambda x: {x[0]['LastPriceChangeDateTime']:x[0]['CashPrice']})
if you are getting TypeError: string indices must be integers use:
import ast
df['FuelPrices']=df['FuelPrices'].apply(ast.literal_eval)
df['test_v1']=df['FuelPrices'].apply(lambda x: {x[0]['LastPriceChangeDateTime']:x[0]['CashPrice']})

How to remove column with number as index name?

I have the following dataframe:
I tried to drop the data of -1 column by using
df = df.drop(columns=['-1'])
However, it is giving me the following error:
I was able to drop the column if the column name is some language character using this similar coding script, but not a number. What am I doing wrong?
You can test real columns names by converting them to list:
print (df.columns.tolist())
I think you need droping number -1 instead string '-1':
df = df.drop(columns=[-1])
Or another solution with same ouput:
df = df.drop(-1, axis=1)
EDIT:
If need select all columns without first use DataFrame.iloc for select by position, first : means select all rows and second 1: all columns with omit first:
df = df.iloc[:, 1:]
If you are just trying to remove the first column, another approach that would be independent of the column name is this:
df = df[df.columns[1:]]
you can do it simply by using the following code:
first check the name of the column by using following:
df.columns
then if the output is like:
Index(['-1', '0'], dtype='object')
use drop command to delete the column
df.drop(['-1'], axis =1, inplace = True)
guess this should help for future as well

Create a dictionary from pandas empty dataframe with only column names

I have a pandas data frame with only two column names( single row, which can be also considered as headers).I want to make a dictionary out of this with the first column being the value and the second column being the key.I already tried the
to.dict() method, but it's not working as it's an empty dataframe.
Example
df=|Land |Norway| to {'Land': Norway}
I can change the pandas data frame to some other type and find my way around it, but this question is mostly to learn the best/different/efficient approach for this problem.
For now I have this as the solution :
dict(zip(a.iloc[0:0,0:1],a.iloc[0:0,1:2]))
Is there any other way to do this?
Here's a simple way convert the columns to a list and a list to a dictionary
def list_to_dict(a):
it = iter(a)
ret_dict = dict(zip(it, it))
return ret_dict
df = pd.DataFrame([], columns=['Land', 'Normway'])
dict_val = list_to_dict(df.columns.to_list())
dict_val # {'Land': 'Normway'}
Very manual solution
df = pd.DataFrame(columns=['Land', 'Norway'])
df = pd.DataFrame({df.columns[0]: df.columns[1]}, index=[0])
If you have any number of columns and you want each sequential pair to have this transformation, try:
df = pd.DataFrame(dict(zip(df.columns[::2], df.columns[1::2])), index=[0])
Note: You will get an error if your DataFrame does not have at least two columns.

how to remove a column from Pandas dataframe using Python?

I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.
I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df itself
If you want to make your attempt with for loop works, try this:
for col_name in df.columns:
if col_name not in cols_list:
del df[col_name]
Removing the redundant column after the df.append is finished is quite simple:
df = df[cols_list]
As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.
According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’,
‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True. An example of a valid callable argument would be lambda x:
x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in
much faster parsing time and lower memory usage.
This is the answer to your problem.
I think need intersection by list of column namess and then filter by subset with []:
cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]

Changing the dtype for specific columns in a pandas dataframe

I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)

Categories

Resources