I'm trying to read in a CSV file that has columns without headers. Currently, my solution is
df = pd.read_csv("test.csv")
df = df[[col for col in df.columns if 'Unnamed' not in col]]
This seems a little hacky, and would fail if the file contains columns with the word 'Unnamed' in them. Is there a better way to do this?
The usecols argument of the read_csv function accepts a callable function as input. If you provide a function that evaluates to False for your undesired column headers, then these columns are dropped.
func = lambda x: not x.startswith('Unnamed: ')
df = pd.read_csv('test.csv', usecols=func)
I guess that this solution is not really fundamentally different from your original solution though.
Maybe you could rename those columns first?
df = pd.read_csv("test.csv")
df.columns = df.columns.str.replace('^Unnamed:.*', '')
df[[col for col in df.columns if col]]
Still pretty hacky, but at least this replaces only the strings which start with "Unnamed:" with '' before filtering them.
Related
I have come across this question many a times over internet however not many answers are there except for few of the likes of the following:
Cannot rename the first column in pandas DataFrame
I approached the same using following:
df = df.rename(columns={df.columns[0]: 'Column1'})
Is there a better or cleaner way of doing the rename of the first column of a pandas dataframe? Or any specific column number?
You're already using a cleaner way in pandas.
It is sad that:
df.columns[0] = 'Column1'
Is impossible because Index objects do not support mutable assignments. It would give an TypeError.
You still could do iterable unpacking:
df.columns = ['Column1', *df.columns[1:]]
Or:
df = df.set_axis(['Column1', *df.columns[1:]], axis=1)
Not sure if cleaner, but possible idea is convert to list and set by indexing new value:
df = pd.DataFrame(columns=[4,7,0,2])
arr = df.columns.tolist()
arr[0] = 'Column1'
df.columns = arr
print (df)
Empty DataFrame
Columns: [Column1, 7, 0, 2]
Index: []
I have a Pandas dataframe with a lot of columns looking like p_d_d_c0, p_d_d_c1, ... p_d_d_g1, p_d_d_g2, ....
df =
a b c p_d_d_c0 p_d_d_c1 p_d_d_c2 ... p_d_d_g0 p_d_d_g1 ...
All these columns, which confirm to the regex need to be selected and their datatypes need to be changed from object to float. In particular, columns look like p_d_d_c* and p_d_d_g* are they are all object types and I would like to change them to float types. Is there a way to select columns in bulk by using regular expression and change them to float types?
I tried the answer from here, but it takes a lot of time and memory as I have hundreds of these columns.
df[df.filter(regex=("p_d_d_.*"))
I also tried:
df.select(lambda col: col.startswith('p_d_d_g'), axis=1)
But, it gives an error:
AttributeError: 'DataFrame' object has no attribute 'select'
My Pandas version is 1.0.1
So, how to select columns in bulk and change their data types using regex?
Try this:
import pandas as pd
# sample dataframe
df = pd.DataFrame(data={"co1":[1,2,3,4], "co22":[4,3,2,1], "co3":[2,3,2,4], "abc":[5,4,3,2]})
# select all columns which have co in it
floatcols = [col for col in df.columns if "co" in col]
for floatcol in floatcols:
df[floatcol] = df[floatcol].astype(float)
From the same link, and with some astype magic.
column_vals = df.columns.map(lambda x: x.startswith("p_d_d_"))
train_temp = df.loc(axis=1)[column_vals]
train_temp = train_temp.astype(float)
EDIT:
To modify the original dataframe, do something like this:
column_vals = [x for x in df.columns if x.startswith("p_d_d_")]
df[column_vals] = df[column_vals].astype(float)
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])
I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.
I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df itself
If you want to make your attempt with for loop works, try this:
for col_name in df.columns:
if col_name not in cols_list:
del df[col_name]
Removing the redundant column after the df.append is finished is quite simple:
df = df[cols_list]
As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.
According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’,
‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True. An example of a valid callable argument would be lambda x:
x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in
much faster parsing time and lower memory usage.
This is the answer to your problem.
I think need intersection by list of column namess and then filter by subset with []:
cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]
I'm not sure how to reset index after dropna(). I have
df_all = df_all.dropna()
df_all.reset_index(drop=True)
but after running my code, row index skips steps. For example, it becomes 0,1,2,4,...
The code you've posted already does what you want, but does not do it "in place." Try adding inplace=True to reset_index() or else reassigning the result to df_all. Note that you can also use inplace=True with dropna(), so:
df_all.dropna(inplace=True)
df_all.reset_index(drop=True, inplace=True)
Does it all in place. Or,
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
to reassign df_all.
You can chain methods and write it as a one-liner:
df = df.dropna().reset_index(drop=True)
You can reset the index to default using set_axis() as well.
df.dropna(inplace=True)
df.set_axis(range(len(df)), inplace=True)
set_axis() is especially useful, if you want to reset the index to something other than the default because as long as the lengths match, you can change the index to literally anything with it. For example, you can change it to first row, second row etc.
df = df.dropna()
df = df.set_axis(['first row', 'second row'])