I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.
I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df itself
If you want to make your attempt with for loop works, try this:
for col_name in df.columns:
if col_name not in cols_list:
del df[col_name]
Removing the redundant column after the df.append is finished is quite simple:
df = df[cols_list]
As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.
According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’,
‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True. An example of a valid callable argument would be lambda x:
x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in
much faster parsing time and lower memory usage.
This is the answer to your problem.
I think need intersection by list of column namess and then filter by subset with []:
cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]
Related
I have a column named keywords in my pandas dataset. The values of the column are like this :
[jdhdhsn, cultuere, jdhdy]
I want my output to be
jdhdhsn, cultuere, jdhdy
Try this
keywords = [jdhdhsn, cultuere, jdhdy]
if(isinstance(keyword, list)):
output = ','.join(keywords)
else:
output = keywords[1:-1]
The column of your dataframe seems to be a list
Lists are formatted with brackets and each elements of that list's repr()
Pandas has built in functions for dealing with strings
df['column_name'].str let's you use each element in the column and apply a str function on them. Just like ', '.join(['foo', 'bar', 'baz'])
Thus df['column_name_str'] = df['column_name'].str.join(', ') will produce a new column with the formatting you're after.
You can also use the .apply to perform arbitrary lambda functions on a column, such as:
df['column_name'].apply(lambda row: ', '.join(row))
But since pandas has the .str built in this isn't needed for this example.
Try this
data = ["[jdhdhsn, cultuere, jdhdy]"]
df = pd.DataFrame(data, columns = ["keywords"])
new_df = df['keywords'].str[1:-1]
print(df)
print(new_df)
Question is quite self explanatory.Is there any way to read the csv file to read the time series data skipping first column.?
I tried this code:
df = pd.read_csv("occupancyrates.csv", delimiter = ',')
df = df[:,1:]
print(df)
But this is throwing an error:
"TypeError: unhashable type: 'slice'"
If you know the name of the column just do:
df = pd.read_csv("occupancyrates.csv") # no need to use the delimiter = ','
df = df.drop(['your_column_to_drop'], axis=1)
print(df)
df = pd.read_csv("occupancyrates.csv")
df.pop('column_name')
dataframe is like a dictionary, where column names are keys & values are the column items. For Ex
d = dict(a=1,b=2)
d.pop('a')
Now if you print d, the output will be
{'b': 2}
This is what I have done above to remove a column out of data frame.
By doing this way you need not to assign it back to dataframe like other answer(s)
df = df.iloc[:, 1:]
Or you don't even need to specify inplace=True anywhere
The simplest way to delete the first column should be:
del df[df.columns[0]]
or
df.pop(df.columns[0])
I have some data in text file that I am reading into Pandas. A simplified version of the txt read in is:
idx_level1|idx_level2|idx_level3|idx_level4|START_NODE|END_NODE|OtherData...
353386066294006|1142|2018-09-20T07:57:26Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:26Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:26Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:31Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:31Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:31Z|3|18260005359901|18260004567689|...
353386066294006|1142|2018-09-20T07:57:36Z|1|18260004567689|18260005575180|...
353386066294006|1142|2018-09-20T07:57:36Z|2|18260004567689|18260004240718|...
353386066294006|1142|2018-09-20T07:57:36Z|3|18260005359901|18260004567689|...
353386066736543|22|2018-04-17T07:08:23Z||||...
353386066736543|22|2018-04-17T07:08:24Z||||...
353386066736543|22|2018-04-17T07:08:25Z||||...
353386066736543|22|2018-04-17T07:08:26Z||||...
353386066736543|403|2018-07-02T16:55:07Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:07Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:07Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:07Z|6|18260004283163|18260006215338|...
353386066736543|403|2018-07-02T16:55:01Z|1|18260004580350|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|2|18260005235340|18260005141535|...
353386066736543|403|2018-07-02T16:55:01Z|3|18260005235340|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|4|18260006215338|18260005235340|...
353386066736543|403|2018-07-02T16:55:01Z|5|18260004483352|18260005945439|...
353386066736543|403|2018-07-02T16:55:01Z|6|18260004283163|18260006215338|...
And the code I use to read in is as follows:
mydata = pd.read_csv('/myloc/my_simple_data.txt', sep='|',
dtype={'idx_level1': 'int',
'idx_level2': 'int',
'idx_level3': 'str',
'idx_level4': 'float',
'START_NODE': 'str',
'END_NODE': 'str',
'OtherData...': 'str'},
parse_dates = ['idx_level3'],
index_col=['idx_level1','idx_level2','idx_level3','idx_level4'])
What I really want to do is have a seperate panadas DataFrames for each unique idx_level1 & idx_level2 value. So in the above example there would be 3 DataFrames pertaining to idx_level1|idx_level2 values of 353386066294006|1142, 353386066736543|22 & 353386066736543|403 respectively.
Is it possible to read in a text file like this and output each change in idx_level2 to a new Pandas DataFrame, maybe as part of some kind of loop? Alternatively, what would be the most efficient way to split mydata into DataFrame subsets, given that everything I have read suggests that it is inefficient to iterate through a DataFrame.
Read your dataframe as you are currently doing then groupby and use list comprehension:
group = mydata.groupby(level=[0,1])
dfs = [group.get_group(x) for x in group.groups]
you can call your dataframes by doing dfs[0] and so on
To specifically address your last paragraph, you could create a dict of dfs, based on unique values in the column using something like:
import copy
dict = {}
cols = df[column].unique()
for value in col_values:
key = 'df'+str(value)
dict[key] = copy.deepcopy(df)
dict[key] = dict[key][df[column] == value]
dict[key].reset_index(inplace = True, drop = True)
where column = idx_level2
Read the table as-it-is and use groupby, for instance:
data = pd.read_table('/myloc/my_simple_data.txt', sep='|')
groups = dict()
for group, subdf in data.groupby(data.columns[:2].tolist()):
groups[group] = subdf
Now you have all the sub-data frames in a dictionary whose keys are a tuple of the two indexers (eg: (353386066294006, 1142))
I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.
I have a pandas dataframe which I have created from data stored in an xml file:
Initially the xlm file is opened and parsed
xmlData = etree.parse(filename)
trendData = xmlData.findall("//TrendData")
I created a directory which lists all the data names (which are used as column names) as keys and gives the position of the data in the xml file:
Parameters = {"TreatmentUnit":("Worklist/AdminData/AdminValues/TreatmentUnit"),
"Modality":("Worklist/AdminData/AdminValues/Modality"),
"Energy":("Worklist/AdminData/AdminValues/Energy"),
"FieldSize":("Worklist/AdminData/AdminValues/Fieldsize"),
"SDD":("Worklist/AdminData/AdminValues/SDD"),
"Gantry":("Worklist/AdminData/AdminValues/Gantry"),
"Wedge":("Worklist/AdminData/AdminValues/Wedge"),
"MU":("Worklist/AdminData/AdminValues/MU"),
"My":("Worklist/AdminData/AdminValues/My"),
"AnalyzeParametersCAXMin":("Worklist/AdminData/AnalyzeParams/CAX/Min"),
"AnalyzeParametersCAXMax":("Worklist/AdminData/AnalyzeParams/CAX/Max"),
"AnalyzeParametersCAXTarget":("Worklist/AdminData/AnalyzeParams/CAX/Target"),
"AnalyzeParametersCAXNorm":("Worklist/AdminData/AnalyzeParams/CAX/Norm"),
....}
This is just a small part of the directory, the actual one list over 80 parameters
The directory keys are then sorted:
sortedKeys = list(sorted(Parameters.keys()))
A header is created for the pandas dataframe:
dateList=[]
dateList.append('date')
headers = dateList+sortedKeys
I then create an empty pandas dataframe with the same number of rows as the number of records in trendData and with the column headers set to 'headers' and then loop through the file filling the dataframe:
df = pd.DataFrame(index=np.arange(0,len(trendData)), columns=headers)
for a,b in enumerate(trendData):
result={}
result["date"] = dateutil.parser.parse(b.attrib['date'])
for i,j in enumerate(Parameters):
result[j] = b.findtext(Parameters[j])
df.loc[a]=(result)
df = df.set_index('date')
This seems to work fine but the problem is that the dtype for each colum is set to 'object' whereas most should be integers. It's possible to use:
df.convert_objects(convert_numeric=True)
and it works fine but is now depricated.
I can also use, for example, :
df.AnalyzeParametersBQFMax = pd.to_numeric(df.AnalyzeParametersBQFMax)
to convert individual columns. But is there a way of using pd.to_numeric with a list of column names. I can create a list of columns which should be integers using the following;
int64list=[]
for q in sortedKeys:
if q.startswith("AnalyzeParameters"):
int64list.append(q)
but cant find a way of passing this list to the function.
You can explicitly replace columns in a DataFrame with the same column just with another dtype.
Try this:
import pandas as pd
data = pd.DataFrame({'date':[2000, 2001, 2002, 2003], 'type':['A', 'B', 'A', 'C']})
data['date'] = data['date'].astype('int64')
when now calling data.dtypes it should return the following:
date int64
type object
dtype: object
for multiple columns use a for loop to run through the int64list you mentioned in your question.
for multiple columns you can do it this way:
cols = df.filter(like='AnalyzeParameters').columns.tolist()
df[cols] = df[cols].astype(np.int64)