I am creating a dataframe from a CSV file. I have gone through the docs, multiple SO posts, links as I have just started Pandas but didn't get it. The CSV file has multiple columns with same names say a.
So after forming dataframe and when I do df['a'] which value will it return? It does not return all values.
Also only one of the values will have a string rest will be None. How can I get that column?
the relevant parameter is mangle_dupe_cols
from the docs
mangle_dupe_cols : boolean, default True
Duplicate columns will be specified as 'X.0'...'X.N', rather than 'X'...'X'
by default, all of your 'a' columns get named 'a.0'...'a.N' as specified above.
if you used mangle_dupe_cols=False, importing this csv would produce an error.
you can get all of your columns with
df.filter(like='a')
demonstration
from StringIO import StringIO
import pandas as pd
txt = """a, a, a, b, c, d
1, 2, 3, 4, 5, 6
7, 8, 9, 10, 11, 12"""
df = pd.read_csv(StringIO(txt), skipinitialspace=True)
df
df.filter(like='a')
I had a similar issue, not due to reading from csv, but I had multiple df columns with the same name (in my case 'id'). I solved it by taking df.columns and resetting the column names using a list.
In : df.columns
Out:
Index(['success', 'created', 'id', 'errors', 'id'], dtype='object')
In : df.columns = ['success', 'created', 'id1', 'errors', 'id2']
In : df.columns
Out:
Index(['success', 'created', 'id1', 'errors', 'id2'], dtype='object')
From here, I was able to call 'id1' or 'id2' to get just the column I wanted.
That's what I usually do with my genes expression dataset, where the same gene name can occur more than once because of a slightly different genetic sequence of the same gene:
create a list of the duplicated columns in my dataframe (refers to column names which appear more than once):
duplicated_columns_list = []
list_of_all_columns = list(df.columns)
for column in list_of_all_columns:
if list_of_all_columns.count(column) > 1 and not column in duplicated_columns_list:
duplicated_columns_list.append(column)
duplicated_columns_list
Use the function .index() that helps me to find the first element that is duplicated on each iteration and underscore it:
for column in duplicated_columns_list:
list_of_all_columns[list_of_all_columns.index(column)] = column + '_1'
list_of_all_columns[list_of_all_columns.index(column)] = column + '_2'
This for loop helps me to underscore all of the duplicated columns and now every column has a distinct name.
This specific code is relevant for columns that appear exactly 2 times, but it can be modified for columns that appear even more than 2 times in your dataframe.
Finally, rename your columns with the underscored elements:
df.columns = list_of_all_columns
That's it, I hope it helps :)
Similarly to JDenman6 (and related to your question), I had two df columns with the same name (named 'id').
Hence, calling
df['id']
returns 2 columns.
You can use
df.iloc[:,ind]
where ind corresponds to the index of the column according how they are ordered in the df. You can find the indices using:
indices = [i for i,x in enumerate(df.columns) if x == 'id']
where you replace 'id' with the name of the column you are searching for.
Related
I have a problem, I want to exclude from a column and drop from my DF all my rows finishing by "99".
I tried to create a list :
filteredvalues = [x for x in df['XX'] if x.endswith('99')]
I have in this list all the concerned rows but how to apply to my DF and drop those rows :
I tried a few things but nothing works :
Lately I tried this :
df = df[df['XX'] not in filteredvalues]
Any help on this?
Use the .str attribute, with corresponding string methods, to select such items. Then use ~ to negate the result, and filter your dataframe with that:
df = df[~df['XX'].str.endswith('99')]
I have pandas df which has 7000 rows * 7 columns. And I have list (row_list) that consists with the value that I want to filter out from df.
What I want to do is to filter out the rows if the rows from df contain the corresponding value in the list.
This is what I got when I tried,
"Empty DataFrame
Columns: [A,B,C,D,E,F,G]
Index: []"
df = pd.read_csv('filename.csv')
df1 = pd.read_csv('filename1.csv', names = 'A')
row_list = []
for index, rows in df1.iterrows():
my_list = [rows.A]
row_list.append(my_list)
boolean_series = df.D.isin(row_list)
filtered_df = df[boolean_series]
print(filtered_df)
replace
boolean_series = df.RightInsoleImage.isin(row_list)
with
boolean_series = df.RightInsoleImage.isin(df1.A)
And let us know the result. If it doesn't work show a sample of df and df1.A
(1) generating separate dfs for each condition, concat, then dedup (slow)
(2) a custom function to annotate with bool column (default as False, then annotated True if condition is fulfilled), then filter based on that column
(3) keep a list of indices of all rows with your row_list values, then filter using iloc based on your indices list
Without an MRE, sample data, or a reason why your method didn't work, it's difficult to provide a more specific answer.
Hi I am working with pandas to manipulate some lab data. I currently have a data frame with 5 columns.
The first three columns(Analyte,CAS NO(1), and Value) are in the correct order.
The last two columns(CAS NO 2 and Value 2) are not.
Is there a way to align CAS No(2) and Value(2) with the first three columns based off of matching CAS Numbers(aka CAS NO(2)=CAS(NO1).
I am new to python and pandas. Thank you for your help
you can reorder the columns by reassigning the df variable as a slice of itself indexed on a list whose entries are the column names in question.
colidx = ['Analyte', 'CAS NO(1)', 'CAS NO(2)']
df = df[colidx]
Better provide input data in text format so we can copy-paste it. I understand you question like this: You need to sort two last columns together, so that CAS NO(2) matches CAS NO(1).
Since CAS NO(2)=CAS(NO1) you then do not need duplicated CAS NO(2) column, right?
Split off two last columns and make a Series from it, then convert that series to dict, and use that dict to map new values.
# Split 2 last columns and assign index.
df_tmp = df[['CAS NO(2)', 'Value(2)']]
df_tmp = df_tmp.set_index('CAS NO(2)')
# Keep only 3 first columns of original dataframe
df = df[['Analyte',' CASNo(1)', 'Value(1)']]
# Now copy the CasNO(1) to CAS NO(2)
df['CAS NO(2)'] = df['CasNO(1)']
# Now create Value(2) column on original dataframe
df['Value(2)'] = df['CASNo(1)'].map(df_tmp.to_dict()['Value(2)'])
Try the following:
import pandas as pd
import numpy as np
#create an example of your table
list_CASNo1 = ['71-43-2', '100-41-4', np.nan, '1634-04-4']
list_Val1 = [np.nan]*len(list_CASNo1)
list_CASNo2 = [np.nan, np.nan, np.nan, '100-41-4']
list_Val2 = [np.nan, np.nan, np.nan, '18']
df = pd.DataFrame(zip(list_CASNo1, list_Val1, list_CASNo2, list_Val2), columns =['CASNo(1)','Value(1)','CAS NO(2)','Value(2)'], index = ['Benzene','Ethylbenzene','Gasonline Range Organics','Methyl-tert-butyl ether'])
#split the data to two dataframes
df1 = df[['CASNo(1)','Value(1)']]
df2 = df[['CAS NO(2)','Value(2)']]
#merge df2 to df1 based on the specified columns
#reset_index and set_index will take care
#that df_adjusted will have the same index names as df1
df_adjusted = df1.reset_index().merge(df2.dropna(),
how = 'left',
left_on = 'CASNo(1)',
right_on = 'CAS NO(2)').set_index('index')
but be careful with duplicates in your columns, those will cause the merge to fail..
I have the following code (Python 2.7):
df = pd.DataFrame()
pages = [i for i in range(1, int(math.ceil(reports.get_reports_count()/page_size)+1))]
with ThreadPoolExecutor(max_workers=len(pages)) as executor:
futh = [executor.submit(reports.fill_dataframe, page) for page in pages]
for data in as_completed(futh):
df = df.append(data.result(), ignore_index=True)
cuttent_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df["timestamp"] = cuttent_time
df.columns = [c.lower().replace(' ', '_') for c in df.columns]
df = df.replace(r'\n', ' ', regex=True)
file_name = "{0}.csv.gz".format(tab_name)
df.to_csv(path_or_buf=file_name, index=False, encoding='utf-8',
compression='gzip',
quoting=QUOTE_NONNUMERIC)
This creates a compressed csv file from the data stream.
Now, I want to make sure that the column in the file are the ones I expect (order does not matter). Meaning that if for any reason the data stream contains more columns than this columns will be removed. Note that I add a column of my own to the data stream called timestamp.
The allowed columns are:
cols_list = ['order_id', 'customer_id', 'date', 'price']
I'm aware that there is del df['column_name'] option but this doesn't work for me as I have no idea what will be the redundant column name.
I'm looking for something like:
if col_name not it cols_list:
del df[???] #delete column and it's data.
print [???] #print the name of the redundant column for log
I think there are two approaches here:
not to add the redundant column to the df in the first place.
remove the redundant column after the df.append is finished.
I prefer the 1st option as it should be with better performance (?)
One of my attempts was:
for i, data in enumerate(df):
for col_name in cols_list:
if col_name not in data.keys():
del df[col_name ]
but it doesn't work..
if col_name not in data.keys(): AttributeError: 'str' object has no attribute 'keys'
I'm not sure I enumerate over df itself
If you want to make your attempt with for loop works, try this:
for col_name in df.columns:
if col_name not in cols_list:
del df[col_name]
Removing the redundant column after the df.append is finished is quite simple:
df = df[cols_list]
As for the first suggestion, you could apply the statement described above before appending it to the df. However, you should note that this requires a pandas DataFrame object, so you would probably need to transform the data.result() to a pandas Dataframe first.
According to the Pandas documentation for the function read_csv at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html there is a parameter 'usecols' which is described:
usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or
strings that correspond to column names provided either by the user in
names or inferred from the document header row(s). For example, a
valid list-like usecols parameter would be [0, 1, 2] or [‘foo’, ‘bar’,
‘baz’]. Element order is ignored, so usecols=[0, 1] is the same as [1,
0]. To instantiate a DataFrame from data with element order preserved
use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the
column names, returning names where the callable function evaluates to
True. An example of a valid callable argument would be lambda x:
x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in
much faster parsing time and lower memory usage.
This is the answer to your problem.
I think need intersection by list of column namess and then filter by subset with []:
cols_list = ['order_id', 'customer_id', 'date', 'price']
cols = df.columns.intersection(cols_list)
df = df[cols]
I have two dataframes. The first named mergedcsv is of the format:
mergedcsv dataframe
The second dataframe named idgrp_df is of a dictionary format which for each region Id a list of corresponding string ids.
idgrp_df dataframe - keys with lists
For each row in mergedcsv (and the corresponding row in idgrp_df) I wish to select the columns within mergedcsv where the column labels are equal to the list with idgrp_df for that row. Then sum the values of those particular values and add the output to a column within mergedcsv. The function will iterate through all rows in mergedcsv (582 rows x 600 columns).
My line of code to try to attempt this is:
mergedcsv['TotRegFlows'] = mergedcsv.groupby([idgrp_df],as_index=False).numbers.apply(lambda x: x.iat[0].sum())
It returns a ValueError: Grouper for class pandas.core.frame.DataFrame not 1-dimensional.
This relates to the input dataframe for the groupby. How can I access the list for each row as the input for the groupby?
So for example, for the first row in mergedcsv I wish to select the columns with labels F95RR04, F95RR06 and F95RR15 (reading from the list in the first row of idgrp_df). Sum the values in these columns for that row and insert the sum value into TotRegFlows column.
Any ideas as to how I can utilize the list would be very much appreciated.
Edits:
Many thanks IanS. Your solution is useful. Following modification of the code line based on this advice I realised that (as suggested) my index in both dataframes are out of sync. I tested the indices (mergedcsv had 'None' and idgrp_df has 'REG_ID' column as index. I set the mergedcsv to 'REG_ID' also. Then realised that the mergedcsv has 582 rows (the REG_ID is not unique) and the idgrp_df has 220 rows (REG_ID is unique). I therefor think I am missing a groupby based on REG_ID index in mergedcsv.
I have modified the code as follows:
mergedcsv.set_index('REG_ID', inplace=True)
print mergedcsv.index.name
print idgrp_df.index.name
mergedcsvgroup = mergedcsv.groupby('REG_ID')[mergedcsv.columns].apply(lambda y: y.tolist())
mergedcsvgroup['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum(), axis=1)
I have a keyError:'REG_ID'.
Any further recommendations are most welcome. Would it be more efficient to combine the groupby and apply into one line?
I am new to working with pandas and trying to build experience in python
Further amendments:
Without an index for mergedcsv:
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID').sum(), axis=1)
this throws a KeyError: (the label[0] is not in the [index], u 'occurred at index 0')
With an index for mergedcsv:
mergedcsv.set_index('REG_ID', inplace=True)
columnlist = list(mergedcsv.columns.values)
mergedcsv['TotRegFlows'] = mergedcsv.apply(lambda row: row[idgrp_df.loc[row.name]].groupby('REG_ID')[columnlist].transform().sum(), axis=1)
this throws a TypeError: ("unhashable type:'list'", u'occurred at index 7')
Or finally separating the groupby function:
columnlist = list(mergedcsv.columns.values)
mergedcsvgroup = mergedcsv.groupby('REG_ID')
mergedcsv['TotRegFlows'] = mergedcsvgroup.apply(lambda row: row[idgrp_df.loc[row.name]].sum())
this throws a TypeError: unhashable type list. The axis=1 argument is not available also with groupby apply.
Any ideas how I can use the lists with the apply function? I've explored tuples in the apply code but have not had any success.
Any suggestions much appreciated.
If I understand correctly, I have a simple solution with apply:
Setup
import pandas as pd
df = pd.DataFrame({'A': [1,2,3], 'B': [4,5,6], 'C': [7,8,9]})
lists = pd.Series([['A', 'B'], ['A', 'C'], ['C']])
Solution
I apply a lambda function that gets the list of columns to be summed from the lists series:
df.apply(lambda row: row[lists[row.name]].sum(), axis=1)
The trick is that, when iterating over rows (axis=1), row.name is the original index of the dataframe df. I use that to access the list from the lists series.
Notes
This solution assumes that both dataframes share the same index, which appears not to be the case in the screenshots you included. You have to address that.
Also, if idgrp_df is a dataframe and not a series, then you need to access its values with .loc.