I'm construction a new DataFrame by concatenating the columns of other DataFrames, like so:
pairs = pd.concat([pos1['Close'], pos2['Close'], pos3['Close'], pos4['Close'], pos5['Close'],
pos6['Close'], pos7['Close']], axis=1)
I want to rename all of the columns of the pairs Dataframe to the symbol of the underlying securities. Is there a way to do this during the the concat method call? Reading through the docs on the method here http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.concat.html didn't give me a solid answer.
You can achieve the same in one go using the attribute keys:
pairs = pd.concat([pos1['Close'], pos2['Close'], pos3['Close'], pos4['Close'], pos5['Close'], pos6['Close'], pos7['Close']],
axis=1, keys= ['JPM', 'WFC', 'BAC', 'C', 'STI', 'PNC', 'CMA'])
This is the approach I'm taking. Seems to fit all my requirements.
symbols = ['JPM', 'WFC', 'BAC', 'C', 'STI', 'PNC', 'CMA']
pairs.columns = symbols
Related
I have a dataframe with several columns that I am filtering in hopes of using the output to filter another dataframe further. Ultimately I'd like to convert the group by object to a list to further filter another dataframe but I am having a hard time converting the SeriesGroupBy object to a list of values. I am using:
id_list = df[df['date_diff'] == pd.Timedelta('0 days')].groupby('id')['id'].tolist()
I've tried to reset_index() and to_frame() and .values before to_list() with no luck.
error is:
'SeriesGroupBy' object has no attribute tolist
Expected output: Simply a list of id's
Try -
pd.Timedelta('0 days')].groupby('id')['id'].apply(list)
Also, I am a bit skeptical about how you are comparing df['date_diff'] with the groupby output.
EDIT: This might be useful for your intended purpose (s might be output of your groupby):
s = pd.Series([['a','a','b'],['b','b','c','d'],['a','b','e']])
s.explode().unique().tolist()
['a', 'b', 'c', 'd', 'e']
I have two data frames, let's say A and B. A has the columns ['Name', 'Age', 'Mobile_number'] and B has the columns ['Cell_number', 'Blood_Group', 'Location'], with 'Mobile_number' and 'Cell_number' having common values. I want to join the 'Location' column only onto A based off the common values in 'Mobile_number' and 'Cell_number', so the final DataFrame would have A={'Name':,'Age':,'Mobile_number':,'Location':]
a = {'Name': ['Jake', 'Paul', 'Logan', 'King'], 'Age': [33,43,22,45], 'Mobile_number':[332,554,234, 832]}
A = pd.DataFrame(a)
b = {'Cell_number': [832,554,123,333], 'Blood_group': ['O', 'A', 'AB', 'AB'], 'Location': ['TX', 'AZ', 'MO', 'MN']}
B = pd.DataFrame(b)
Please suggest. A colleague suggest to use pd.Join but I don't understand how.
Thank you for your time.
the way i see it, you want to merge a dataframe with a part of another dataframe, based on some common column.
first you have to make sure the common column share the same name:
B['Mobile_number'] = B['Cell_number']
then create a dataframe that contains only the relevant columns (the indexing column and the relevant data column):
B1 = B[['Mobile_number', 'Location']]
and at last you can merge them:
merged_df = pd.merge(A, B1, on='Mobile_number')
note that this usage of pd.merge will take only rows with Mobile_number value that exists in both dataframes.
you can look at the documentation of pd.merge to change how exactly the merge is done, what to include etc..
I have the following code snippet
{dataset: https://www.internationalgenome.org/data-portal/sample}
genome_data = pd.read_csv('../genome')
genome_data_columns = genome_data.columns
genPredict = genome_data[genome_data_columns[genome_data_columns != 'Geuvadis']]
This drops the column Geuvadis, is there a way I can include more than one column?
You could use DataFrame.drop like genome_data.drop(['Geuvadis', 'C2', ...], axis=1).
Is it ok for you to not read them in the first place?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
The ‘usecols’ option in read_csv lets you specify the columns of data you want to include in the DataFrame.
Venkatesh-PrasadRanganath is the correct answer to how to drop multiple columns.
But if you want to avoid reading data into memory which you’re not going to use, genome_data = pd.read_csv('../genome', usecols=["only", "required", "columns"] is the syntax to use.
I think #Venkatesh-PrasadRanganath 's answer is better, but taking a similar approach to your attempt, this is how I would do it.:
identify all of the columns with columns.to_list()'
Create a list of columns to be excluded
Subtract the columns to be excluded from the full list with list(set() - set())
Select the remaining columns.
genome_data = pd.read_csv('../genome')
all_genome_data_columns = genome_data.columns.to_list()
excluded_genome_data_columns = ['a', 'b', 'c'] #Type in the columns that you want to exclude here.
genome_data_columns = list(set(all_genome_data_columns) - set(excluded_genome_data_columns))
genPredict = genome_data[genome_data_columns]
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')
If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?
Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]
I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.
If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})
While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression
You can do it with df['Column Name']
If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.