I have two lists. I have a list of table titles(title_df). My other list is from my contents (prediction_df) to be sorted by titles. I want to populate my contents by titles and create a table in the result.
title_df=['a','b']
prediction_df=['1','2','3','800800','802100','800905']
My table has three rows and two columns
Use numpy.reshape, 2 is for 2 columns and -1 is for count number of rows by data, last pass to DataFrame constructor:
df = pd.DataFrame(np.reshape(prediction_df, (-1,2), order='F'), columns=title_df)
print (df)
a b
0 1 800800
1 2 802100
2 3 800905
Related
I am trying to drop pandas column in the following way. I have a list with columns to drop. This list will be used many times in my notebook. I have 2 columns which are only referenced once
drop_cols=['var1','var2']
df = df.drop(columns={'var0',drop_cols})
So basically, I want to drop all columns from list drop_cols in addition to a hard-coded "var0" column all in one swoop. This gives an error, How do I resolve?
df = df.drop(columns=drop_cols+['var0'])
From what I gather you have a set of columns you wish to drop from several different dataframes while at the same time adding another unique column to also be dropped a data frame. The command you have used is close but misses the point in that you can't create a concatenated list in the way you are trying to do it. This is how I would approach the problem.
Given a Dataframe of the form:
V0 V1 V2 V3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
define a function to merge colnames
def mergeNames(spc_col, multi_cols):
rslt = [spc_col]
rslt.extend(multi+cols)
return rslt
Then with
drop_cols = ['V1', 'V2']
df.drop(columns=mergeNames('V0', drop_cols)
yields:
V3
0 4
1 8
2 12
I am having a dataset which look like follows(in dataframe):
**_id** **paper_title** **references** **full_text**
1 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
2 XYZ [{'abc':'something','def':'something'},{'def':'something'},...many others] something
3 XYZ [{'abc':'something'},{'def':'something'},...many others] something
Expected:
**_id** **paper_title** **abc** **def** **full_text**
1 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
2 XYZ something something something
something something
.
.
(all the dic in list with respect to_id column)
I have tried df['column_name'].apply(pd.Series).apply(pd.Series) to split the list and dictionaries into columns of dataframe but doesn't help as it didn't split dictionaries.
First row of my dataframe:
df.head(1)
Assuming your original DataFrame is a list of dictionaries with one key:value pair and a key named 'reference':
print(df)
id paper_title references full_text
0 1 xyz [{'reference': 'description1'}, {'reference': ... some text
1 2 xyz [{'reference': 'descriptiona'}, {'reference': ... more text
2 3 xyz [{'reference': 'descriptioni'}, {'reference': ... even more text
Then you can use concat to separate out your references with their index:
df1 = pd.concat([pd.DataFrame(i) for i in df['references']], keys = df.index).reset_index(level=1,drop=True)
print(df1)
reference
0 description1
0 description2
0 description3
1 descriptiona
1 descriptionb
1 descriptionc
2 descriptioni
2 descriptionii
2 descriptioniii
Then use DataFrame.join to join the columns back together on their index:
df = df.drop('references', axis=1).join(df1).reset_index(drop=True)
print(df)
id paper_title full_text reference
0 1 xyz some text description1
1 1 xyz some text description2
2 1 xyz some text description3
3 2 xyz more text descriptiona
4 2 xyz more text descriptionb
5 2 xyz more text descriptionc
6 3 xyz even more text descriptioni
7 3 xyz even more text descriptionii
8 3 xyz even more text descriptioniii
After a lot of Documentation reading of pandas, I found the explode method applying with apply(pd.Series) is the easiest of what I was looking for in the question.
Here is the Code:
df = df.explode('reference')
# It explodes the lists to rows of the subset columns
df = df['reference'].apply(pd.Series).merge(df, left_index=True, right_index=True, how ='outer')
# split a list inside a Dataframe cell into rows and merge with original dataframe like (AUB) in set theory
Sidenote: while merging look for unique values in column as there will many columns with duplicated values
I hope this helps someone with dataframe/Series with columns having list containing multiple dictionaries and want to split list of multiple dictionaries key to new column with values as their rows.
I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?
You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.
Premise
I need to use a dictionary as a filter on a large dataframe, where the key-value pairs are values in different columns.
This dictionary is obtained from a separate dataframe, using dict(zip(df.id_col, df.rank_col)) so if a dictionary isn't the best way to go, that is open to change.
This is very similar to this question: Filter a pandas dataframe using values from a dict but fundamentally (I think) different because my dictionary contains column-paired values:
Example data
df_x = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3],
'B':[1,1,1,0,1,0,1,0,1], 'Rank':['1','2','3','1', '2','3','1','2','3'],'D':[1,2,3,4,5,6,7,8,9]})
filter_dict = {'1':'1', '2':'3', '3':'2'}
For this dataframe df_x I would want to be able to look at the filter dictionary and apply it to a set of columns, here id and Rank, so the dataframe is pared down to:
The actual source dataframe is approx 1M rows, and the dictionary is >100 key-value pairs.
Thanks for any help.
You can check with isin
df_x[df_x[['id','Rank']].astype(str).apply(tuple,1).isin(filter_dict.items())]
Out[182]:
id B Rank D
0 1 1 1 1
5 2 0 3 6
7 3 0 2 8
I was able to produce a pandas dataframe with identical column names.
Is it this normal fro a pandas dataframe?
How can I choose one of the two columns only?
Using the identical name, it has, as a result, to produce as output both columns of the dataframe?
Example given below:
# Producing a new empty pd dataset
dataset=pd.DataFrame()
# fill in a list with values to be added to the dataset later
cases=[1]*10
# Adding the list of values in the dataset, and naming the variable / column
dataset["id"]=cases
# making a list of columns as it is displayed below:
data_columns = ["id", "id"]
# Then, we call the pd dataframe using the defined column names:
dataset_new=dataset[data_columns]
# dataset_new
# It has as a result two columns with identical names.
# How can I process only one of the two dataset columns?
id id
0 1 1
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 1 1
7 1 1
You can use the .iloc to access either column.
dataset_new.iloc[:,0]
or
dataset_new.iloc[:,1]
and of course you can rename your columns just like you did when you set them both to 'id' using:
dataset_new.column = ['id_1', 'id_2']
df = pd.DataFrame()
lst = ['1', '2', '3']
df[0] = lst
df[1] = lst
df.rename(columns={0:'id'}, inplace=True)
df.rename(columns={1:'id'}, inplace=True)
print(df[[1]])