pandas dataframe restructuring [duplicate] - python

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 3 years ago.
I have a dataframe like so:
Input:
df = pd.DataFrame({'a': range(3), 'b': np.arange(3)-1})
Desired output:
df_rearranged = pd.DataFrame({'data': [0,1,2,-1,0,1], 'origin': ['a', 'a', 'a', 'b', 'b', 'b']})
I have found a (hacky) way of doing this:
Attempt:
subset_1 = df[['a']]
subset_1['origin'] = 'a'
subset_1.rename(columns={'a':'data'}, inplace=True)
subset_2 = df[['b']]
subset_2['origin'] = 'b'
subset_2.rename(columns={'b':'data'}, inplace=True)
df_rearranged = subset_1.append(subset_2)
This works, but it quickly becomes impractical when I want to pool larger numbers of columns. Also, I feel that there should be a function in pandas that does this by default, but I am lacking the keywords to find it. Help is greatly appreciated!

Use DataFrame.melt with change order columns by DataFrame.reindex:
df1 = df.melt(var_name='origin', value_name='data').reindex(['data','origin'], axis=1)
print (df1)
data origin
0 0 a
1 1 a
2 2 a
3 -1 b
4 0 b
5 1 b
Or DataFrame constructor with numpy.ravel and numpy.repeat, obviously working with better performance:
df1 = pd.DataFrame({'data':df.values.ravel(), 'origin':np.repeat(df.columns, len(df))})
print (df1)
data origin
0 0 a
1 -1 a
2 1 a
3 0 b
4 2 b
5 1 b

Related

How to replace df.loc with df.reindex without KeyError

I have a huge dataframe which I get from a .csv file. After defining the columns I only want to use the one I need. I used Python 3.8.1 version and it worked great, although raising the "FutureWarning:
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative."
If I try to do the same in Python 3.10.x I get a KeyError now: "[’empty’] not in index"
In order to get slice/get rid of columns I don't need I use the .loc function like this:
df = df.loc[:, ['laenge','Timestamp', 'Nick']]
How can I get the same result with .reindex function (or any other) without getting the KeyError?
Thanks
If need only columns which exist in DataFrame use numpy.intersect1d:
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
Same output is if use DataFrame.reindex with remove only missing values columns:
df = df.reindex(['laenge','Timestamp', 'Nick'], axis=1).dropna(how='all', axis=1)
Sample:
df = pd.DataFrame({'laenge': [0,5], 'col': [1,7], 'Nick': [2,8]})
print (df)
laenge col Nick
0 0 1 2
1 5 7 8
df = df[np.intersect1d(['laenge','Timestamp', 'Nick'], df.columns)]
print (df)
Nick laenge
0 2 0
1 8 5
Use reindex:
df = pd.DataFrame({'A': [0], 'B': [1], 'C': [2]})
# A B C
# 0 0 1 2
df.reindex(['A', 'C', 'D'], axis=1)
output:
A C D
0 0 2 NaN
If you need to get only the common columns, you can use Index.intersection:
cols = ['A', 'C', 'E']
df[df.columns.intersection(cols)]
output:
A C
0 0 2

Filter Columns from Pandas Dataframe with given list when list elements may or may not be present as column

I have a huge dataframe and I need to filter out the columns from the dataframe if the columns are present in a given list.
For example,
df = pd.DataFrame([[1,2,3,4,5],[6,7,8,9,10]], columns=list('ABCDE'))
This is the dataframe.
A B C D E
0 1 2 3 4 5
1 6 7 8 9 10
I have a list.
fil_lst = ['A', 'D', 'F']
The list may contain column names that are not present in the dataframe. I need only the columns that are present in the dataframe.
I need the resulting dataframe like,
A D
0 1 4
1 6 9
I know it can be done with the help of list comprehension like,
new_df = df[[col for col in fil_lst if col in df.columns]]
But as I have a huge dataframe, it is better if I don't use this computationally expensive process.
Is it possible to vectorize this in any way?
Use Index.isin for test membership in columns and DataFrame.loc for filter by columns, so : mean select all rows and columns by mask:
fil_lst = ['A', 'D', 'F']
df = df.loc[:, df.columns.isin(fil_lst)]
print(df)
A D
0 1 4
1 6 9
Or use Index.intersection:
fil_lst = ['A', 'D', 'F']
df = df[df.columns.intersection(fil_lst)]
print(df)
A D
0 1 4
1 6 9
If you are dealing with large lists, and the focus is on performance more than order of columns, you can use set intersection:
In [2944]: fil_lst = ['A', 'D', 'F']
In [2945]: col_list = df.columns.tolist()
In [2947]: df = df[list(set(col_list) & set(fil_lst))]
In [2947]: df
Out[2947]:
D A
0 4 1
1 9 6
EDIT: If order of columns is important, then do this:
In [2953]: df = df[sorted(set(col_list) & set(fil_lst), key = col_list.index)]
In [2953]: df
Out[2953]:
A D
0 1 4
1 6 9

How to rename the rows in dataframe using pandas read (Python)?

I want to rename rows in python program (version - spyder 3 - python 3.6) . At this point I have something like that:
import pandas as pd
data = pd.read_csv(filepath, delim_whitespace = True, header = None)
Before that i wanted to rename my columns:
data.columns = ['A', 'B', 'C']
It gave me something like that.
A B C
0 1 n 1
1 1 H 0
2 2 He 1
3 3 Be 2
But now, I want to rename rows. I want:
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
How can I do it? The main idea is to rename every row created by pd.read by the data in the B column. I tried something like this:
for rows in data:
data.rename(index={0:'df.loc(index, 'B')', 1:'one'})
but it's not working.
Any ideas? Maybe just replace the data frame rows by column B? How?
I think need set_index with rename_axis:
df1 = df.set_index('B', drop=False).rename_axis(None)
Solution with rename and dictionary:
df1 = df.rename(dict(zip(df.index, df['B'])))
print (dict(zip(df.index, df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
If default RangeIndex solution should be:
df1 = df.rename(dict(enumerate(df['B'])))
print (dict(enumerate(df['B'])))
{0: 'n', 1: 'H', 2: 'He', 3: 'Be'}
Output:
print (df1)
A B C
n 1 n 1
H 1 H 0
He 2 He 1
Be 3 Be 2
EDIT:
If dont want column B solution is with read_csv by parameter index_col:
import pandas as pd
temp=u"""1 n 1
1 H 0
2 He 1
3 Be 2"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), delim_whitespace=True, header=None, index_col=[1])
print (df)
0 2
1
n 1 1
H 1 0
He 2 1
Be 3 2
I normally rename my rows in my dataset by following these steps.
import pandas as pd
df=pd.read_csv("zzzz.csv")
#in a dataframe it is hard to change the names of our rows so,
df.transpose()
#this changes all the rows to columns
df.columns=["","",.....]
# make sure the length of this and the length of columns are same ie dont skip any names.
#Once you are done renaming them:
df.transpose()
#We get our original dataset with changed row names.
just put colnames into "names" when reading
import pandas as pd
df = pd.read_csv('filename.csv', names=["colname A", "colname B"])

reshape a pandas dataframe

suppose a dataframe like this one:
df = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12]], columns = ['A', 'B', 'A1', 'B1'])
I would like to have a dataframe which looks like:
what does not work:
new_rows = int(df.shape[1]/2) * df.shape[0]
new_cols = 2
df.values.reshape(new_rows, new_cols, order='F')
of course I could loop over the data and make a new list of list but there must be a better way. Any ideas ?
The pd.wide_to_long function is built almost exactly for this situation, where you have many of the same variable prefixes that end in a different digit suffix. The only difference here is that your first set of variables don't have a suffix, so you will need to rename your columns first.
The only issue with pd.wide_to_long is that it must have an identification variable, i, unlike melt. reset_index is used to create a this uniquely identifying column, which is dropped later. I think this might get corrected in the future.
df1 = df.rename(columns={'A':'A1', 'B':'B1', 'A1':'A2', 'B1':'B2'}).reset_index()
pd.wide_to_long(df1, stubnames=['A', 'B'], i='index', j='id')\
.reset_index()[['A', 'B', 'id']]
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
You can use lreshape, for column id numpy.repeat:
a = [col for col in df.columns if 'A' in col]
b = [col for col in df.columns if 'B' in col]
df1 = pd.lreshape(df, {'A' : a, 'B' : b})
df1['id'] = np.repeat(np.arange(len(df.columns) // 2), len (df.index)) + 1
print (df1)
A B id
0 1 2 1
1 5 6 1
2 9 10 1
3 3 4 2
4 7 8 2
5 11 12 2
EDIT:
lreshape is currently undocumented, but it is possible it might be removed(with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
I solved this in 3 steps:
Make a new dataframe df2 holding only the data you want to be added to the initial dataframe df.
Delete the data from df that will be added below (and that was used to make df2.
Append df2 to df.
Like so:
# step 1: create new dataframe
df2 = df[['A1', 'B1']]
df2.columns = ['A', 'B']
# step 2: delete that data from original
df = df.drop(["A1", "B1"], 1)
# step 3: append
df = df.append(df2, ignore_index=True)
Note how when you do df.append() you need to specify ignore_index=True so the new columns get appended to the index rather than keep their old index.
Your end result should be your original dataframe with the data rearranged like you wanted:
In [16]: df
Out[16]:
A B
0 1 2
1 5 6
2 9 10
3 3 4
4 7 8
5 11 12
Use pd.concat() like so:
#Split into separate tables
df_1 = df[['A', 'B']]
df_2 = df[['A1', 'B1']]
df_2.columns = ['A', 'B'] # Make column names line up
# Add the ID column
df_1 = df_1.assign(id=1)
df_2 = df_2.assign(id=2)
# Concatenate
pd.concat([df_1, df_2])

Merging and sum up several value-counts series in Pandas

I usually use value_counts() to get the number of occurrences of a value. However, I deal now with large database-tables (cannot load it fully into RAM) and query the data in fractions of 1 month.
Is there a way to store the result of value_counts() and merge it with / add it to the next results?
I want to count the number user actions. Assume the following structure of
user-activity logs:
# month 1
id userId actionType
1 1 a
2 1 c
3 2 a
4 3 a
5 3 b
# month 2
id userId actionType
6 1 b
7 1 b
8 2 a
9 3 c
Using value_counts() on those produces:
# month 1
userId
1 2
2 1
3 2
# month 2
userId
1 2
2 1
3 1
Expected output:
# month 1+2
userId
1 4
2 2
3 3
Up until now, I just have found a method using groupby and sum:
# count users actions and remember them in new column
df1['count'] = df1.groupby(['userId'], sort=False)['id'].transform('count')
# delete not necessary columns
df1 = df1[['userId', 'count']]
# delete not necessary rows
df1 = df1.drop_duplicates(subset=['userId'])
# repeat
df2['count'] = df2.groupby(['userId'], sort=False)['id'].transform('count')
df2 = df2[['userId', 'count']]
df2 = df2.drop_duplicates(subset=['userId'])
# merge and sum up
print pd.concat([df1,df2]).groupby(['userId'], sort=False).sum()
What is the pythonic / pandas' way of merging the information of several series' (and dataframes) efficiently?
Let me suggest "add" and specify a fill value of 0. This has an advantage over the previously suggested answer in that it will work when the two Dataframes have non-identical sets of unique keys.
# Create frames
df1 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'c', 'c', 'd'], 'a': [1, 1, 2, 3, 3, 5]})
df2 = pd.DataFrame(
{'User_id': ['a', 'a', 'b', 'b', 'c', 'c', 'c'], 'a': [1, 1, 2, 2, 3, 3, 4]})
Now add the the two sets of values_counts(). The fill_value argument will handle any NaN values that would arise, in this example, the 'd' that appears in df1, but not df2.
a = df1.User_id.value_counts()
b = df2.User_id.value_counts()
a.add(b,fill_value=0)
You can sum the series generated by the value_counts method directly:
#create frames
df= pd.DataFrame({'User_id': ['a','a','b','c','c'],'a':[1,1,2,3,3]})
df1= pd.DataFrame({'User_id': ['a','a','b','b','c','c','c'],'a':[1,1,2,2,3,3,4]})
sum the series:
df.User_id.value_counts() + df1.User_id.value_counts()
output:
a 4
b 3
c 5
dtype: int64
This is know as "Split-Apply-Combine". It is done in 1 line and 3-4 clicks, using a lambda function as follows.
1️⃣ paste this into your code:
df['total_for_this_label'] = df.groupby('label', as_index=False)['label'].transform(lambda x: x.count())
2️⃣ replace 3x label with the name of the column whose values you are counting (case-sensitive)
3️⃣ print df.head() to check it's worked correctly

Categories

Resources