Pandas DataFrames: Extract Information and Collapse Columns

Pandas DataFrames: Extract Information and Collapse Columns - python

I have a pandas DataFrame which contains information in columns which I would like to extract into a new column.
It is best explained visually:
df = pd.DataFrame({'Number Type 1':[1,2,np.nan],
'Number Type 2':[np.nan,3,4],
'Info':list('abc')})
The Table shows the initial DataFrame with Number Type 1 and NumberType 2 columns.
I would like to extract the types and create a new Type column, refactoring the DataFrame accordingly.
basically, Numbers are collapsed into the Number columns, and the types extracted into the Type column. The information in the Info column is bound to the numbers (f.e. 2 and 3 have the same information b)
What is the best way to do this in Pandas?

Use melt with dropna:
df = df.melt('Info', value_name='Number', var_name='Type').dropna(subset=['Number'])
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
4 b 2 3
5 c 2 4
Another solution with set_index and stack:
df = df.set_index('Info').stack().rename_axis(('Info','Type')).reset_index(name='Number')
df['Type'] = df['Type'].str.extract('(\d+)')
df['Number'] = df['Number'].astype(int)
print (df)
Info Type Number
0 a 1 1
1 b 1 2
2 b 2 3
3 c 2 4

Related

Reassign a pandas df field to 'both' when the record contains both values

In the pandas dataframe there are 3 columns including RiderID and Type. Types can be either A or B.
The data looks like:
RiderID Type AnotherCol
1 A some information
2 B some information
2 B some information
3 A some information
3 B some information
So rider 3 has both 2 types. i want to reassign the last 2 records' Type to 'both'
RiderID Type AnotherCol
1 A some information
2 B some information
2 B some information
3 Both some information
3 Both some information
I can only think out to get a dataframe containing all RiderID that have 2 Types by:
temp = data.groupby(by='RiderID')[['Type']].nunique()
temp = temp[temp['Type'] ==2].reset_index()
and temp right joins with the original dataframe 'data' using RiderID. (then some filtering and field removal).
But I feel there must be a less complicated way to do it.

Use groupby.transform('nunique') and boolean indexing:
m = df.groupby('RiderID')['Type'].transform('nunique').gt(1)
df.loc[m, 'Type'] = 'Both'
Updated DataFrame:
RiderID Type AnotherCol
0 1 A some information
1 2 B some information
2 2 B some information
3 3 Both some information
4 3 Both some information
If you only have 2 Types, you can also use:
g = df['Type'].eq('A').groupby(df['RiderID'])
# if we have an A are there also non-A in the group?
m = g.transform('any') != g.transform('all')
df.loc[m, 'Type'] = 'Both'

How to add a MultiIndex after loading csv data into a pandas dataframe?

I am trying to add additional index rows to an existing pandas dataframe after loading csv data into it.
So let's say I load my data like this:
columns = ['Relative_Pressure','Volume_STP']
df = pd.read_csv(StringIO(contents), skiprows=4, delim_whitespace=True,index_col=False,header=None)
df.columns = columns
where contents is a string in csv format. The resulting DataFrame might look something like this:
For clarity reasons I would now like to add additional index rows to the DataFrame as shown here:
However in the link these multiple index rows are generated right when the DataFrame is created. I would like to add e.g. rows for unit or descr to the columns.
How could I do this?

You can create a MultiIndex on the columns by specifically creating the index and then assigning it to the columns separately from reading in the data.
I'll use the example from the link you provided. The first method is to create the MultiIndex when you make the dataframe:
df = pd.DataFrame({('A',1,'desc A'):[1,2,3],('B',2,'desc B'):[4,5,6]})
df.columns.names=['NAME','LENGTH','DESCRIPTION']
df
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
As stated, this is not what you are after. Instead, you can make the dataframe (from your file for example) and then make the MultiIndex from a set of lists and then assign it to the columns:
df = pd.DataFrame({'desc A':[1,2,3], 'desc B':[4,5,6]})
# Output
desc A desc B
0 1 4
1 2 5
2 3 6
# Create a multiindex from lists
index = pd.MultiIndex.from_arrays((['A', 'B'], [1, 2], ['desc A', 'desc B']))
# Assign to the columns
df.columns = index
# Output
A B
1 2
desc A desc B
0 1 4
1 2 5
2 3 6
# Name the columns
df.columns.names = ['NAME','LENGTH','DESCRIPTION']
# Output
NAME A B
LENGTH 1 2
DESCRIPTION desc A desc B
0 1 4
1 2 5
2 3 6
There are other ways to construct a MultiIndex, for example, from_tuples and from_product. You can read more about Multi Indexes in the documentation.

Matching the column names of two pandas data-frames in python

I have two pandas dataframes with names df1 and df2 such that
`
df1: a b c d
1 2 3 4
5 6 7 8
and
df2: b c
12 13
I want the result be like
result: b c
2 3
6 7
Here it should be noted that a b c d are the column names in pandas dataframe. The shape and values of both pandas dataframe are different. I want to match the column names of df2 with that of column names of df1 and select all the rows of df1 the headers of which are matched with the column names of df2.. df2 is only used to select the specific columns of df1 maintaining all the rows. I tried some code given below but that gives me an empty index.
df1.columns.intersection(df2.columns)
The above code is not giving me my resut as it gives index headers with no values. I want to write a code in which I can give my two dataframes as input and it compares the columns headers for selection. I don't have to hard code column names.

I believe you need:
df = df1[df1.columns.intersection(df2.columns)]
Or like #Zero pointed in comments:
df = df1[df1.columns & df2.columns]

Or, use reindex
In [594]: df1.reindex(columns=df2.columns)
Out[594]:
b c
0 2 3
1 6 7
Also as
In [595]: df1.reindex(df2.columns, axis=1)
Out[595]:
b c
0 2 3
1 6 7

Alternatively to intersection:
df = df1[df1.columns.isin(df2.columns)]

how to write the pivot_table to txt file by python

I have get the pivot_table as follows:
there are spaces in the table,
what i want to write to txt is:
how to get it ?
chaoshidishi=pd.pivot_table(clsc,index="故障发生地市",values="工单号",aggfunc=len)
chaoshidishi=chaoshidishi.to_frame()
f=open('E:\gaotie\dishi.txt','w')
for row in chaoshidishi:
f.write(row[0]+row[1])
f.close()

Following up on #shanmuga's comment, you should be able to use to_csv() without first using to_frame().
First, here's some sample data that seems to reflect your setup:
import pandas as pd
group = ['a','a','b','c','c']
value = [1,2,3,4,5]
df = pd.DataFrame({'group':group,'value':value})
print(df)
group value
0 a 1
1 a 2
2 b 3
3 c 4
4 c 5
Now apply pivot_table():
df.pivot_table(columns='group', values='value', aggfunc=len)
group
a 2
b 1
c 2
Name: value, dtype: int64
You can save to file directly from this output. If you don't want to preserve index and column names, use header=None on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt'))
newdf = pd.read_csv('foo.txt', header=None)
print(newdf)
0 1
0 a 2
1 b 1
2 c 2
To preserve column and index names, use the header argument on save, and the index_col argument on load:
(df.pivot_table(columns='group', values='value', aggfunc=len)
.to_csv('foo.txt', header='group'))
newdf = pd.read_csv('foo.txt', index_col='group')
print(newdf)
value
group
a 2
b 1
c 2

drop duplicate data in a data frame under special condition using pandas (python)

I have a the following data frame:
I want to remove duplicate data in WD column, if they have the same drug_id.
For example, there is two "crying" in WD column with the same drug_id = 32. So I want to remove one of the row that has crying.
How I can do it? I know how to duplicate rows, but I do not know how to add this condition to this code.
df = df.apply(lambda x:x.drop_duplicates())

You can use drop_duplicates with subset parameter which optionally considers certain columns for duplicates:
df.drop_duplicates(subset = ["drug_id", "WD"])
If the upper/lower cases are important for considering duplicates, you could try:
df[~df[['drug_id', 'WD']].apply(lambda x: x.str.lower()).duplicated()]
Where you can convert both drug_id and WD columns to lower case, use duplicated() method to identify duplicated rows and then use the generated logical series to filter out duplicated rows.
Example:
df = pd.DataFrame({"A": [1,1,2,2], "B":[1,2,3,4], "C":[1,1,2,3]})
df
# A B C
#0 1 1 1
#1 1 2 1
#2 2 3 2
#3 2 4 3
df.drop_duplicates(subset=['A', 'C'])
# A B C
#0 1 1 1
#2 2 3 2
#3 2 4 3

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas DataFrames: Extract Information and Collapse Columns - python

Related

Reassign a pandas df field to 'both' when the record contains both values

How to add a MultiIndex after loading csv data into a pandas dataframe?

Matching the column names of two pandas data-frames in python

how to write the pivot_table to txt file by python

drop duplicate data in a data frame under special condition using pandas (python)

Categories

Resources