Split and join a column in dataframe - python

I have a column dataframe which has the following structure:
Initial-column
--------------
abc:123-456
dsf:231-436
ghi:173-486
jkl:193-156
mnq:120-256
I want to obtain the following columns:
Index Name Coords
---- ---- ----
0 abc 123
1 abc 456
2 dsf 231
3 dsf 436
4 ghi 173
5 ghi 486
6 jkl 193
7 jkl 156
8 mnq 120
9 mnq 256
I am able to obtain Name and Coords but in two different columns, but I do not know how to combine them. Any ideas?
df[['Name', 'CoordN']] = df['Initial-column'].str.split(':', expand=True)
df[['Coords0', 'Coords1']] = df['CoordN'].str.split('-', expand=True)

You can leverage pd.concat for that:
df = pd.DataFrame(pd.concat([df.Coords0,df.Coords1])).sort_index()

Related

Expand information in one dataframe and link with data from other

Lets say I have dataframe as such:
df1
Index Id
ABC [1227, 33234]
DEF [112, 323, 2223, 231239]
GHI [9238294, 213, 2398219]
And another one:
df2
Id variable
112 500
213 78073
323 10000000
1227 12
...
9238294 906
My goal is to expand df1['Id'] to connect it with respective value from df2['variable'] to do comparisons of within values of variables from df2 for each Index from df1.
Data at hand has large volume.
What's the most efficient way to expand information from df1 and ascribe value from df2['variable']?
You can explode df1 and merge it with df2 on Id:
out = df1.explode('Id').astype({'Id':int}).merge(df2.astype({'Id':int}), on='Id')
Output:
index Id variable
0 ABC 1227 12
1 DEF 112 500
2 DEF 323 10000000
3 GHI 9238294 906
4 GHI 213 78073

How to join concatenated values to new values in python

Hi im new to python and trying to understand joining
I have two dataframe -
df1
OutputValues
12-99
22-99
264-99
12-323,138-431
4-21
12-123
df2
OldId NewId
99 191
84 84
323 84
59 59
431 59
208 59
60 59
58 59
325 59
390 59
324 59
564 564
123 564
21 21
I want to join both of these based on the second half of the values in df1 i.e. the values after the hifen, for example 12--99 joins old id 99 in df2 and 4-21 to old id 21.
The final new output dataframe should join to the new values in df2 and look like-
df3
OutputValues OutputValues2
12-99 12-191
22-99 22-191
264-99 264-191
12-323,138-431 12-323,138-59
4-21 4-21
12-123,4-325 12-564,4-59
As you see, now the first part of the concatenation is joined with the new id in my desired final output dataframe df3 where there is 99 it is replaced with 191, 123 is replaced with 564 and 325 with 59,etc
How can i do this?
Let's extract both parts, map the last part then concatenate back:
s = df1.OutputValues.str.extractall('(\d+-)(\d+)');
df1['OutputValues2'] = (s[0]+s[1].map(df2.astype(str).set_index('OldId')['NewId'])
).groupby(level=0).agg(','.join)
Output:
OutputValues OutputValues2
0 12-99 12-191
1 22-99 22-191
2 264-99 264-191
3 12-323,138-431 12-84,138-59
4 4-21 4-21
5 12-123 12-564
Update: Looks like simple replace would also work, but this might fail in some edge cases:
df1['OutputValues2'] = df1.OutputValues.replace(('-'+df2.astype(str))
.set_index('OldId')['NewId'],
regex=True)
df1=df1['OutputValues'].str.split(',').explode().str.split('\-',expand=True).join(df1)#Separate explode to separate OutputValues and join them back to df1
df3=df2.astype(str).merge(g, left_on='OldId', right_on=1)#merge df2 and new df1
df3=df3.assign(OutputValues2=df3[0].str.cat(h.NewId, sep='-')).drop(columns=['OldId','NewId',0,1])#Create OutputValues2 and drop unrequired columns
df3.groupby('OutputValues')['OutputValues2'].agg(','.join).reset_index()
OutputValues OutputValues2
0 12-123 12-564
1 12-323,138-431 12-84,138-59
2 12-99 12-191
3 22-99 22-191
4 264-99 264-191
5 4-21 4-21

Sort letters in ascending order ('a-z') in Python after using value_counts

I imported my data file and isolated the first letter of each word, and provided the count of the word. My next step is to sort the letters in ascending order 'a-z'. This is the code that I have right now:
import pandas as pd
df = pd.read_csv(text.txt", names=['FirstNames'])
df
df['FirstLetter'] = df['FirstNames'].str[:1]
df
df['FirstLetter'] = df['FirstLetter'].str.lower()
df
df['FirstLetter'].value_counts()
df
df2 = df['FirstLetter'].index.value_counts()
df2
Using .index.value_counts() wasn't working for me. It turned this output:
Out[72]:
2047 1
4647 1
541 1
4639 1
2592 1
545 1
4643 1
2596 1
549 1
2600 1
2612 1
553 1
4651 1
2604 1
557 1
4655 1
2608 1
561 1
2588 1
4635 1
..
`````````
How can I fix this?
You can use the sort_index() function. This should work df['FirstLetter'].value_counts().sort_index()

Turn the column headers into the first row and row headers into the first column in Pandas dataframe

I have a dataframe that looks like so:
123 345 456 789
987 876 765 543
... ... ... ...
But the top row and leftmost column are taken as headers when they are actually values. Is there anyway to shift them down/right and replace them with the default index?
EDIT: I have already considered setting header=None, but it is not an option. The dataframe was created via a read_excel, but many parts of the program already use .loc and such and directly reference the header names that are to be dropped.
for your solution, you can just shift it. But if you are reading the data from any csv file, while reading you can take considerations of not taking header(header = None)
345 456 789
123
987 876 765 543
df.reset_index().T.reset_index().T
Out:
0 1 2 3
index 123 345 456 789
0 987 876 765 543
pd.read_csv('data.csv',header=None)
Out:
0 1 2 3
0 123 345 456 789
1 987 876 765 543
Use parameter index_col=[0], by default first row is converted to columns names, so no parameter for it is necessary:
import pandas as pd
temp=u"""123;345;456;789
987;876;765;543"""
#after testing replace 'pd.compat.StringIO(temp)' to 'filename.csv'
df = pd.read_csv(pd.compat.StringIO(temp), sep=";", index_col=[0])
print (df)
345 456 789
123
987 876 765 543
If input data is DataFrame with no header:
print (df)
0 1 2 3
0 123 345 456 789
1 987 876 765 543
#set first row to columns
df.columns = df.iloc[0]
#remove first row from data and remove columns name
df = df.iloc[1:].rename_axis(None, axis=1)
#set index by first column
df = df.set_index(df.columns[0])
print (df)
345 456 789
123
987 876 765 543
If same types of values in data is possible use numpy with indexing:
arr = df.values
df = pd.DataFrame(arr[1:,1:], index=arr[1:,0], columns=arr[0,1:])
df.index.name = arr[0,0]
print (df)
345 456 789
123
987 876 765 543
There seems to be an issue with the creation of the dataframe. How is the dataframe created? You most likely can solve your issue right with the creation
If that, however, is not an option, try the following:
pandas.DataFrame.reset_index() is what you want. As for the column names, just add them as a regular row using pandas.DataFrame.append() with df.columns as an argument (where df is your dataframe) and rename the columns after.

Ordering a dataframe using value_counts

I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)

Categories

Resources