How to select all Dataframe columns with the same names? [duplicate] - python

This question already has answers here:
Find column whose name contains a specific string
(8 answers)
Closed 17 days ago.
I am creating a dataframe based on a csv import:
ID, attachment, attachment, comment, comment
1, lol.jpg, lmfao.png, 'Luigi',
2, cat.docx, , 'It's me', 'Mario'
Basically the number of 'attachments' and 'comment' columns corresponds to the line that has the bigger number of said attachment and comment.
Since I am exporting the CSV from a third party software, I do not know in advance how many attachments and comment columns there will be.
Importing this CSV with pd.read_csv creates the following dataframe
ID
attachment
attachment.1
comment
comment.1
0
1
lol.jpg
lmfao.png
'Luigi'
1
2
cat.docx
'It's me'
'Mario'
Is there a simple way to select all attachment/comment columns?
Such as attachments_df = imported_df.attachment.all or comments_df = imported_df['comment].??
Thanks.

Use DataFrame.filter for columns starting by string by ^ and optionaly . with \d for comma with decimal for end of string is used $:
attachments_df = imported_df.filter(regex='^attachment\.*\d*$')
comments_df = imported_df.filter(regex='^comment\.*\d*$')

Another possible solution:
attachments_df = imported_df.loc[:,imported_df.columns.str.startswith('attachment')]
comments_df = imported_df.loc[:,imported_df.columns.str.startswith('comment')]

you also can use like atribute of filter function:
imported_df.filter(like='attach')
'''
attachment attachment.1
0 lol.jpg lmfao.png
1 cat.docx NaN

Related

Pandas - How to split a string column into several columns, by the index of specific characters?

I want to extract the user ID from a string column called "filename", and create a new ID column,
based on the index of specific character in the original string.
Two examples for the string in "filename", with ID of 2 or 3 digits:
filename = ID100session1neg_emotions_rating.csv ---> ID = 100
filename =ID21session2neu_emotions_rating.csv ---> ID = 21
I tried this -
df['ID '] = df.filename.str[2:**4**]
but I couldn't find the end index of the ID for the slice per row (it 3 or 4, depends on the length of the ID as 2 or 3 digits).
finding the index of "s" after each row in the data frame will solve my problem.
The simple option didn't work for me -
s_index = df.filename.str.index("s")
(I also tried some split option, but I don't have a specific character such as comma, to split by)
Thanks a lot!
sorry if it's a duplication of a previous question
I'd use regex with str.extract:
s_index = df.filename.str.extract("^ID(\d+)")
As integers:
s_index = df.filename.str.extract("^ID(\d+)").astype(int)
Regex101 explanation
An alternative to regex which is probably the best answer is to use split first on 'session' and grabbing the very first element and then another split grabbing the last element:
df['ID'] = df.filename.str.split('session').str[0].str.split('ID').str[1]

Filter rows of a dataframe based on presence of specific character '+' in the column

I have data frame like this:
id : Name
0 : one
1 : one + two
2 : two
3 : two + three + four
I want to filter rows from this dataframe where name contains '+' and save it to another dataframe. I tried:
df[df.Name.str.contains("+")]
but i'm gettin error:
nothting to repeat at position 0
Any help would be appreciated...Thanks
Looking at the documentation of the str.contains method, it assumes that the string you are passing is a regexp by default.
Therefore, you can either escape the plus character: "\+" or pass the argument regex=False to the method:
df[df.Name.str.contains("\+")]
df[df.Name.str.contains("+", regex=False)]

Ignoring multiple commas while reading csv in pandas

I m trying to read multiple files whose names start with 'site_%'. Example, file names like site_1, site_a.
Each file has data like :
Login_id, Web
1,http://www.x1.com
2,http://www.x1.com,as.php
I need two columns in my pandas df: Login_id and Web.
I am facing error when I try to read records like 2.
df_0 = pd.read_csv('site_1',sep='|')
df_0[['Login_id, Web','URL']] = df_0['Login_id, Web'].str.split(',',expand=True)
I am facing the following error :
ValueError: Columns must be same length as key.
Please let me know where I am doing some serious mistake and any good approach to solve the problem. Thanks
Solution 1: use split with argument n=1 and expand=True.
result= df['Login_id, Web'].str.split(',', n=1, expand=True)
result.columns= ['Login_id', 'Web']
That results in a dataframe with two columns, so if you have more columns in your dataframe, you need to concat it with your original dataframe (that also applies to the next method).
EDIT Solution 2: there is a nicer regex-based solution which uses a pandas function:
result= df['Login_id, Web'].str.extract('^\s*(?P<Login_id>[^,]*),\s*(?P<URL>.*)', expand=True)
This splits the field and uses the names of the matching groups to create columns with their content. The output is:
Login_id URL
0 1 http://www.x1.com
1 2 http://www.x1.com,as.php
Solution 3: convetional version with regex:
You could do something customized, e.g with a regex:
import re
sp_re= re.compile('([^,]*),(.*)')
aux_series= df['Login_id, Web'].map(lambda val: sp_re.match(val).groups())
df['Login_id']= aux_series.str[0]
df['URL']= aux_series.str[1]
The result on your example data is:
Login_id, Web Login_id URL
0 1,http://www.x1.com 1 http://www.x1.com
1 2,http://www.x1.com,as.php 2 http://www.x1.com,as.php
Now you could drop the column 'Login_id, Web'.

Using pandas cell as a key [duplicate]

This question already has an answer here:
Replace values in a pandas series via dictionary efficiently
(1 answer)
Closed 4 years ago.
I have a df where each row has a code that indicates a department.
On the other hand, I have a dictionary in which each code corresponds to a region name (a region is constituted of multiple departments).
I thought of a loop to put the value in a new column that indicates the region.
Here is what I thought would work:
for r in df:
dep = r["dep"].astype(str)
r["region"] = dep_dict.get(dep)
But the only thing I get is "string indices must be integers".
Do you people know how could I make it work ? Or if I should take a totally different route (like joining) ?
Thanks ! 🙏
df.region = df.dep.apply(lambda x: dep_dict[x])
Would this help?

How to remove Semicolon from array in python?

I'm reading a csv file through pandas in python and the last column also includes ; how can i remove it. if i use delimiter as ; it does not work.
Example :
0 -0.22693644;
1 -0.22602014;
2 0.37201694;
3 -0.27763826;
4 -0.5549711;
Name: Z-Axis, dtype: object
I would use parameter comment:
df = pd.read_csv(file, comment=';')
NOTE: this will work properly only for the last column, as everything starting from the comment character till the end of string will be ignored
PS as a little bonus Pandas will treat such column as numeric one, not as a string.
Use str.rstrip:
df['Z-Axis'] = df['Z-Axis'].str.rstrip(";")
Another option:
df['Z-Axis'] = df['Z-Axis'].str[:-1]

Categories

Resources