Conditional Extraction of Data from Pandas Dataframe

Conditional Extraction of Data from Pandas Dataframe - python

I have a simple DataFrame that looks like:
Names
0 Alexi Laiho
1 Jari Maenpaa
2 Kirk Hammett
3 Antti Kokko
4 Yngwie Malmsteen
5 Petri Lindroos
I want to retrieve records which only have more than 5 vowels in their names.
For this I made function:
def vowcount(sentence=[]):
count=0
vow='aeiouAEIOU'
for i in sentence:
for j in i:
if j in vow:
count+=1
return count
How can I use this function to extract records from the DataFrame?
Please help me to understand how to use df.apply(map()) function on this Pandas Series and how to get the same using list comprehension if possible.

We can use a simple regex statement and using str.lower, str.count and .query:
m = df['Names'].str.lower().str.count(r'[aeiou]')
df = df.query('#m > 5')
Or we can use re.I to ignore case:
import re
m = df['Names'].str.count(r'[aeiou]', flags = re.I)
df = df.query('#m > 5')
Output
Names
0 Alexi Laiho
1 Jari Maenpaa

Alternatively with findall:
import re
df[df.Names.str.findall('[aeiou]',flags=re.I).str.len().gt(5)]
Names
0 Alexi Laiho
1 Jari Maenpaa

Related

Multiple lambda outputs in string replacement using apply [Python]

I have a list of "states" from which I have to iterate:
states = ['antioquia', 'boyaca', 'cordoba', 'choco']
I have to iterate one column in a pandas df to replace or cut the string where the state text is found, so I try:
df_copy['joined'].apply([(lambda x: x.replace(x,x[:-len(j)]) if x.endswith(j) and len(j) != 0 else x) for j in states])
And the result is:
Result wanted:
joined column is the input and the desired output is p_joined column
If it's possible also to find the state not only in the end of the string but check if the string contains it and replace it
Thanks in advance for your help.

This will do what your question asks:
df_copy['p_joined'] = df_copy.joined.str.replace('(' + '|'.join(states) + ')$', '')
Output:
joined p_joined
0 caldasantioquia caldas
1 santafeantioquia santafe
2 medelinantioquiamedelinantioquia medelinantioquiamedelin
3 yarumalantioquia yarumal
4 medelinantioquiamedelinantioquia medelinantioquiamedelin

python dataframe count word occurrences

I have searched a lot here and I couldnt find the answer for it.
I have a dataframe with column "Descriptions" which contain a long string,
I'm trying to count the number of occurence for a specific word "restaurant",
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
text = text.split()
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))
Did the above and it works but it doesn't look like a good way to do it and it generates this "error" as well:
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df['has_restaurants'][index] = (sum(map(lambda count : 1 if 'restaurant' in count else 0, text)))

You might simplify that by using .str.count method, consider following simple example
import pandas as pd
df = pd.DataFrame({"description":["ABC DEF GHI","ABC ABC ABC","XYZ XYZ XYZ"]})
df['ABC_count'] = df.description.str.count("ABC")
print(df)
output
description ABC_count
0 ABC DEF GHI 1
1 ABC ABC ABC 3
2 XYZ XYZ XYZ 0

You could use Python's native .count() method:
df['has_restaurants'] = 0
for index,text in enumerate(df['Description']):
df['has_restaurants'][index] = text.count('restaurant')

How to Convert a text data into DataFrame

How i can convert the below text data into a pandas DataFrame:
(-9.83334315,-5.92063135,-7.83228037,5.55314146), (-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976), (-22.25802006,-10.12843806,-2.9688831,-2.70574665), (-20.3418791,-9.4157625,-3.348587,-7.65474665)
I want to convert this to Data frame with 4 rows and 5 columns. For example, the first row contains the first element of each parenthesis.
Thanks for your contribution.

Try this:
import pandas as pd
with open("file.txt") as f:
file = f.read()
df = pd.DataFrame([{f"name{id}": val.replace("(", "").replace(")", "") for id, val in enumerate(row.split(",")) if val} for row in file.split()])

import re
import pandas as pd
with open('file.txt') as f:
data = [re.findall(r'([\-\d.]+)',data) for data in f.readlines()]
df = pd.DataFrame(data).T.astype(float)
Output:
0 1 2 3 4
0 -9.833343 -5.531373 -11.492390 -22.258020 -20.341879
1 -5.920631 -8.310108 -1.680536 -10.128438 -9.415762
2 -7.832280 -3.280625 -4.147730 -2.968883 -3.348587
3 5.553141 -6.860671 -3.541440 -2.705747 -7.654747

Your data is basically in tuple of tuples forms, hence you can easily use pass a list of tuples instead of a tuple of tuples and get a DataFrame out of it.
Your Sample Data:
text_data = ((-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665))
Result:
As you see it's default takes up to 6 decimal place while you have 7, hence you can use pd.options.display.float_format and set it accordingly.
pd.options.display.float_format = '{:,.8f}'.format
To get your desired data, you simply use transpose altogether to get the desired result.
pd.DataFrame(list(text_data)).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665
OR
Simply, you can use as below as well, where you can create a DataFrame from a list of simple tuples.
data = (-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)
# data = [(-9.83334315,-5.92063135,-7.83228037,5.55314146),(-5.53137301,-8.31010785,-3.28062536,-6.86067081),(-11.49239039,-1.68053601,-4.14773043,-3.54143976),(-22.25802006,-10.12843806,-2.9688831,-2.70574665),(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
pd.DataFrame(data).T
0 1 2 3 4
0 -9.83334315 -5.53137301 -11.49239039 -22.25802006 -20.34187910
1 -5.92063135 -8.31010785 -1.68053601 -10.12843806 -9.41576250
2 -7.83228037 -3.28062536 -4.14773043 -2.96888310 -3.34858700
3 5.55314146 -6.86067081 -3.54143976 -2.70574665 -7.65474665

wrap the tuples as a list
data=[(-9.83334315,-5.92063135,-7.83228037,5.55314146),
(-5.53137301,-8.31010785,-3.28062536,-6.86067081),
(-11.49239039,-1.68053601,-4.14773043,-3.54143976),
(-22.25802006,-10.12843806,-2.9688831,-2.70574665),
(-20.3418791,-9.4157625,-3.348587,-7.65474665)]
df=pd.DataFrame(data, columns=['A','B','C','D'])
print(df)
output:
A B C D
0 -9.833343 -5.920631 -7.832280 5.553141
1 -5.531373 -8.310108 -3.280625 -6.860671
2 -11.492390 -1.680536 -4.147730 -3.541440
3 -22.258020 -10.128438 -2.968883 -2.705747
4 -20.341879 -9.415762 -3.348587 -7.654747

How to split a string without given delimeter in Panda

dfcolumn = [PUEF2CarmenXFc034DpEd, PUEF2BalulanFc034CamH, CARF1BalulanFc013Baca, ...]
My output should be:
dfnewcolumn1 = [PUEF2, PUEF2 , CARF1]
dfnewcolumn2 = [CarmenXFc034DpEd, BalulanFc034CamH, BalulanFc013Baca]

Assuming your split criteria is by fixed number of characters (e.g. 5 here), you can use:
df['dfnewcolumn1'] = df['dfcolumn'].str[:5]
df['dfnewcolumn2'] = df['dfcolumn'].str[5:]
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
If your split criteria is by the first digit in the string, you can use:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Using the following modified original data with more test cases:
dfcolumn
0 PUEF2CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH
2 CARF1BalulanFc013Baca
3 CAF1BalulanFc013Baca
4 PUEFA2BalulanFc034CamH
Run code:
df[['dfnewcolumn1', 'dfnewcolumnX']] = df['dfcolumn'].str.split(r'(?<=\d)\D', n=1, expand=True)
df[['dfnewcolumnX', 'dfnewcolumn2']] = df['dfcolumn'].str.split(r'\D*\d', n=1, expand=True)
df = df.drop(columns='dfnewcolumnX')
Result:
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca
3 CAF1BalulanFc013Baca CAF1 BalulanFc013Baca
4 PUEFA2BalulanFc034CamH PUEFA2 BalulanFc034CamH

Assuming your prefix consists of a sequence of alphabets followed by a sequence of digits, which both have variable length. Then a regex split function can be constructed and applied on each cell.
Solution
import pandas as pd
import re
# data
df = pd.DataFrame()
df["dfcolumn"] = ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]
def f_split(s: str):
"""Split two part by regex"""
# alphabet(s) followed by digit(s)
o = re.match(r"^([A-Za-z]+\d+)(.*)$", s)
# may add exception handling here if there is no match
return o.group(1), o.group(2)
df[["dfnewcolumn1", "dfnewcolumn2"]] = df["dfcolumn"].apply(f_split).to_list()
Note the .to_list() to convert tuples into lists, which is required for the new column assignment to work.
Result
print(df)
dfcolumn dfnewcolumn1 dfnewcolumn2
0 PUEF2CarmenXFc034DpEd PUEF2 CarmenXFc034DpEd
1 PUEF2BalulanFc034CamH PUEF2 BalulanFc034CamH
2 CARF1BalulanFc013Baca CARF1 BalulanFc013Baca

Hoe about this compact solution:
import pandas as pd
df = pd.DataFrame({"original": ["PUEF2CarmenXFc034DpEd", "PUEF2BalulanFc034CamH", "CARF1BalulanFc013Baca"]})
df2 = pd.DataFrame(df.original.str.split(r"(\d)", n=1).to_list(), columns=["part1", "separator", "part2"])
df2.part1 = df2.part1 + df2.separator.astype(str)
df2
part1 separator part2
0 PUEF2 2 CarmenXFc034DpEd
1 PUEF2 2 BalulanFc034CamH
2 CARF1 1 BalulanFc013Baca
I use:
Series.str.split with a regex pattern and a kwarg to specify that it should only split on the first match.
in th regex pattern, I use a group (the round braces in (\d)) to capture the separating character
to_list() to output the split as a list of lists
DataFrame constructor to build a new DataFrame from that list
string concat of two columns

Python: how to filter pandas.Series with function without losing index association?

I have a pandas.DataFrame on which I'm iterating over the rows. On each row I need to filter out some non valuable values and keep the indexes association. This is where I'm at right now:
for i,row in df.iterrows():
my_values = row["first_interesting_column":]
# here I need to filter 'my_values' Series based on a function
# what I'm doin right now is use the built-in python filter function, but what I get back is a list with no indexes anymore
my_valuable_values = filter(lambda x: x != "-", my_values)
How can I do that?

I was suggested the answer by a guy on IRC. Here it is:
w = my_values != "-" # creates a Series with a map of the stuff to be included/exluded
my_valuable_values = my_values[w]
... which could also be shortened in ...
my_valuable_values = my_values[my_values != "-"]
... and, of course, to avoid one more step ...
row["first_interesting_column":][row["first_interesting_column":] != "-"]

It is generally bad practice (and very slow) to iterate over rows. As #JohnE suggested you want to use applymap.
If I understand your question, I think what you want to do is:
import pandas as pd
from io import StringIO
datastring = StringIO("""\
2009 2010 2011 2012
1 4 - 4
3 - 2 3
4 - 8 7
""")
df = pd.read_table(datastring, sep='\s\s+')
a = df[df.applymap(lambda x: x != '-')].astype(np.float).values
a[~np.isnan(a)]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Conditional Extraction of Data from Pandas Dataframe - python

Alternatively with findall: import re df[df.Names.str.findall('[aeiou]',flags=re.I).str.len().gt(5)] Names 0 Alexi Laiho 1 Jari Maenpaa

Related

Multiple lambda outputs in string replacement using apply [Python]

python dataframe count word occurrences

How to Convert a text data into DataFrame

How to split a string without given delimeter in Panda

Python: how to filter pandas.Series with function without losing index association?

Categories

Resources