Abstract value in a csv cell using Regex in python

Abstract value in a csv cell using Regex in python - python

I am abstracting a number value from a csv column like:
column=[None, you earn 5%]
It would be great if it can store 'None' as 0 and simply 5% for the second one.
I tried to transfer the % with the following code. But it raise error as
"TypeError: expected string or bytes-like object"
data.loc[(data['column'] == re.findall(r'([\w]+)', data['column'])), 'disc'] = re.findall(r'([0-9]+\%)',data['column'])
And for loop. But doesn't seemed helpful
def fs(a):
for i in a:
if i == 'None':
a[i] = 0
else:
a[i]=re.search(r'(?<=\().+?(?=\))', a[i])

If you have a Data Frame that has a string column and you want to replace the string 'None" by 0 and also keep numbers and % then do:
df.textColumn.str.replace("None","0").str.replace("[^0-9.%]", "")
Example:
import pandas as pd
df = pd.DataFrame({'n':[1,2,3,4], 'text':["None","you earn 5%", "this is 3.4%", "5.5"]})
df['text'] = df.text.str.replace("None","0").str.replace("[^0-9.%]", "")
df
n text
0 1 0
1 2 5%
2 3 3.4%
3 4 5.5

Related

Reading values from datafram.iloc is too slow and problem in dataframe.values

I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index

You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1

Counting Values In Columns Igonorig AlphaNumeric Values

First post here, I am trying to find out total count of values in an excel file. So after importing the file, I need to run a condition which is count all the values except 0 also where it finds 0 make that blank.
> df6 = df5.append(df5.ne(0).sum().rename('Final Value'))
I tried the above one but not working properly, It is counting the column name as well, I only need to count the float values.
Demo DataFrame:
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
GSM95473 0.08277 0.00874 0.00363 0.01877
GSM95474 0.09503 0.00592 0.00352 0
GSM95475 0.08486 0.00678 0.00386 0.01973
GSM95476 0.08105 0.00913 0.00306 0.01801
GSM95477 0.00000 0.00812 0.00428 0
GSM95478 0.07615 0.00777 0.00438 0.01799
GSM95479 0 0.00508 1 0
GSM95480 0.08499 0.00442 0.00298 0.01897
GSM95481 0.08893 0.00734 0.00204 0
0 1 2 3
ID_REF 1007_s 1053_a 117_at 121_at
These are column name and index value which needs to be ignored when counting.
The output Should be like this after counting:
Final 8 9 9 5

If you just nee the count, but change the values in your dataframe, you could apply a function to each cell in your DataFrame with the applymap method. First create a function to check for a float:
def floatcheck(value):
if isinstance(value, float):
return 1
else:
return 0
Then apply it to your dataframe:
df6 = df5.applymap(floatcheck)
This will create a dataframe with a 1 if the value is a float and a 0 if not. Then you can apply your sum method:
df7 = df6.append(df6.sum().rename("Final Value"))

I was able to solve the issue, So here it is:
df5 = df4.append(pd.DataFrame(dict(((df4[1:] != 1) & (df4[1:] != 0)).sum()), index=['Final']))
df5.columns = df4.columns
went = df5.to_csv("output3.csv")
What i did was i changed the starting index so i didn't count the first row which was alphanumeric and then i just compared it.
Thanks for your response.

pandas read_csv skiprows - determine rows to skip

below is a csv snippet with some dummy headers while the actual frame anchored by beerId:
This work is an unpublished, copyrighted work and contains confidential information.
beer consumption
consumptiondate 7/24/2018
consumptionlab H1
numbeerssuccessful 40
numbeersfailed 0
totalnumbeers 40
consumptioncomplete TRUE
beerId Book
341027 Northern Light
this df = pd.read_csv(path_csv, header=8) code works, but the issue is that header is not always in 8 depending on a day. cannot figure out how to use lambda from help as in
skiprows : list-like or integer or callable, default None
Line numbers to skip (0-indexed) or number of lines to skip (int) at
the start of the file.
If callable, the callable function will be evaluated against the row
indices, returning True if the row should be skipped and False
otherwise. An example of a valid callable argument would be lambda x:
x in [0, 2].
to find the index row of beerId

I think need preprocessing first:
path_csv = 'file.csv'
with open(path_csv) as f:
lines = f.readlines()
#get list of all possible lins starting by beerId
num = [i for i, l in enumerate(lines) if l.startswith("beerId" )]
#if not found value return 0 else get first value of list subtracted by 1
num = 0 if len(num) == 0 else num[0] - 1
print (num)
8
df = pd.read_csv(path_csv, header=num)
print (df)
beerId Book
0 341027 Northern Light

Pandas Parse DataFrame Field and Maintain ID Field

I have a made-up pandas series that I split on a delimiter:
s2 = pd.Series(['2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*'])
split = s2.str.split('*')
The general logic to parse this string:
Asterisks are the delimiter
Numbers immediately before asterisks identify the length of the following block
Three indicators
C indicates field names will follow
N indicates new field values will follow
O indicates old field values will follow
Numbers immediately after indicators (tough because they are next to numbers before asterisks) identify how many field names or values will follow
The parsing logic and code works on a single pandas series. Therefore, it is less important to understand that than it is to understand applying the logic/code to a dataframe.
I calculate the number of fields in the string (in this case, the 3 in the second block which is C316):
number_of_fields = int(split[0][1][1:int(split[0][0])])
I apply a lot of list splitting to extract the results I need into three separate lists (field names, new values, and old values):
i=2
string_length = int(split[0][1][int(split[0][0]):])
field_names_list = []
while i < number_of_fields + 2:
field_name = split[0][i][0:string_length]
field_names_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 3 + number_of_fields
string_length = int(split[0][2 + number_of_fields][string_length:])
new_values_list = []
while i < 3+number_of_fields*2:
field_name = split[0][i][0:string_length]
new_values_list.append(field_name)
string_length = int(split[0][i][string_length:])
i+=1
i = 4 + number_of_fields*2
string_length = int(split[0][3 + number_of_fields*2][string_length:])
old_values_list = []
while i <= 3 + number_of_fields*3:
old_value = split[0][i][0:string_length]
old_values_list.append(old_value)
if i == 3 + number_of_fields*3:
string_length = 0
else:
string_length = int(split[0][i][string_length:])
i+=1
I combine the lists into a df with three columns:
df = pd.DataFrame(
{'field_name': field_names_list,
'new_value': new_values_list,
'old_value': old_values_list
})
field_name new_value old_value
0 first_field_name field value
1 second_field_name Y
2 third_field_name hello
How would I apply this same process to a df with multiple strings? The df would look like this:
row_id string
0 24 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
1 25 2*C316*first_field_name17*second_field_name16*third_field_name2*N311*field value1*Y5*hello2*O30*0*0*
I'm unsure how to maintain the row_id with the eventual columns. The end result should look like this:
row_id field_name new_value old_value
0 24 first_field_name field value
1 24 second_field_name Y
2 24 third_field_name hello
3 25 first_field_name field value
4 25 second_field_name Y
5 25 third_field_name hello
I know I can concatenate multiple dataframes, but that would come after maintaining the row_id. How do I keep the row_id with the corresponding values after a series of list slicing operations?

Convert a string column to number in a dataframe

I'm trying to convert a column in my DataFrame to numbers. The input is email domains extracted from email addresses. Sample:
>>> data['emailDomain']
0 [gmail]
1 [gmail]
2 [gmail]
3 [aol]
4 [yahoo]
5 [yahoo]
I want to create a new column where if the domain is gmail or aol, the column entry would be a 1 and 0 otherwise.
I created a method which goes like this:
def convertToNumber(row):
try:
if row['emailDomain'] == '[gmail]':
return 1
elif row['emailDomain'] == '[aol]':
return 1
elif row['emailDomain'] == '[outlook]':
return 1
elif row['emailDomain'] == '[hotmail]':
return 1
elif row['emailDomain'] == '[yahoo]':
return 1
else:
return 0
except TypeError:
print("TypeError")
and used it like:
data['validEmailDomain'] = data.apply(convertToNumber, axis=1)
However, my output column is 0 even when I know there are gmail and aol emails present in the input column.
Any idea what could be going wrong?
Also, I think this usage of conditional statements might not be the most efficient way to tackle this problem. Is there any other approach to getting this done?

you can use series.isin
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
data['emailDomain'].isin(providers)
searching the provider
instead of applying a re to each email in each row, you can use the Series.str methods to do it on a columns at a time
pattern2 = '(?<=#)([^.]+)(?=\.)'
df['email'].str.extract(pattern2, expand=False)
so this becomes something like this:
pattern2 = '(?<=#)([^.]+)(?=\.)'
providers = {'gmail', 'aol', 'yahoo','hotmail', 'outlook'}
df = pd.DataFrame(data={'email': ['test.1#gmail.com', 'test.2#aol.com', 'test3#something.eu']})
provider_serie = df['email'].str.extract(pattern2, expand=False)
0 gmail
1 aol
2 something
Name: email, dtype: object
interested_providers = df['email'].str.extract(pattern2, expand=False).isin(providers)
0 True
1 True
2 False
Name: email, dtype: bool
If you really want 0s and 1s, you can add a .astype(int)

Your code would work if your series contained strings. As such, they likely contain lists, in which case you need to extract the first element.
I would also utilise pd.Series.map instead of using any row-wise logic. Below is a complete example:
df = pd.DataFrame({'emailDomain': [['gmail'], ['gmail'], ['gmail'], ['aol'],
['yahoo'], ['yahoo'], ['else']]})
domains = {'gmail', 'aol', 'outlook', 'hotmail', 'yahoo'}
df['validEmailDomain'] = df['emailDomain'].map(lambda x: x[0]).isin(domains)\
.astype(int)
print(df)
# emailDomain validEmailDomain
# 0 [gmail] 1
# 1 [gmail] 1
# 2 [gmail] 1
# 3 [aol] 1
# 4 [yahoo] 1
# 5 [yahoo] 1
# 6 [else] 0

You could sum up the occurence checks of every Provider via list comprehensions and write the resulting list into data['validEmailDomain']:
providers = ['gmail', 'aol', 'outlook', 'hotmail', 'yahoo']
data['validEmailDomain'] = [np.sum([p in e for p in providers]) for e in data['emailDomain'].values]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Abstract value in a csv cell using Regex in python - python

Related

Reading values from datafram.iloc is too slow and problem in dataframe.values

Counting Values In Columns Igonorig AlphaNumeric Values

pandas read_csv skiprows - determine rows to skip

Pandas Parse DataFrame Field and Maintain ID Field

Convert a string column to number in a dataframe

Categories

Resources