Extracting a numerical value from .csv files - python

I have a dataframe, with a column of pathnames. I can access these paths using:
for i, p in enumerate(df['path']):
I am now however looking to extract a value from each of these output files.
The csv file looks like:
# some values
# some values : some values
# some values : some values
# some values : some values
# some string : the value I want
# some string : some values
Is there a way of extracting this value and inserting it into my dataframe?
I believe regex would do the trick. I am just not sure of the exact way. I have some template code which looks like:
if re.match(r"something", p):
df = pd.read_csv(p)
df.iloc[i, value_column] = the value I want

Here is a solution to extract the value from the text/csv using the builtin split:
def get_value(string):
array = string.split(": ") # maybe without the white space
return array[0] if len(array) == 1 else array[1]
get_value('some values')
# 'some values'
get_value('some string : the value I want')
# 'the value I want'
Alternatively, using regex
re.sub(r'.*\:\s*(.*)', r'\1', 'some values')
# 'some values'
re.sub(r'.*\:\s*(.*)', r'\1', 'some string : the value I want')
# 'the value I want'

I was assisted with this question when asked in a more clear context.
for a line in a csv file.
if re.match('# some string\s*:\s*([^\n]+)', line):
number = re.match('# some string\s*:\s*([^\n]+)', line).group(1)

Related

Update values in string based with values from pandas data frame

Given the following Data Frame:
df = pd.DataFrame({'term' : ['analys','applic','architectur','assess','item','methodolog','research','rs','studi','suggest','test','tool','viewer','work'],
'newValue' : [0.810419, 0.631963 ,0.687348, 0.810554, 0.725366, 0.742715, 0.799152, 0.599030, 0.652112, 0.683228, 0.711307, 0.625563, 0.604190, 0.724763]})
df = df.set_index('term')
print(df)
newValue
term
analys 0.810419
applic 0.631963
architectur 0.687348
assess 0.810554
item 0.725366
methodolog 0.742715
research 0.799152
rs 0.599030
studi 0.652112
suggest 0.683228
test 0.711307
tool 0.625563
viewer 0.604190
work 0.724763
I am trying to update values in this string behind each "^" with the values from the Data Frame.
(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5
Additionally, this should be done with regard to the corresponding word such that I get this:
(analysi analys^0.810419 studi^0.652112 framework methodolog^0.742715 architectur^0.687348)^0.0625 (recommend suggest^0.683228 rs^0.599030)^0.125 (system tool^0.625563 applic^0.631963)^0.25 (evalu assess^0.810554 test^0.711307)^0.5
Thanks in advance for helping!
The best way I could come up with does this in multiple stages.
First, take the old string and extract all the values that you want to replace. that can be done with a regular expression.
old_string = "(analysi analys^0.8046919107437134 studi^0.6034331321716309 framework methodolog^0.7360332608222961 architectur^0.6806665658950806)^0.0625 (recommend suggest^0.6603200435638428 rs^0.5923488140106201)^0.125 (system tool^0.6207902431488037 applic^0.610009491443634)^0.25 (evalu assess^0.7828741073608398 test^0.6444937586784363)^0.5"
pattern = re.compile(r"(\w+\^(0|[1-9]\d*)(\.\d+)?)")
# pattern.findall(old_string) returns a list of tuples,
# so we need to keep just the outer capturing group for each match.
matches = [m[0] for m in pattern.findall(old_string)]
print("Matches:", matches)
In the next part, we make two dictionaries. One is a dictionary of the prefix (word part, before ^) of the values to replace to the whole value. We use that to create the second dictionary, from the values to replace to the new values (from the dataframe).
prefix_dict = {}
for m in matches:
pre, post = m.split('^')
prefix_dict[pre] = m
print("Prefixes:", prefix_dict)
matches_dict = {}
for i, row in df.iterrows(): # df is the dataframe from the question
if i in prefix_dict:
old_val = prefix_dict[i]
new_val = "%s^%s" % (i, row.newValue)
matches_dict[old_val] = new_val
print("Matches dict:", matches_dict)
With that done, we can loop through the items in the old value > new value dictionary and replace all the old values in the input string.
new_string = old_string
for key, val in matches_dict.items():
new_string = new_string.replace(key, val)
print("New string:", new_string)

a loop that is suppose to write lines to a file isnt working

I have a very large file that looks like this:
[original file][1]
field number 7 (info) contains ~100 pairs of X=Y separated by ';'.
I first want to split all X=Y pairs.
Next I want to scan one pair at a time, and if X is one of 4 titles and Y is an int- I want to put them them in a dictionary.
After finishing going through the pairs I want to check if the dictionary contains all 4 of my titles, and if so, I want to calculate something and write it into a new file.
This is the part of my code which suppose to do that:
for row in reader:
m = re.split(';',row[7]) # split the info field by ';'
d = {}
nl = []
for c in m: # for each info field, split by '=', if it is one of the 4 fields wanted and the value is int- add it to a dict
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE') and type(t[1])==int:
d[t[0]] = t[1]
if 'AC_MALE' in d and 'AC_FEMALE' in d and 'AN_MALE' in d and 'AN_FEMALE' in d: # if the dict contain all 4 wanted fields- make a new line for the final file
total_ac = int(d['AC_MALE'])+ int(d['AC_FEMALE'])
total_an = int(d['AN_MALE'])+ int(d['AN_FEMALE'])
ac_an = total_ac/total_an
nl.extend([row[0],row[1],row[3],row[4],total_ac,total_an, ac_an])
writer.writerow(nl)
The code is running with no errors but isnt writing anything to the file.
Can someone figure out why?
Thanks!
type(t[1])==int is never true. t[1] is a string, always, because you just split that object from another string. It doesn't matter here if the string contains only digits and could be converted to a int.
Test if you can convert your string to an integer, and if that fails, just move on to the next. If it succeeds, add the value to your dictionary:
for c in m:
t = re.split('=',c)
if (t[0]=='AC_MALE' or t[0]=='AC_FEMALE' or t[0]=='AN_MALE' or t[0]=='AN_FEMALE'):
try:
d[t[0]] = int(t[1])
except ValueError:
# string could not be converted, so move on
pass
Note that you don't need to use re.split(); use the standard str.split() method instead. You don't need to test if all keys are present in your dictionary afterwards, just test if the dictionary contains 4 elements, so has a length of 4. You can also simplify the code to test the key name:
for row in reader:
d = {}
for key_value in row[7].split(','):
key, value = key_value.split('=')
if key in {'AC_MALE', 'AC_FEMALE', 'AN_MALE', 'AN_FEMALE'}:
try:
d[key] = int(value)
except ValueError:
pass
if len(d) == 4:
total_ac = d['AC_MALE'] + d['AC_FEMALE']
total_an = d['AN_MALE'] + d['AN_FEMALE']
ac_an = total_ac / total_an
writer.writerow([
row[0], row[1], row[3], row[4],
total_ac, total_an, ac_an])

Using np.genfromtxt to read in data that contains arrays

So I am trying to read in some data which looks like this (this is just the first line):
1 14.4132966509 (-1.2936631396696465, 0.0077236319580324952, 0.066687939649724415) (-13.170491147387787, 0.0051387952329040587, 0.0527163312916894)
I'm attempting to read it in with np.genfromtxt using:
skirt_data = np.genfromtxt('skirt_data.dat', names = ['halo', 'IRX', 'beta', 'intercept'], delimiter = ' ', dtype = None)
But it's returning this:
ValueError: size of tuple must match number of fields.
My question is, how exactly do I load in the arrays that are within the data, so that I can pull out the first number in that array? Ultimately, I want to do something like this to look at the first value of the beta column:
skirt_data['beta'][1]
Thanks ahead of time!
If each line is the same, I would go with a custom parser.
You can split the line using str.split(sep, optional max splits)
So something along the lines of
names = [list from above]
output = {}
with open('skirt_data.dat') as sfd:
for i, line in enumerate(sfd.readlines()):
skirt_name = names[i]
first_col, second_col, rest = line.split(' ', 2)
output[skirt_name] = int(first_col)
print output

How do I remove everything after a certain character in a value in a dictionary for all dictionaries in a group of dictionaries?

My goal is to remove all characters after a certain character in a value from a set of dictionaries.
I have imported a CSV file from my local machine and printed using the following code:
import csv
with open('C:\Users\xxxxx\Desktop\Aug_raw_Page.csv') as csvfile:
reader=csv.DictReader(csvfile)
for row in reader:
print row
I get a set of directories that look like:
{Pageviews_Aug':'145', 'URL':'http://www.domain.com/#fbid=12345'}
For any directory that includes a value with #fbid, I am trying to removing #fbid and any characters that come after that - for all directories where this is true.
I have tried:
for key,value in row.items():
if key == 'URL' and '#' in value or 'fbid' in value
value.split('#')[0]
print row
Didn't work.
Don't think rsplit will work as it removes only whitespace.
Fastest way I thought about is using rsplit()
out = text.rsplit('#fbid')[0]
Okay, so I'm guessing your problem isn't in removing the text that comes afer the # but in getting to that string.
What is 'row'?
I'm guessing it's a dictionnary with a single 'URL' key, am I wrong?
for key,value in row.items():
if key == 'URL' and '#fbid' in value:
print value.split('#')[0]
I don't quite get the whole format of your data.
If you want to edit a single variable in your dictionary, you don't have to iterate through all the items:
if 'URL' in row.keys():
if '#fbid' in row['URL']:
row['URL'] = row['URL'].rsplit('#fbid')[0]
That should work.
But I really think you should copy an example of your whole data (three items would suffice)
Use a regular expression:
>>> import re
>>> value = 'http://www.domain.com/#fbid=12345'
>>> re.sub(ur'#fbid.*','',value)
'http://www.domain.com/'
>>> value = 'http://www.domain.com/'
>>> re.sub(ur'#fbid.*','',value)
'http://www.domain.com/'
for your code you could do something like this to get the answer in the same format as before:
import csv
with open('C:\Users\xxxxx\Desktop\Aug_raw_Page.csv') as csvfile:
reader=csv.DictReader(csvfile)
for row in reader:
row['URL'] = re.sub(ur'#fbid.*','',row['URL'])
print row
given your sample code, it looks to you that don't work because you don't save the result of value.split('#')[0], do something like
for key,value in row.items():
if key == 'URL' and '#' in value or 'fbid' in value
new_value = value.split('#')[0] # <-- here save the result of split in new_value
row[key] = new_value # <-- here update the dict row
print row # instead of print each time, print it once at the end of the operation
this can be simplify to
if '#fbid' in row['URL']:
row['URL'] = row['URL'].split('#fbid')[0]
because it only check for one key.
example
>>> row={'Pageviews_Aug':'145', 'URL':'http://www.domain.com/#fbid=12345'}
>>> if "#fbid" in row["URL"]:
row["URL"] = row['URL'].split("#fbid")[0]
>>> row
{'Pageviews_Aug': '145', 'URL': 'http://www.domain.com/'}
>>>

Python List: IndexError: list index out of range

When I try to print splited_data[1] I'm getting error message IndexError: list index out of range, On the other hand splited_data[0] is working fine.
I want to insert data into MySQL. splited_data[0] are my MySQL columns and splited_data[1] is mysql column values. I want if splited_data[1] is empty then insert empty string in mysql. But I'm getting IndexError: list index out of range. How to avoid this error? Please help me. thank you
Here is my code. Which is working fine. I'm only get this error message when splited_data[1] is empty.
def clean(data):
data = data.replace('[[','')
data = data.replace(']]','')
data = data.replace(']','')
data = data.replace('[','')
data = data.replace('|','')
data = data.replace("''",'')
data = data.replace("<br/>",',')
return data
for t in xml.findall('//{http://www.mediawiki.org/xml/export-0.5/}text'):
m = re.search(r'(?ms).*?{{(Infobox film.*?)}}', t.text)
if m:
k = m.group(1)
k.encode('utf-8')
clean_data = clean(k) #Clean function is used to replace garbase data from text
filter_data = clean_data.splitlines(True) # splited data with lines
filter_data.pop(0)
for index,item in enumerate(filter_data):
splited_data = item.split(' = ',1)
print splited_data[0],splited_data[1]
# splited_data[0] used as mysql column
# splited_data[1] used as mysql values
here is Splited_data data
[u' music ', u'Jatin Sharma\n']
[u' cinematography', u'\n']
[u' released ', u'Film datedf=y201124']
split_data = item.partition('=')
# If there was an '=', then it is now in split_data[1],
# and the pieces you want are split_data[0] and split_data[2].
# Otherwise, split_data[0] is the whole string, and
# split_data[1] and split_data[2] are empty strings ('').
Try removing the whitespace on both sides of the equals sign, like this:
splited_data = item.split('=',1)
A list is contiguous. So you need to make sure it's length is greater than your index before you try to access it.
'' if len(splited_data) < 2 else splited_data[1]
You could also check before you split:
if '=' in item:
col, val=item.split('=',1)
else:
col, val=item, ''

Categories

Resources