I am reading some data from a dataframe column and I do some manipulation on each value if the value contains a "-". These manipulations include spliting based on the "-". However I do not understand why each value in the list has an "\n*" as for instance
['2010\n1', '200\n2 450\n3', ..., '1239\n1000']
here is a sample of my code:
splited = []
wantedList = []
val = str(x) # x represents the value in the value read from the dataframe column
print val # the val variable does not does not contain those special characters
if val.find('-') != -1:
splited = val.split('-')
wantedList.append(splited[0])
print splited # splited list contains those special characters
print wantedList # wantedList contains those special characters
I guess this has to do with the way I created the list or the way I am appending to it.
Does anyone knows why something like this does happen
There isn't nothing in your code that could possibly automagically add a new line character at some random position within your strings. I'd say the characters are already in the string but print isn't showing as \n but as a new line.
You can confirm that by printing the representation of the string:
print repr(val)
If you want them out of your strings, you can with a simple str.replace for all \n.
Related
I am developing a program to read through a CSV file and create a dictionary of information from it. Each line in the CSV is essentially a new dictionary entry with the delimited objects being the values.
As one subpart of task, I need to extract an unknown number of numeric digits from within a string. I have a working version, but it does not seem very pythonic.
An example string looks like this:
variable = Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]
variable is string's name in the python code, and represents the variable name within a MODBUS. I want to extract just the digits prior to the .WORD_type[0] which relate to the number of bytes the string is packed into.
Here is my working code, note this is nested within a for statement iterating through the lines in the CSV. var_length and var_type are some of the keys, i.e. {"var_length": var_length}
if re.search(".+_ST[0-9]{1,2}\\.WORD_type.+", variable):
var_type = "string"
temp = re.split("\\.", variable)
temp = re.split("_", temp[2])
temp = temp[-1]
var_length = int(str.lstrip(temp, "ST")) / 2
You could maybe try using matching groups like so:
import re
variable = "Applicaiton.Module_Name.VAR_NAME_ST12.WORD_type[0]"
matches = re.match(r".+_ST(\d+)\.WORD_type.+", variable)
if matches:
print(matches[1])
matches[0] has the full match and matches[1] contains the matched group.
I want to work with data form a csv in python. I'm looking to make each column a separate string, and I am wondering if there is a way to loop through this process so that I don't have to specify the name of each string individually (as the naming conventions are very similar).
For a number of the csv columns, I am using the following code:
dot_title=str(row[0]).lower()
onet_title=str(row[1]).lower()
For [2]-[11], I would like each string to be named the same but numbered. I.e., row[2] would become a string called onet_reported_1, row[3] would be onet_reported_2, row[4] would be onet_reported_3... etc., all the way through to row[12].
Is there a way of doing this with a loop, instead of simply defining onet_reported_1, _2, _3, _4 etc. individually?
Thanks in advance!
So, first some clarity.
A string is a variable type. In Python, you create a string by surrounding some text in either single or double quotes.
"This is a string"
'So is this. It can have number characters: 123. Or any characters: !##$'
Strings are values that can be assigned to a variable. So you use a string by giving it a name:
my_string = "This is a string"
another_string = "One more of these"
You can do different kinds of operations on strings like joining them with the + operator
new_string = my_string + another_string
And you can create lists of strings:
list_of_strings = [new_string, my_string, another_string]
which looks like ["This is a stringOne more of these", "This is a string", "One more of these"].
To create multiple strings in a loop, you'll need a place to store them. A list is a good candidate:
list_of_strings = []
for i in range(1, 11):
list.append("onet_reported_" + i)
But I think what you want is to name the variables "onet_reported_x" so that you end up with something equivalent to :
onet_reported_1 = row[1]
onet_reported_2 = row[2]
and so forth, without having to type out all that redundant code. That's a good instinct. One nice way to do this kind of thing is to create a dictionary where the keys are the string names that you want and the values are the row[i]'s. You can do this in a loop:
onet_dict = {}
for i in range(1, 11):
onet_dict["onet_reported_" + i] = row[i]
or with a dictionary comprehension:
onet_dict = {"onet_reported_" + i: row[i] for i in range(1,11)}
Both will give you the same result. Now you have a collection of strings with then names you want as the keys of the dict that are mapped to the row values you want them associated to. To use them, instead of referring directly to the name onet_reported_x you have to access the value from the dict like:
# Adding some other value to onet_reported_5. I'm assuming the values are numbers.
onet_dict["onet_reported_5"] += 20457
I am new to Python (and programming, in general) and hoping to see if someone can help me. I am trying to automate a task that I am currently doing manually but is no longer feasible. I want to find and write all strings between two given strings. For example, if starting and ending strings are XYZ-DF 000010 and XYZ-DF 000014, the desired output should be XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012; XYZ-DF 000013; XYZ-DF 000014. The prefix and numbers (and their padding) are not always the same. For example, next starting and ending strings in the list could be ABC_XY00000001 and ABC_XY00000123. The prefix and padding for any pair of starting and ending strings, though, will always be the same.
I think I need to separate the prefix (includes any alphabets, spaces, underscore, hyphen etc.) and numbers, remove padding from the numbers, increment the numbers by 1 from starting number to ending number for every starting and ending strings in a second loop, and then finally get the output by concatenation.
So far this is what I have:
First, I read the 2 columns that contain a list of starting and ending strings in a csv into lists using pandas:
columns = ['Beg', 'End']
data = pd.read_csv('C:/Downloads/test.csv', names=columns, header = None)
begs = data.Beg.tolist()
ends= data.End.tolist()
Next, I loop over "begs" and "ends" using the zip function.
for beg, end in zip(begs,ends):
Inside the loop, I want to iterate over each string in begs and ends (one pair at a time) and perform the following operations on them:
1) Use regex to separate the characters (including alphabets, spaces, underscore, hyphen etc.) from the numbers (including padding) for each of the strings one at a time.
start = re.match(r"([a-z-_ ]+)([0-9]+)", beg, re.I) #Let's assume first starting string in the begs list is "XYZ-DF 000010" from my example above
prefix = start.group(1) #Should yield "XYZ-DF "
start_num = start.group(2) #Should yield "000010"
padding = (len(start_num)) #Yields 6
start_num_stripped = start_num.lstrip("0") #Yields 10
end = re.match(r"([a-z-_ ]+)([0-9]+)", end, re.I) #Let's assume first ending string in the ends list is "XYZ-DF 000014" from my example above
end_num = end.group(2) #Yields 000014
end_num_stripped = end_num.lstrip("0") #Yields 14
2) After these operations, run a nested while loop from start_num_stripped until end_num_stripped
output_string = ""
while start_num_stripped <= end_num_stripped:
output_string = output_string+prefix+start_num_stripped.zfill(padding)+"; "
start_num_stripped += 1
Finally, how do I write the output_string for each pair of starting and ending strings to a csv file that contains 3 columns containing the starting string, ending string, and their output string? An example of an output in csv format is given below (newline after each row is for clarity and not needed in the output).
"Starting String", "Ending String", "Output String"
"ABCD-00001","ABCD-00003","ABCD-00001; ABCD-00002; ABCD-00003"
"XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010; XYZ-DF 000011; XYZ-DF 000012"
"BBB_CC0000008","BBB_CC0000014","BBB_CC0000008; BBB_CC0000009; BBB_CC0000010; BBB_CC0000011; BBB_CC0000012; BBB_CC0000013; BBB_CC0000014"
You could find the longest trailing numeric suffix using a regular expression. Then simply iterate numbers from start to end appending them (with leading zeros) to the common prefix:
import re
startString = "XYZ-DF 000010"
endString = "XYZ-DF 000012"
suffixLen = len(re.findall("[0-9]*$",startString)[0])
start = int("1"+startString[-suffixLen:])
end = int("1"+endString[-suffixLen:])
result = [ startString[:-suffixLen]+str(n)[1:] for n in range(start,end+1) ]
csvLine = '"' + '","'.join([ startString,endString,";".join(result) ]) + '"'
print(csvLine) # "XYZ-DF 000010","XYZ-DF 000012","XYZ-DF 000010;XYZ-DF 000011;XYZ-DF 000012"
Note: using int("1" + suffix) causes numbers in the range to always have 1 more digit than the length of the suffix (1xxxxx). This makes it easy to get the leading zeroes by simply dropping the first character after turning them back into strings str(n)[1:]
Note2: I'm not familiar with pandas but I'm pretty sure it has a way to write a csv directly from the result list rather than formatting it manually as I did here in csvLine.
I have a list of strings that look something like this:
"['id', 'thing: 1\nother: 2\n']"
"['notid', 'thing: 1\nother: 2\n']"
I would now like to read the value of 'other' out of each of them.
I did this by counting the number at a certain position but since the position of such varies I wondererd if I could read from a certain character like a comma and say: read x_position character from comma. How would I do that?
Assuming that "other: " is always present in your strings, you can use it as a separator and split by it:
s = 'thing: 1\nother: 2'
_,number = s.split('other: ')
number
#'2'
(Use int(number) to convert the number-like string to an actual number.) If you are not sure if "other: " is present, enclose the above code in try-except statement.
I am parsing a large text file that has key value pairs separated by '='. I need to split these key value pairs into a dictionary. I was simply going to split by '='. However I noticed that some of the values contain the equals sign character. When a value contains the equals sign character, it seems to be always wrapped in parenthesis.
Question: How can I split by equals sign only when the equals sign is not in between two parenthesis?
Example data:
PowSup=PS1(type=Emerson,fw=v.03.05.00)
Desired output:
{'PowSup': 'PS1(type=Emerson,fw=v.03.05.00)'}
UPDATE: The data does not seem to have any nested parenthesis. (Hopefully that remains true in the future)
UPDATE 2: The key doesn't ever seem to have equals sign either.
UPDATE 3: The full requirements are much more complicated and at this point I am stuck so I have opened up a new question here: Python parse output of mixed format text file to key value pair dictionaries
You could try partition('=') to split from the first instance
'PowSup=PS1(type=Emerson,fw=v.03.05.00)'.partition('=')[0:3:2]
mydict=dict()
for line in file:
k,v=line.split('=',1)
mydict[k]=v
Simple solution using str.index() function:
s = "PowSup=PS1(type=Emerson,fw=v.03.05.00)"
pos = s.index('=') # detecting the first position of `=` character
print {s[:pos]:s[pos+1:]}
The output:
{'PowSup': 'PS1(type=Emerson,fw=v.03.05.00)'}
You can limit the split() operation to a single split (the first =):
>>> x = "PowSup=PS1(type=Emerson,fw=v.03.05.00)"
>>> x.split('=', 1)
['PowSup', 'PS1(type=Emerson,fw=v.03.05.00)']
You can then use these values to populate your dict:
>>> x = "PowSup=PS1(type=Emerson,fw=v.03.05.00)"
>>> key, value = x.split('=', 1)
>>> out = {}
>>> out[key] = value
>>> out
{'PowSup': 'PS1(type=Emerson,fw=v.03.05.00)'}