python string as variable reporting nan - python

I am finding my name content/ variable value inside one document with the below:
find_name = re.search(r'^[^\d]*', clean_content)
Name = find_name.group(0)
NameUp = Name.upper()
Which works fine... it equals DAN STEPP as needed.
I then open up an excel file:
data1 = pd.read_excel(config.Excel1)
Pass into a data frame, give them headers; all this works:
df = pd.DataFrame(data1)
header = df.iloc[0]
Now when I do the search; with the below it returns nan erroneously
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
With my NameUp var, which equals DAN STEPP in value when I print and test, so it does contain correct value. However, when I use the variable above to search, I get nan.
When I replace NameUp with "DAN STEPP" like that, not using the variable, it becomes found - any thoughts on this? i.e. '.str.contains("DAN STEPP")'

Would you mind doing repr(NameUp)? It's slightly different from str(NameUp) in that it will print exactly what's in the string. Besides that I have no idea what to make of
row_numberd1 = df[df['Member Name'].str.contains(NameUp)].index.min()
I don't use pandas but that's up... that's a lot of stuff in one line? I would check each process individually as to see what's wrong. Since you said that it was throwing the wrong thing with the NameUp variable, I would deconstruct df['Member Name'].str.contains(NameUp) to see what it spits out, and make sure that it's consistent with your testing. Have you tried with any other names/values?
TL;DR: if the variable is not working, and manually inputting the string is, there is one of two things happening. Either the two strings are different in some minor way, or the process of which you are testing the two are not the same.

Related

Problem transforming a variable in logs, python

I am using Python. I would like to create a new column which is the log transformation of column 'lights1992'.
I am using the following code:
log_lights1992 = np.log(lights1992)
I obtain the following error:
I have tried two things: 1) adding a 1 to each value and transform the column 'lights1992' to numeric.
city_join['lights1992'] = pd.to_numeric(city_join['lights1992'])
city_join["lights1992"] = city_join["lights1992"] + 1
However, that two solution has not worked. Variable 'lights1992' is a float64 type. Do you know what can be the problem?
Edit:
The variable 'lights1992' comes from doing a zonal_statistics from a raster 'junk1992', maybe this affect.
zs1 = zonal_stats(city_join, junk1992, stats=['mean'], nodata=np.nan)
city_join['lights1992'] = [x['mean'] for x in zs1]
the traceback states:
'DatasetReader' object has no attribute'log'.
Did you re-assign numpy to something else at some point? I can't find much about 'DatasetReader' is that a custom class?
EDIT:
I think you would need to pass the whole column because your edit doesn't show a variable named 'lights1992'
so instead of:
np.log(lights1992)
can you try passing in the Dataframe's column to log?:
np.log(city_join['lights1992'])
2ND EDIT:
Since you've reported back that it works I'll dive into the why a little bit.
In your original statement you called the log function and gave it an argument, then you assigned the result to a variable name:
log_lights1992 = np.log(lights1992)
The problem here is that when you give python text without any quotes it thinks you are giving it a variable name (see how you have log_lights1992 on the left of the equal sign? You wanted to assign the results of the operation on the right hand side of the equal sign to the variable name log_lights1992) but in this case I don't think lights1992 had any value!
So there were two ways to make it work, either what I said earlier:
Instead of giving it a variable name you give .log the column of the city_join dataframe (that's what city_join["lights1992"]) directly.
Or
You assign the value of that column to the variable name first then you pass it in to .log, like this:
lights1992 = city_join["lights1992"]
log_lights1992 = np.log(lights1992)
Hope that clears it up for you!

Iterating over array and slicing or making changes in Python

I'm about to pull my hair out on this. I'm not sure why the index in my array is not being implemented in the second column.
I created this array - project_information :
project_information.append([proj_id,project_text])
When I print this out, I get the rows and columns. It contains about 40 rows.
When I iterate through it to print out the contents, everything comes out fine. I am using this:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
The project_text column contains text, while the project_id contains integers. It prints out perfectly, and the index, changes for both project_id and project_text.
However, I need to use the project_text in a different way, and I am really struggling with this. I need to slice the text to a shorter text for reuse. To do this, I tried:
for i in range(0,len(project_information)):
project_id = project_information[i][0]
project_text = project_information[i][1]
print(project_id)
print (project_text)
if len(project_text) > 5000:
trunc_proj_text = project_text[:1000]
else:
trunc_proj_text = project_text
print (project_id)
print(trunc_proj_text)
The problem I'm having here is that though the project_id column is being iterated through properly, the project_text is not. What I am getting is just the text in the first row for the project_text, sliced, and repeated for as many times as the length of the array.
I have tried different ways, and also a while loop, but it is still not working.
I've also looked at these answers for reference - Slicing,indexing and iterating over 2D Numpy arrays,Efficient iteration over slice in Python, iteration over list slices, and I can't seem to see how they can be applied to my problem.
I'm not well-versed in using Numpy, so is this something that it could help with? I'm well aware this might be simple and I'm missing it because I've been working on various aspects of this project for the past weeks, so I would appreciate a bit of consideration in this.
Thanks in advance.
The problem was with the input list here, so the slicing with this code does in fact work. The code to create the input array has now been fixed. The original code to create the input list was concatenating the strings for each entry, so the project_texts for each appeared different from the end, but all had the same beginning. But viewing this on a console, it was hard to see.

avoid writing df['column'] twice when doing df['column'] = df['column']

I don't even know how to phrase this but is there a way in Python to reference the text before the equals without having to actually write it again?
** EDIT - I'm using python3 in Jupyter
I seem to spend half my life writing:
df['column'] = df['column'].some_changes
Is there a way to tell Python that I'm referencing the part before the equals sign?
For example, I would write the following, where <% is just to represent the reference to the text before the = (df['column'])
df['column'] = <%.replace(np.nan)
you are looking for in place methods.
I believe you can pass inplace=True as an argument to most methods in pandas
so it would be something just like
df['column'].replace(np.nan, inplace=True)
edit
You could also do
df["computed_column"] = df["original_column"].many_operations
so you still have access to the original data down the line.
And do all the needed operations at once instead of saving each step.
One of the advantages of inplace not being the default is if you are doing a batch of operations and it fails midway your data is not mangled.

Delete a Portion of a CSV Cell in Python

I have recently stumbled upon a task utilizing some CSV files that are, to say the least, very poorly organized, with one cell containing what should be multiple separate columns. I would like to use this data in a Python script but want to know if it is possible to delete a portion of the row (all of it after a certain point) then write that to a dictionary.
Although I can't show the exact contents of the CSV, it looks like this:
useful. useless useless useless useless
I understand that this will most likely require either a regular expression or an endswith statement, but doing all of that to a CSV file is beyond me. Also, the period written after useful on the CSV should be removed as well, and is not a typo.
If you know the character you want to split on you can use this simple method:
good_data = bad_data.split(".")[0]
good_data = good_data.strip() # remove excess whitespace at start and end
This method will always work. split will return a tuple which will always have at least 1 entry (the full string). Using index may throw an exception.
You can also limit the # of splits that will happen if necessary using split(".", N).
https://docs.python.org/2/library/stdtypes.html#str.split
>>> "good.bad.ugly".split(".", 1)
['good', 'bad.ugly']
>>> "nothing bad".split(".")
['nothing bad']
>>> stuff = "useful useless"
>>> stuff = stuff[:stuff.index(".")]
ValueError: substring not found
Actual Answer
Ok then notice that you can use indexing for strings just like you do for lists. I.e. "this is a very long string but we only want the first 4 letters"[:4] gives "this". If we now new the index of the dot we could just get what you want like that. For exactly that strings have the index method. So in total you do:
stuff = "useful. useless useless useless useless"
stuff = stuff[:stuff.index(".")]
Now stuff is very useful :).
In case we are talking about a file containing multiple lines like that you could do it for each line. Split that line at , and put all in a dictionary.
data = {}
with open("./test.txt") as f:
for i, line in enumerate(f.read().split("\n")):
csv_line = line[:line.index(".")]
for j,col in enumerate(csv_line.split(",")):
data[(i,j)] = col
How one would do this
Notice that most people would not want to do it by hand. It is a common task to work on tabled data and there is a library called pandas for that. Maybe it would be a good idea to familiarise yourself a bit more with python before you dive into pandas though. I think a good point to start is this. Using pandas your task would look like this
import pandas as pd
pd.read_csv("./test.txt", comment=".")
giving you what is called a dataframe.

How can I split integers from string line?

How can I split confirmed value, death value and recovered value. I want to add them to different lists. I tried to isdigit method to find value in line. Also I tried split('":'). I thought I can define value after '":'. But these are not working.
https://api.covid19api.com/total/dayone/country/us
I added all line to textlist from this page.
I just edited question for other users. My problem solved thank you.
The list actually contains a string. You need to parse it and then iterate over it to access the required values from it.
import json
main_list = ['.....']
data_points = json.parse(main_list[0])
confirmed = []
for single_data_point in data_points:
confirmed.append(single_data_point.Confirmed)
print(confirmed)
A similar approach can be taken for any other values needed.
Edit:
On a better look at your source, it looks like the initial data is not in the right JSON format to begin with. Some issues I noticed:
Each object which has a Country value does not have its closing }. This is a bigger issue and needs to be resolved first.
The country object starting from the 2nd object has a ' before the object starting. This should not be the case as well.
I suggest you to look at how you are initially parsing/creating the list.
Since you gave the valid source of your data it becomes pretty simple:
import urllib.request
import json
data = json.load(urllib.request.urlopen("https://api.covid19api.com/total/dayone/country/turkey"))
confirmed=[]
deaths=[]
recovered=[]
for dataline in data:
confirmed.append(dataline["Confirmed"])
deaths.append(dataline["Deaths"])
recovered.append(dataline["Recovered"])
print ("Confirmed:",confirmed)
print ("Deaths:", deaths)
print ("Recovered:",recovered)

Categories

Resources