Plot latitude longitude with drop wrong data in rows - python

Hello I need help or clue with my data frame.
I have 319k rows with two columns named 'Latitude' adn 'Longtitude', by reason of checking I grouped and coutned the rows: https://i.stack.imgur.com/IqCka.png , https://i.stack.imgur.com/gg9v0.png
I need make scatter plot, but unfortunetely I'm very very new in python, and I don't know how I can find correct long and lat data in rows without empty records or wrong data like -1.0000 (screenshot). Lat and long for Boston (MA) are 42... And -72...
I think my code to plotting is good, but i cant correctly filtered my data to make it:
for seaborn sns.stripplot(x='Latitude', y='Longtitude', data=MojaBaza)-> for now, I've got:
https://i.stack.imgur.com/RL1t1.png
for matplotlib plt.scatter(x=MojaBaza['Longtitude'], y=MojaBaza['Latitude']) -> and for this instruction, I've got "'value' must be an instance of str or bytes, not a float"
Sorry if my question is stupid, but I really don't know, how handle it.
Greetings

Problem was the type of data.
The solution is:
MojaBaza['Latitude']=MojaBaza['Latitude'].astype('float')
MojaBaza['Longitude']=MojaBaza['Longitude'].astype('float')
In the next step:
Filtr1 = MojaBaza['Latitude'] > 40
Filtr2 = MojaBaza['Longitude'] < -70
Lokacja = MojaBaza[Filtr1 & Filtr2]
and we got it:

Related

.... (bunch of dots) on elements of dataframe in pandas

I have links in a column of a data frame in pandas. whenever I try to iterate through that column(links) and get some text data the following happens.
suppose df is the data frame
for i in df:
for line in urllib.request.urlopen(i):
decoded_line = line.decode("utf-8")
print(decoded_line)
if I run the above code, it shows an error.
Then When I printed that column, I saw those column elements (links) end with a bunch of dots...
After searching a little I did,
pd.options.display.max_colwidth = 100
And it worked fine.
But I am curious how changing the " display column width " resolves my issue.
As far as I understood, when I was working with pd.options.display.max_colwidth = 50 the 'i' in for loop was taking some portion of the links with a bunch of dots in the end(why? How display width changes values actually taken by 'i'), and now when I change the display column width to 100 pd.options.display.max_colwidth = 100 it is taking the whole link. But why?
And is pd.options.display.max_colwidth changes only the display col-width or it has aslo something to do with the actual value?
Please help
Thank you in advance.

Changing North/South latitude values in Python

In Python I'm dealing with a couple of large csv's containing geographical data in different kinds of formats for latitude and longitude. I settled on converting them to decimal degrees. My issue is that some files are already formatted this way, but with a direction (N,S,E,W) attached at the end of each individual coordinate. Also, the south and west coordinates are not yet negative, and they should be when in decimal degrees.
I was initially using regex to filter these directions out, but can't figure out a way to attach a negative to South and West coordinates before dropping them. I am using pandas to read the csv in.
Example coordinates:
Latitude, Longitude
30.112342N, 10.678982W
20.443459S, 30.678997E
import *pandas* as pd
df = pd.read_csv("mydataset.csv")
if df['Latitude'].str.endswith('S'):
df.Latitude = -float(df['Latitude'].str.strip('S'))
else:
df.Latitude = float(df['Latitude'].str.strip('N'))
Depending on how I tweak it, I get different errors, the most common being:
Attribute error: 'Latitude' object has no attribute 'strip'.
I've tried changing the dtype to string, among other methods, with no luck. I can filter out the directions with regular expressions, but can't discern what the direction was to change to negative if necessary. Any help is appreciated.
Look into .apply(). The value df['Latitude'] is the full column, so you can't work on all of them at once for this sort of operation.
Instead, do something like this:
def fix_latitude(x):
"""Work on individual latitude value X."""
if x.endswith('S'):
x = -float(x.strip('S'))
else:
x = x.strip('N')
return x
df['fixedLatitude'] = df.Latitude.apply(fix_latitude)

How to use & or and operation to get correct data when I am getting 'nan' for all values

I am trying to manipulate some data but as I am not really firm with Python, I am here to ask this question.
test_val_set = merged_data2.loc[merged_data2.reject_yn=='Y'] &
merged_data2.loc[merged_data2.reject_yn=='N'].iloc[70000:,:]
Like above, I am trying to get this two conditions to be combined for my final
desired information. But When I do, it gives me correct raw x column but
data point values are all 'nan'.
When I use them separately as below,
a = merged_data2.loc[merged_data2.reject_yn=='Y']
b = merged_data2.loc[merged_data2.reject_yn=='N'].iloc[70000:,:]
they all work just fine with correct values in it.
How can I use this '&'? or is there any other way around?
Thank you people in advance.
Base on you description , you can concat the two dataframe back
a = merged_data2.loc[merged_data2.reject_yn=='Y']
b = merged_data2.loc[merged_data2.reject_yn=='N'].iloc[70000:,:]
df=pd.concat([a,b])

Detecting and Removing GPS coordinates of "0.0000, 0.00000"(non fixed) data using python

I am doing a vehicle monitoring process with raw data files.
As of now, some cleaning up was done before the issue surface. As I have inconsistent data, It causes some problem to me. Data includes "Model(v,w,x,y and z), Timestamp, Latitude, Longitude and Mode(0,2,4,8)".
The objective of this process is to calculate distance and duration with cleaning of data
I have successfully done calculating of duration using the timestamp with respect to both Model and Mode. I have also successfully done calculating the distance between rows using coordinates and haversine formula. HERE COMES THE PROBLEM:
So I can only successfully calculate the distance among rows if both Lat & Long is present and in the right format (e.g. 1.035436, 103.234623). Data received can be of empty field which causes an error. This error was solved by identifying the empty field and removing the line (As without lat long, the data is useless)
mydataset = mydataset[mydataset['Mode'].notnull()] #for removing empty mode
mydataset = mydataset[mydataset['Latitude'].notnull()] #for removing empty latitude
But there are lat long received as 0.00000000,0.00000000 and i would like to remove rows with lat long in this numbers. Some methods has been tried but it doesnt work. I've tried identifying the 0 and remove it using:
mydataset = mydataset[(mydataset[['Latitude','Longitude']] != 0).all(axis=1)]
and
mydataset = mydataset[(mydataset.Latitude != 0).any()]
Due to confidential data and code, I cannot provide much but would like to know why the above 2 method do not work and if possible, can anyone advice me on how to tackle with such problem?
Thank you! Much appreciate and Thank you for your time!
Some fake data are as shown below:
,Model,Timestamp,Longitude,Latitude,Mode
0,x,1970-01-19 01:29:17.058,103.235623,1.045436,0
1,x,1970-01-19 01:29:22.058,0.00000000,0.00000000,0 #Would like to remove this row
2,x,1970-01-19 01:29:27.058,103.234813,1.038436,2
3,x,1970-01-19 01:29:32.058,103.235623,1.039436,2
4,x,1970-01-19 01:29:38.058,103.234123,1.036436,0
5,x,1970-01-19 01:29:38.058,,,0 #removed via the code above
I am not sure, if i understood correctly, do you need something like this ?
Sample df
Lat,Long
55.6,22.06
0.00000000,0.00000000
56.056,22.10
df1 = df[df[['Lat','Long']] != 0].dropna(how='any').reset_index(drop= True)
print(df1)
Lat Long
0 55.600 22.06
1 56.056 22.10

pandas groupby is returning two groups for the same unique id

I have a large pandas dataframe, where I am running groups by operations.
CHROM POS Data01 Data02 ......
1 ....................
1 ...................
2 ..................
2 ............
scaf_9 .............
scaf_9 ............
So, i am doing:
my_data_grouped = my_data.groupby('CHROM')
for chr_, data in my_data_grouped:
do something in chr_
write something from that chr_ data
Everything is fine in small data and in the data where there is no string type CHROM i.e scaff_9. But, with very large data and with scaff_9, I am getting two groups of 2. It really isn't an error message and it is not affecting the computation. The issue is when I write the data by group in the file; I am getting two groups of 2 (splitted unequally).
It is becoming very hard for me to traceback the origin of this problem, since there is no error message and with small data it works well. My only assumption are:
Is there certain limit on the the number of lines in total dataframe vs. grouped dataframe the pandas module can handle. What is the fix to this problem ?
Among all the 2 most of them are treated as integer object and some (later part) as string object being close to scaff_9. Is this possible ?
Sorry, I am only making my assumption here, and it is becoming impossible for me to know the origin of the problem.
Post Edit:
I have also tried to run sort_by(['CHROM']) before doing to groupby, but the problem still persists.
Any possible fix to the issue.
Thanks,
In my opinion there is data problem, obviously some whitespaces, so pandas processes each group separately.
Solution should be remove traling whitespaces first:
df.index = df.index.astype(str).str.strip()
You can also check unique strings values of index:
a = df.index[df.index.map(type) == str].unique().tolist()
If first column is not index:
df['CHROM'] = df['CHROM'].astype(str).str.strip()
a = df.loc[df['CHROM'].map(type) == str, 'CHROM'].unique().tolist()
EDIT:
Last final solution was simplier - casting to str like:
df['CHROM'] = df['CHROM'].astype(str)

Categories

Resources