I have a table samplecol that contains (a sample):
vessel_hash | status | station | speed | latitude | longitude | course | heading | timestamp | the_geom
--------------+--------+---------+-------+-------------+-------------+--------+---------+--------------------------+----------------------------------------------------
103079215239 | 99 | 841 | 5 | -5.41844510 | 36.12160900 | 314 | 511 | 2016-06-12T06:31:04.000Z | 0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0
103079215239 | 99 | 3008 | 0 | -5.41778710 | 36.12144900 | 117 | 511 | 2016-06-12T06:43:27.000Z | 0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0
103079215239 | 99 | 841 | 17 | -5.42236900 | 36.12356900 | 259 | 511 | 2016-06-12T06:50:27.000Z | 0101000020E610000054E6E61BD10F42407C60C77F81B015C0
103079215239 | 99 | 841 | 17 | -5.41781710 | 36.12147900 | 230 | 511 | 2016-06-12T06:27:03.000Z | 0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0
103079215239 | 99 | 841 | 61 | -5.42201900 | 36.13256100 | 157 | 511 | 2016-06-12T06:08:04.000Z | 0101000020E6100000CFDC43C2F71042409929ADBF25B015C0
103079215239 | 99 | 841 | 9 | -5.41834020 | 36.12225000 | 359 | 511 | 2016-06-12T06:33:03.000Z | 0101000020E6100000CFF753E3A50F42408D68965F61AC15C0
I try to fetch all points inside polygon with:
poisInpolygon = """SELECT col.vessel_hash,col.longitude,col.latitude,
ST_Contains(ST_GeomFromEWKT('SRID=4326; POLYGON((-15.0292969 47.6357836,-15.2050781 47.5172007,-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836,-15.0292969 47.6357836))'),
ST_GeomFromEWKT(col.the_geom)) FROM samplecol As col;"""
The output is:
(103079215291L, Decimal('40.87123100'), Decimal('29.24107000'), False)
(103079215291L, Decimal('40.86702000'), Decimal('29.23967000'), False)
(103079215291L, Decimal('40.87208200'), Decimal('29.22113000'), False)
(103079215291L, Decimal('40.86973200'), Decimal('29.23963000'), False)
(103079215291L, Decimal('40.87770800'), Decimal('29.20229900'), False)
I don't figure out what is False in the results. Is this the correct way or am I doing something wrong?
Also this code uses the INDEX in the field the_geom?
The query returns false because all points from your sample are outside of the given polygon. Here an overview of your points (somewhere in the northeast of Tanzania) and polygon (south Europe and north Africa):
To test your query, I added another point somewhere in Málaga, which is inside of your polygon, and it returned true just as expected (last geometry in the insert statement as EWKT). This is the script:
CREATE TEMPORARY TABLE t (the_geom GEOMETRY);
INSERT INTO t VALUES ('0101000020E61000001BF33AE2900F424090AF4EDF7CAC15C0'),
('0101000020E6100000E2900DA48B0F424042C3AC61D0AB15C0'),
('0101000020E610000054E6E61BD10F42407C60C77F81B015C0'),
('0101000020E61000004D13B69F8C0F424097D6F03ED8AB15C0'),
('0101000020E6100000CFDC43C2F71042409929ADBF25B015C0'),
('0101000020E6100000CFF753E3A50F42408D68965F61AC15C0'),
(ST_GeomFromEWKT('SRID=4326;POINT(-4.4427 36.7233)'));
And here is your query:
db=# SELECT
ST_Contains(ST_GeomFromEWKT('SRID=4326; POLYGON((-15.0292969 47.6357836,-15.2050781 47.5172007,-16.2597656 29.3821751, 35.0683594 26.1159859, 38.0566406 47.6357836,-15.0292969 47.6357836))'),
ST_GeomFromEWKT(col.the_geom))
FROM t As col;
st_contains
-------------
f
f
f
f
f
f
t
(7 Zeilen)
Btw: storing the same coordinates as GEOMETRY and as NUMERIC is totally redundant. You might want to get rid of the columns latitude and longitude and extract their values with ST_X and ST_Y on demand.
Related
https://docs.google.com/document/d/1qqhVYhuwQsR2GOkpcTwhLvX5QUBCj5tv7LYqXAzB2UE/edit?usp=sharing
The above document shows the output of BeautifulSoup after html parsing. This is a response from an API with POST request. In the website, it renders as a table.
Can anyone tell what is the data format and why find() and find_all() not working with it.
# Import libs
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
# Reading codes from local CSV file on my computer
dp_code = ("code.csv")
# Form Data for passing to the request body
formdata = {'objid': '14'}
# URL
url = "https://www.somewebsite.com"
# Query
for i in dp_code:
formdata["objid"] = str(i)
response = requests.request("POST", url, data=formdata, timeout=1500)
out = response.content
soup = BeautifulSoup(out,"html.parser")
json = json.loads(soup.text)
df = pd.DataFrame(bat["form"])
df.to_csv(str(i) + ".csv")
Can anyone tell what is the data format
It looks like either json or stringified json - which you probably realized, since you're using json.loads; I don't think parsing with BeautifulSoup before parsing json is necessary at all, but I can't be sure without knowing what response.content looks like - in fact, if response.json() works, even json.loads becomes unnecessary.
...the output of BeautifulSoup after html parsing...
...and why find() and find_all() not working with it.
There's not much point to using BeautifulSoup (which is for html parsing as you yourself have noted!) unless the input is in a html/lxml/xml format. Otherwise, it tends to be just parsed as document with a single NavigableString (and that's likely what happened here); so then it looks [to bs4] like there's nothing to find.
Anyway, I downloaded the document as a txt and then read it and extracted the one value [which is a HTML string] with
docContents = open('Unknow Data Type.txt', mode='r', encoding='utf-8-sig').read()
formHtml = json.loads(docContents)['form']
(The encoding took a little bit of trial and error to figure out, but I expect that step will be unnecessary for you as you have the raw contents.)
After that, formHtml can be parsed like any HTML string with BeautifulSoup; since it's just tables, you can even use pandas.read_html directly, but since you asked about find and find_all, I tried this little example:
formSoup = BeautifulSoup(formHtml, 'html.parser')
for t in formSoup.find_all('table'):
print('+'*100)
for r in t.find_all('tr'):
cols = [c.text for c in r.find_all(['td', 'th'])]
cols = [f'{c[:10].strip():^12}' for c in cols] #just formatting
print(f'| {" | ".join(cols)} |')
print('+'*100)
It prints the tables as output:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| DISTRICT: | KASARGOD | LOCAL BODY | G14001-Kum |
| WARD: | 001-ANNADU | POLLING ST | 002-G J B |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
| Serial No | Name | Guardian's | New House | House Name | Gender / A | ID Card No |
| 1 | | Adru | 1 | | M / 55 | KL/01/002/ |
| 2 | | Battya | 1 | | F / 41 | KL/01/002/ |
| 3 | | | MOIDEEN KU | 289 | C.H.NAGAR | F / 22 | SECID15757 |
| 566 | | MOHAMMED K | 296 | ANNADKA HO | M / 49 | SECID15400 |
| 567 | | MOIDDEEN K | 296 | ANNADKA HO | F / 40 | SECID15400 |
| 568 | | MOHAMMED K | 296 | MUNDRAKOLA | M / 36 | SECID15400 |
| 569 | | RADHA | 381 | MACHAVU HO | M / 23 | SECID15400 |
| 570 | | SHIVAPPA S | 576 | UJJANTHODY | F / 47 | ZII0819813 |
| 571 | | SURESHA K | 826 | KARUVALTHA | F / 33 | JWQ1718857 |
| കൂട്ടിച്ചേ |
| 572 | DIVYA K | SUNDARA K | 182 | BHANDARA V | F / 24 | ZII0767137 |
| 573 | KUNHAMMA | ACHU BELCH | 185 | PODIPALLA | F / 84 | KL/01/002/ |
| 574 | SUJATHA M | KESHAVAN K | 186 | PODIPALLA | F / 48 | JWQ1687797 |
| 575 | SARATH M | SUJATHA M | 186 | PODIPALLA | M / 25 | SECID4BCFE |
| 576 | SAJITH K | SUJATHA M | 186 | PODIPPALLA | M / 21 | ZII3300043 |
| തിരുത്തലുക |
| ഇല്ല |
| ഒഴിവാക്കലു |
| ഇല്ല |
| |
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I have a question whether we have any direct or a simple approach to flatten the data than what I am describing as below to implement in PySpark/Hive.
I have a dataset in one table which looks like below:
rdd = sc.parallelize([("123","000"),("456","123"),("789","456"),("111","000"),("999","888")])
df = rdd.toDF(["active_acct","inactive_acct"])
df.createOrReplaceTempView("temp_main_active_accts")
temp_main_active_accts_pd=temp_main_active_accts.toPandas()
df.show()
| active_acct | inactive_acct |
| 123 | 000 |
| 456 | 123 |
| 789 | 456 |
| 111 | 000 |
| 999 | 888 |
I am expecting the final output to be like:
| Current_active | all_old_active |
| 789 | 456,123,000 |
| 111 | 000 |
| 999 | 888 |
Which means 789 is the current active record and the records 456,123,000 one or the other time were active and this is the reason you can see recursive link in the main table.
I had to get to the latest record i.e. 789 so that I can find the link to the previous credit cards. I have query which gets to the latest account used and returns the records like :
active_accts=spark.sql("select active_acct,inactive_acct from temp_main_active_accts where active_acct not in
(select t1.active_acct from temp_main_active_accts t1 join temp_main_active_accts t2 on t1.active_acct=t2.inactive_acct)")
active_accts.show()
| active_acct | inactive_acct |
| 789 | 456 |
| 111 | 000 |
| 999 | 888 |
Below is the logic to flatten the records with a UDF but, the problem is this is taking a lot of time to run. I was looking whether we have any simple solution to do this so that I can avoid the UDF implementation here or any other way which is much simple. May it be sql or pyspark based.
global old_acc_list
def first_itr(active_acc):
qry="""active_account == '{0}'""".format(active_acc)
active_acc_pd=(temp_main_active_accts_pd.query(qry))
active_acc_pd=active_acc_pd.drop_duplicates()
active_acc_pd=active_acc_pd.reset_index(drop=True)
active_acc_cnt=active_acc_pd.size
if active_acc_cnt>0:
inactive_acc=active_acc_pd['old_active'].astype(str)[0]
global old_acc_list
old_acc_list+=","+str(inactive_acc)
first_itr(inactive_acc)
else:
old_acc_list=old_acc_list.lstrip(",")
return old_acc_list
extract_old_acc_udf=udf(lambda row: first_itr(row),StringType())
df_final=active_accts.withColumn("all_old_accs",extract_old_acc_udf(col['old_active_acc']))
The data frame consists of column 'value' which has some hidden characters.
When I write the data frame to PostgreSQL I get the below error
ValueError: A string literal cannot contain NUL (0x00) characters.
I some how found the cause of error. Refer table below (missing column value
| | datetime | mc | tagname | value | quality |
|-------|--------------------------|----|---------|------------|---------|
| 19229 | 16-12-2021 02:31:29.083 | L | VIN | | 192 |
| 19230 | 16-12-2021 02:35:28.257 | L | VIN | C4A 173026 | 192 |
Checked the length of string- it was same 10 character like below rows
df.value.str.len()
Requirement:
I want to replace that empty area with text 'miss', i tried different method in pandas. I'm not able to do.
df['value'] = df['value'].str.replace(r"[\"\',]", '')<br />
df.replace('\'','', regex=True, inplace=True)
| | datetime | mc | tagname | value | quality |
|-------|--------------------------|----|---------|------------|---------|
| 19229 | 16-12-2021 02:31:29.083 | L | VIN | miss | 192 |
| 19230 | 16-12-2021 02:35:28.257 | L | VIN | C4A 173026 | 192 |
Try this:
df['value'] = df['value'].str.replace(r'[\x00-\x19]', '').replace('', 'miss')
I have a MultiIndex Pandas DataFrame like so:
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
| | VECTOR | SEGMENTS | OVERALL | INDIVIDUAL |
| | | | TIP X | TIP Y | CURVATURE | TIP X | TIP Y | CURVATURE |
| 0 | (TOP, TOP) | 2 | 3.24 | 1.309 | 44 | 1.62 | 0.6545 | 22 |
| 1 | (TOP, BOTTOM) | 2 | 3.495 | 0.679 | 22 | 1.7475 | 0.3395 | 11 |
| 2 | (BOTTOM, TOP) | 2 | 3.495 | -0.679 | -22 | 1.7475 | -0.3395 | -11 |
| 3 | (BOTTOM, BOTTOM) | 2 | 3.24 | -1.309 | -44 | 1.62 | -0.6545 | -22 |
+---+------------------+----------+---------+--------+-----------+------------+---------+-----------+
How can I drop duplicates based on all columns contained under 'OVERALL' or 'INDIVIDUAL'? So if I choose 'INDIVIDUAL' to drop duplicates from the values of TIP X, TIP Y, and CURVATURE under INDIVIDUAL must all match for it to be a duplicate?
And further, as you can see from the table 1 and 2 are duplicates that are simply mirrored about the x-axis. These must also be dropped.
Also, can I center the OVERALL and INDIVIDUAL headings?
EDIT: frame.drop_duplicates(subset=['INDIVIDUAL'], inplace=True) produces KeyError: Index(['INDIVIDUAL'], dtype='object')
You can pass pandas .drop_duplicates a subset of tuples for multi-indexed columns:
df.drop_duplicates(subset=[
('INDIVIDUAL', 'TIP X'),
('INDIVIDUAL', 'TIP Y'),
('INDIVIDUAL', 'CURVATURE')
])
Or, if your row indices are unique, you could use the following approach that saves some typing:
df.loc[df['INDIVIDUAL'].drop_duplicates().index]
Update:
As you suggested in the comments, if you want to do operations on the dataframe you can do that in-line:
df.loc[df['INDIVIDUAL'].abs().drop_duplicates().index]
Or for non-pandas functions, you can use .transform:
df.loc[df['INDIVIDUAL'].transform(np.abs).drop_duplicates().index]
I have a massive CSV (1.4gb, over 1MM rows) of stock market data that I will process using R.
The table looks roughly like this. For each ticker, there are thousands of rows of data.
+--------+------+-------+------+------+
| Ticker | Open | Close | High | Low |
+--------+------+-------+------+------+
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| A | 32 | 23 | 43 | 344 |
| A | 121 | 121 | 212 | 2434 |
| B | 32 | 23 | 43 | 344 |
+--------+------+-------+------+------+
To make processing and testing easier, I'm breaking this colossus into smaller files using the script mentioned in this question: How do I slice a single CSV file into several smaller ones grouped by a field?
The script would output files such as data_a.csv, data_b.csv, etc.
But, I would also like to create index.csv which simply lists all the unique stock ticker names.
E.g.
+---------+
| Ticker |
+---------+
| A |
| B |
| C |
| D |
| ... |
+---------+
Can anybody recommend an efficient way of doing this in R or Python, when handling a huge filesize?
You could loop through each file, grabbing the index of each and creating a set union of all indices.
import glob
tickers = set()
for csvfile in glob.glob('*.csv'):
data = pd.read_csv(csvfile, index_col=0, header=None) # or True, however your data is set up
tickers.update(data.index.tolist())
pd.Series(list(tickers)).to_csv('index.csv', index=False)
You can retrieve the index from the file names:
(index <- data.frame(Ticker = toupper(gsub("^.*_(.*)\\.csv",
"\\1",
list.files()))))
## Ticker
## 1 A
## 2 B
write.csv(index, "index.csv")