Integrating two diff dataframes using contains condition and creating a new dataframe - python

First data frame looks like below:
OSIED geometry
257005 POLYGON ((311852.712 178933.993, 312106.023 17...
017049 POLYGON ((272943.107 137755.159, 272647.627 13...
017032 POLYGON ((276637.425 146141.397, 276601.509 14...
Second data frame looks like below:
small_area Median_BER
217099001/217112002/217112003/2570052005/217112... 212.9
047041005/047041001/2570051004/047041002/047041... 271.3
157041002/157041004/157041003/157041001/157129... 222.5
I need to search col1(df1) in col1(df2) using "contains" condition.
If it matches/has the string then fetch the corresponding values from df1 and df2
I tried merge,df.get and str.contains.
str.contains works but I am unable to fetch other records
Output should look like this:
OSIED geometry small_area Median_BER
257005 POLYGON ((311852.712 178933.993, 312106.023 17... 217099001/217112002/217112003/2570052005/217112
212.9
017049 POLYGON ((272943.107 137755.159, 272647.627 13... 047041005/047041001/2570051004/047041002/047041
222.5

Playing around with some code I was able to generate the following
small_area_oseid_df = pd.DataFrame(
[
{'OSIED': oseid[:6], 'median_ber': row['median_ber']}
for row in df.to_dict(orient='records')
for oseid in row['small_area'].split('/')
]
)
Then you can join this table with the first table on the OSIED key. This is dependent on how many elements are in each row in the split. Since this will explode the dimension for the small_area_oseid_df you will create.

Related

Split and create data from a column to many columns

I have a pandas data frame in which the values of one of its columns looks like that
print(VCF['INFO'].iloc[0])
Results (Sorry I can copy and paste this data as I am working from a cluster without an internet connection)
I need to create new columns with the name END, SVTYPE and SVLEN and their info as values of that columns. Following the example, this would be
END SVTYPE SVLEN-
224015456 DEL 223224913
The rest of the info contained in the column INFOI do not need it so far.
The information contained in this column is huge but as far I can read there is not more something=value as you can see in the picture.
Simply use .str.extract:
extracted = df['INFO'].str.extract('END=(?P<END>.+?);SVTYPE=(?P<SVTYPE>.+?);SVLEN=(?P<SVLEN>.+?);')
Output:
>>> extracted
END SVTYPE SVLEN
0 224015456 DEL -223224913

Create list of dictionary items from lists

I am working on a project that involves going through two columns of latitude and longitude values. If the lat/long in one pair of columns are blank, then I need to figure out which pair of lat/long values in another two columns are (geographically) closest to those in the destination. The dataframe looks like this:
origin_lat | origin_lon | destination_lat | destination_lon
----------------------------------------------------------------
20.291326 -155.838488 25.145242 -98.491404
25.611236 -80.551706 25.646763 -81.466360
26.897654 -75.867564 nan nan
I am trying to build two dictionaries, one with the origin lat and long, and the other with the destination lat and long, in this format:
tmplist = [{'origin_lat': 39.7612992, 'origin_lon': -86.1519681},
{'origin_lat': 39.762241, 'origin_lon': -86.158436 },
{'origin_lat': 39.7622292, 'origin_lon': -86.1578917}]
What I want to do is for every row where the destination lat/lon are blank, compare the origin lat/lon in the same row to a dictionary of all the non-nan destination lat/lon values, then print the geographically closest lat/lon from the dictionary of destination lat/lon to the row in place of the nan values. I've been playing around with creating lists of dictionary objects but can't seem to build a dictionary in the correct format. Any help would be appreciated!
If df is your pandas.DataFrame, you can generate the requested dictionaries by iterating through the rows of df:
origin_dicts = [{'origin_lat': row['origin_lat'], 'origin_long': row['origin_lon']} for _, row in df.iterrows()]
and analogously for destination_dicts.
Remark: if the only reason for creating the dictionaries is the calculation of values replacing the nan-entries, it might be easier to do this directly on the data frame, e.g.
df['destination_lon'] = df.apply(find_closest_lon, axis=1)
df['destination_lat'] = df.apply(find_closest_lat, axis=1)
where find_closest_lon, find_closes_lat are functions receiving a data frame row as an argument and having access to the values of the origin-columns of the data frame.
The format that you want is the built-in 'records' format:
df[['origin_lat','origin_lon']].to_dict(orient = 'records')
produces
[{'origin_lat': 20.291326, 'origin_lon': -155.83848799999998},
{'origin_lat': 25.611235999999998, 'origin_lon': -80.55170600000001},
{'origin_lat': 26.897654, 'origin_lon': -75.867564}]
and of course you can equally have
df[['destination_lat','destination_lon']].to_dict(orient = 'records')
But I agree with #ctenar that you do not need to generate dictionaries for your ultimate task, Pandas provide enough functionality for that

How to calculate sum of specific column based on more than 2 complex conditions in python dataframe

'So basically what I wanted to figure out is that is there a way of calculating 'batsman_runs'(not visible in the image but yes there is a column) per 'match_id' for different 'batsman' and then store them as a dictionary or a list or just print the value.
The following link is a snapshot of the dataset
https://i.stack.imgur.com/zVWSh.jpg
Assuming you have imported numpy:
result=pd.your_df['batsman_runs'].to_numpy()/pd.your_df['match_id'].to_numpy()
result will be a numpy array, which holds all the values of the 'batsman_runs' column divided by all the respective values of the 'match_id' column.
You can try something like this as you said you have a column called batsman_runs
df = df.groupby(by=['match_id','batsman'])['batsman_runs'].sum()

For loop does not stop

for lat,lng,value in zip(location_saopaulo_df['geolocation_lat'], location_saopaulo_df['geolocation_lng'], location_saopaulo_df['municipality']):
coordinates = (lat,lng)
items = rg.search(coordinates)
value = items[0]['admin2']
I am trying to iterate over 3 columns from the dataframe, get the latitude and longitude values from the two columns, use it to get the address then add the city name to the last column I stated which is an empty column consists of NaN values.
However, my for loop is not stopping. I would be grateful if you can tell me why it doesn't stop or better way to do what I'm trying to do.
Thank you in advance.
if rg is reverse_geocoder, there is a better way to query several coordinates at once than looping. try this:
res = rg.search(tuple(zip(location_saopaulo_df['geolocation_lat'],
location_saopaulo_df['geolocation_lng'])))
And then extract just the admin2 value by constructing dataframe for example like:
df_ = pd.Dataframe(res)
and see what it looks like. You may be able to perform a merge or index alignment to put it back into your original dataframe location_saopaulo_df

Union Row inside Row PySpark Dataframe

I want to convert my Dataframe that has rows inside rows to a unique row, like this:
My dataframe:
[Row(Autorzc=u'S', Cd=u'00000012793', ClassCli=u'A' Op=Row(CEP=u'04661904', CaracEspecial='S', Venc=Row(v110=u'1', v120=u'2'))),
Row(Autorzc=u'S', Cd=u'00000012794', ClassCli=u'A' Op=Row(CEP=u'04661904', CaracEspecial='S', Venc=Row(v110=u'1', v120=u'2')))]
and I want to transform to this:
[Row(Autorzc=u'S', Cd=u'00000012793', ClassCli=u'A', CEP=u'04661904', CaracEspecial='S', v110=u'1', v120=u'2'),
Row(Autorzc=u'S', Cd=u'00000012794', ClassCli=u'A', CEP=u'04661904', CaracEspecial='S', v110=u'1', v120=u'2')]
Any suggestion?
You can do a simple select operation and your columns will be renamed accordingly.
final = initial.select("Autorzc","Cd" , "ClassCli", "Op.CEP"
"Op.CaracEspecial","Op.Venc.v110","Op.Venc.v120")
print(final.first())

Categories

Resources