Posted in the output you will see that this code take the Location column(or series), and places it in a data frame. After which, the first,second, and third part of the nested for loop then takes the first index of each column and then creates a dataframe to add to the first dataframe. What I have been trying to do is for loop through, going up one index each for loop, and then adding a new dataframe of repetitve data. However, when I try to print it, the dataframe will only print the first dataframe, and the last repetitive dataframe that it looped through. However I'm trying to make a huge dataframe that attaches a repetitive index data frame from 0-17. I have updated this to show the repetitiveness that I am looking for, but in a truncated way. I hope this helps. Thanks!
Here is the input
for j in range(0,18,1):
for i in range(0,18,1):
df['Rep Loc'] = str(df['Location'][j:j+1])
df['Rep Lat'] = float(df['Latitude'][j:j+1])
df['Rep Long'] = float(df['Longitude'][j:j+1])
break
print(df)
Here is the output
Location Latitude
Longitude \
0 Letsholathebe II Rd, Maun, North-West District... -19.989491
23.397709
1 North-West District, Botswana -19.389353
23.267951
2 Silobela, Kwekwe, Midlands Province, Zimbabwe -18.993930
29.147992
3 Mosi-Oa-Tunya, Livingstone, Southern Province,... -17.910147
25.861904
4 Parkway Drive, Victoria Falls, Matabeleland No... -17.909231
25.827019
5 A33, Kasane, North-West District, Botswana -17.795057
25.197270
6 T1, Southern Province, Zambia -17.040664
26.608454
7 Sikoongo Road, Siavonga, Southern Province, Za... -16.536204
28.708753
8 New Kasama, Lusaka Province, Zambia -15.471934
28.398588
9 Simon Mwansa Kapwepwe Avenue, Avondale, Lusaka... -15.386244
28.397111
10 Lusaka, Lusaka Province, 1010, Zambia -15.416697
28.281381
11 Chigwirizano Road, Rhodes Park, Lusaka, Lusaka... -15.401848
28.302248
12 T2, Kabwe, Central Province, Zambia -14.420744
28.462169
13 Kabushi Road, Ndola, Copperbelt Province, Zambia -12.997968
28.608536
14 Dr Aggrey Avenue, Mishenshi, Kitwe, Copperbelt... -12.797684
28.199061
15 President Avenue, Kalulushi, Copperbelt Provin... -12.833375
28.108370
16 Eglise Methodiste Unie, Avenue Mantola, Mawawa... -11.699407
27.500234
17 Avenue Babemba, Kolwezi, Lwalaba, Katanga, Lua... -10.698109
25.503816
Rep Loc Rep Lat
Rep
Long
0 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
1 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
2 0 Letsholathebe II Rd, Maun, North-West Dis... -19.989491
23.397709
Rep Loc Rep Lat
Rep Long
0 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
1 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
2 1 North-West District, Botswana\nName: Loca... -19.389353
23.267951
Rep Loc Rep Lat
Rep Long
0 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
1 2 Silobela, Kwekwe, Midlands Province, Zimb... -18.99393
29.147992
Rep Loc Rep Lat
Rep Long
0 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
1 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
2 3 Mosi-Oa-Tunya, Livingstone, Southern Prov... -17.910147
25.861904
Rep Loc Rep Lat Rep
Long
0 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
1 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
2 4 Parkway Drive, Victoria Falls, Matabelela... -17.909231
25.827019
Rep Loc Rep Lat Rep
Long
0 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
1 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
2 5 A33, Kasane, North-West District, Botswan... -17.795057
25.19727
Good practice when asking questions is to provide an example of what you want your output to look like. However, this is my best guess at what you want.
pd.concat({i: d.shift(i) for i in range(18)}, axis=1)
Related
I have a table with ids and locations they have been to.
id
Location
1
Maryland
1
Iowa
2
Maryland
2
Texas
3
Georgia
3
Iowa
4
Maryland
4
Iowa
5
Maryland
5
Iowa
5
Texas
I'd like to perform a query that would allow me to get the number of ids per combination.
In this example table, the output would be -
Maryland, Iowa - 2
Maryland, Texas - 1
Georgia, Iowa - 1
Maryland, Iowa, Texas - 1
My original thought was to add the ASCII values of the distinct locations of each id, and see how many have each value, and what the combinations are that correspond to the value. I was not able to do that as SQL server would not let me cast an nvarchar as a numeric data type. Is there any other way I could use SQL to get the number of devices per combination? Using python to get the number of ids per combination is also acceptable, however, SQL is preferred.
If you want to solve this in SQL and you are running SQL Server 2017 or later, you can use a CTE to aggregate the locations for each id using STRING_AGG, and then count the occurrences of each aggregated string:
WITH all_locations AS (
SELECT STRING_AGG(Location, ', ') WITHIN GROUP (ORDER BY Location) AS aloc
FROM locations
GROUP BY id
)
SELECT aloc, COUNT(*) AS cnt
FROM all_locations
GROUP BY aloc
ORDER BY cnt, aloc
Output:
aloc cnt
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Maryland, Texas 1
Iowa, Maryland 2
Note that I have applied an ordering to the STRING_AGG to ensure that someone who visits Maryland and then Iowa is treated the same way as someone who visits Iowa and then Maryland. If this is not the desired behaviour, simply delete the WITHIN GROUP clause.
Demo on dbfiddle
Use groupby + agg + value_counts:
new_df = df.groupby('id')['Location'].agg(list).str.join(', ').value_counts().reset_index()
Output:
>>> new_df
index Location
0 Maryland, Iowa 2
1 Maryland, Texas 1
2 Georgia, Iowa 1
3 Maryland, Iowa, Texas 1
Let us do groupby with join then value_counts
df.groupby('id')['Location'].agg(', '.join).value_counts()
Out[938]:
join
Maryland, Iowa 2
Georgia, Iowa 1
Maryland, Iowa, Texas 1
Maryland, Texas 1
dtype: int64
Use a frozenset to aggregate to ensure having unique groups:
df.groupby('id')['Location'].agg(', '.join).value_counts()
Output:
(Maryland, Iowa) 2
(Texas, Maryland) 1
(Georgia, Iowa) 1
(Texas, Maryland, Iowa) 1
Name: Location, dtype: int64
Or a sorted string join:
df.groupby('id')['Location'].agg(lambda x: ', '.join(sorted(x))).value_counts()
output:
Iowa, Maryland 2
Maryland, Texas 1
Georgia, Iowa 1
Iowa, Maryland, Texas 1
Name: Location, dtype: int64
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have a dataframe as follows :
Neighborhood, City, State, Country
Westside, Boston, MA,USA
South District, New York,NY,USA
Business Town,,OR,USA
Shopping District,,Wellington,New Zealand
Big Mountain,,,Australia
Now I want to go over pairs of NON Empty Columns C0,C1 C1,C2 C2,C3 and create a dataframe that looks like below. However if C1 is empty or null then pair C0 with C2 and so on
Root Child
OR Business Town
USA OR
New Zealand Wellington
Wellington. Shopping District
Boston Westside
MA Boston
USA MA
New York South District
NY New York
USA NY
Australia Big Mountain
Here is one way using shift after stack
s=df.stack().iloc[::-1]
yourdf=pd.DataFrame({'Root':s.groupby(level=0).shift().values,'Child':s.values}).dropna()
yourdf
Out[62]:
Root Child
1 Australia Big Mountain
3 New Zealand Wellington
4 Wellington Shopping District
6 USA OR
7 OR Business Town
9 USA NY
10 NY New York
11 New York South District
13 USA MA
14 MA Boston
15 Boston Westside
comprehension
and other things
pd.DataFrame([
t for _, g in df.stack().groupby(level=0)
for t in zip(g.iloc[1:], g)
], columns=['Root', 'Child'])
Root Child
0 Boston Westside
1 MA Boston
2 USA MA
3 New York South District
4 NY New York
5 USA NY
6 OR Business Town
7 USA OR
8 Wellington Shopping District
9 New Zealand Wellington
10 Australia Big Mountain
I would like to add a dictionary to a list, which contains several other dictionaries.
I have a list of ten top travel cities:
City Country Population Area
0 Buenos Aires Argentina 2891000 4758
1 Toronto Canada 2800000 2731571
2 Pyeongchang South Korea 2581000 3194
3 Marakesh Morocco 928850 200
4 Albuquerque New Mexico 559277 491
5 Los Cabos Mexico 287651 3750
6 Greenville USA 84554 68
7 Archipelago Sea Finland 60000 8300
8 Walla Walla Valley USA 32237 33
9 Salina Island Italy 4000 27
10 Solta Croatia 1700 59
11 Iguazu Falls Argentina 0 672
I imported the excel with pandas:
import pandas as pd
travel_df = pd.read_excel('./cities.xlsx')
print(travel_df)
cities = travel_df.to_dict('records')
print(cities)
variables = list(cities[0].keys())
I would like to add a 12th element to the end of the list but don't know how to do so:
beijing = {"City" : "Beijing", "Country" : "China", "Population" : "24000000", "Ares" : "6490" }
print(beijing)
Try appending the new row to the DataFrame you read.
travel_df.append(beijing, ignore_index=True)
Say that I have a dataframe that looks something like this
date location year
0 1908-09-17 Fort Myer, Virginia 1908
1 1909-09-07 Juvisy-sur-Orge, France 1909
2 1912-07-12 Atlantic City, New Jersey 1912
3 1913-08-06 Victoria, British Columbia, Canada 1912
I want to use pandas groupby function to create an output that shows the total number of incidents by year but also keep the location column that will display one of the locations that year. Any which one works. So it would look something like this:
total location
year
1908 1 Fort Myer, Virginia
1909 1 Juvisy-sur-Orge, France
1912 2 Atlantic City, New Jersey
Can this be done without doing funky joining? The furthest I can get is using the normal groupby
df = df.groupby(['year']).count()
But that only gives me something like this
location
year
1908 1 1
1909 1 1
1912 2 2
How can I display one of the locations in this dataframe?
You can use groupby.agg and use 'first' to extract the first location in each group:
res = df.groupby('year')['location'].agg(['first', 'count'])
print(res)
# first count
# year
# 1908 Fort Myer, Virginia 1
# 1909 Juvisy-sur-Orge, France 1
# 1912 Atlantic City, New Jersey 2
I have this dataframe:
iata airport city state country lat \
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Colorado Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
I am trying to get the number of airports per state. For example if I have the function:
f(dataframe, state):
result reuslt
Where state would be a state abbreviation, such as 'MA'. I am trying to group the dataframe by the input variable, such as state ('MA') to then get the number of airports per state.
When I use:
df.groupby(state)['airport'].value_counts()
or
df.groupby(state)['airport'].value_counts()/df['airport'].count()
df.groupby(['state'] == state)['airport'].value_counts()/df['airport'].count()
The last two are regarding the conditional probability a selected airport will be in that state.
It throws a Key Error: 'MA', which I think is due to the input variable not being recognized as a column, but a value in the column.
Is there a way to get the number of airports per state?
I would use Pandas's nunique to get the number of airports per state. The code is easier to read and remember.
To illustrate my point, I modified the dataset as follows, such that Florida has three more fictional airports:
iata airport city state country lat
0 00M Thigpen Bay Springs MS USA 31.953765
1 00R Livingston Municipal Livingston TX USA 30.685861
2 00V Meadow Lake Springs CO USA 38.945749
3 01G Perry-Warsaw Perry NY USA 42.741347
4 01J Hilliard Airpark Hilliard FL USA 30.688012
5 f234 Weirdviller Chilliard FL USA 30.788012
6 23r2 Johnson Billiard FL USA 30.888012
Then, we write:
df.groupby('state').iata.nunique()
to get the following results:
state
CO 1
MS 1
TX 1
FL 3
NY 1
Name: iata, dtype: int64
Hope this helps.
Assuming each record is an airport throughout, you can just count the records for each state / country combination:
df.groupby(['country','state']).size()
You can rewrite this as an explicit groupby apply:
In [11]: df.groupby("state")["airport"].apply(lambda x: x.value_counts() / len(x))
Out[11]:
state
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
or store the groupby and reuse it (probably this is faster):
In [21]: g = df.groupby("state")["airport"]
In [22]: g.value_counts() / g.size()
Out[22]:
state airport
CO Meadow Lake 1.0
FL Hilliard Airpark 1.0
MS Thigpen 1.0
NY Perry-Warsaw 1.0
TX Livingston Municipal 1.0
Name: airport, dtype: float64
This seemed to work the way I intended with all your help. a[state] represents an input in the form of a state abbreviation ('MA'). This returns the probability of a randomly selected airport belonging to that state.
a = df.groupby('state').iata.nunique()
s = a.sum()
result = a[state]/s
return result