This is the look of my DataFrame:
StateAb GivenNm Surname PartyNm PartyAb ElectedOrder
35 WA Joe BULLOCK Australian Labor Party ALP 2
36 WA Michaelia CASH Liberal LP 3
37 WA Linda REYNOLDS Liberal LP 4
38 WA Wayne DROPULICH Australian Sports Party SPRT 5
39 WA Scott LUDLAM The Greens (WA) GRN 6
and I want to list a list of senators whose surname is more than 9 characters long.
So I think the code should be like this:
df[len(df.Surname) > 9]
but this raises a KeyError, where did I go wrong?
The correct way to filter a DataFrame based on the length of strings in a column is
df[df['Surname'].str.len() > 9]
df['Surname'].str.len() creates a Series of lengths for the surname column and df[df['Surname'].str.len() > 9] filters out the ones less than or equal to 9. What you did is to check the length of the Series itself (how many rows it has).
Have a look at the python filter function. It does exactly what you want.
df = [
{"Surname": "Bullock-ish"},
{"Surname": "Cash"},
{"Surname": "Reynolds"},
]
longnames = list(filter(lambda s: len(s["Surname"]) > 9, df))
print(longnames)
>>[{'Surname': 'Bullock-ish'}]
Sytse
Related
i have this dataframe, and i want to extract cities in a separate column. You can also see, that the format is not same, and the city can be anywhere in the row. How can i extract only cities i a new column?
Prompt. Here we are talking about German cities. May be to find a dictionary, that shows all German cities and somehow compare with my dataset?
Here is dictionary of german cities: https://gist.github.com/embayer/772c442419999fa52ca1
Dataframe
Adresse
0 Karlstr 10, 10 B, 30,; 04916 Hamburg
1 München Dorfstr. 28-55, 22555
2 Marnstraße. Berlin 12, 45666 Berlin
3 Musterstr, 24855 Dresden
... ...
850 Muster Hausweg 11, Hannover, 56668
851 Mariestr. 4, 48669 Nürnberg
852 Hilden Weederstr 33-55, 56889
853 Pt-gaanen-Str. 2, 45883 Potsdam
Output
Cities
0 Hamburg
1 München
2 Berlin
3 Dresden
... ...
850 Hannover
851 Nürnberg
852 Hilden
853 Potsdam
You could extract in a list all the cities from the dictionary you provided ( I asssume it's the 'stadt' key ), and then use str.findall in your column:
cities_ = [cities[n]['stadt'] for n in range(0,len(cities))]
df.Adresse.str.findall(r'|'.join(cities_))
>>>
0 [Karlstr, Hamburg]
1 []
2 []
3 []
4 []
5 []
6 []
7 []
8 []
Name: Adresse, dtype: object
You can simply use str.extract since all the names are between couple of stars.
df["cities"] = df["Adress"].str.extract(r'\*\*(\w+)\*\*')
Since it seems the stars are not present in your file, you can do it differently.
Use the dictionary of cities, called cities from the file you linked but keep only a unique sequence (called a set) of cities.
german_cities = set(map(lambda x: x['stadt'], cities))
Then, we'll split the address string for each row and lookup in the German cities dictionary.
Since the first argument of apply is the series itself, we just need to tell it to have a look at the set of German cities.
def lookup_cities(string, cities):
splits = string.replace(",", "").split(" ")
for s in splits:
if s in cities:
return s
return "NaN"
df["Adress"].apply(lookup_cities, args=(german_cities,))
Now if you find any "NaN" then it's either that a city in your document has a typo or maybe several way to write it, you'll have to investigate yourself.
P.S: I had to remove all the spaces in the cities files otherwise the names wouldn't match. It was just a matter of using find and replace all in my editor.
You can use regular expression to extract the city names, as they are indicated by **:
import re
import pandas
df = pd.DataFrame({"Adresse": ["Karlstr 10, 10 B, 30,; 04916 **Hamburg**", "**München** Dorfstr. 28-55, 22555", "Marnstraße. Berlin 12, 45666 **Berlin**", "Musterstr, 24855 **Dresden**"]})
df['Cities'] = [re.findall(r".*\*\*(.*)\*\*", address)[0] for address in df['Adresse']]
This results in:
df
Adresse Cities
0 Karlstr 10, 10 B, 30,; 04916 **Hamburg** Hamburg
1 **München** Dorfstr. 28-55, 22555 München
2 Marnstraße. Berlin 12, 45666 **Berlin** Berlin
3 Musterstr, 24855 **Dresden** Dresden
Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5
Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1
I want to extract unique cities from city column in pandas dataframe. City column has values in list. How would I extract the cities frequency like:
Lahore 3
Karachi 2
Sydney 1
etc.
Sample dataframe:
Name Age City
a jack 34 [Sydney,Delhi]
b Riti 31 [Lahore,Delhi]
c Aadi 16 [New York, Karachi, Lahore]
d Mohit 32 [Peshawar,Delhi, Karachi]
Thank you
Let us try explode + value_counts
out = df.City.explode().value_counts()
I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object
population = pd.DataFrame({'village': pd.Series([15,4,1,2], index=['boys','girls','men','women']),
'town': pd.Series([20,36,26,28], index=['boys','girls', 'men', 'women'])})
Output:
---- town village
boys 20 15
girls 36 4
men 26 1
women 28 2
For any index in the dataframe above, I want that particular index value to be the minimum value between the previous two index values.
For example I expect the the value for men in town to be 20 since it is the smaller value between (36,20)
I tried implementing it using df.shift(2).cummin(axis=0) but that didn't work.
Expected_output:
---- town village
boys NaN NaN
girls NaN NaN
men 20 4
women 26 1
As was said by #Zero, so you can mark this as answered, you can use:
population.shift(1).rolling(2).min()