I have a two column CSV:
Name, Sport
Abraham Soccer
Adam Basketball
Adam Soccer
John Soccer
Jacob Tennis
Jacob Soccer
What is the simplest way to convert this into something openable in Excel that is either in XLS or CSV such that when opening up in MS Excel, it looks something like:
Basketball, Soccer, Tennis
Abraham X
Adam X X X
John X
Jacob X X
I would consider pandas to be a suitable package for this kind of application. The centerpiece of pandas is a dataframe object (df), which is in essence a table of your data. csv files can be read into pandas using read_csv.
import pandas as pd
df = pd.read_csv('filename.csv')
In [3]:df
Out[3]:
Name Sport
0 Abraham Soccer
1 Adam Basketball
2 Adam Soccer
3 John Soccer
4 Jacob Tennis
5 Jacob Soccer
There is a pandas method crosstab that does what you want as simply as
table = pd.crosstab(df['Name'], df['Sport'])
In [4]:table
Out[4]:
Sport Basketball Soccer Tennis
Name
Abraham 0 1 0
Adam 1 1 0
Jacob 0 1 1
John 0 1 0
Then you can convert back to a csv file with
table.to_csv('filename.csv')
Related
my df looks like this:
category text
-------- ----
soccer soccer game is good
soccer soccer game
basketball game basketball
basketball game
volleyball sport volleyball sport
What I want to do is groupby category and then list the words by its frequency
category text frequency
-------- ---- ---------
soccer soccer 2
game 2
is 1
good 1
basketball game 2
basketball 1
volleyball sport 2
volleyball 1
what did I do?
I group all the text together
df.groupby(['category])['text'].sum()
Now all the text are on the same rows since I grouped it but I do not know how to do a Frequency Table using each word count.
Could someone please help me?
#Method 1:
You can use series.str.split with explode and the groupby.value_counts
(df.assign(text=df['text'].str.split()).explode("text")
.groupby("category",sort=False)['text'].value_counts())
category text
soccer game 2
soccer 2
good 1
is 1
basketball game 2
basketball 1
volleyball sport 2
volleyball 1
Name: text, dtype: int64
#Method 2:
For older version of pandas using np.concatenate and index.repeat with df.join (There are other methods listed here)
s = df['text'].str.split()
(df[['category']].join(pd.Series(np.concatenate(s),
index=df.index.repeat(s.str.len()),name='text'))
.groupby("category",sort=False)['text'].value_counts())
#Method 3: using MultiLabelBinarizer from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
s = df['text'].str.split()
mlb = MultiLabelBinarizer()
mlb.fit(s)
out = pd.DataFrame(mlb.transform(s),columns=mlb.classes_).groupby(df['category']).sum()
out.replace(0,np.nan).stack().astype(int)
category
basketball basketball 1
game 2
soccer game 2
good 1
is 1
soccer 2
volleyball sport 1
volleyball 1
dtype: int32
value_counts
is the right way. Usable inside a groupby too after split in words
Basically, I got a table like the following:
Name Sport Frequency
Jonas Soccer 3
Jonas Tennis 5
Jonas Boxing 4
Mathew Soccer 2
Mathew Tennis 1
John Boxing 2
John Boxing 3
John Soccer 1
Let's say this is a standard table and I will transform that into a Pandas DF, using the groupby function just like that:
table = df.groupby(['Name'])
After the dataframe is created I want to delete all the rows where frequencies of all other sports than Soccer are greater than Soccer frequency.
So I need to run following conditions:
Identify where Soccer is present; and then
If so, identify if there is any other sport present; and finally
Delete rows where sport is any other than Soccer and its frequency is greater than the Soccer frequency associated to that name (used in the groupby function).
So, the output would be something like:
Name Sport Frequency
Jonas Soccer 3
Mathew Soccer 2
Mathew Tennis 1
John Soccer 1
Thank you for your support
This is one way about it, by iterating through the groups :
pd.concat(
[
value.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.bfill()
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
for key, value in df.groupby("Name").__iter__()
]
)
Name Sport Frequency
7 John Soccer 1
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
You could also create a categorical type for the Sports column, sort the dataframe, then group :
sport_dtype = pd.api.types.CategoricalDtype(categories=df.Sport.unique(), ordered=True)
df = df.astype({"Sport": sport_dtype})
(
df.sort_values(["Name", "Sport"], ascending=[False, True])
.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.ffill()
.query("Frequency <= temp")
.drop('temp', axis = 1)
)
Name Sport Frequency
3 Mathew Soccer 2
4 Mathew Tennis 1
0 Jonas Soccer 3
7 John Soccer 1
Note that this works because Soccer is the first entry in the Sports column; if it is not, you have to reorder it to ensure Soccer is the first in the categories
Another option is to get the index of rows that meet our criteria and filter the dataframe :
index = (
df.assign(temp=lambda x: x.loc[x.Sport == "Soccer", "Frequency"])
.groupby("Name")
.pipe(lambda x: x.ffill().bfill())
.query("Frequency <= temp")
.index
)
df.loc[index]
Name Sport Frequency
0 Jonas Soccer 3
3 Mathew Soccer 2
4 Mathew Tennis 1
7 John Soccer 1
A bit surprised that I lost the grouping index though.
UPDATE : Gave this some thought; this may be a simpler solution, find the rows where sport is soccer or the average is greater than or equal to 0.5. the average ensures that soccer is not less than the others.
(df.assign(temp=df.Sport == "Soccer",
temp2=lambda x: x.groupby("Name").temp.transform("mean"),
)
.query('Sport=="Soccer" or temp2>=0.5')
.iloc[:, :3]
)
I want to read a file that has partial header i.e. some columns have names some have not. I want to read the file as it is. So I want to keep the names of the columns that already have names and the rest as it is. Is there any clean way to do that in pandas?
The short answer to your question is no, since pandas dataframes cannot have more than one empty column name, so if you try to import a .csv file with multiple empty column names, you won't get the behavior you expect: pandas will fill in empty column names with Unnamed: 0, Unnamed: 1... and so on (or possibly something else if you have a space in place of the column name in the .csv file).
For example, this .csv file with columns of index 0, 3, 4, 5 removed...
,Doe,120 jefferson st.,,,
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
...will get imported in the following way:
Unnamed: 0 Doe 120 jefferson st. Unnamed: 3 Unnamed: 4
0 Jack McGinnis 220 hobo Av. Phila PA 9119
1 John "Da Man" Repici 120 Jefferson St. Riverside NJ 8075
2 Stephen Tyler 7452 Terrace "At the Plaza" road SomeTown SD 91234
3 NaN Blankman NaN SomeTown SD 298
4 Joan "the bone", Anne Jet 9th, at Terrace plc Desert City CO 123
If for example you have missing columns names for column 1,2. you will have this structure after reading the file normally by pandas
df.head()
Unnamed: 0 Unnamed: 1 col3 col4 col5
0 .. ..
1 .. ..
After reading the df, you can rename the unnamed columns as below
df.rename(columns = {'Unnamed: 1':'Col1','Unnamed: 2':'Col2'})
I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.
I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")
Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781
Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)
Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")
I am trying to parse table located here using Pandas read.html function. I was able to parse the table. However, the column capacity returned with NaN . I am not sure, what could be the reason.I would like to parse entire table and use it for further research. So any help is appreciated. Below is my code so far..
wiki_url='Above url'
df1=pd.read_html(wiki_url,index_col=0)
Try something like this (include flavor as bs4):
df = pd.read_html(r'https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df = df[0]
print(df.head())
Image Stadium City State \
0 NaN Aggie Memorial Stadium Las Cruces NM
1 NaN Alamodome San Antonio TX
2 NaN Alaska Airlines Field at Husky Stadium Seattle WA
3 NaN Albertsons Stadium Boise ID
4 NaN Allen E. Paulson Stadium Statesboro GA
Team Conference Capacity \
0 New Mexico State Independent 30,343[1]
1 UTSA C-USA 65000
2 Washington Pac-12 70,500[2]
3 Boise State Mountain West 36,387[3]
4 Georgia Southern Sun Belt 25000
.............................
.............................
To replace anything under square brackets use:
df.Capacity = df.Capacity.str.replace(r"\[.*\]","")
print(df.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Hope this helps.
Pandas is only able to get the superscript (for whatever reason) rather than the actual value, if you print all of df1 and check the Capacity column, you will see that some of the values are [1], [2], etc (if they have footnotes) and NaN otherwise.
You may want to look into alternatives of fetching the data, or scraping the data yourself using BeautifulSoup, since Pandas is looking and therefore returning the wrong data.
Answer Posted by #anky_91 was correct. I wanted to try another approach without using Regex. Below was my solution without using Regex.
df4=pd.read_html('https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums',header=[0],flavor='bs4')
df4 = df4[0]
Solution was to takeout "r" presented by #anky_91 in line 1 and line 4
print(df4.Capacity.head())
0 30,343
1 65000
2 70,500
3 36,387
4 25000
Name: Capacity, dtype: object