Loading a random sample from CSV with pandas - python

I have a CSV of the format
Team, Player
What I want to do is apply a filter to the field Team, then take a random subset of 3 players from EACH team.
So for instance, my CSV looks like :
Man Utd, Ryan Giggs
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
...
I want to end up with an XLS consisting of 3 random players from each team, and only 1 or 2 in the case where there is less than 3 e.g,
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
I started out using XLRD, my original post is here.
I am now trying to use Pandas as I believe this will be more flexible into the future.
So, in psuedocode what I want to do is :
foreach(team in csv)
print random 3 players + team they are assigned to
I've been looking through Pandas and trying to find the best approach to doing this, but I can't find anything similar to what I want to do (it's a difficult thing to Google!). Here's my attempt so far :
import pandas as pd
from collections import defaultdict
import csv as csv
columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\\Users\\ADMIN\\Desktop\\CSV_1.csv') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
print(row)
#for (k,v) in row.items(): # go over each column name and value
# columns[k].append(v) # append the value into the appropriate list
# based on column name k
So I have commented out the last two lines as I am not really sure if I am needed. I now each row being printed, so I just need to select a random 3 rows per each football team (or 1 or 2 in the case where there are less).
How can I accomplish this ? Any tips/tricks?
Thanks.

First use the better optimised read_csv:
import pandas as pd
df = pd.read_csv('DataFrame')
Now as a random example, use a lambda to get a random subset by randomizing the dataframe (replace 'x' with LivFC for example):
In []
df= pd.DataFrame()
df['x'] = np.arange(0, 10, 1)
df['y'] = np.arange(0, 10, 1)
df['x'] = df['x'].astype(str)
df['y'] = df['y'].astype(str)
df['x'].ix[np.random.random_integers(0, len(df), 10)][:3]
Out [382]:
0 0
3 3
7 7
Name: x, dtype: object
This will make you more familiar with pandas, however starting with version 0.16.x, there is now a DataFrame.sample method built-in:
df = pandas.DataFrame(data)
# Randomly sample 70% of your dataframe
df_0.7 = df.sample(frac=0.7)
# Randomly sample 7 elements from your dataframe
df_7 = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_0.7.index)]

Related

Splitting column by multiple custom delimiters in Python

I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])

How to read a file containing more than one record type within?

I have a .csv file that contains 3 types of records, each with different quantity of columns.
I know the structure of each record type and that the rows are always of type1 first, then type2 and type 3 at the end, but I don't know how many rows of each record type there are.
The first 4 characters of each row define the record type of that row.
CSV Example:
typ1,John,Smith,40,M,Single
typ1,Harry,Potter,22,M,Married
typ1,Eva,Adams,35,F,Single
typ2,2020,08,16,A
typ2,2020,09,02,A
typ3,Chevrolet,FC101TT,2017
typ3,Toyota,CE972SY,2004
How can I read It with Pandas? It doesn't matter if I have to read one record type each time.
Thanks!!
Here it is a pandas solution.
First we must read the csv file in a way that pandas keeps the entires lines in one cell each. We do that by simply using a wrong separator, such as the 'at' symbol '#'. It can be whatever we want, since we guarantee it won't ever appear in our data file.
wrong_sep = '#'
right_sep = ','
df = pd.read_csv('my_file.csv', sep=wrong_sep).iloc[:, 0]
The .iloc[:, 0] is used as a quick way to convert a DataFrame into a Series.
Then we use a loop to select the rows that belong to each data structure based on their starting characters. Now we use the "right separator" (probably a comma ',') to split the desired data into real DataFrames.
starters = ['typ1', 'typ2', 'typ3']
detected_dfs = dict()
for start in starters:
_df = df[df.str.startswith(start)].str.split(right_sep, expand=True)
detected_dfs[start] = _df
And here you go. If we print the resulting DataFrames, we get:
0 1 2 3 4 5
0 typ1 Harry Potter 22 M Married
1 typ1 Eva Adams 35 F Single
0 1 2 3 4
2 typ2 2020 08 16 A
3 typ2 2020 09 02 A
0 1 2 3
4 typ3 Chevrolet FC101TT 2017
5 typ3 Toyota CE972SY 2004
Let me know if it helped you!
Not Pandas:
from collections import defaultdict
filename2 = 'Types.txt'
with open(filename2) as dataLines:
nL = dataLines.read().splitlines()
defDList = defaultdict(list)
subs = ['typ1','typ2','typ3']
dataReadLines = [defDList[i].append(j) for i in subs for j in nL if i in j]
# dataReadLines = [i for i in nL]
print(defDList)
Output:
defaultdict(<class 'list'>, {'typ1': ['typ1,John,Smith,40,M,Single', 'typ1,Harry,Potter,22,M,Married', 'typ1,Eva,Adams,35,F,Single'], 'typ2': ['typ2,2020,08,16,A', 'typ2,2020,09,02,A'], 'typ3': ['typ3,Chevrolet,FC101TT,2017', 'typ3,Toyota,CE972SY,2004']})
You can make use of the skiprows parameter of pandas read_csv method to skip the rows you are not interested in for a particular record type. The following gives you a dictionary dfs of dataframes for each type. An advantage is that records of the same types don't necessarily have to be adjacent to each other in the csv file.
For larger files you might want to adjust the code such that the file is only read once instead of twice.
import pandas as pd
from collections import defaultdict
indices = defaultdict(list)
types = ['typ1', 'typ2', 'typ3']
filename = 'test.csv'
with open(filename) as csv:
for idx, line in enumerate(csv.readlines()):
for typ in types:
if line.startswith(typ):
indices[typ].append(idx)
dfs = {typ: pd.read_csv(filename, header=None,
skiprows=lambda x: x not in indices[typ])
for typ in types}
Read the file as a CSV file using the CSV reader. The reader fortunately does not care about line formats:
import csv
with open("yourfile.csv") as infile:
data = list(csv.reader(infile))
Collect the rows with the same first element and build a dataframe of them:
import pandas as pd
from itertools import groupby
dfs = [pd.DataFrame(v) for _,v in groupby(data, lambda x: x[0])]
You've got a list of three dataframes (or as many as necessary).
dfs[1]
# 0 1 2 3 4
#0 typ2 2020 08 16 A
#1 typ2 2020 09 02 A

How to join columns in CSV files using Pandas in Python

I have a CSV file that looks something like this:
# data.csv (this line is not there in the file)
Names, Age, Names
John, 5, Jane
Rian, 29, Rath
And when I read it through Pandas in Python I get something like this:
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
And the output of the program is:
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
Is there any way to get:
Names Age
0 John 5
1 Rian 29
2 Jane
3 Rath
First, I'd suggest having unique names for each column. Either go into the csv file and change the name of a column header or do so in pandas.
Using 'Names2' as the header of the column with the second occurence of the same column name, try this:
Starting from
datalist = [['John', 5, 'Jane'], ['Rian', 29, 'Rath']]
df = pd.DataFrame(datalist, columns=['Names', 'Age', 'Names2'])
We have
Names Age Names
0 John 5 Jane
1 Rian 29 Rath
So, use:
dff = pd.concat([df['Names'].append(df['Names2'])
.reset_index(drop=True),
df.iloc[:,1]], ignore_index=True, axis=1)
.fillna('').rename(columns=dict(enumerate(['Names', 'Ages'])))
to get your desired result.
From the inside out:
df.append combines the columns.
pd.concat( ... ) combines the results of df.append with the rest of the dataframe.
To discover what the other commands do, I suggest removing them one-by-one and looking at the results.
Please forgive the formating of dff. I'm trying to make everything clear from an educational perspective.
Adjust indents so the code will compile.
You can use:
usecols which helps to read only selected columns.
low_memory is used so that we Internally process the file in chunks.
import pandas as pd
data = pd.read_csv("data.csv", usecols = ['Names','Age'], low_memory = False))
print(data)
Please have unique column name in your csv

pandas row manipulation - If startwith keyword found - append row to end of previous row

I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot

Find the most string value in a whole dataframe / Pandas

I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column
I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.
Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'
Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!

Categories

Resources