I am new to Python/Pandas and have a data frame with two columns one a series and another a string.
I am looking to split the contents of a Column(Series) to multiple columns .Appreciate your inputs on this regard .
This is my current dataframe content
Songdetails Density
0 ["'t Hof Van Commerce", "Chance", "SORETGR12AB... 4.445323
1 ["-123min.", "Try", "SOERGVA12A6D4FEC55"] 3.854437
2 ["10_000 Maniacs", "Please Forgive Us (LP Vers... 3.579846
3 ["1200 Micrograms", "ECSTACY", "SOKYOEA12AB018... 5.503980
4 ["13 Cats", "Please Give Me Something", "SOYLO... 2.964401
5 ["16 Bit Lolitas", "Tim Likes Breaks (intermez... 5.564306
6 ["23 Skidoo", "100 Dark", "SOTACCS12AB0185B85"] 5.572990
7 ["2econd Class Citizen", "For This We'll Find ... 3.756746
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
Desired output is SONG , ARTIST , SONG ID ,DENSITY i.e. split song details into columns.
for e.g. for the sample data
SONG DETAILS DENSITY
8 ["2tall", "Demonstration", "SOYYQZR12A8C144F9D"] 5.472524
SONG ARTIST SONG ID DENSITY
2tall Demonstration SOYYQZR12A8C144F9D 5.472524
Thanks
The following worked for me:
In [275]:
pd.DataFrame(data = list(df['Song details'].values), columns = ['Song', 'Artist', 'Song Id'])
Out[275]:
Song Artist Song Id
0 2tall Demonstration SOYYQZR12A8C144F9D
1 2tall Demonstration SOYYQZR12A8C144F9D
For you please try: pd.DataFrame(data = list(df['Songdetails'].values), columns = ['SONG', 'ARTIST', 'SONG ID'])
Thank you , i had a do an insert of column to the new data frame and was able to achieve what i needed thanks df2 = pd.DataFrame(series.apply(lambda x: pd.Series(x.split(','))))
df2.insert(3,'Density',finaldf['Density'])
Related
I have a CSV file with a lot of rows and different number of columns.
How to group data by count of columns and show it in different frames?
File CSV has the following data:
1 OLEG US FRANCE BIG
1 OLEG FR 18
1 NATA 18
Because I have different number of colums in each row I have to group rows by count of columns and show 3 frames to be able set header then:
ID NAME STATE COUNTRY HOBBY
FR1: 1 OLEG US FRANCE BIG
ID NAME COUNTRY AGE
FR2: 1 OLEG FR 18
FR3:
ID NAME AGE
1 NATA 18
Any words, I need to group rows by count of columns and show them in different dataframes.
since pandas doesn't allow you to have different length of columns, just don't use it to import your data. Your goal is to create three seperate df, so first import the data as lists, and then deal with it and its differents lengths.
One way to solve this is read the data with csv.reader and create the df's with list comprehension together with a condition for the length of the lists.
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
df1 = pd.DataFrame([item for item in data if len(item)==3], columns='ID NAME AGE'.split())
df2 = pd.DataFrame([item for item in data if len(item)==4], columns='ID NAME COUNTRY AGE'.split())
df3 = pd.DataFrame([item for item in data if len(item)==5], columns='ID NAME STATE COUNTRY HOBBY'.split())
print(df1, df2, df3, sep='\n\n')
ID NAME AGE
0 1 NATA 18
ID NAME COUNTRY AGE
0 1 OLEG FR 18
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
If you need to hardcode too many lines for the same step (e.g. too many df's), then you should consider using a loop to create them and store each dataframe as key/value in a dictionary.
EDIT
Here is the little optimizedway of creating those df's. I think you can't get around creating a list of columns you want to use for the seperate df's, so you need to know what variations of number of columns you have in your data (except you want to create those df's without naming the columns.
col_list=[['ID', 'NAME', 'AGE'],['ID', 'NAME', 'COUNTRY', 'AGE'],['ID', 'NAME', 'STATE', 'COUNTRY', 'HOBBY']]
with open('input.csv', 'r') as f:
reader = csv.reader(f, delimiter=' ')
data= list(reader)
dict_of_dfs = {}
for cols in col_list:
dict_of_dfs[f'df_{len(cols)}'] = pd.DataFrame([item for item in data if len(item)==len(cols)], columns=cols)
for key,val in dict_of_dfs.items():
print(f'{key=}: \n {val} \n')
key='df_3':
ID NAME AGE
0 1 NATA 18
key='df_4':
ID NAME COUNTRY AGE
0 1 OLEG FR 18
key='df_5':
ID NAME STATE COUNTRY HOBBY
0 1 OLEG US FRANCE BIG
Now you don't have variables for your df, instead you have them in a dictionary as keys. (I named the df with the number of columns it has, df_3 is the df with three columns.
If you need to import the data with pandas, you could have a look at this post.
I have a csv list of keywords in this format:
75410,Sportart
75419,Ballsport
75428,Basketball
76207,Atomenergie
76212,Atomkraftwerk
76223,Wiederaufarbeitung
76225,Atomlager
67869,Werbewirtschaft
I read the values using pandas and create a table in this format:
DF: name
id
75410 Sportart
75419 Ballsport
75428 Basketball
76207 Atomenergie
76212 Atomkraftwerk
... ...
251450 Tag und Nacht
241473 Kollektivverhalten
270930 Indigene Völker
261949 Wirtschaft und Politik
282512 Impfen
Using the name, I want to delete the whole row, e.g. 'Sportart' deletes first row.
I want to check this with values from my wordList array, I store them as Strings in a list.
What did I miss? Using the code below I receive an '(value) not in axis' error.
df = pd.read_csv("labels.csv", header=None, index_col=0)
df.index.name = "id"
df.columns = ["name"]
print('DF: ',df)
df.drop(labels=wordList, axis=0,inplace=True)
pd_frame = pd.DataFrame(df)
cleaned_pd_frame = pd_frame.query('name != {}'.format(wordList))
I succeeded to hide them with query(), but I want to remove the entirely.
You can use a helper function, index_to_drop below, to take in a name and filter its index out:
index_to_drop = lambda name: df.index[df['name']==name]
Then you can drop "Sportart" like:
df.drop(index_to_drop('Sportart'), inplace=True)
print(df)
Output:
id name
1 75419 Ballsport
2 75428 Basketball
3 76207 Atomenergie
4 76212 Atomkraftwerk
5 251450 Tag und Nacht
6 241473 Kollektivverhalten
7 270930 Indigene Völker
8 261949 Wirtschaft und Politik
9 282512 Impfen
That being said, this is just a convoluted way to drop a row. The same outcome can be obtained much simpler by using isin:
df = df[df['name']!='Sportart']
I have a dataframe with 4 columns each containing actor names.
The actors are present in several columns and I want to find the actor or actress most present in all the dataframe.
I used mode and but it doesn't work, it gives me the most present actor in each column
I would strongly advise you to use the Counter class in python. Thereby, you can simply add whole rows and columns into the object. The code would look like this:
import pandas as pd
from collections import Counter
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
# Creating counter
counter = Counter()
# inserting the whole row into the counter
for _, row in df.iterrows():
counter.update(row)
print("counter object:")
print(counter)
# We show the two most common actors
for actor, occurences in counter.most_common(2):
print("Actor {} occured {} times".format(actor, occurences))
The output would look like this:
counter object:
Counter({'Will Smith': 4, 'Morgan Freeman': 3, 'Johnny Depp': 3, 'Mila Kunis': 3, 'Charlie Sheen': 3})
Actor Will Smith occured 4 times
Actor Morgan Freeman occured 3 times
The counter object solves your problem quite fast but be aware that the counter.update-function expects lists. You should not update with pure strings. If you do it like this, your counter counts the single chars.
Use stack and value_counts to get the entire list of actors/actresses:
df.stack().value_counts()
Using #Ofi91 setup:
# Artifically creating DataFrame
actors = [
["Will Smith","Johnny Depp","Johnny Depp","Johnny Depp"],
["Will Smith","Morgan Freeman","Morgan Freeman","Morgan Freeman"],
["Will Smith","Mila Kunis","Mila Kunis","Mila Kunis"],
["Will Smith","Charlie Sheen","Charlie Sheen","Charlie Sheen"],
]
df = pd.DataFrame(actors)
df.stack().value_counts()
Output:
Will Smith 4
Morgan Freeman 3
Johnny Depp 3
Charlie Sheen 3
Mila Kunis 3
dtype: int64
To find most number of appearances:
df.stack().value_counts().idxmax()
Output:
'Will Smith'
Let's consider your data frame to be like this
First we stack all columns to 1 column.
Use the below code to achieve that
df1 = pd.DataFrame(df.stack().reset_index(drop=True))
Now, take the value_counts of the actors column using the code
df2 = df1['actors'].value_counts().sort_values(ascending = False)
Here you go, the resulting data frame has the actor name and the number of occurrences in the data frame.
Happy Analysis!!!
I have a dataframe with data in the of similar format
song lyric tokenized_lyrics
0 Song 1 Look at her face, it's a wonderful face [look , at , her ,face, it's a wonderful, face ]
1 Song 2 Some lyrics of the song taken [Some, lyrics ,of, the, song, taken]
I want to count the no of words in the lyrics per song and an output like
song count
song 1 8
song 2 6
I tried aggregate function but it is not yielding the correct result.
Code I tried :
df.groupby(['song']).agg(
word_count = pd.NamedAgg(column='text' , aggfunc = 'count' )
)
How can I achieve the desired result
I couldnt copy tokenized_lyrics as a list, it came in as a string, so I tokenized the lyrics, with the assumption that the delimiter is a white space:
df['token_count'] = df.lyric.str.replace(',','').str.split().str.len()
df.filter(['song','token_count'])
song token_count
0 Song 1 8
1 Song 2 6
note that you can just apply string len to the tokenized lyrics to get your count, since it is a list, it will count the individual items
Use Series.str.len for count values and if duplicated song values then aggregate sum:
df1 = (df.assign(count = df['tokenized_lyrics'].str.len())
.groupby('song', as_index=False)['count'].sum())
I am looping through a list of 103 FourSquare URLs to find "Coffee Shops."
I can create a DataFrame for each URL and print each DataFrame as I loop through the list (sample output at bottom).
I cannot figure out how to append the DataFrame for each URL into a single DataFrame as I loop through the list. My goal is to compile a single DataFrame from the DataFrames I am printing.
x = 0
while x < 103 :
results = requests.get(URLs[x]).json()
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
print(x, '!!!', dfven, '\n')
x = x + 1
Here is some output (I do get complete results):
0 !!! name categories lat lng
5 Tim Hortons Coffee Shop 43.80200 -79.198169
8 Tim Hortons / Esso Coffee Shop 43.80166 -79.199133
1 !!! Empty DataFrame
Columns: [name, categories, lat, lng]
Index: []
2 !!! name categories lat lng
5 Starbucks Coffee Shop 43.770367 -79.186313
18 Tim Hortons Coffee Shop 43.769591 -79.187081
3 !!! name categories lat lng
0 Starbucks Coffee Shop 43.770037 -79.221156
4 Country Style Coffee Shop 43.773716 -79.207027
I apologize if this is bad form or a breach of etiquette but I solved my problem and figured I should post. Perhaps making an effort to state the problem for StackOverflow helped me solve it?
First I learned how to ignore empty DataFrames:
dfven = nearby_venues.loc[nearby_venues['categories'] == 'Coffee Shop']
if dfven.empty == False :
Once I added this code my printed output was a clean series of identically formatted data frames so appending them into one data frame was easy. I created a data frame at the beginning of my code (merge = pd.DataFrame()) and then added this line where I was printing.
merge = merge.append(dfven)
Now my output is perfect.