I have a large file (>500 rows) with multiple data points for each unique item in the list, something like :
cheese
weight
location
gouda
1.4
AL
gouda
2
TX
gouda
1.2
CA
cheddar
5.3
AL
cheddar
6
MN
chaddar
2
WA
Havarti
4
CA
Havarti
4.2
AL
I want to make data frames for each cheese to store the relevant data
I have this:
main_cheese_file = pd.read_csv('CheeseMaster.csv')
cut_the_cheese = main_cheese_file.cheese.unique()
melted = {elem: pd.DataFrame() for elem in cut_the_cheese}
for slice in melted.slice():
melted[slice] = main_cheese_file[:][main_cheese_file.cheese == slice]
to split it up on the unique thing I want.
What I want to do with it is make df's that can be exported for each cheese with the cheese name as the file name.
So far I can force it with
melted['Cheddar'].to_csv('Cheddar.csv')
and get the Cheddars ....
but I don't want to have to know and type out each type of cheese on the list of 500 rows...
Is there a way to add this to my loop?
You can just iterate over a groupby object
import pandas as pd
df = pd.read_csv('CheeseMaster.csv')
for k,v in df.groupby('cheese'):
v.to_csv(f'{k}.csv', index=False)
Related
I have a big dataframe like:
product price serial category department origin
0 cookies 4 2345 breakfast food V
1 paper 0.5 4556 stationery work V
2 spoon 2 9843 kitchen household M
I want to convert to dict, but I just want an output like:
{serial: 2345}{serial: 4556}{serial: 9843} and {origin: V}{origin: V}{origin: M}
where key is column name and value is value
Now, i've tried with df.to_dict('values') and I selected dic['origin'] and returns me
{0: V}{1:V}{2:M}
I've tried too with df.to_dict('records') but it give me:
{product: cookies, price: 4, serial:2345, category: breakfast, department:food, origin:V}
and I don't know how to select only 'origin' or 'serial'
You can do something like:
serial_dict = df[['serial']].to_dict('r')
origin_dict = df[['origin']].to_dict('r')
This is what I'm trying to accomplish with my code: I have a current csv file with tennis player names, and I want to add new players to it once they show in the rankings. My script goes through the rankings and creates an array, then imports the names from the csv file. It is supposed to see which names are not in the latter, and then extract online data for those names. Then, I just want the new rows to be appended at the end of that old CSV file. My issue is that the new row is being indexed with the Player's name rather than following the index of the old file. Any ideas why that's happening? Also why is an unnamed column being added?
def get_all_players():
# imports names of players currently in the atp rankings
current_atp_ranking = check_atp_rankings()
current_player_list = current_atp_ranking['Player']
# clean up names in case of white spaces
for i in range(0, len(current_player_list)):
current_player_list[i] = current_player_list[i].strip()
# reads the main file and makes a dataframe out of it
current_file = 'ATP_stats_new.csv'
df = pd.read_csv(current_file)
# gets all the names within the main file to see which current ones aren't there
names_on_file = list(df['Player'])
# cleans up in case of any white spaces
for i in range(0, len(names_on_file)):
names_on_file[i] = names_on_file[i].strip()
# Removing Nadal for testing purposes
names_on_file.remove("Rafael Nadal")
# creating a list of players in current_players_list but not in names_on_file
new_player_list = [x for x in current_player_list if x not in names_on_file]
# loop through new_player_list
for player in new_player_list:
# delay to avoid stopping
time.sleep(2)
# finding the player's atp link for profile based on their name
atp_link = current_atp_ranking.loc[current_atp_ranking['Player'] == player, 'ATP_Link']
atp_link = atp_link.iloc[0]
# make a basic dictionary with just the player's name and link
player_dict = [{'Name': player, 'ATP_Link': atp_link}]
# enter the new dictionary into the existing main file
df.append(player_dict, ignore_index=True)
# print dataframe to see how it looks before exporting
print(df)
# export dataframe into current file
df.to_csv(current_file)
This is what the file looks like at first:
Unnamed: 0 Player ... Coach Turned_Pro
0 0 Novak Djokovic ... NaN NaN
1 1 Rafael Nadal ... Carlos Moya, Francisco Roig 2001.0
2 2 Roger Federer ... Ivan Ljubicic, Severin Luthi 1998.0
3 3 Daniil Medvedev ... NaN NaN
4 4 Dominic Thiem ... NaN NaN
... ... ... ... ... ...
1976 1976 Brian Bencic ... NaN NaN
1977 1977 Boruch Skierkier ... NaN NaN
1978 1978 Majed Kilani ... NaN NaN
1979 1979 Quentin Gueydan ... NaN NaN
1980 1980 Preston Brown ... NaN NaN
And this is what the new row looks like:
1977 1977.0 ... NaN
1978 1978.0 ... NaN
1979 1979.0 ... NaN
1980 1980.0 ... NaN
Rafael Nadal NaN ... 2001
There are critical parts of your code missing that are necessary to answer the question precisely. Two thoughts based on what you posted:
Importing Your CSV File
Your previous csv file was probably saved with the index. Make sure the csv file contents does not have the dataframe index when you last used it in the first csv column. When you save do the following:
file.to_csv('file.csv', index=False)
When you load the file like this;
pandas.read_csv('file.csv')
it will automatically assigned the index number and there won't be a duplicate column.
Misordering of Columns
Not sure what info in what order atp_link is pulling in. From what you provided it looks like it is returning two columns: "Coach" and "Turning Pro".
I would recommend creating a list (not a dict) for each new player you want to add after you pull the info from atp_link. So if you are adding Nadal, You create an info list from the information for each new player. Nadal's info list would look like this:
info_list = ['Rafael Nadal', '','2001']
Then you append the list to the dataframe like this:
df.loc[len(df),:] = info_list
Hope this helps.
I am using PySpark Python3 - Spark 2.1.0 and I have the a list of differents list, such as:
lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
This list have elements with different lengths. So now, I would like to create a DataFrame from this list, where the columns are the first attribute (i.e. 'FILE, NAME, SURNAME, BIRTHDATE, NATIONALITY) and the data is the second attribute.
As you can see, the second list has not the column 'BIRTHDATE', I need the DataFrame to create this column with a NaN or white space in this place.
Also, I need DataFrame to be like this:
FILE NAME SURNAME BIRTHDATE NATIONALITY
----------------------------------------------------
123.xml ANA LÓPEZ 05-05-2000 ESP
458.xml JUAN PÉREZ NaN ESP
789.xml PEDRO CASTRO 07-07-2007 ESP
The data of the lists have to be in the same columns.
I have done this code, but it doesn't seems like the table I'd like:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
d = dictOfWords
tabla = pd.DataFrame(dict([ (k,pd.Series(v)) for k,v in dictOfWords.items() ]))
tabla_final = tabla.transpose()
tabla_final
Also, I have done this:
dictOfWords = { i : lista_archivos[i] for i in range(0, len(lista_archivos) ) }
print(dictOfWords)
tabla = pd.DataFrame.from_dict(dictOfWords, orient='index')
tabla
And the result is not good.
I would like a pandas DataFrame and a Spark DataFrame if it is possible.
Thanks!!
The following should work in your case:
In [5]: lista_archivos = [[['FILE','123.xml'],['NAME','ANA'],['SURNAME','LÓPEZ'],
...: ['BIRTHDATE','05-05-2000'],['NATIONALITY','ESP']], [['FILE','458.xml'],
...: ['NAME','JUAN'],['SURNAME','PÉREZ'],['NATIONALITY','ESP']], [['FILE','789.xml'],
...: ['NAME','PEDRO'],['SURNAME','CASTRO'],['BIRTHDATE','07-07-2007'],['NATIONALITY','ESP']]]
In [6]: pd.DataFrame(list(map(dict, lista_archivos)))
Out[6]:
BIRTHDATE FILE NAME NATIONALITY SURNAME
0 05-05-2000 123.xml ANA ESP LÓPEZ
1 NaN 458.xml JUAN ESP PÉREZ
2 07-07-2007 789.xml PEDRO ESP CASTRO
Essentially, you convert your sublists to dict objects, and feed a list of those to the data-frame constructor. The data-frame constructor works with list-of-dicts very naturally.
I have a CSV of the format
Team, Player
What I want to do is apply a filter to the field Team, then take a random subset of 3 players from EACH team.
So for instance, my CSV looks like :
Man Utd, Ryan Giggs
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
...
I want to end up with an XLS consisting of 3 random players from each team, and only 1 or 2 in the case where there is less than 3 e.g,
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
I started out using XLRD, my original post is here.
I am now trying to use Pandas as I believe this will be more flexible into the future.
So, in psuedocode what I want to do is :
foreach(team in csv)
print random 3 players + team they are assigned to
I've been looking through Pandas and trying to find the best approach to doing this, but I can't find anything similar to what I want to do (it's a difficult thing to Google!). Here's my attempt so far :
import pandas as pd
from collections import defaultdict
import csv as csv
columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\\Users\\ADMIN\\Desktop\\CSV_1.csv') as f:
reader = csv.DictReader(f) # read rows into a dictionary format
for row in reader: # read a row as {column1: value1, column2: value2,...}
print(row)
#for (k,v) in row.items(): # go over each column name and value
# columns[k].append(v) # append the value into the appropriate list
# based on column name k
So I have commented out the last two lines as I am not really sure if I am needed. I now each row being printed, so I just need to select a random 3 rows per each football team (or 1 or 2 in the case where there are less).
How can I accomplish this ? Any tips/tricks?
Thanks.
First use the better optimised read_csv:
import pandas as pd
df = pd.read_csv('DataFrame')
Now as a random example, use a lambda to get a random subset by randomizing the dataframe (replace 'x' with LivFC for example):
In []
df= pd.DataFrame()
df['x'] = np.arange(0, 10, 1)
df['y'] = np.arange(0, 10, 1)
df['x'] = df['x'].astype(str)
df['y'] = df['y'].astype(str)
df['x'].ix[np.random.random_integers(0, len(df), 10)][:3]
Out [382]:
0 0
3 3
7 7
Name: x, dtype: object
This will make you more familiar with pandas, however starting with version 0.16.x, there is now a DataFrame.sample method built-in:
df = pandas.DataFrame(data)
# Randomly sample 70% of your dataframe
df_0.7 = df.sample(frac=0.7)
# Randomly sample 7 elements from your dataframe
df_7 = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_0.7.index)]
Hi I'm learning data science and am trying to make a big data company list from a list with companies in various industries.
I have a list of row numbers for big data companies, named comp_rows.
Now, I'm trying to make a new dataframe with the filtered companies based on the row numbers. Here I need to add rows to an existing dataframe but I got an error. Could someone help?
my datarame looks like this.
company_url company tag_line product data
0 https://angel.co/billguard BillGuard The fastest smartest way to track your spendin... BillGuard is a personal finance security app t... New York City · Financial Services · Security ...
1 https://angel.co/tradesparq Tradesparq The world's largest social network for global ... Tradesparq is Alibaba.com meets LinkedIn. Trad... Shanghai · B2B · Marketplaces · Big Data · Soc...
2 https://angel.co/sidewalk Sidewalk Hoovers (D&B) for the social era Sidewalk helps companies close more sales to s... New York City · Lead Generation · Big Data · S...
3 https://angel.co/pangia Pangia The Internet of Things Platform: Big data mana... We collect and manage data from sensors embedd... San Francisco · SaaS · Clean Technology · Big ...
4 https://angel.co/thinknum Thinknum Financial Data Analysis Thinknum is a powerful web platform to value c... New York City · Enterprise Software · Financia...
My code is below:
bigdata_comp = DataFrame(data=None,columns=['company_url','company','tag_line','product','data'])
for count, item in enumerate(data.iterrows()):
for number in comp_rows:
if int(count) == int(number):
bigdata_comp.append(item)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-234-1e4ea9bd9faa> in <module>()
4 for number in comp_rows:
5 if int(count) == int(number):
----> 6 bigdata_comp.append(item)
7
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.pyc in append(self, other, ignore_index, verify_integrity)
3814 from pandas.tools.merge import concat
3815 if isinstance(other, (list, tuple)):
-> 3816 to_concat = [self] + other
3817 else:
3818 to_concat = [self, other]
TypeError: can only concatenate list (not "tuple") to list
It seems you are trying to filter out an existing dataframe based on indices (which are stored in your variable called comp_rows). You can do this without using loops by using loc, like shown below:
In [1161]: df1.head()
Out[1161]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
We will get the rows with indices 'a','b' and 'c', for all columns:
In [1162]: df1.loc[['a','b','c'],:]
Out[1162]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
You can read more about it here.
About your code:
1.
You do not need to iterate through a list to see if an item is present in it:
Use the in operator. For example -
In [1199]: 1 in [1,2,3,4,5]
Out[1199]: True
so, instead of
for number in comp_rows:
if int(count) == int(number):
do this
if number in comp_rows
2.
pandas append does not happen in-place. You have to store the result into another variable. See here.
3.
Append one row at a time is a slow way to do what you want.
Instead, save each row that you want to add into a list of lists, make a dataframe of it and append it to the target dataframe in one-go. Something like this..
temp = []
for count, item in enumerate(df1.loc[['a','b','c'],:].iterrows()):
# if count in comp_rows:
temp.append( list(item[1]))
## -- End pasted text --
In [1233]: temp
Out[1233]:
[[1.9350940285526077,
-0.16057932637141861,
-0.17345827000000605,
0.43326722021644282],
[1.66963201034217,
-1.1308932586268696,
-1.2103527446031515,
0.82213753819050794],
[0.49462218161377397,
1.0140133740187862,
0.2156547595968879,
1.0451391564351897]]
In [1236]: df2 = df1.append(pd.DataFrame(temp, columns=['A','B','C','D']))
In [1237]: df2
Out[1237]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
f -0.872135 2.938475 -0.099367 -1.472519
0 1.935094 -0.160579 -0.173458 0.433267
1 1.669632 -1.130893 -1.210353 0.822138
2 0.494622 1.014013 0.215655 1.045139
Replace the following line:
for count, item in enumerate(data.iterrows()):
by
for count, (index, item) in enumerate(data.iterrows()):
or even simply as
for count, item in data.iterrows():