Hi I'm learning data science and am trying to make a big data company list from a list with companies in various industries.
I have a list of row numbers for big data companies, named comp_rows.
Now, I'm trying to make a new dataframe with the filtered companies based on the row numbers. Here I need to add rows to an existing dataframe but I got an error. Could someone help?
my datarame looks like this.
company_url company tag_line product data
0 https://angel.co/billguard BillGuard The fastest smartest way to track your spendin... BillGuard is a personal finance security app t... New York City · Financial Services · Security ...
1 https://angel.co/tradesparq Tradesparq The world's largest social network for global ... Tradesparq is Alibaba.com meets LinkedIn. Trad... Shanghai · B2B · Marketplaces · Big Data · Soc...
2 https://angel.co/sidewalk Sidewalk Hoovers (D&B) for the social era Sidewalk helps companies close more sales to s... New York City · Lead Generation · Big Data · S...
3 https://angel.co/pangia Pangia The Internet of Things Platform: Big data mana... We collect and manage data from sensors embedd... San Francisco · SaaS · Clean Technology · Big ...
4 https://angel.co/thinknum Thinknum Financial Data Analysis Thinknum is a powerful web platform to value c... New York City · Enterprise Software · Financia...
My code is below:
bigdata_comp = DataFrame(data=None,columns=['company_url','company','tag_line','product','data'])
for count, item in enumerate(data.iterrows()):
for number in comp_rows:
if int(count) == int(number):
bigdata_comp.append(item)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-234-1e4ea9bd9faa> in <module>()
4 for number in comp_rows:
5 if int(count) == int(number):
----> 6 bigdata_comp.append(item)
7
/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/frame.pyc in append(self, other, ignore_index, verify_integrity)
3814 from pandas.tools.merge import concat
3815 if isinstance(other, (list, tuple)):
-> 3816 to_concat = [self] + other
3817 else:
3818 to_concat = [self, other]
TypeError: can only concatenate list (not "tuple") to list
It seems you are trying to filter out an existing dataframe based on indices (which are stored in your variable called comp_rows). You can do this without using loops by using loc, like shown below:
In [1161]: df1.head()
Out[1161]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
We will get the rows with indices 'a','b' and 'c', for all columns:
In [1162]: df1.loc[['a','b','c'],:]
Out[1162]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
You can read more about it here.
About your code:
1.
You do not need to iterate through a list to see if an item is present in it:
Use the in operator. For example -
In [1199]: 1 in [1,2,3,4,5]
Out[1199]: True
so, instead of
for number in comp_rows:
if int(count) == int(number):
do this
if number in comp_rows
2.
pandas append does not happen in-place. You have to store the result into another variable. See here.
3.
Append one row at a time is a slow way to do what you want.
Instead, save each row that you want to add into a list of lists, make a dataframe of it and append it to the target dataframe in one-go. Something like this..
temp = []
for count, item in enumerate(df1.loc[['a','b','c'],:].iterrows()):
# if count in comp_rows:
temp.append( list(item[1]))
## -- End pasted text --
In [1233]: temp
Out[1233]:
[[1.9350940285526077,
-0.16057932637141861,
-0.17345827000000605,
0.43326722021644282],
[1.66963201034217,
-1.1308932586268696,
-1.2103527446031515,
0.82213753819050794],
[0.49462218161377397,
1.0140133740187862,
0.2156547595968879,
1.0451391564351897]]
In [1236]: df2 = df1.append(pd.DataFrame(temp, columns=['A','B','C','D']))
In [1237]: df2
Out[1237]:
A B C D
a 1.935094 -0.160579 -0.173458 0.433267
b 1.669632 -1.130893 -1.210353 0.822138
c 0.494622 1.014013 0.215655 1.045139
d -0.628889 0.223170 -0.616019 -0.264982
e -0.823133 0.385790 -0.654533 0.582255
f -0.872135 2.938475 -0.099367 -1.472519
0 1.935094 -0.160579 -0.173458 0.433267
1 1.669632 -1.130893 -1.210353 0.822138
2 0.494622 1.014013 0.215655 1.045139
Replace the following line:
for count, item in enumerate(data.iterrows()):
by
for count, (index, item) in enumerate(data.iterrows()):
or even simply as
for count, item in data.iterrows():
Related
user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)
You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])
I am a rookie. Collectively, I have about two weeks of experience with any sort of computer code.
I created a dictionary with some baseball player names, their batting order, and the position they play. I'm trying to get it to print out in columns with "order", "name", and "position" as headings, with the order numbers, positions, and names under those. Think spreadsheet kind of layout (stackoverflow won't let me format the way I want to in here).
order name position
1 A Baddoo LF
2 J Schoop 1B
3 R Grossman DH
I'm new here, so apparently you have to click the link to see what I wrote. Dodgy, I know...
As you can see, I have tried 257 times to get this thing to work. I've consulted the google, a python book, and other sources to no avail.
Here is the working code
for order in lineup["order"]:
order -= 1
position = lineup["position"][order]
name = lineup["name"][order ]
print(order + 1, name, position)
Your Dictionary consists of 3 lists - if you want entry 0 you will have to access entry 0 in each list separately.
This would be an easier way to use a dictionary
players = {1: ["LF", "A Baddoo"],
2: ["1B", "J Schoop"]}
for player in players:
print(players[player])
Hope this helps
Pandas is good for creating tables as you describe what you are trying to do. You can convert a dictionary into a a dataframe/table. The only catch here is the number of rows need to be the same length, which in this case it is not. The columns order, position, name all have 9 items in the list, while pitcher only has 1. So what we can do is pull out the pitcher key using pop. This will leave you with a dictionary with just those 3 column names mentioned above, and then a list with your 1 item for pitcher.
import pandas as pd
lineup = {'order': [1,2,3,4,5,6,7,8,9],
'position': ['LF', '1B', 'RF','DH','3B','SS','2B','C','CF'],
'name': ['A Baddoo','J Schoop','R Grossman','M Cabrera','J Candelario', 'H Castro','W Castro','D Garneau','D Hill'],
'pitcher':['Matt Manning']}
pitching = lineup.pop('pitcher')
starting_lineup = pd.DataFrame(lineup)
print("Today's Detroit Tigers starting lineup is: \n", starting_lineup.to_string(index=False), "\nPitcher: ", pitching[0])
Output:
Today's Detriot Tigers starting lineup is:
order position name
1 LF A Baddoo
2 1B J Schoop
3 RF R Grossman
4 DH M Cabrera
5 3B J Candelario
6 SS H Castro
7 2B W Castro
8 C D Garneau
9 CF D Hill
Pitcher: Matt Manning
How do I print out entries in a df using a keyword search? I have a legislative database I'm running a list of climate keywords against:
climate_key_words = ['climate','gas','coal','greenhouse','carbon monoxide','carbon',\
'carbon dioxide','education',\
'gas tax','regulation']
Here's my for loop:
for bill in df.title:
for word in climate_key_words:
if word in bill:
print(bill)
print(word)
print(df.state)
print('------------')
When it prints, df.state forces everything to print funky:
24313 AK
24314 AK
24315 AK
24316 AK
24317 AK
Name: state, Length: 24318, dtype: object
------------
Relating to limitations on food regulations at farms, farmers' markets, and cottage food production operations.
regulation
But when print(df.state) is absent, it looks much nicer:
------------
Higher education; providing for the protection of certain expressive activities.
education
------------
Schools; allowing a school district board of education to amend certain policy to stock inhalers. Effective date. Emergency.
education
------------
How can I include df.state (and other values) and have them printed only once?
Ideally, my output should look like this:
###bill
###corresponding title
###corresponding state
print(df.state) is going to print out the column/field 'state'. You presumably want the state associated with that row of the dataframe?
So I would suggest tweaking your approach slightly and doing something like:
for row in range(dataframe.shape[0]): #for each row in the dataframe
for word in keywords:
if word in dataframe.iloc[row][bill]
print(dataframe.iloc[row][bill]) #allows you to access values in the df by row,column
print(dataframe.iloc[row][state])
print(dataframe.iloc[row][title])
Update:
Now I am getting the below error:
AttributeError: 'list' object has no attribute 'str' , I think the original error was due to the fact that I needed 2 square brackets in the code for the Dataframe to be recognized as a Dataframe rather than a series. now I have this
AttributeError: 'list' object has no attribute 'str'
Updated Code
## Create list of Privileged accounts
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF.columns=[['first_name']].str.contains(pattern)
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['PrivilegedAccess'].map({True: 'True', False: 'False'})
ORIGINAL Question:
I'm just wondering how one might overcome the below error.
TypeError: only integer scalar arrays can be converted to a scalar index
I am using the below code on a pandas dataframe (126 rows × 6 columns)
What I am trying to do is create a new column "PrivilegedAccess" and in this column I want to write "True" if any of the names in the first_names column match the ones outlined in the "Search_for_These_values" list and "False" if they don't
SAMPLE DATA:
uid last_name first_name language role email_address department
0 121 Chad Diagnostics English Team Lead Michael.chad#gmail.com Data Scientist
1 253 Montegu Paulo Spanish CIO Paulo.Montegu#gmail.com Marketing
2 545 Mitchel Susan English Team Lead Susan.Mitchel#gmail.com Data Scientist
3 555 Vuvko Matia Polish Marketing Lead Matia.Vuvko#gmail.com Marketing
4 568 Sisk Ivan English Supply Chain Lead Ivan.Sisk#gmail.com Supply Chain
5 475 Andrea Patrice Spanish Sales Graduate Patrice.Andrea#gmail.com Sales
6 365 Akkinapalli Cherifa French Supply Chain Assistance Cherifa.Akkinapalli#gmail.com Supply Chain
CODE:
## Create list of Privileged accounts
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['first_name'].str.contains(pattern)
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['PrivilegedAccess'].map({True: 'Yes', False: 'No'})
Many thanks
Hi I'm learning data science and am making a big data company list with some attributes.
Currently my dataframe, data, looks like this.
company_url company tag_line product data
0 https://angel.co/billguard BillGuard The fastest smartest way to track your spendin... BillGuard is a personal finance security app t... New York City · Financial Services · Security ...
1 https://angel.co/tradesparq Tradesparq The world's largest social network for global ... Tradesparq is Alibaba.com meets LinkedIn. Trad... Shanghai · B2B · Marketplaces · Big Data · Soc...
2 https://angel.co/sidewalk Sidewalk Hoovers (D&B) for the social era Sidewalk helps companies close more sales to s... New York City · Lead Generation · Big Data · S...
3 https://angel.co/pangia Pangia The Internet of Things Platform: Big data mana... We collect and manage data from sensors embedd... San Francisco · SaaS · Clean Technology · Big ...
4 https://angel.co/thinknum Thinknum Financial Data Analysis Thinknum is a powerful web platform to value c... New York City · Enterprise Software · Financia...
I want to sort the "data" column with certain keywords such as "big data" and make a new dataframe with the rows.
I was thinking to first find the fitting rows and then, put them into a list and sort the dataframe, data, based on the rows list but I got an error for the first part.
My code:
comp_rows = []
a = ['Data','Analytics','Machine Learning','Deep','Mining']
for count, item in enumerate(data.data):
if any(x in item for x in a):
comp_rows.append(count)
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-174-afeee7d3d179> in <module>()
3
4 for count, item in enumerate(data.data):
----> 5 if any(x in item for x in a):
6 comp_rows.append(count)
<ipython-input-174-afeee7d3d179> in <genexpr>((x,))
3
4 for count, item in enumerate(data.data):
----> 5 if any(x in item for x in a):
6 comp_rows.append(count)
TypeError: argument of type 'float' is not iterable
Could someone help me on this?
It would work if your data.data was a list of strings, but a float was found there. Try to replace
if any(x in item for x in a):
with if any(x in str(item) for x in a):