I love you all. First time with Python, I am reading in a csv with 10842 cities and counting how many occurrences there are of each. When I print to terminal it outputs the first 29 cities, prints ... and then prints 10813 - 10842. This is the code:
import pandas as pd
df = pd.read_csv('Csz.csv')
s = df['City'].value_counts().rename('Total_City')
df = df.join(s, on='City')
print (df)
I'm a bit lost on how to get all of them to print, and hopefully after will figure out how to remove duplicates. Thank you for your help!
Put this in your code right after the imports
pd.options.display.max_rows = 999
see the doc for full explanation:
https://pandas.pydata.org/pandas-docs/stable/options.html
Related
I have the following code, but when I execute it, it prints the age_groups_list 17 times,
Any idea why?
import pandas as pd
file = pd.read_csv(r"file location")
age_groups_list = []
for var in file[1:]:
age = file.iloc[:, 10]
age_groups_list.append(age)
print(age_groups_list)
the idea is that I have a csv file with 16,000 (+) rows and 20 columns, I am picking the age group from index 10, adding it to a list and then print the list, however when printing the list, it does it for 17 time, this image shows the end of the printing output.
Any idea what am I doing wrong here?
thanks
file.iloc[:,10] already gives you all the data you need the loop is useless
what you see is actually a list of lists.
change it to this:
import pandas as pd
file = pd.read_csv(r"file location")
age_groups_list = file.iloc[:, 10]
print(age_groups_list)
So, I was working on titanic dataset to extract Title(Mr,Ms,Mrs) from Name column from Data frame(df). Its has 1309 rows.
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
This peice of code gives the following output
nan 1309
As supposed it had to stop for ind=1308, but it goes one step further even if not indicated to do so.
What could be the flaw here? Is it due to the fact that I am using 1 based indexing of the data frame?
If so, what could be done here to prevent such behaviour?
I am new to this platform, so please ask for clarifications in case of any discrepancies.
Here is a short Example:-
import numpy as np
import pandas as pd
dict1 = {'Name':['Hey, Mr.','Hello, Ms.','Hi, Mrs,','Welcome, Master.','Yes, Mr.'],'ind':[1,2,3,4,5]}
df = pd.DataFrame(data = dict1)
df.set_index('ind')
for ind,name in enumerate(df['Name']):
if type(name)==str:
inf = name.find(', ') + 2
df.loc[ind+1,'Title'] = name[inf:name.find('.')]
else :
print(name,ind)
print(df['Title'])
Most answer given to the questions that seemed like the answers would help weren't helpful to anyone asking any questions. Nor the people answering the questions knew figure things out after they found out what they had contributed didn't work. I have tried pretty much every str() and .to_string variation I could find.
Anyways, I've been trying to get data in a file paired up and omit data I can't pair up. I believe I've paired things together, but there's no way for me to verify this other than seeing the column and true or false.
import pandas as pd
# read file
with open('TastyTrades.csv', 'r') as trade_history:
trade_reader = pd.read_csv('TastyTrades.csv')
# sort out for options only
options_frame = trade_reader.loc[(trade_reader['Instrument Type'] == 'Equity Option')]
# resort data
date_frame = options_frame.sort_values(by=['Symbol', 'Date', 'Action'], ascending=True)
# pair BTO to STC
BTO = date_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_CLOSE'])
STO = date_frame['Action'].isin(['SELL_TO_OPEN', 'BUY_TO_CLOSE'])
# bringing both frames as one
pairs = [BTO, STO]
# return readable data
result = pd.concat(pairs).astype(str)
# write to new file
result.to_csv('new_taste.csv')
This code bring me:
,Action
101,True
75,True
102,False
76,False
95,False
97,True
98,True
38,True
174,True
166,True
I am trying to get the data back to the readable format:
Date,Type,Action,Symbol,Instrument Type,Description,Value,Quantity,Average Price,Commissions,Fees,Multiplier,Underlying Symbol,Expiration Date,Strike Price,Call or Put
2020-02-14T15:49:12-0500,Trade,SELL_TO_OPEN,TGT 200327C00127000,Equity Option,Sold 1 TGT 03/27/20 Call 127.00 # 1.33,133,1,133,-1,-0.15,100,TGT,3/27/2020,127,CALL
2020-02-14T15:49:11-0500,Trade,SELL_TO_OPEN,TGT 200327P00107000,Equity Option,Sold 1 TGT 03/27/20 Put 107.00 # 1.80,180,1,180,-1,-0.15,100,TGT,3/27/2020,107,PUT
2020-02-14T15:49:11-0500,Trade,BUY_TO_OPEN,TGT 200327C00128000,Equity Option,Bought 1 TGT 03/27/20 Call 128.00 # 1.17,-117,1,-117,-1,-0.14,100,TGT,3/27/2020,128,CALL
Here BTO and STO will only have the result of isin condition (true or false).
So, Rewrite two of your lines as below:
BTO = date_frame[date_frame['Action'].isin(['BUY_TO_OPEN', 'SELL_TO_CLOSE'])]
STO = date_frame[date_frame['Action'].isin(['SELL_TO_OPEN', 'BUY_TO_CLOSE'])]
This will give all the columns into BTO and STO and then you can merge these two DF's.
Hope this helps.
Check below working code:
I tried it, and got expected result. All the rows using above code. I tried without converting to 'str', that too gave me same result. Try printing the result and see what it shows.
BTO = quotes[quotes['Action'].isin(['BTO', 'STC'])]
STO = quotes[quotes['Action'].isin(['STO', 'BTC'])]
frames = [BTO,STO]
result = pd.concat(frames).astype(str)
result.to_csv('new_taste.csv')
So I have a CSV file of users which is in the format:
"Lastname, Firstname account_last_used_date"
I've tried dateutil parser, however it states this list is an invalid string. I need to keep the names and the dates together. I've also tried datetime but i'm having issues with "datetime not defined". I'm very new to Python, so forgive me if i've missed an easy solution.
import re
from datetime import date
with open("5cUserReport.csv","r") as dilly:
li = [(x.replace("\n","")) for x in dilly]
li2 = [(x.replace(",","")) for x in li]
for x in li2:
match = re.search(r"\d{2}-\d{2}-\d{4}", x)
date = datetime.strptime(match.group(), "%d-%m-%Y").x()
print(date)
The end goal is I need to check if the date the user last logged in is longer than 4 months. Honestly, any help here is massively welcome!
The CSV format is:
am_testuser1 02/12/2017 08:42:48
am_testuser11 13/10/2017 17:44:16
am_testuser20 27/10/2017 16:31:07
am_testuser5 23/08/2017 09:42:41
am_testuser50 21/10/2017 15:38:12
Edit: Edited the answer based on the given csv
You could do something like this with pandas
import pandas as pd
colnames = ['Lastname, Firstname', 'Date', 'Time']
df = pd.read_csv('5cUserReport.csv', delim_whitespace=True, skiprows=1, names=colnames, parse_dates={'account_last_used_date': [1,2]}, dayfirst =True)
more_than_4_months_ago = df[df['account_last_used_date'] < (pd.to_datetime('now') - pd.DateOffset(months=4))]
print(more_than_4_months_ago)
The DataFrame more_than_4_months_ago will give you a subset of all records, based on if the account_last_used_date is more than 4 months ago.
This is based on the given format. Allthough I doubt that this is your actual format, since the given usernames don't really match the format 'firstname, lastname'
Lastname, Firstname account_last_used_date
am_testuser1 02/12/2017 08:42:48
am_testuser11 13/10/2018 17:44:16
am_testuser20 27/10/2017 16:31:07
am_testuser5 23/08/2018 09:42:41
am_testuser50 21/10/2017 15:38:12
(I edited 2 lines to 2018, so that the test actually shows that it works).
For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.
I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.
import pandas as pd
import csv
df = pd.read_csv('tweets_hiv.csv')
saved_column4 = df.text
print saved_column4
Out comes the correct output:
0 Some example tweet text
1 Oh hey look more tweet text #things I hate #stuff
...a bunch more lines
Name: text, Length: 8540, dtype: object
But, when I try this
from textblob import TextBlob
tweetstr = str(saved_column4)
tweets = TextBlob(tweetstr).upper()
print tweets.words.count('sex', case_sensitive=False)
My output is 22.
There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?
I'm not entirely sure this is methodically correct insofar as language processing, but using join will give you the count you need.
import pandas as pd
from textblob import TextBlob
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweetstr = " ".join(tweets.tolist())
tweetsb = TextBlob(tweetstr).upper()
print tweetsb.words.count('sex', case_sensitive=False)
# 1000
If you just need the count without necessarily using TextBlob, then just do:
import pandas as pd
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
sex_tweets = tweets.str.contains('sex', case=False)
print sex_tweets.sum()
# 1000
You can get a TypeError in the first snippet if one of your elements is not of type string. This is more of join issue. A simple test can be done using the following snippet:
# tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweets = pd.Series(x for x in range(1000))
tweetstr = " ".join(tweets.tolist())
Which gives the following result:
Traceback (most recent call last):
File "F:\test.py", line 6, in <module>
tweetstr = " ".join(tweets.tolist())
TypeError: sequence item 0: expected string, numpy.int64 found
A simple workaround is to convert x in the list comprehension into a string before using join, like so:
tweets = pd.Series(str(x) for x in range(1000))
Or you can be more explicit and create a list first, map the str function to it, and then use join.
tweetlist = tweets.tolist()
tweetstr = map(str, tweetlist)
tweetstr = " ".join(tweetstr)
The CSV conversion is not the problem! When you use str() on a column of a DataFrame (that is, a Series), it makes a "print-friendly" output of the Series, which means cutting out the majority of the data, and just displaying the first few and the last few. Here is a transcript of an IPython session that will probably illustrate the issue better:
In [1]: import pandas as pd
In [2]: blah = pd.Series('tweet %d' % n for n in range(1000))
In [3]: blah
Out[3]:
0 tweet 0
1 tweet 1
... (output continues from 1 to 29)
29 tweet 29
... (OUTPUT SKIPS HERE)
970 tweet 970
... (output continues from 970 to 998)
998 tweet 998
999 tweet 999
dtype: object
In [4]: blahstr = str(blah)
In [5]: blahstr.count('tweet')
Out[5]: 60
So, since the output of the str() operation cuts off my data (and might even truncate column values, If I had used longer strings), I don't get 1000, I get 60.
If you want to do it your way (combine everything back into a single string and work with it that way), there's no point in using a library like Pandas. Pandas gives you better ways:
Working With a Series of Strings
Pandas has tools for working with a Series that contains strings. Here is a tutorial-like page about it, and here is the full string handling API documentation. In particular, for finding the number of uses of the word "sex", you could do something like this (assuming df is a DataFrame, and text is the column containing the tweets):
import re
counts = df['text'].str.count('sex', re.IGNORECASE)
counts should be a Series containing the number of occurrences of "sex" in each tweet. counts.sum() would give you the total number of usages, which should hopefully be more than 1000.