I have a dataframe which contains topic and keywords column as shown below:
topic keyword
0 ['player', 'team', 'word_finder_unscrambler', ...
1 ['weather', 'forecast', 'sale', 'philadelphia'...
2 ['name', 'state', 'park', 'health', 'dog', 'ce...
3 ['game', 'flight', 'play', 'game_live', 'play_...
4 ['dictionary', 'clue', 'san_diego', 'professor...
Need to create a text file of each topic separately namely topic1.txt, topic2.txt,....topic20.txt and the topic text file should contain strings in newline inside keyword column something like this:
topic1.txt file should contain:
player
team
word_finder_unscrambler
etc
For each row of DataFrame create new txt file by name with topic column with add 1:
import csv
for t, k in zip(df['topic'], df['keyword']):
with open(f"topic{t + 1}.txt","w") as f:
wr = csv.writer(f,delimiter="\n")
wr.writerow(k)
EDIT: Because no lists but strings in keyword column use:
import csv, ast
for t, k in zip(df['topic'], df['keyword']):
with open(f"topic{t + 1}.txt","w") as f:
wr = csv.writer(f,delimiter="\n")
wr.writerow(ast.literal_eval(k))
Related
I have a text file that contains information about customers for a bus company booking system.
The file is laid out as such:
id, name, customer discount, total money spent
E.g. part of the file is:
C1, James, 0, 100
C2, Lily, 0, 30
I want to import this information to a list in Python, but I only need the ids and names.
I've tried a few different ways of importing the information, but I can only import the whole file to a list and even then it always comes out like this:
[['C1,' 'James,' '0', '100'], ['C2', 'Lily', '0', '30']]
And I don't even know how to begin separating the items so that I can just have the id and name in the list.
Since your text file contains comma-separated values, the csv module likely will be most useful.
import csv
with open ('data.txt', 'r') as fh:
header = [h.strip() for h in next(fh).split(',')] # remove spaces and assign the header to dictionary keys
reader = csv.DictReader(fh, fieldnames=header) # read the row contents assigning names to the fields
for row in reader:
print(row['id'], row['name'])
C1 James
C2 Lily
The useful part of the csv module reading the file as a dictionary assigns the column names to each row's field, making it easy to index which column names you want to select, such as row['id'] and row['name'].
Also, since you mentioned you want to "just have the id and name in the list", at first create an empty list, then append each rows items to that list as so:
import csv
id_name = [] # list to store ids, names
with open ('data.txt', 'r') as fh:
header = [h.strip() for h in next(fh).split(',')]
reader = csv.DictReader(fh, fieldnames=header)
for row in reader:
# print(row['id'], row['name'])
id_name.append([row['id'], row['name']])
print(id_name) # print the resulting list
[['C1', ' James'], ['C2', ' Lily']]
I have CSV file hello.csv with a cik-numbers in the third column. Now I have second file cik.csv where are the related companies (in column 4) to the cik-numbers (in column 1) and I want to have a list with the related companies to the cik-numbers in hello.csv.
I tried it with a loop:
with open('hello.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list1=list(readCSV)
b=-1
for j in list1:
b=b+1
if b>0:
cik=j[2]
with open('cik.csv', 'r', encoding="latin-1") as csvfile:
readCSV=csv.reader(csvfile, delimiter=',')
list2=list(readCSV)
I don't now how find my cik in the csv-file cik.csv and get the related company. Can I use pandas there ?
Use pandas to read in the two .csv files and map the respective values:
import pandas as pd
## create some dummy data
hello_csv="""
a,b,cik_numbers,d
'test',1,12, 5
'var', 6, 2, 0.1
"""
cik_csv="""
cik_numbers,b,c,related_companies
12,1,12, 'Apple'
13,6,20, 'Microsoft'
2,1,712,'Google'
"""
## note: you would rather give this a path to your csv files
# like: df_hello=pd.read_csv('/the/path/to/hello.csv')
df_hello=pd.read_csv(pd.compat.StringIO(hello_csv))
df_cik=pd.read_csv(pd.compat.StringIO(cik_csv))
## and add a new column to df_hello based on a mapping of cik_numbers
df_hello['related_companies'] = df_hello['cik_numbers'].map(df_cik.set_index('cik_numbers')['related_companies'])
print(df_hello)
yields:
a b cik_numbers d related_companies
0 'test' 1 12 5.0 'Apple'
1 'var' 6 2 0.1 'Google'
Question:
How can I split a list into two sublists where the elements are separated by a tab in the element?
Context:
I want to read a .txt file delimited by tabs into a Pandas DataFrame. The files look something like:
Column1 \t 123
Column2 \t
Column3 \t text
Meaning that each line has one column followed by one tab and then one value of the column (sometimes no value).
My idea was to read the file and save each line as an element of a list, then split the list into two keeping the first part before the tab as one list and the second part after the tab as another. Then build my dataframe from there.
for file in txt_files: #iterate over all files
f = open(file) #open each file individually
lines = f.readlines() #read each line as an element into a list
f.close()
#make sublists columns and values
You can read your files into a dataframe like this:
import pandas as pd
# Empty list to store dataframe rows
df_rows = []
# Read all text files
for tf in text_files:
# For each file
with open(tf) as f:
# Empty dictionary to store column names and values
df_dict = {}
# For each line
for line in f:
# Split by tab
k, v = line.split('\t')
# Column name as key, value as value
df_dict[k] = v
# Add the dictionary to list
df_rows.append(df_dict)
# Read the list of dictionaries as a dataframe
df = pd.DataFrame(df_rows)
# Preview dataframe
df.head()
If I understand correctly, you can just transpose the dataframe read_csv will give you with delimiter='\t'.
Demo:
>>> from io import StringIO
>>> import pandas as pd
>>>
>>> file = StringIO('''Column1\t123
...: Column2\t
...: Column3\ttext''')
>>>
>>> df = pd.read_csv(file, delimiter='\t', index_col=0, header=None).T
>>> df
>>>
0 Column1 Column2 Column3
1 123 NaN text
(If your delimiter is really ' \t ' then use delimiter=' \t ' and engine='python').
Intro Python question: I am working on a program that counts the number of politicians in each political party for each session of the U.S. Congress. I'm starting from a .csv with biographical data, and wish to export my political party membership count as a new .csv. This is what I'm doing:
import pandas as pd
read = pd.read_csv('30.csv', delimiter = ';', names = ['Name', 'Years', 'Position', 'Party', 'State', 'Congress'])
party_count = read.groupby('Party').size()
with open('parties.csv', 'a') as f:
party_count.to_csv(f, header=False)
This updates my .csv to read as follows:
'Year','Party','Count'
'American Party',1
'Democrat',162
'Independent Democrat',3
'Party',1
'Whig',145
I next need to include the date under my first column ('Year'). This is contained in the 'Congress' column in my first .csv. What do I need to add to my final line of code to make this work?
Here is a snippet from the original .csv file I am drawing from:
'Name';'Years';'Position';'Party';'State';'Congress'
'ABBOTT, Amos';'1786-1868';'Representative';'Whig';'MA';'1847'
'ADAMS, Green';'1812-1884';'Representative';'Whig';'KY';'1847'
'ADAMS, John Quincy';'1767-1848';'Representative';'Whig';'MA';'1847'
You can merge back the counts of Party to your original dataframe by:
party_count = df.groupby('Party').size().reset_index(name='Count')
df = df.merge(party_count, on='Party', how='left')
Once you have the count of parties now you can select your data. For eg: If you need [Congress, Party, Count] you can use:
out_df = df[['Congress ', 'Party', 'Count']].drop_duplicates()
out_df.columns = ['Year', 'Party', 'Count']
Here, out_df being the dataframe you can write to my.csv file.
out_df.to_csv('my.csv', index=False)
I have a CSV file that I need to read and process in python.
The CSV file contains tabular values as follows:
*aa
1 foo1 foo_bar1
2 foo2 foo_bar2
*bb
1.22 bla1 blabla1 blablabla22
1.33 bla2 ' ' blablabla33
Here aa and bb are the names of each table. Wherever table names occur, the name is preceded by a * and the rows below it are the rows of that table.
Note that each table can have:
different number of columns as well as rows.
There can also be empty columns representing missing values. I would like to keep them as ' ' after reading in.
However, we know exactly which tables are present in the csv file (i.e. the table names)
I need to read in the csv file and assign a table's entire content to one variable. I can think of a brute force way of doing this. However, since python has a csv module with read write operations, is there any built in functionality that could make this easier or more efficient for me?
Note: One of the major problem I've faced so far is that after reading in the csv file using csv.reader(), I see that aa's rows have additional empty columns. I believe this is because of the mismatch in the number of aa's and bb's columns. I also want to get rid of these additional empty columns without deleting the empty columns that represent actual missing values.
The cleanest way is to separate the tables before feeding each group to the csv reader. Here is a rough cut to get you started:
from itertools import takewhile
import csv
# Instead of *s*, you can use an open file object here
s = '''\
*aa
1,foo1,foo_bar1
2,foo2,foo_bar2
*bb
1.22,bla1,blabla1,blablabla22
1.33,bla2, ,blablabla33
'''.splitlines()
it = iter(s)
next(it)
for table in ['aa', 'bb']:
print(f'\nTable: {table}')
for row in csv.reader(takewhile(lambda r: not r.startswith('*'), it)):
print(row)
This produces:
Table: aa
['1', 'foo1', 'foo_bar1 ']
['2', 'foo2', 'foo_bar2']
Table: bb
['1.22', 'bla1', 'blabla1', 'blablabla22']
['1.33', 'bla2', ' ', 'blablabla33']
You could parse your csv file like so checking if the first value startswith a '*' and build a dict from it.
import csv
from collections import defaultdict
import pprint
csv_data = defaultdict(list)
with open('data.csv', 'r') as csv_file:
# filter empty lines
csv_reader = csv.reader(filter(lambda l: l.strip(',\n'), csv_file))
header = None
for row in csv_reader:
if row[0].startswith('*'):
header = row[0]
else:
# additional row processing if needed
csv_data[header].append(row)
pprint.pprint(csv_data)
# Output
defaultdict(<class 'list'>,
{'*aa': [['1', ' foo1', 'foo_bar1', ''],
['2', ' foo2', 'foo_bar2', '']],
'*bb': [['1.22', ' bla1', 'blabla1', 'blablabla22'],
['1.333', ' bla2', '', 'blablabla3']]})
If you want to remove the excess elements from a table due to another being larger, one option is
csv_data[header].append(row[:col_nums[header]])
Where as you mentioned you know how many columns your table should have
col_nums = {'*aa' : 3, '*bb' : 4}
defaultdict(<class 'list'>,
{'*aa': [['1', ' foo1', 'foo_bar1'],
['2', ' foo2', 'foo_bar2']],
'*bb': [['1.22', ' bla1', 'blabla1', 'blablabla22'],
['1.333', ' bla2', '', 'blablabla3']]})
If I misread it and you only know the max number of columns and not the number of columns for each table, then you could instead do.
def trim_row(row):
for i, item in enumerate(reversed(row)):
if not item:
break
return row[:len(row) - i]
# use it like so
csv_data[header].append(trim_row(row))
Have you considered using pandas?
import pandas as pd
df = pd.read_csv('foo.csv', sep=r'/s+', header=None) #if there is table headings, remove header = None
You do not need to add any line to the top of the file.
This reads files with different number of rows and columns into a dataframe. You can perform all sorts of actions in it now. For ex:
Empty elements are represented by NaN, which means Not a Number. You can replace it with ' ' just by writing
df.fillna(' ')
To fit your use case, from what i understand, you have multiple tables in the same csv file, try this:
df = pd.read_csv("foo.csv", header=None, names=range(3))
table_names = ["*aa", "*bb", "*cc"..]
groups = df[0].isin(table_names).cumsum()
tables = {g.iloc[0,0]: g.iloc[1:] for k,g in df.groupby(groups)}
This will create a list of tables with key as table name and value as the table itself.
for k,v in tables.items():
print("table:", k)
print(v)
print()
You can find more details in the documentation.